Did You Know These SEO Tactics? - Picardes

Log File Analysis Study: Learnings from Analysing 31 Million Log File Hits

Roman Adamita — Sun, 23 Oct 2022 15:36:58 +0000

Log file analysis is something that I like the most regarding a website’s technical analysis. It makes me think and act according to the insights I get based on the log file data. Whenever I receive log files for a website, I try to understand why search engines crawl those URLs the most or less.

I’m here to share a study on the log file analysis I got in the last year. Previously, I shared all the considerable essential knowledge on the Screaming Frog blog, where I analyzed an e-commerce website with over 7 million log file events. Those log file events ended up being around 32 million.

In the meantime, I analyzed various websites’ server log files from multiple industries. Below I’ll include the learnings based on the data from 32 million log file events and other websites where I received the log files. Before starting, I’d like to clarify the context of the server log files.

What is server log file analysis?

Analyzing the server log files allows you to receive the data based on how search engines like Google crawl a website. By analyzing these log file events, you will understand where Googlebot, Yandexbot, Bingbot, or other major search engines are wasting or investing time in crawling your web pages.

In this case, the server log file analysis study is only based on search engine activities.

What kind of data is in server log file events?

You can see many things whenever you get enough log file events. But in a nutshell, you’ll be able to find:

How are search engines crawling a website?
Do search engines crawl pages that shouldn’t crawl?
Do search engines crawl the robots.txt file?
How often are the most critical pages crawled?
Do search engines crawl any broken pages?
Are there any pages with server timeout issues?
Which external resources are pointing to the pages?

The above questions and more can be found in the official log file events.

How to ensure you got the correct log file for SEO?

There might be various official and non-official log events regarding requesting for SEO purposes. For example, if you only want to receive the log file events that belong to Google, you will need to ensure that the log file lines include an IP address starting with 66.249. There are multiple user agents of Googlebot, so you must ensure that the IP addresses are official.

Non-official log file events can be due to tools that analyze the website. DeepCrawl, ContentKing, Ahrefs, and Semrush crawlers have their IP addresses, for example. From these tools, anyone can crawl a website and will be included in the log files when someone crawls a website.

Example official log file of a Googlebot event:

66.249.76.201 – [22/Aug/2022:22:31:50 +0100] “GET /landing-page HTTP/1.1” 200 889484 “-” “Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.119 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” 127 cookie:-

The above example log is only a line from the log file events from an official Googlebot user agent request.

What Data Would You Need to Analyse the Log Files?

Requested URL: this will show the URL crawled by a user agent.
Timestamp: this will show when (exact date and time) a user agent requests to crawl a URL. In the above example log line is [22/Aug/2022:22:31:50 +0100].
Remote Host: this will show which IP address is requested to crawl a URL. If you’re looking for Googlebot log file events, you’ll need to find only the IP addresses that start with 66.249. In the above example is 66.249.76.201.
User Agent: this will show what type of user agent is requested to crawl a URL. If we take the IP addresses that start with 66.249, in this case, they can be “Googlebot”, “Googlebot Smartphone”, “Googlebot Image”, and more that is related to Googlebot user agents. In the above example is “Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.119 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”.
Method: this will show if the URL was requested from a specific resource as a “GET” request method or if the URL was processed to a particular resource as a “POST” method. In many cases, the method request will be GET, which is the most common HTML request method. In the above example is indicated as “GET”.
HTTP Version: this will show the HTTP version. I’ve seen the HTTP version in many cases as either 1.1 or 2.0. In the above example is HTTP/1.1.
Response Code: this will show the URL’s returning status code, such as “200-OK”, “301 Permanent Redirect”, “404 Not Found”, “500 Server Error”, and so on.
Referer: this will show which page referred to crawl any of the linked URLs. This can be an external or internal source.

As I’ve clarified, I’ll share my latest log file analysis learning below.

Log File Analysis is Not About Getting Ranked 1st Tomorrow

Just because you found some insights on log files and took actions accordingly, this won’t help you to get ranked tomorrow.

Log files are about understanding complex technical things that can’t be easily seen without analyzing them. Log file events will allow you to see where search engines are wasting or investing time while crawling a website.

Technical SEO solutions, most of the time, are about long-term solutions and results, which can impact the whole organic performance.

More Log File Events, More Actionable Insights

Log file events don’t mean anything if you don’t get meaning from the data you receive. Analyzing the log events for only one month might mean just thousands of log file hits.

You must be on track for at least a few months to get more meaningful insights and act accordingly. In my experience, I saw many scenarios where Google crawled pages for a month, but those pages weren’t indexed yet.

Whenever a page is crawled, it doesn’t mean it will be indexed immediately, especially when it’s high with JavaScript. This blog post from Ziemek of Onely covers it in more detail: “It takes 9x more time for Google to crawl JavaScript than HTML.”

Tracking the log file hits for extended periods will allow you to learn more and understand whenever those results need action.

Try to Understand If the Log File Results Need Actions

You can have thousands of broken pages in the log file events, but:

If none of those pages is in an HTML format,
None of those pages was indexable on search engines,
None of those pages got traffic from any channels.

It’s nonsense to redirect them somewhere else. Prioritizing these broken links to redirect won’t improve the website’s crawl budget. In most cases, improving the crawl budget is about optimizing millions of URLs crawled by Googlebot but not indexed.

Google Never Forgets a URL That Ever Existed

Google never forgets a URL that ever existed. Even a script URL, as you see in the above image.

I’d strongly advise you to only pay attention to those issues that can positively impact your organic performance.

E.g.,

Broken but valuable pages should be 301-redirected for at least one year. Redirecting these pages to a relevant page is the best practice.

When you get broken pages that once were valuable for your business, try to 301-redirect all of them to another related page. Keep the 301 redirections for at least one year since Googlebot constantly crawls them. It’s also recommended by Google (Google’s John Mueller and ex-Googler Matt Cutts).

Treat Valuable Broken Pages Like Dead Backlinks from Useful Resources

Treat valuable broken pages like dead backlinks from useful resources.

You will lose the backlink’s value from that external resource if you don’t redirect the broken linked page elsewhere. It’s the same for your internal broken pages with or without backlinks.

To find broken internal links, you can use Screaming Frog’s SEO Spider broken link checker, as it’s the easiest way I’ve experienced.

You Don’t Need to Remove All the “404 Not Found” Pages Intentionally

You don’t need to intentionally remove all 404 pages (by making them return a “410 Gone” status code). I was wrong when I said Googlebot wouldn’t crawl 410 intentionally removed pages more than two times. I kept analyzing the log file events for the “410 Gone” URLs, and from what I’ve experienced, Googlebot kept crawling those URLs. The only difference I’ve seen is that “410 Gone” URLs can slightly be less crawled compared to the “404 Not Found” URLs.

Having 404 pages is also not a bad practice from a crawl budget perspective. You only lose the value of that page if you don’t redirect it.

A Better View of Log File Events

If you want a better view of log file events, export clicks from Google Search Console and page speed score from PageSpeed Insights data. To easily export this data, you can use Screaming Frog SEO Spider’s GSC and PSI integration APIs.

When you get this data, import it into Screaming Frog Log File Analyser. This will allow you to make better decisions while prioritizing the technical fixes. You can see my research and learnings in the blog post I shared with Screaming Frog.

Using WooCommerce for WordPress

If you’re using WordPress and have WooCommerce but don’t show products on every page, ensure that you disable the “AJAX add to cart buttons” option, as this will make a filter parameter and will be crawled by Googlebot every single time.

Googlebot Will Always Crawl the Ads.txt File

Ads.txt file will be crawled by Googlebot all the time. This is not a problem, even if the /ads.txt URL returns a 404 status code. It’s a part of your crawl budget; Googlebot will crawl the ads.txt file consistently to ensure that your website includes AdSense.

Adx.txt file is for the website owners to protect the earnings from Google AdSense.

An Alternative to Server Log File Analysis: Crawl Stats

If you can’t export the log files, use Crawl Stats on your Google Search Console account. This is also an alternative to log file analysis, but with a 1K row and 90-day limit.

Crawl stats have many great features you won’t see on the log files. One of the features I like the most is under the “By purpose” report, which shows if the URLs in the list were crawled for the first time (under “Discovery”) or it was crawled for refresh purposes (under “Refresh”).

Crawl stats can be found under settings in your Search Console property.

What % of Log File Analyses Result in Actionable Insights?

I’ve got this question from Andrew Shotland on Twitter, and I believe many of you could ask the same, as it was interesting for me, too.

I’ll continue studying the log file events, investigating how search engines crawl websites, experimenting with things considered “best practices” that have yet to be tested, and having fun.

In the meantime, if you want to support and keep me motivated, you can share this post on Twitter or your LinkedIn account.

The post Log File Analysis Study: Learnings from Analysing 31 Million Log File Hits appeared first on Picardes.

Which Search Engine Indexed My First Blog Post Faster?

Roman Adamita — Sun, 30 Jan 2022 17:03:12 +0000

Later this week, on January 27, I published my first blog on this website. Before that, on January 21 I completely removed all the content and started all over again. I gave search engines a little bit of time to understand what happened here and where old URLs are now 301-redirected.

If you missed the first post where I covered a quick way to find and solve out-out-stock products, here is the link to take a look.

I wanted to know how fast search engines will discover and index this blog post. So I tracked them while doing some log file analysis.

It might be a detail that matters:

Bing and Yandex bots didn’t have access to crawl my website due to the “400 Bad Request” issue. This status code usually appears when blocking a specific page to a particular thing – country, bots, IP addresses, etc. I never wanted to block Yandexbot or Bingbot. But I solved this issue on 24th January by disabling ModSecurity on cPanel.

I thought it might be helpful to know some of these details before going to the story-based SEO research.

Which things did I examine?

Which search engine bot crawled this post first?

Well, Bingbot was fast, and it’s a clear winner of this question. I published the blog post on January 27, 2022, at 3:00 PM. According to the log file events I have, Bingbot crawled it 2 hours later on the same day. This log event coincides with the Bing Webmasters tool’s URL inspection details.

5 hours later, Yandexbot crawled it 2 times repeatedly.

As far as I see, the Yandex Webmaster tool doesn’t give the exact time of its crawl statistics. But I had a chance to catch it on log file events. You might be interested to see what it looks like, so of course, I will share it with you.

How many times did Bingbot and Yandexbot crawl the post until indexing it?

Since I published the blog post, Bingbot and Yandexbot crawled just on the same date and never looked back.

In total, Bingbot crawled one time, and Yandexbot crawled two times.

Did all bots crawl the XML sitemap?

No. But in this case, it was clear that Googlebot takes care of the XML sitemap. Since January 21:

Googlebot crawled the main XML sitemap 13 times. Post-based sitemap was crawled 4 times.
On the day I published the blog post, Googlebot crawled the main XML sitemap 7 times. But the other sitemap, which included blog posts, was crawled once.
When it comes to Yandexbot, their bot crawled the main sitemap about 2 times. As well both times post-based sitemap.
Bingbot never crawled any of the XML sitemaps.

Are those log events by a verified crawler?

Yes. I set to analyze just verified bots on Screaming Frog Log File Analyser. Here are the IP addresses that I got until then.

Top 5 Yandexbot IP addresses (by the number of events):

5.255.253.114
213.180.203.5
5.255.253.124
5.255.253.110
5.45.207.99

It’s easy to find them publicly on the web so that’s why I didn’t hesitate to share them with you.

Top 5 Bingbot IP addresses (by the number of events):

40.77.167.99
157.55.39.28
207.46.13.230
207.46.13.100
40.77.167.39

You can check if the crawler IP address belongs to Bingbot or not by using Bing’s Toolbox or Bing Webmasters’ Verify Bingbot.

Findings

In this research, I discovered something weird – which might be just for me.

When I wanted to check if Googlebot even discovered my blog post on Google Search Console, it was discovered but is currently not indexed.

Well, there wasn’t any date of discovering the blog post. So I checked the “Discovered – currently not indexed” coverage report, but there’s no blog post URL.

At first look, I didn’t know the reason for it. But then I thought of two things:

Googlebot crawled my XML sitemaps.
Googlebot crawled my homepage and the category of the blog post.

I was right, and also Google by not showing the URL in the “Discovered – currently not indexed” issue reports list. When I checked Google search results by entering the page title, the homepage and author page were there.

Sometimes we need to look at complex things simply. So we can understand it faster and better.

Conclusions

Clearly, Googlebot was too busy while understanding my new website version.
Bing and Yandex indexed my blog post on the day I published it.
After three days, Google still didn’t index my blog post.

Update — February 6, 2022:

The blog post was finally indexed by Google on February 4, after 8 days.

The post Which Search Engine Indexed My First Blog Post Faster? appeared first on Picardes.

A Quick Way to Find Out-of-Stock Products on eCommerce

Roman Adamita — Thu, 27 Jan 2022 12:51:18 +0000

I was thinking about my first blog post and decided to post one of my favorite methods that I use very often.

Since I enjoy working with e-commerce websites, the most common issue I face is out-of-stock products, their impact on conversion rate, and of course, organic performance. There are not too many secrets on why you must prioritize these tasks, such as finding out-of-stock products. But I will explain a bit why you must find and solve this issue.

What’s wrong with out-of-stock products?

Out-of-stock products can be crawled as most as in-stock products. – Don’t let your website’s crawl budget be spent with these kinds of products.
They can still be indexable and found by the new users. – Your brand can create a bad reputation just because of this.
When there are a lot of them, you might face Soft 404 errors. – Google usually doesn’t like out-of-stock product pages and, in some unusual cases (see bellow), decides to remove them from search results (but still will crawl to know if there are any changes).

In most cases “Soft 404” error causes these things:

Including “This product is no longer available.” notice on the product page.
The page or category might be empty, which doesn’t meet visitors’ expectations.
Including something like “Sorry this page is not working.” would be a potential why your page was removed from Google’s search results.
When your website is small, showing “You need to enable JavaScript to run this app.” notice also could be a potential why your page is included in the Soft 404 report (when Googlebot will crawl-render it again, your website will get indexed).

You can also check on Google’s source how to fix soft 404 errors.

Now, let’s get to the point.

How to find out-of-stock products?

If your CMS has a filter for that, you can find them easily. But in most cases, it’s a bit challenging to filter and find every out-of-stock product, especially if you’re an SEO consultant asking developers to report them to you.

I usually use Screaming Frog’s SEO Spider to find it.

Find one of the out-of-stock products and copy the exact text.
Follow this menu on the top: Configuration > Custom > Search.
Enter the exact text as bellow in the Custom Search form.

You need to choose the “Page Text No Anchors” filter but not others because you need to find just the simple text containing the exact term.

In this case, the website included the “OUT OF STOCK” exact text in their pages. Remember that it’s crucial to have the same text on every stock-out product page.

Now, as you have the filter included in the Custom Search form, click OK and start the crawl for your domain.

Once the crawl is finished, follow these steps:

Go to the Custom Search tab.
You will see “All” selected under tabs on the right side. Change it to the name you gave to find the exact pages you are looking for.

What now? How to solve stock-out products?

As my point in this blog post is not to go into the details, I will explain to you quickly how to solve it.

The products are not going to be in stock again:
- Redirect the product to its category. (Home > Category <301-redirect> Out-of-Stock Product)
- Remove from the XML sitemap all of those products. But not on the same day you did the redirection. Give a little time (a week or so) to Googlebot to crawl those 301-redirects. By using this method, you’ll also be able to track how was the impact.
- If those products were created by accident or duplicates to another one, I suggest you make them “410 Gone.” Googlebot might crawl it a few times, but not as much as a “404 Not Found” status code.
- If “Other Suggested Products” under the product page is not working dynamically, check if there still are these stock-out products.
The stock of products will be updated in the meantime:
- Include a notice such as “This product will be available soon!”
- Let people enter their email addresses to notice when the product will be in stock again.
- Until the products come in stock, make the pages in the “304 Not Modified” status code as a recommendation.

I hope it makes a bit of sense why this one it’s so important to be aware of, even if you’re not an SEO professional.

Thanks to Eric Mandell for sharing a similar method with many pieces but for different perspectives. If you want to read Eric’s practice of using SEO Spider to find internal linking opportunities, I suggest looking at this blog post.

There are a ton of things that you can do with SEO Spider, so don’t hesitate to check some little tips on Twitter.

Tweet this post

The post A Quick Way to Find Out-of-Stock Products on eCommerce appeared first on Picardes.