In the previous chapter we saw the importance of crawlability, and how to start analyzing it in Google Search Console. To go further, it is necessary to examine the behavior of Googlebot through your server logs.
The Purpose of Analyzing Logs
Your logs contain the only 100% accurate data about how search engines browse your website. The logs record information you can’t find anywhere else.
Below are three other important reasons to track your crawlability in both the Search Console and your server logs.
1. Explore Your Crawl Budget and Where it Is Being Wasted
You can examine the page view statistics which will allow you to see if the most important pages on your site are the ones Googlebot visits the most, or if it's spending more time on older ones.
In particular, you can see:
Which pages are most often viewed by Googlebot.
The ratio between the number of pages visited by Googlebot and the total number of pages of your site.
You can also see if your site is being explored in its entirety, or if the Google spider gets stuck in some places.
All this information will enable you to take the appropriate actions. Let's look at a concrete example of how this information might be useful.
You own an e-commerce site. Thanks to the log analysis, you notice that last season’s products get more views than the new ones. You can offer special discounts to sell the older inventory then remove the items from your catalog and redirect to viewers to the new stock. Or, you can add a link to encourage Google to visit the new product pages.
2. Identify Pages That Are Rarely/Never Visited
You can also sort pages by crawl frequency, and which pages on your site are the least visited.
3. Identify Orphan Pages
These are pages on your site which don't have links. For example, Google was able to find them through the sitemap, but not through menus or links on other pages or in articles.
Retrieve Your Log File
This is what a tiny portion of a log file looks like:
It depends on the which software the server runs and which host you use.
Here's how to retrieve the logs:
Log in through FTP.
Go to the logs directory (top level of the directory).
The logs will have the form: access.log.XX.gz.
Go to https://logs.ovh.net/ to download it.
Host using Cpanel
Type "log" into the search bar.
Click on Raw Access.
You should see the screen below:
Check the configuration boxes if they aren't already. If you don't, you will only have the log file for the day, which isn't enough.
To get your logs, click on the corresponding domain name under Domain.
Use Appropriate Tools to Analyze Your Logs
Your log file will not yet be usable. At this point, it will show you all HTTP requests made to your server, not only Googlebot’s.
Additionally, after filtering your file, you must organize the data and draw graphs before you are ready to analyze Googlebot’s behavior.
Here is a list of tools to make the job easier:
Oncrawl: a free and open source log analyzer.
Seolyzer: free and easy to install.
A data processing tool like Excel or GoogleSheets: It is possible to analyze your logs directly in Excel. A log file is a text file, so it can easily be converted to a CSV file and imported into a spreadsheet. You are then free to use the Excel features (such as filters and pivot tables) to analyze your logs.
Screaming Frog SEO Log File Analyser: the most popular crawling tool and log analyzer. It has a free version (very limited), as well as a paid version (£ 99.00).
Botify: crawler and log analyzer - premium tool with a high price tag.
Example with Seolyzer
Here's an Seolyzer demo video that shows how to use it.
To recap, here are the screens I find most useful:
The HTTP Codes screen. This screen allows you to see the HTTP codes returned by your pages, and the 4xx and 5xx error pages. It is also possible to view a lot of these errors in Search Console.
Log analysis allows you to go further and find all the pages that return these errors, as well as to see the precise moment when they originated.
The HTTP vs. HTTPS screen (if your site uses HTTPS). It allows you to check that all the pages are returned in HTTPS, and that your HTTP to HTTPS redirects are correct.
The Active Pages screen. This screen allows you to see which pages have received at least one visitor from search results:
Focus more on the inactive pages. To view them, export the list of your active pages and compare it to the complete list of pages on your site in a spreadsheet.
You can view pages with incorrect SEO.
The screen The +crawled. Look at the least crawled pages.
Let’s look at an example:
You discover that your new product (https://mysite.com/mynewproduct) does not show up in search results. After analyzing the logs, you find that the page is inactive.
On the screen showing the most crawled pages, you notice that this page is very rarely crawled, which could explain why it doesn't show up in searches.
This pushes you to audit your site. You realize that this page is very difficult to access, because it's only available during special sales.
You can now take the necessary steps to improve your SEO!
Compare Results to What You See in Search Console
Once you have finished your analysis using one of the tools above, quickly compare your findings with those given by the Search Console, including which pages have errors.
You will notice one of three things:
1. You find similar results
This means that you have carried out your log analysis correctly. So, correct any errors you found.
2. You find more errors through the log analysis
Don't panic, it's quite normal: If Googlebot doesn’t crawl your entire site, then it makes sense not all errors would be listed in the Search Console.
3. You find fewer errors than in the Search Console
In this case, you need to check whether you made any mistakes in your analysis, or whether you were analyzing too long a period in the logs.
Logs are the only 100% accurate data source for how crawlers are behaving on your website.
By analyzing your log file, you will be able to determine whether Googlebot is crawling your website and your main pages correctly.
Remember to compare your results with those given by the Search Console.
You now know how to decrypt what Googlebot is doing on your site and what this means for your crawlability. In the next chapter, you'll see how to improve crawlability and indexing!