What follows is a basic log file analysis that assumes some knowledge of SEO, Microsoft Excel formulas and Screaming Frog SEO Spider (or other crawling software).
I’ve also provided example findings from a recent log file analysis of salience.co.uk.
- Why is Log File Analysis Important?
- Software used for our Log File Analyses
- Log File Analysis Process
- Example Findings from salience.co.uk Log Files
- Final Thoughts & Useful Links
Why is Log File Analysis Important?
Guiding search bots (such as Googlebot and Bingbot) around a website is an important aspect of technical SEO. You want to send crawlers to the pages you want to rank in search results, while keeping them away from unimportant pages.
This can be something as simple as a well structured main menu with a clear hierarchy that leads to products e.g: Beds > Double Beds > Jameson Natural Pine
You don’t want search bots crawling account pages such as individual customer orders (I’ve seen this, though none had personally identifiable information).
Server log files show every hit to a website. We filter this data to show
search bots only, and analyse the data to understand search engine crawl behaviour. Which pages do search engines see as important, for example? It’s often not the top-level category pages you’d expect.
Every website has a crawl budget. If this is wasted on pages that shouldn’t be crawled, it can lead to the website not being crawled or indexed as well as it should be. Generally, the more pages a website has, the greater benefits a log file analysis can provide. Though we have seen indexation improvements on websites with as few as 150 pages indexed in Google (e.g. new pages crawled and indexed quicker after reducing crawl budget waste). Like removing thin pages during a content review, a log file analysis can find low-quality pages that search engines crawl, but don’t deem worthy of decent rankings.
The order pages I mentioned numbered in the thousands, and were responsible for a significant chunk of crawl budget. According to crawls we ran on the website, the pages weren’t linked internally. They weren’t indexed in search engines either. Without the log files, they wouldn’t have been found.
Distilled has an excellent section on the benefits of log file analysis: Why should you do log analysis?
One excellent point is “By seeing where Google spends its time on your site you can prioritise the areas that will affect it the most.”
Software Used for Our Log File Analyses
- Screaming Frog SEO Spider
- Screaming Frog Log Analyser
- Microsoft Excel (may not be suitable to analyse large amounts of data – mine starts struggling at around a million rows)
Log File Analysis Process
Obtaining Log Files
Obtaining the correct log files from web developers or clients isn’t always straightforward. Screaming Frog has a template e-mail to help:
Some website’s log files are huge, so we recommend zipping files before they’re sent and/or filtering them to the search bots we analyse.
I start a log file analysis by crawling the website with Screaming Frog SEO Spider, following all internal links (including nofollow, XML sitemaps and hreflang if relevant) and linking with the Google Analytics API for recent organic traffic data (at least a month, over the same data range as the log file data).
Analysing Log File Data with Log Analyser and Excel
I import the log files into Screaming Frog Log Analyser and verify genuine bot hits (Project > Verify Bots if the log files are already imported), then filter to show Verified bots only:
All verified bots are pasted into an Excel template. If you don’t use suitable spreadsheet software, check the “Useful Links” section below for further information on how powerful Screaming Frog Log Analyser is. You can find plenty of actionable information with Log Analyser alone.
It’s worth comparing the graph on Log Analyser’s Overview tab (Events panel filtered to Verified, All Googlebots) to the Crawl Stats graph in Google Search Console. They rarely match exactly, but I’ve never seen them wildly differ. If they do, maybe some log file data is missing. Here’s one example of a graph matching enough to be confident the data is correct (Google Search Console on the left and Log Analyser on the right):
Added to the default columns (from the Log Analyser export)
in our Excel template are Parameter, Crawl Issue, Canonical Mismatch, Indexability and Organic Traffic (see screenshot a few paragraphs below).
To isolate query strings (making it easier to scan and filter them), the following formula is in the Parameter column (assuming the first URL is in cell A3):
Once the data has populated, select it and Copy > Paste Values (to speed up data filtering).
When the crawl (started earlier) finishes, it’s exported and dragged into the Imported URL Data tab on the Log Analyser interface. The crawl data automatically matches to the URLs found in the log file (Log Analyser > URLs tab > View: Matched with URL Data). This data is exported and pasted into a separate Excel tab, using a VLOOKUP to populate the Canonical Mismatch, Indexability and Organic Traffic columns at the end:
What follows isn’t an exact science; we’re looking for a good estimate of crawl budget waste. We’re only taking the number of URLs and events into account. For example, we could estimate bandwidth wastage from the Average Bytes column. File size is taken into account later. If URLs with a high file size are hit by search bots often, we may recommend optimising the URLs (e.g. they could be oversized images or PDFs).
In the Crawl Issue column of our spreadsheet, the following URLs are marked with a “y” :
- Highlighted as Non-Indexable in the Indexability column.
- Not 200 OK in the Last Response column (e.g. 301, 404).
- Contains URL parameter (excluding correctly paginated URLs and parameters that aren’t used for content we do want indexed in search engines. The latter is a rarity on modern websites).
- Cart, account, wishlist, conversion, thank you etc. (excluding the main login URL, as they’re often clicked to from search engines).
- Should be nofollowed but are crawled.
- Internal search result pages.
- Duplicate URLs (e.g. lowercase and uppercase versions).
- Test or template pages.
As I’m doing the above, I make a note of issues to highlight or investigate later. They’re URLs we’d prefer to be crawled less but mostly not at all, or to be fixed (e.g. 404s that should return 200 OK).
It helps to filter out URLs already marked with an issue, ordering by the URL and Parameter columns to spot patterns. It’s not possible to list all issues, but they won’t all fall into the above list. E.g. Something I spotted when scanning through salience.co.uk log file data was most URLs were duplicated with a suffix /feed/ folder. These URLs weren’t linked anywhere internally yet accounted for 32% of the URLs causing crawl budget waste.
On an Overview tab of our log file analysis Excel template, formulas populate pie charts summarising crawl budget waste:
The visuals always help get our points across to busy clients and are much easier than looking at rows of data.
When prioritising issues, it can help to order them by the number of events. For
I check search bots are crawling the most important pages (usually top-level categories on e-commerce websites) most often. Even on near-perfect websites, this rarely lines up with expectations. Sometimes one page may be crawled way more than another (e.g. due to high authority backlinks or social traction). It may be a concern if none of the most important pages
The organic traffic data we crawled with Screaming Frog SEO Spider can be taken into account when prioritising. E.g. Big gains often come from improving pages/content that receive lots of traffic. Is a page that receives lots of organic traffic not crawled often? Maybe it’s hidden deep in the website’s structure, and could receive more traffic if it was more prominent. Related product/post links can help with this.
Sometimes it’s worth breaking down findings by date such as:
- Did 404 Not Found errors increase suddenly?
- Are regularly updated pages crawled often? E.g. Category pages that frequently include new products – is Googlebot finding them quickly?
Example Findings from Salience.co.uk Log Files
Our recent domain migration from insideonline.co.uk to salience.co.uk was an opportunity to show some of the
Most Crawled URLs
Overall, our most crawled URLs were as expected; our homepage, contact page, Insight page, and various blog posts and sector reports.
The URLs with a /feed/ folder were, according to Dave, one of our developers: “a default WordPress function to generate XML feed data on all pages/posts.” They’ve been disabled with a custom function.
In the next log file analysis, I’ll check if the number of hits to these URLs has reduced.
Interestingly, Bingbot hit some of these URLs multiple times daily, whereas Googlebot didn’t crawl some at all during the log period. I generally find Googlebot to be cleverer when figuring out whether URLs are worth crawling or not. You can see in the pie charts screenshot above that Googlebot crawl budget waste is much less than crawl budget waste from combined search bots.
Two form completion pages were crawled by bots. These aren’t linked internally or indexed in search results, and we don’t want them indexed.
We disallowed them in robots.txt.
Many WordPress /tag/ URLs that existed at insideonline.co.uk were crawled on the salience.co.uk domain by Bingbot (mostly) and Googlebot.
We deleted the tags within the WordPress dashboard so they all return 404 Not Found. Over time, search bots should crawl these URLs less (410 Gone can remove URLs from Google’s crawl schedule quicker).
We don’t have an onsite search, yet Bingbot and Googlebot crawled /search/ URLs. We made sure these return 404 Not Found.
utm_ parameters are commonly used for tracking purposes, and should usually be disallowed in the console/webmaster tools of the relevant search engines. If such utm_ URLs are linked internally on websites, we advise the client updates them to the canonical link without the utm_ parameter. The salience.co.uk utm_ URLs are barely crawled by search bots. But to keep things clean, we disallowed them in Google Search Console:
Final Thoughts & Useful Log File Analysis Links
The more log files analyses you do, the more patterns you’ll notice and the quicker you’ll spot issues.
It can be interesting to compare search bots. Bingbot, for example, seems much more inefficient than Googlebot. Bingbot may regularly crawl URLs that haven’t returned a 200 OK server status for years, whereas Googlebot figures out the URLs aren’t important and crawls them less frequently or not at all.
- The Ultimate Guide to Log File Analysis was my introduction to log file analysis a few years ago. It remains one of the best posts on the topic, and it’s kept up-to-date.
- 22 Ways To Analyse Logs Using Screaming Frog Log File Analyser
- How to: Read a web site log file
- A Complete Guide to Log Analysis with Big Query – even if you don’t use Big Query, there’s some excellent information here, including obtaining log files with the correct data.
If you have any more questions about Log File Analysis, get in touch or comment below.