What follows is a basic log file analysis that assumes some knowledge of SEO, Microsoft Excel formulas and Screaming Frog SEO Spider (or other crawling software).
I’ve also provided example findings from a recent log file analysis of salience.co.uk.
- Why is Log File Analysis Important?
- Software used for our Log File Analyses
- Log File Analysis Process
- Example Findings from salience.co.uk Log Files
- Final Thoughts & Useful Links
Why is Log File Analysis Important?
Guiding search bots (such as Googlebot and Bingbot) around a website is an important aspect of technical SEO. You want to send crawlers to the pages you want to rank in search engines while keeping them away from unimportant pages and especially pages that shouldn’t be indexed (such as test and template pages).
This can be something as simple as a well structured main menu with a clear hierarchy that leads to products e.g: Beds > Double Beds > Jameson Natural Pine
You don’t want search bots crawling account pages such as individual customer orders (I’ve seen this, though none had personally identifiable information).
Server log files show every hit to a website. We filter this data to show
search bots only, and analyse the data to understand search engine crawl behaviour. Which pages do search engines see as important, for example? It’s often not the top-level category pages you’d expect.
Every website has a crawl budget. If this is wasted on pages that shouldn’t be crawled, it can lead to the website not being crawled or indexed as well as it should be. Generally, the more pages a website has, the greater benefits a log file analysis can provide. Though we have seen indexation improvements on websites with as few as 150 pages indexed in Google (e.g. new pages crawled and indexed quicker after reducing crawl budget waste). Like removing thin pages during a content review, a log file analysis can find low-quality pages that search engines crawl, but don’t deem worthy of decent rankings.
The order pages I mentioned numbered in the thousands, and were responsible for a significant chunk of crawl budget. According to crawls we ran on the website, the pages weren’t linked internally. They weren’t indexed in search engines either. Without the log files, they wouldn’t have been found.
Distilled has an excellent section on the benefits of log file analysis: Why should you do log analysis?
One excellent point is “By seeing where Google spends its time on your site you can prioritise the areas that will affect it the most.”
Software Used for Our Log File Analyses
- Screaming Frog SEO Spider
- Screaming Frog Log Analyser
- Microsoft Excel (may not be suitable to analyse large amounts of data – mine starts struggling at around a million rows)
Log File Analysis Process
Obtaining Log Files
Obtaining the correct log files from web developers or clients isn’t always straightforward. Screaming Frog has a template e-mail to help:
Some website’s log files are huge, so we recommend zipping files before they’re sent and/or filtering them to the search bots we analyse.
I start a log file analysis by crawling the website with Screaming Frog SEO Spider, following all internal links (including nofollow, XML sitemaps and hreflang if relevant) and linking with the Google Analytics API for recent organic traffic data (at least a month, over the same data range as the log file data).
Analysing Log File Data with Log Analyser and Excel
I import the log files into Screaming Frog Log Analyser and verify genuine bot hits (Project > Verify Bots if the log files are already imported), then filter to show Verified bots only:
All verified bots are pasted into an Excel template. If you don’t use suitable spreadsheet software, check the “Useful Links” section below for further information on how powerful Screaming Frog Log Analyser is. You can find plenty of actionable information with Log Analyser alone.
It’s worth comparing the graph on Log Analyser’s Overview tab (Events panel filtered to Verified, All Googlebots) to the Crawl Stats graph in Google Search Console. They rarely match exactly, but I’ve never seen them wildly differ. If they do, maybe some log file data is missing. Here’s one example of a graph matching enough to be confident the data is correct (Google Search Console on the left and Log Analyser on the right):
Added to the default columns (from the Log Analyser export)
in our Excel template are Parameter, Crawl Issue, Canonical Mismatch, Indexability and Organic Traffic (see screenshot a few paragraphs below).
To isolate query strings (making it easier to scan and filter them), the following formula is in the Parameter column (assuming the first URL is in cell A3):
Once the data has populated, select it and Copy > Paste Values (to speed up data filtering).
When the crawl (started earlier) finishes, it’s exported and dragged into the Imported URL Data tab on the Log Analyser interface. The crawl data automatically matches to the URLs found in the log file (Log Analyser > URLs tab > View: Matched with URL Data). This data is exported and pasted into a separate Excel tab, using a VLOOKUP to populate the Canonical Mismatch (from the Canonical Link Element 1 column), Indexability and Organic Traffic columns:
What follows isn’t an exact science; we’re looking for a good estimate of crawl budget waste. We’re only taking the number of URLs and events into account. For example, we could estimate bandwidth wastage from the Average Bytes column. File size is taken into account later. If URLs with a high file size are hit by search bots often, we may recommend optimising the URLs (e.g. they could be oversized images or PDFs).
In the Crawl Issue column of our spreadsheet, the following URLs are marked with a “y” :
- Highlighted as Non-Indexable in the Indexability column.
- Not 200 OK in the Last Response column (e.g. 301, 404).
- Contains URL parameter (excluding correctly paginated URLs and parameters that aren’t used for content we do want indexed in search engines. The latter is a rarity on modern websites).
- Cart, account, wishlist, conversion, thank you etc. (excluding the main login URL, as they’re often clicked to from search engines).
- Should be nofollowed but are crawled.
- Internal search result pages.
- Duplicate URLs (e.g. lowercase and uppercase versions).
- Test or template pages.
As I’m doing the above, I make a note of issues to highlight or investigate later. They’re URLs we’d prefer to be crawled less but mostly not at all, or to be fixed (e.g. 404s that should return 200 OK).
It helps to filter out URLs already marked with an issue, ordering by the URL and Parameter columns to spot patterns. It’s not possible to list all issues, but they won’t all fall into the above list. E.g. Something I spotted when scanning through salience.co.uk log file data was most URLs were duplicated with a suffix /feed/ folder. These URLs weren’t linked anywhere internally yet accounted for 32% of the URLs causing crawl budget waste.
On an Overview tab of our log file analysis Excel template, formulas populate pie charts summarising crawl budget waste:
The visuals always help get our points across to busy clients and are much easier than looking at rows of data.