What follows is a basic log file analysis that assumes some knowledge of SEO, Microsoft Excel formulas and Screaming Frog SEO Spider (or other crawling software).

I’ve also provided example findings from a recent log file analysis of salience.co.uk.

Jump to:

Why is Log File Analysis Important?

Guiding search bots (such as Googlebot and Bingbot) around a website is an important aspect of technical SEO. You want to send crawlers to the pages you want to rank in search results, while keeping them away from unimportant pages.

This can be something as simple as a well structured main menu with a clear hierarchy that leads to products e.g: Beds > Double Beds > Jameson Natural Pine

You don’t want search bots crawling account pages such as individual customer orders (I’ve seen this, though none had personally identifiable information).

Server log files show every hit to a website. We filter this data to show
search bots only, and analyse the data to understand search engine crawl behaviour. Which pages do search engines see as important, for example? It’s often not the top-level category pages you’d expect.

Every website has a crawl budget. If this is wasted on pages that shouldn’t be crawled, it can lead to the website not being crawled or indexed as well as it should be. Generally, the more pages a website has, the greater benefits a log file analysis can provide. Though we have seen indexation improvements on websites with as few as 150 pages indexed in Google (e.g. new pages crawled and indexed quicker after reducing crawl budget waste). Like removing thin pages during a content review, a log file analysis can find low-quality pages that search engines crawl, but don’t deem worthy of decent rankings.

The order pages I mentioned numbered in the thousands, and were responsible for a significant chunk of crawl budget. According to crawls we ran on the website, the pages weren’t linked internally. They weren’t indexed in search engines either. Without the log files, they wouldn’t have been found.

Distilled has an excellent section on the benefits of log file analysis: Why should you do log analysis?

One excellent point is “By seeing where Google spends its time on your site you can prioritise the areas that will affect it the most.”

Software Used for Our Log File Analyses

  • Screaming Frog SEO Spider
  • Screaming Frog Log Analyser
  • Microsoft Excel (may not be suitable to analyse large amounts of data – mine starts struggling at around a million rows)

Log File Analysis Process

Obtaining Log Files

Obtaining the correct log files from web developers or clients isn’t always straightforward. Screaming Frog has a template e-mail to help:
www.screamingfrog.co.uk/requesting-logs-for-the-log-file-analyser/.

Some website’s log files are huge, so we recommend zipping files before they’re sent and/or filtering them to the search bots we analyse.

Website Crawl

I start a log file analysis by crawling the website with Screaming Frog SEO Spider, following all internal links (including nofollow, XML sitemaps and hreflang if relevant) and linking with the Google Analytics API for recent organic traffic data (at least a month, over the same data range as the log file data).

Analysing Log File Data with Log Analyser and Excel

I import the log files into Screaming Frog Log Analyser and verify genuine bot hits (Project > Verify Bots if the log files are already imported), then filter to show Verified bots only:

Screenshot from Screaming Frog Log Analyser: Verifying bots

All verified bots are pasted into an Excel template. If you don’t use suitable spreadsheet software, check the “Useful Links” section below for further information on how powerful Screaming Frog Log Analyser is. You can find plenty of actionable information with Log Analyser alone.

It’s worth comparing the graph on Log Analyser’s Overview tab (Events panel filtered to Verified, All Googlebots) to the Crawl Stats graph in Google Search Console. They rarely match exactly, but I’ve never seen them wildly differ. If they do, maybe some log file data is missing. Here’s one example of a graph matching enough to be confident the data is correct (Google Search Console on the left and Log Analyser on the right):

Added to the default columns (from the Log Analyser export)
in our Excel template are Parameter, Crawl Issue, Canonical Mismatch, Indexability and Organic Traffic (see screenshot a few paragraphs below).

To isolate query strings (making it easier to scan and filter them), the following formula is in the Parameter column (assuming the first URL is in cell A3):

=IFERROR(RIGHT(A3,LEN(A3)-FIND("?",A3)),"")

Once the data has populated, select it and Copy > Paste Values (to speed up data filtering).

When the crawl (started earlier) finishes, it’s exported and dragged into the Imported URL Data tab on the Log Analyser interface. The crawl data automatically matches to the URLs found in the log file (Log Analyser > URLs tab > View: Matched with URL Data). This data is exported and pasted into a separate Excel tab, using a VLOOKUP to populate the Canonical Mismatch (from the Canonical Link Element 1 column), Indexability and Organic Traffic columns:

Log File Analysis Excel template

What follows isn’t an exact science; we’re looking for a good estimate of crawl budget waste. We’re only taking the number of URLs and events into account. For example, we could estimate bandwidth wastage from the Average Bytes column. File size is taken into account later. If URLs with a high file size are hit by search bots often, we may recommend optimising the URLs (e.g. they could be oversized images or PDFs).

In the Crawl Issue column of our spreadsheet, the following URLs are marked with a “y” :

  • Highlighted as Non-Indexable in the Indexability column.
  • Not 200 OK in the Last Response column (e.g. 301, 404).
  • Contains URL parameter (excluding correctly paginated URLs and parameters that aren’t used for content we do want indexed in search engines. The latter is a rarity on modern websites).
  • Cart, account, wishlist, conversion, thank you etc. (excluding the main login URL, as they’re often clicked to from search engines).
  • Should be nofollowed but are crawled.
  • Internal search result pages.
  • Duplicate URLs (e.g. lowercase and uppercase versions).
  • Test or template pages.

As I’m doing the above, I make a note of issues to highlight or investigate later. They’re URLs we’d prefer to be crawled less but mostly not at all, or to be fixed (e.g. 404s that should return 200 OK).

It helps to filter out URLs already marked with an issue, ordering by the URL and Parameter columns to spot patterns. It’s not possible to list all issues, but they won’t all fall into the above list. E.g. Something I spotted when scanning through salience.co.uk log file data was most URLs were duplicated with a suffix /feed/ folder. These URLs weren’t linked anywhere internally yet accounted for 32% of the URLs causing crawl budget waste.

On an Overview tab of our log file analysis Excel template, formulas populate pie charts summarising crawl budget waste:

Pie charts displaying search bot log file data

The visuals always help get our points across to busy clients and are much easier than looking at rows of data.

All a bit technical?

We'll help maximise your crawl budget

Your Free Audit

Prioritising issues

When prioritising issues, it can help to order them by the number of events. For example, if a 404 Not Found URL is hit by Googlebot 1,000 times a month, it’s a bigger waste of crawl budget than a 404 that’s barely crawled.

I check search bots are crawling the most important pages (usually top-level categories on e-commerce websites) most often. Even on near-perfect websites, this rarely lines up with expectations. Sometimes one page may be crawled way more than another (e.g. due to high authority backlinks or social traction). It may be a concern if none of the most important pages are crawled most often, when further investigations may be required. Ask yourself, are internal links hidden behind JavaScript? Is the hierarchy of the website unclear and difficult for search bots to follow?

The organic traffic data we crawled with Screaming Frog SEO Spider can be taken into account when prioritising. E.g. Big gains often come from improving pages/content that receive lots of traffic. Is a page that receives lots of organic traffic not crawled often? Maybe it’s hidden deep in the website’s structure, and could receive more traffic if it was more prominent. Related product/post links can help with this.

In Log Analyser, the Directories tab shows the most crawled directories. They’re often folders that contain JavaScript, CSS and images, which makes sense as bots need to crawl those files to render pages. After that, the most crawled directories should generally contain the pages that are important to rank.

Sometimes it’s worth breaking down findings by date such as:

  • Did 404 Not Found errors increase suddenly?
  • Are regularly updated pages crawled often? E.g. Category pages that frequently include new products – is Googlebot finding them quickly?

Example Findings from Salience.co.uk Log Files

Our recent domain migration from insideonline.co.uk to salience.co.uk was an opportunity to show some of the useful data provided by a log file analysis. None of the following examples was found in a normal crawl of the website (such as by using Screaming Frog SEO Spider or DeepCrawl). They’re all common issues.

Most Crawled URLs

Overall, our most crawled URLs were as expected; our homepage, contact page, Insight page, and various blog posts and sector reports.

Exceptions were old CSS and JavaScript files, which continue to be crawled by Googlebot despite consistently returning a 404 Not Found status. I’ll check the status of these in a future log file analysis.

/feed/ URLs

The URLs with a /feed/ folder were, according to Dave, one of our developers: “a default WordPress function to generate XML feed data on all pages/posts.” They’ve been disabled with a custom function.

In the next log file analysis, I’ll check if the number of hits to these URLs has reduced.

Interestingly, Bingbot hit some of these URLs multiple times daily, whereas Googlebot didn’t crawl some at all during the log period. I generally find Googlebot to be cleverer when figuring out whether URLs are worth crawling or not. You can see in the pie charts screenshot above that Googlebot crawl budget waste is much less than crawl budget waste from combined search bots.

However, Googlebot is much better at finding URLs I’d prefer it didn’t find (such as URL fragments in JavaScript).

Conversion Pages

Two form completion pages were crawled by bots. These aren’t linked internally or indexed in search results, and we don’t want them indexed.

We disallowed them in robots.txt.

Tag URLs

Many WordPress /tag/ URLs that existed at insideonline.co.uk were crawled on the salience.co.uk domain by Bingbot (mostly) and Googlebot.

We deleted the tags within the WordPress dashboard so they all return 404 Not Found. Over time, search bots should crawl these URLs less (410 Gone can remove URLs from Google’s crawl schedule quicker).

Search Pages

We don’t have an onsite search, yet Bingbot and Googlebot crawled /search/ URLs. We made sure these return 404 Not Found.

URL Parameters

utm_ parameters are commonly used for tracking purposes, and should usually be disallowed in the console/webmaster tools of the relevant search engines. If such utm_ URLs are linked internally on websites, we advise the client updates them to the canonical link without the utm_ parameter. The salience.co.uk utm_ URLs are barely crawled by search bots. But to keep things clean, we disallowed them in Google Search Console:

Some parameter use that may lead to crawl budget waste is inevitable. E.g. A version number on a JavaScript or CSS file such as ?ver=1.2 could indicate fingerprinting is used. Old versions will still be crawled for a short time even if they 301 redirect or return 410 Gone.

Final Thoughts & Useful Log File Analysis Links

The more log files analyses you do, the more patterns you’ll notice and the quicker you’ll spot issues.

It can be interesting to compare search bots. Bingbot, for example, seems much more inefficient than Googlebot. Bingbot may regularly crawl URLs that haven’t returned a 200 OK server status for years, whereas Googlebot figures out the URLs aren’t important and crawls them less frequently or not at all.

However, Googlebot is better at finding URLs you don’t want it to find, which can be a problem. URLs hidden in JavaScript, for example.

If you have any more questions about Log File Analysis, get in touch or comment below.

Alex Harford

Alex is a Technical SEO Manager with ten years' experience. He worked as a web developer and designer for over six years and became interested in SEO when a website he recoded doubled organic traffic overnight (still his greatest achievement).He maintains a personal website, meaning he has hands-on experience of many recommendations he makes to clients. He enjoys lots of other things including the outdoors, travel, live music and various creative escapades like writing and photography. He is famous for inventing wallpaper during Imperial China's Qin dynasty and doesn't like bacon as much as some people think.

Challenge Us

We'll exceed your expectations.

What's your goal?

Talk To Us

We love a good chinwag.

0800 122 3530