Discover How Google Crawling Evolved in 2026

```json
{
  "alt": "Google logo superimposed on a colorful spider web background.",
  "caption": "A Google logo meets a vibrant spider web, illustrating the intricate and colorful nature of the web and the digital world.",
  "description": "In this image, the Google logo is superimposed over a colorful spider web, symbolizing the interconnected nature of the internet. The web features a gradient of vibrant hues, creating a striking visual effect. This composition highlights themes of complexity, connection, and the digital network. Keywords: Google, spider web, internet, digital, colorful."
}
```

I’ve always been fascinated by how Google keeps improving its search capabilities. Recently, Gary Illyes from Google shared more about Googlebot’s operations, diving into its crawling ecosystem, fetching processes, and how it handles data.

If you’re curious, the article is aptly titled Inside Googlebot: Demystifying Crawling, Fetching, and the Bytes We Process.

Googlebot Reimagined. It’s intriguing to learn that Google uses multiple crawlers for diverse objectives. Referring to Googlebot as a singular entity might not capture this complexity anymore. You can find more details on the various crawlers and user agents here.

Understanding Limits. During a recent discussion, Google elaborated on its crawling limits. Gary Illyes provided these insights:

  • Googlebot fetches up to 2MB for any individual URL, except for PDFs.
  • This means it crawls only up to 2MB of a resource, encompassing the HTTP header.
  • For PDF files, the limit is notably higher at 64MB.
  • Image and video crawlers have varied threshold values, contingent on the product they serve.
  • By default, other crawlers have a 15MB limit, regardless of content type.

What exactly occurs when Google initiates crawling?

  1. Partial Fetching: For HTML files exceeding 2MB, Googlebot will not dismiss the page. Instead, it halts the fetch exactly at the 2MB mark, including HTTP request headers.
  2. Processing the Cutoff: The downloaded section is then forwarded to Google’s indexing systems and the Web Rendering Service (WRS) as if it were the entire file.
  3. The Unseen Bytes: Any data beyond the 2MB cutoff won’t be fetched, rendered, or indexed.
  4. Resource Handling: All referenced resources in the HTML, except media, fonts, and certain files, are fetched by WRS independently, with their own byte count not affecting the parent page’s size.

Rendering Bytes with Google. Once the crawler accesses these bytes, WRS takes over. It processes JavaScript and executes code like a modern browser to grasp the final visual and textual state of the page. This process doesn’t request images or videos but does respect the 2MB threshold for each resource.

Best Practices You might want to embrace these recommended practices:

  • Streamline Your HTML: Shift large CSS and JavaScript to external files. While the main HTML document is capped at 2MB, external scripts and stylesheets can be fetched separately, under their own constraints.
  • Prioritize Content: Position crucial elements like meta tags, <title>, <link>, canonicals, and vital structured data high in the HTML to ensure they’re not overlooked.
  • Monitor Server Logs: Keep track of server response times. If your server struggles to deliver data efficiently, our fetchers may slow down to avoid overloading, reducing crawl frequency.

Don’t Miss the Podcast! Google also released a podcast on this topic. Check it out:


Inspired by this post on Search Engine Land.


crushpress.ai community screenshot

FAQs

What is the maximum fetch size for a single URL by Googlebot?

Googlebot fetches up to 2MB for any individual URL, except PDFs. This means the crawler handles at most 2MB of a resource, including the HTTP header. For PDFs, the limit is notably higher at 64MB.

What happens when a HTML file exceeds 2MB?

For HTML files exceeding 2MB, Googlebot performs a partial fetch and stops at the 2MB mark, including HTTP request headers. The downloaded section is then forwarded to Google’s indexing systems and the Web Rendering Service as if it were the entire file. Any data beyond the 2MB cutoff won’t be fetched, rendered, or indexed.

How are resources like images and fonts handled during crawling?

All referenced resources in the HTML, except media, fonts, and certain files, are fetched by the Web Rendering Service independently, with their own byte counts not affecting the parent page’s size.

What happens after Googlebot fetches the 2MB portion?

Once the 2MB portion is fetched, WRS processes JavaScript to render the final visual and textual state. The process doesn’t fetch images or videos but respects the 2MB limit for each resource.

What are best practices to improve crawling efficiency?

Streamline your HTML by moving large CSS and JavaScript to external files; the main HTML document remains capped at 2MB, while external scripts can be fetched separately. Prioritize content by placing meta tags, title, link, canonicals, and vital structured data high in the HTML to avoid being overlooked. Monitor server logs; if your server struggles to deliver data efficiently, fetchers may slow down to reduce crawl frequency.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *