I’ve always been fascinated by how Google keeps improving its search capabilities. Recently, Gary Illyes from Google shared more about Googlebot’s operations, diving into its crawling ecosystem, fetching processes, and how it handles data.
If you’re curious, the article is aptly titled Inside Googlebot: Demystifying Crawling, Fetching, and the Bytes We Process.
Googlebot Reimagined. It’s intriguing to learn that Google uses multiple crawlers for diverse objectives. Referring to Googlebot as a singular entity might not capture this complexity anymore. You can find more details on the various crawlers and user agents here.
Understanding Limits. During a recent discussion, Google elaborated on its crawling limits. Gary Illyes provided these insights:
- Googlebot fetches up to 2MB for any individual URL, except for PDFs.
- This means it crawls only up to 2MB of a resource, encompassing the HTTP header.
- For PDF files, the limit is notably higher at 64MB.
- Image and video crawlers have varied threshold values, contingent on the product they serve.
- By default, other crawlers have a 15MB limit, regardless of content type.
What exactly occurs when Google initiates crawling?
- Partial Fetching: For HTML files exceeding 2MB, Googlebot will not dismiss the page. Instead, it halts the fetch exactly at the 2MB mark, including HTTP request headers.
- Processing the Cutoff: The downloaded section is then forwarded to Google’s indexing systems and the Web Rendering Service (WRS) as if it were the entire file.
- The Unseen Bytes: Any data beyond the 2MB cutoff won’t be fetched, rendered, or indexed.
- Resource Handling: All referenced resources in the HTML, except media, fonts, and certain files, are fetched by WRS independently, with their own byte count not affecting the parent page’s size.
Rendering Bytes with Google. Once the crawler accesses these bytes, WRS takes over. It processes JavaScript and executes code like a modern browser to grasp the final visual and textual state of the page. This process doesn’t request images or videos but does respect the 2MB threshold for each resource.
Best Practices You might want to embrace these recommended practices:
- Streamline Your HTML: Shift large CSS and JavaScript to external files. While the main HTML document is capped at 2MB, external scripts and stylesheets can be fetched separately, under their own constraints.
- Prioritize Content: Position crucial elements like meta tags,
<title>,<link>, canonicals, and vital structured data high in the HTML to ensure they’re not overlooked. - Monitor Server Logs: Keep track of server response times. If your server struggles to deliver data efficiently, our fetchers may slow down to avoid overloading, reducing crawl frequency.
Don’t Miss the Podcast! Google also released a podcast on this topic. Check it out:
Inspired by this post on Search Engine Land.


Leave a Reply