In my experience, the open web often feels like the Wild West, especially in recent times. Many creators, myself included, have watched as our hard work is scraped and fed into large language models without any hint of permission.
This situation has become a free-for-all, leaving website owners with almost no means to opt out or safeguard their creative endeavors. There have been attempts to address this, such as Jeremy Howard’s llms.txt initiative. Much like robots.txt helps us manage site crawlers, llms.txt aims to provide guidelines for AI companies’ crawling bots.
Unfortunately, there’s little proof that AI companies actually respect llms.txt or its guidelines. Additionally, Google has clearly stated it doesn’t support llms.txt.
However, a promising new protocol is on the horizon, potentially granting site owners like myself more control over how AI firms utilize our content. It looks like this might become part of robots.txt, allowing us to set definitive rules around AI system access and usage.
IETF AI Preferences Working Group
In response to this issue, the Internet Engineering Task Force (IETF) began the AI Preferences Working Group earlier this year in January. Their mission is to craft standardized, machine-readable rules to empower site owners to articulate AI usage preferences for their content.
Since its inception in 1986, the IETF has established core Internet protocols like TCP/IP, HTTP, DNS, and TLS. Now, they’re laying down foundations for the open web’s AI era. Leading this group are co-chairs Mark Nottingham and Suresh Krishnan, joined by figures from Google, Microsoft, Meta, and more.
Of particular interest is Google’s involvement via Gary Illyes, who is part of this working group.
The purpose of this group is clear:
- “The AI Preferences Working Group will standardize building blocks that allow for expressing preferences about how content is collected and processed for Artificial Intelligence (AI) model development, deployment, and use.”
What the AI Preferences Group is Proposing
This group aims to deliver new standards that empower site owners to determine how LLM-powered systems can utilize their open web content.
- A standard track document detailing a vocabulary to express AI-related preferences, independent of content association methods.
- Standard track document(s) that explain how to associate these preferences with content using IETF-defined protocols and formats, for example, Well-Known URIs and HTTP response headers.
- A standard approach for reconciling multiple preference expressions.
At the time of writing, nothing is set in stone yet. Early documents, however, provide a sneak peek into potential standards.
This working group published two crucial documents in August.
- A Vocabulary For Expressing AI Usage Preferences
- Associating AI Usage Preferences with Content in HTTP (with Illyes as a contributing author)
These documents propose significant updates to the Robots Exclusion Protocol (RFC 9309), suggesting new rules and definitions enabling site owners to specify AI content usage permissions.

How It Might Work
AI systems on the web are categorized and assigned standard labels. Whether a directory will exist for site owners to identify system labels remains unclear.
Currently, the defined labels include:
- search: for indexing/discoverability
- train-ai: for general AI training
- train-genai: for generative AI model training
- bots: for all types of automated processing, such as crawling and scraping
For each label, you can set two values:
- y to allow
- n to disallow.
I found it interesting that these rules can be applied at the folder level and customized for different bots. In robots.txt, they’re implemented using a new Content-Usage field, akin to existing Allow and Disallow fields.
Here’s an example robots.txt that the working group shared in their document:
User-Agent: *
Allow: /
Disallow: /never/
Content-Usage: train-ai=n
Content-Usage: /ai-ok/ train-ai=y
Explanation
Content-Usage: train-ai=n indicates that no content on this domain may be used for training any LLM model, whereas Content-Usage: /ai-ok/ train-ai=y permits model training using content within the /ai-ok/ folder.
Why Does This Matter?
There’s significant buzz about llms.txt within the SEO community and its use alongside robots.txt. Yet, no AI company has confirmed adherence to these guidelines, and Google disregards llms.txt.
Website owners, including myself, crave more explicit control over how AI companies leverage our content—be it for training models or RAG-based responses.
I feel that the IETF’s new standards signify positive progress. With Illyes as a contributing author, I remain optimistic that once finalized, companies like Google will embrace these standards, respecting new robots.txt rules during content scraping.
Inspired by this post on Search Engine Land.


Leave a Reply