Empower Your Content with New AI Usage Standards

```json
{
  "alt": "Digital representation of a holographic laptop with lines of code facing three menacing digital spiders, symbolizing cyber threats.",
  "caption": "In a digital world, lines of glowing code face off against sinister cyber spiders, a modern battle against digital threats.",
  "description": "This image depicts a holographic laptop with glowing lines of code, symbolizing digital environments. On the right, three ominous digital spiders represent cyber threats. The image uses neon colors, combining blues and purples to create a futuristic and cautionary atmosphere, emphasizing cybersecurity concerns and the ongoing battle against malware and hacking. Perfect for content about cybersecurity, digital safety, and technological advances."
}
```

In my experience, the open web often feels like the Wild West, especially in recent times. Many creators, myself included, have watched as our hard work is scraped and fed into large language models without any hint of permission.

This situation has become a free-for-all, leaving website owners with almost no means to opt out or safeguard their creative endeavors. There have been attempts to address this, such as Jeremy Howard’s llms.txt initiative. Much like robots.txt helps us manage site crawlers, llms.txt aims to provide guidelines for AI companies’ crawling bots.

Unfortunately, there’s little proof that AI companies actually respect llms.txt or its guidelines. Additionally, Google has clearly stated it doesn’t support llms.txt.

However, a promising new protocol is on the horizon, potentially granting site owners like myself more control over how AI firms utilize our content. It looks like this might become part of robots.txt, allowing us to set definitive rules around AI system access and usage.

IETF AI Preferences Working Group

In response to this issue, the Internet Engineering Task Force (IETF) began the AI Preferences Working Group earlier this year in January. Their mission is to craft standardized, machine-readable rules to empower site owners to articulate AI usage preferences for their content.

Since its inception in 1986, the IETF has established core Internet protocols like TCP/IP, HTTP, DNS, and TLS. Now, they’re laying down foundations for the open web’s AI era. Leading this group are co-chairs Mark Nottingham and Suresh Krishnan, joined by figures from Google, Microsoft, Meta, and more.

Of particular interest is Google’s involvement via Gary Illyes, who is part of this working group.

The purpose of this group is clear:

  • “The AI Preferences Working Group will standardize building blocks that allow for expressing preferences about how content is collected and processed for Artificial Intelligence (AI) model development, deployment, and use.”

What the AI Preferences Group is Proposing

This group aims to deliver new standards that empower site owners to determine how LLM-powered systems can utilize their open web content.

  • A standard track document detailing a vocabulary to express AI-related preferences, independent of content association methods.
  • Standard track document(s) that explain how to associate these preferences with content using IETF-defined protocols and formats, for example, Well-Known URIs and HTTP response headers.
  • A standard approach for reconciling multiple preference expressions.

At the time of writing, nothing is set in stone yet. Early documents, however, provide a sneak peek into potential standards.

This working group published two crucial documents in August.

These documents propose significant updates to the Robots Exclusion Protocol (RFC 9309), suggesting new rules and definitions enabling site owners to specify AI content usage permissions.

```json
{
  "alt": "Diagram showing the relationship between categories of use, including foundation model, AI output, and search under automated processing.",
  "caption": "Exploring the links between foundation models, AI outputs, and search within automated processing systems.",
  "description": "This diagram illustrates the relationship between various categories in automated processing. It highlights the connections between foundation models, AI outputs, and search functionalities. The depiction consists of labeled boxes arranged to show how these categories interact. This visualization aids in understanding the structure and interaction within automated systems, useful for those studying AI and data processing frameworks."
}
```

How It Might Work

AI systems on the web are categorized and assigned standard labels. Whether a directory will exist for site owners to identify system labels remains unclear.

Currently, the defined labels include:

  • search: for indexing/discoverability
  • train-ai: for general AI training
  • train-genai: for generative AI model training
  • bots: for all types of automated processing, such as crawling and scraping

For each label, you can set two values:

  • y to allow
  • n to disallow.

I found it interesting that these rules can be applied at the folder level and customized for different bots. In robots.txt, they’re implemented using a new Content-Usage field, akin to existing Allow and Disallow fields.

Here’s an example robots.txt that the working group shared in their document:

User-Agent: *
Allow: /
Disallow: /never/
Content-Usage: train-ai=n
Content-Usage: /ai-ok/ train-ai=y

Explanation
Content-Usage: train-ai=n indicates that no content on this domain may be used for training any LLM model, whereas Content-Usage: /ai-ok/ train-ai=y permits model training using content within the /ai-ok/ folder.

Why Does This Matter?

There’s significant buzz about llms.txt within the SEO community and its use alongside robots.txt. Yet, no AI company has confirmed adherence to these guidelines, and Google disregards llms.txt.

Website owners, including myself, crave more explicit control over how AI companies leverage our content—be it for training models or RAG-based responses.

I feel that the IETF’s new standards signify positive progress. With Illyes as a contributing author, I remain optimistic that once finalized, companies like Google will embrace these standards, respecting new robots.txt rules during content scraping.


Inspired by this post on Search Engine Land.


crushpress.ai community screenshot

FAQs

What is the AI Usage Preferences protocol?

It is a proposed IETF standard to standardize how content usage preferences are expressed for AI model development, deployment, and use. The goal is to empower site owners to articulate access preferences for AI systems.

Who leads the AI Preferences Working Group?

Co-chairs are Mark Nottingham and Suresh Krishnan, with involvement from Google, Microsoft, Meta, and others. Gary Illyes from Google is also a contributor.

What documents has the AI Preferences Group published?

They published A Vocabulary For Expressing AI Usage Preferences and Associating AI Usage Preferences with Content in HTTP, with Illyes as a contributing author.

What is the purpose of these standards?

To standardize building blocks for expressing AI-related content preferences and to describe how to attach these preferences to content using HTTP and Well-Known URIs, helping site owners control AI usage.

How might these preferences be expressed in robots.txt?

The article includes a sample robots.txt showing Content-Usage fields, such as train-ai=n to disallow training and /ai-ok/ train-ai=y to allow it.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *