Publishers Demand Halt in AI Data Collection by Common Crawl

```json
{
  "alt": "A futuristic warrior in armor faces a giant robotic hand, pressing against a glowing blue digital shield.",
  "caption": "In an epic standoff, a lone warrior confronts a colossal robotic adversary, shielded by a radiant digital barrier. Who will prevail in this high-tech showdown?",
  "description": "This digital artwork depicts a tense confrontation between a futuristic warrior and a massive robotic hand labeled 'CCBOT.' The warrior is clad in high-tech armor and stands behind a luminous blue shield with intricate digital patterns. The image captures a dramatic scene of technology and bravery set against a dark, atmospheric background. With themes of human versus machine, it offers a visual feast for lovers of sci-fi and cybernetic battles."
}
```

Could AI be losing a crucial source of its training data? As a major shift looms, significant publishers are urging Common Crawl to pause its collection and distribution of their content for AI training.

Digital Content Next (DCN) has sent a cease-and-desist letter to the Common Crawl Foundation, asking them to stop scraping and sharing protected publisher content.

Representing leading digital publishers like the AP, the New York Times, NBC Universal, Bloomberg, NPR, and Fox, DCN is also insisting that Common Crawl remove its members’ content, including paywalled and subscriber-only news articles, from its datasets.

Concerns Over Opt-Outs: Questions arise regarding Common Crawl’s adherence to publisher opt-out requests. Specifically, DCN’s lawyers are scrutinizing whether previous statements about compliance—often citing technical costs and delays—were perhaps misleading.

  • The registry maintained by Common Crawl does list sites opting out, including several prominent news organizations.

Claims of Infringement: DCN firmly holds that copyright isn’t an opt-out system. They allege Common Crawl has been “flagrantly infringing” on publisher copyrights by distributing protected content without authorization or compensation.

  • The group further critiques how Common Crawl shares this content with AI developers.
  • DCN’s CEO, Jason Kint, signifies this legal action is a stance against the notion that online content is available for unrestricted collection, storage, and reuse.

Common Crawl’s Defense: Rich Skrenta, the Executive Director, denies allegations of bypassing paywalls and misleading publishers. He references a prompt and technical response to remove previously crawled content upon request.

  • “Our removal process aligns with our dataset’s technical framework,” Skrenta explains.

Importance of This Battle: The outcome of this dispute could drastically influence the scope of publisher content that AI search engines use without explicit permission. Should there be heightened consent requirements, licensed sources may prevail, reducing reliance on openly available web content.

The High Stakes of AI Training: Established in 2008, Common Crawl has amassed billions of webpages to form a free public repository, a vital tool for training AI models. Notably, The New York Times’ lawsuit against OpenAI in 2023 cited that Common Crawl comprised 60% of GPT-3’s training data, as reported by Press Gazette.

  • A 2024 Mozilla Foundation paper found generative AI would scarcely exist today without Common Crawl.
  • Common Crawl’s ongoing efforts to create AI crawling standards indicate a willingness to adapt, yet DCN calls for decisive action—fully halting the scraping of protected content.

Inspired by this post on Search Engine Land.


crushpress.ai community screenshot

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *