Tag: Data Scraping

Publishers Demand Halt in AI Data Collection by Common Crawl
Could AI be losing a crucial source of its training data? As a major shift looms, significant publishers are urging Common Crawl to pause its collection and distribution of their content for AI training.

Digital Content Next (DCN) has sent a cease-and-desist letter to the Common Crawl Foundation, asking them to stop scraping and sharing protected publisher content.

Representing leading digital publishers like the AP, the New York Times, NBC Universal, Bloomberg, NPR, and Fox, DCN is also insisting that Common Crawl remove its members’ content, including paywalled and subscriber-only news articles, from its datasets.

Concerns Over Opt-Outs: Questions arise regarding Common Crawl’s adherence to publisher opt-out requests. Specifically, DCN’s lawyers are scrutinizing whether previous statements about compliance—often citing technical costs and delays—were perhaps misleading.
- The registry maintained by Common Crawl does list sites opting out, including several prominent news organizations.
Claims of Infringement: DCN firmly holds that copyright isn’t an opt-out system. They allege Common Crawl has been “flagrantly infringing” on publisher copyrights by distributing protected content without authorization or compensation.
- The group further critiques how Common Crawl shares this content with AI developers.
- DCN’s CEO, Jason Kint, signifies this legal action is a stance against the notion that online content is available for unrestricted collection, storage, and reuse.
Common Crawl’s Defense: Rich Skrenta, the Executive Director, denies allegations of bypassing paywalls and misleading publishers. He references a prompt and technical response to remove previously crawled content upon request.
- “Our removal process aligns with our dataset’s technical framework,” Skrenta explains.
Importance of This Battle: The outcome of this dispute could drastically influence the scope of publisher content that AI search engines use without explicit permission. Should there be heightened consent requirements, licensed sources may prevail, reducing reliance on openly available web content.

The High Stakes of AI Training: Established in 2008, Common Crawl has amassed billions of webpages to form a free public repository, a vital tool for training AI models. Notably, The New York Times’ lawsuit against OpenAI in 2023 cited that Common Crawl comprised 60% of GPT-3’s training data, as reported by Press Gazette.
- A 2024 Mozilla Foundation paper found generative AI would scarcely exist today without Common Crawl.
- Common Crawl’s ongoing efforts to create AI crawling standards indicate a willingness to adapt, yet DCN calls for decisive action—fully halting the scraping of protected content.
Inspired by this post on Search Engine Land.
June 10, 2026
SerpApi Challenges Reddit’s Allegations in Court Showdown

In a bold move, I’m witnessing firsthand how SerpApi is requesting a federal court to dismiss Reddit’s lawsuit. This legal battle centers around the alleged scraping of Reddit content from Google Search. From my perspective, SerpApi argues that Reddit is using copyright law to exert control over user posts and public search results.

Reddit’s initial complaint was amended in February, but I noticed that SerpApi remains firm. They argue that Reddit has not adequately demonstrated copyright ownership, technical circumvention, or tangible harm resulting from these actions.

SerpApi’s argument. From a blog post by SerpApi CEO Julien Khaleghy, I gather that the lawsuit is flawed for several reasons:

Reddit, interestingly enough, does not own the majority of the content in question, as user agreements clearly state that content ownership resides with the users themselves. It’s fascinating to see that Reddit only has a non-exclusive license to these posts.

The snippets Reddit presented, including dates and short fragments, don’t appear to be copyrightable at all from what I’ve read in the claims.

SerpApi’s stance is that they accessed Google Search pages, not directly interfacing with Reddit’s platform, which I believe weakens Reddit’s argument substantially.

DMCA concerns. In what I find a compelling argument, Khaleghy asserts that Reddit’s claim of a Digital Millennium Copyright Act (DMCA) violation lacks merit. SerpApi contends that their actions parallel what any user might see when conducting a Google search. Khaleghy strongly points out that:

There’s no evidence of encryption breaches or authentication bypass by SerpApi.

Accessing publicly available web pages doesn’t constitute “circumvention” under existing DMCA guidelines.

Reddit seems to be attempting to enforce copyright claims over content that doesn’t belong to them, which is an intriguing angle to this case.

Moreover, Reddit’s privacy policy acknowledges that public posts may surface in search results, supporting SerpApi’s use of the data.

Backstory. It’s clear to me that legal conflicts surrounding search scraping and AI data have gained high stakes lately:

Oct. 22: I came across information about Reddit filing lawsuits against SerpApi, Perplexity, Oxylabs, and AWMProxy, claiming they scraped large amounts of Reddit content through Google Search, referring to a decoy post created solely for Google’s crawler.

Oct. 29: SerpApi’s response, branding Reddit’s allegations as inflammatory, was a critical move, showcasing their resolve to defend access to public search data.

Dec. 19: Further intensifying the narrative, Google filed a lawsuit against SerpApi, accusing them of bypassing bot protections to scrape licensed search functionalities.

Feb. 23: SerpApi retaliated by requesting the court to dismiss the lawsuit filed by Google, arguing that Google is inappropriately leveraging the DMCA to limit access to public search results.

Importance. This case captivates me as it explores whether companies can legally extract information from Google’s search results without infringing on copyright laws or the DMCA, potentially impacting SEO tools and AI data training significantly.

Looking forward. I eagerly await the court’s decision on whether Reddit’s amended complaint holds up. A dismissal with prejudice would put an end to Reddit’s claims against SerpApi in this instance, which could send ripples through the industry.

SerpApi’s blog post. Check out Reddit’s Lawsuit is a Dangerous Attempt to Expand Platform Power for more on SerpApi’s perspective.

Inspired by this post on Search Engine Land.

March 13, 2026
SerpApi’s Legal Battle: Challenging Google’s Scraping Lawsuit

When I first learned about SerpApi’s move to dismiss Google’s lawsuit, my immediate thought was about the bold challenge SerpApi is undertaking. They’re arguing that Google is twisting copyright laws to restrict access to public search results all to protect their ad revenue, not copyrights.

The motion to dismiss was officially filed on February 20th, as mentioned in a recent blog post by SerpApi’s CEO, Julien Khaleghy. This legal battle stems from Google’s accusation in December that SerpApi bypassed security measures to scrape and resell content from Google Search.

The details: According to Khaleghy, Google is improperly applying the Digital Millennium Copyright Act (DMCA). Here’s what I found compelling:

The DMCA is meant to protect copyrighted works, not online platforms or advertising ventures. In addition, Google doesn’t actually own the content that appears in its search results, and accessing publicly available pages doesn’t qualify as “circumvention” under this law, SerpApi argues.

Google claims that SerpApi managed to evade bot-detection and crawling controls using rotating bot identities and large networks to scrape licensed content from features such as images and real-time data. However, SerpApi insists that they do not decrypt systems or breach authentication protocols, and merely gather the same data any user could see via a browser, without needing to log in.

Khaleghy also points out Google’s admission that its anti-bot systems primarily secure its advertising interests, which weakens the DMCA claim against SerpApi.

SerpApi references significant legal precedents, including the Ninth Circuit’s hiQ v. LinkedIn, which cautions against monopolizing public data, and the Sixth Circuit’s Impression Products v. Lexmark, reinforcing that public-facing content shouldn’t be blocked by merely technical measures.

Catch up quick: This lawsuit is the latest in a series of escalating legal clashes over data scraping and AI usage:

Back in October 2022, Reddit filed suits against SerpApi, among others, alleging they indirectly scraped content from Google Search. Reddit claims these companies obscured their identities and operated at an “industrial scale.” In turn, SerpApi has vowed to robustly defend itself, emphasizing that public data should remain accessible.

By December, Google further escalated the legal situation by suing SerpApi for ignoring its security measures and attempting to resell protected content. SerpApi stands firm, citing lawful operation and First Amendment rights to access public search data.

By the numbers: If Google’s interpretation of the DMCA holds, SerpApi suggests potential damages could skyrocket to $7.06 trillion — more than the entire U.S. GDP. However, this staggering figure is a theoretical estimate based on potential penalties, not an actual demand.

What’s next: It all boils down to the court’s decision on whether Google’s claims should move forward. Depending on the outcome, this case could significantly impact how SEO platforms, AI tools, and competitive intelligence software access search results data in the future.

A triumph for Google might hinder third-party access to search data, while a victory for SerpApi could reinforce that publicly accessible search outcomes are indeed fair game.

For deeper insights, I recommend reading Google v. SerpApi: We’re filing a Motion to Dismiss. Here’s why we’re in the right.

Don’t miss Inside SearchGuard: How Google detects bots and what the SerpAPI lawsuit reveals for in-depth analysis.

Inspired by this post on Search Engine Land.

February 23, 2026
Google’s Legal Battle: SerpApi Accused of Unlawful Data Scraping
Today, I came across an intriguing development where Google has initiated legal proceedings against SerpApi. This lawsuit revolves around allegations that SerpApi has been bypassing Google’s security systems to scrape and resell copyrighted content from search results.

The Allegations: According to Google, SerpApi has:
- Circumvented the security measures and standard crawling controls Google has in place.
- Ignored directives from websites that specify content accessibility.
- Employed techniques such as cloaking, rotating bot identities, and large bot networks to scrape vast amounts of content.
- Appropriated licensed content from search features such as images and real-time data, subsequently selling it for profit.
Google’s Stance: Describing SerpApi’s actions as “brazen” and “unlawful,” Google expressed concerns over how stealthy scrapers like SerpApi override crawling directives, stripping sites of their choices. Alarmingly, Google noted a significant increase in SerpApi’s activities over the last year.

Quick Update: Interestingly, Google’s lawsuit mirrors similar legal action by Reddit, which also targeted SerpApi, Perplexity, Oxylabs, and AWMProxy. Reddit accused them of scraping content via Google Search results and concealing their identities to evade restrictions.
- Reddit has licensing agreements with Google and OpenAI, suspecting other entities of attempting to bypass these deals.
- They reportedly set a “trap” post, visible only to Google’s crawler, which eventually surfaced in Perplexity’s results as proof of scraping.
- SerpApi denied these allegations, claiming their operations are lawful.
SerpApi’s Previous Statements: In defense, SerpApi has maintained that “public search data should be accessible,” viewing its actions as protected by the First Amendment. They also warned that lawsuits like the one from Reddit could endanger the “free and open web.”

Why It Matters to Me: Should Google triumph in this case, acquiring reliable SERP data might become increasingly challenging and costly. This could particularly impact teams reliant on services like SerpApi, as they navigate the complexities of understanding search results, performance metrics, and achieving success in an evolving digital landscape.

Inspired by this post on Search Engine Land.
December 19, 2025