Reddit to Perplexity: Get your filthy hands off our forums

Reddit on Wednesday filed a lawsuit against Perplexity AI and three of its alleged data dealers for trafficking in unlawfully scraped information.

The complaint, filed in the Southern District of New York, claims that Oxylabs UAB, AWM Proxy, and SerpApi unlawfully bypassed Reddit’s and Google’s defenses to harvest Reddit content and related search results. It also says that Perplexity chose to purchase the purloined data rather than license it from Reddit.

Ben Lee, chief legal officer at Reddit, told The Register in an emailed statement that AI companies are desperate for quality content generated by real people and that need is fueling an industrial scale data laundering economy.

“Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material,” said Lee. “Reddit is a prime target because it’s one of the largest and most dynamic collections of human conversation ever created.”

Lee claimed that Oxylabs UAB, a data scraping business based in Lithuania, AWM Proxy, a former Russian botnet, and SerpApi, which advertises real-time access to scraped Google search results, represent textbook examples of this sort of illegal behavior.

“Unable to scrape Reddit directly, they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search,” said Lee. “Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself.”

Reddit’s complaint likens these three providers to “would-be bank robbers, who, knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead.” Echoing Cloudflare CEO Matthew Prince’s characterization of Perplexity, the Reddit legal filing describes Perplexity as “more akin to a ‘North Korean hacker'” who will do whatever is necessary to obtain the data to fuel its AI answer engine, other than pay for a license.

Google is not participating in the lawsuit but has tried to prevent automated scraping of its search results.

The social media contends that the defendants have violated the US Digital Millennium Copyright Act by bypassing its technological defenses against automated access to its servers. And it accuses SerpApi and Oxylabs specifically of violating the DMCA’s prohibition on trafficking in technology circumvention products or services. Other claims include unfair competition, unjust enrichment, and civil conspiracy.

Reddit is seeking an injunction to halt the unwanted scraping of its content and damages.

In June, Reddit filed a similar complaint against Anthropic after it failed to convince the AI business to enter into a content licensing deal as OpenAI has done.

AI bubble inflates Microsoft CEO pay to $96.5M

Google porting all internal workloads to Arm, with help from GenAI

OpenAI releases bot-tom feeding browser with ChatGPT built in

AI eats leisure time, makes employees work more, study finds

Oxylabs, which advertises itself as “the largest ethical proxy network and advanced scraping solutions empowering the AI industry and beyond,” did not immediately respond to a request for comment.

“It doesn’t appear we have received any communication or service from Reddit on this,” said Ryan Schafer, customer service success director at SerpApi, in an email to The Register. “We strongly disagree with Reddit’s allegations and intend to vigorously defend ourselves in court. We don’t have further comments at the moment.”

A spokesperson for Perplexity told The Register, “Perplexity has not yet received the lawsuit, but we will always fight vigorously for users’ rights to freely and fairly access public knowledge. Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest.”

Reddit is not alone in its attempts to defend against its content being scraped and used to train AI models without consent. A lawsuit [PDF] filed last month on behalf of two authors accuses Apple of “using Books3, a dataset of pirated copyrighted books” to train its OpenELM language models. The complaint against Apple says that the company’s AppleBot has been scraping web data for nine years and that data is now being used to improve Apple Intelligence models.

Another case, Millette v. OpenAI (2024), contends that OpenAI scraped YouTube videos unlawfully to improve its models. The New York Times Co. v. Microsoft Corp., OpenAI (2023) makes similar allegations with regard to Microsoft’s and OpenAI’s alleged use of its news content.

In August, content delivery network Cloudflare called out Perplexity for running web scraping bots that ignore websites’ no-scraping directives. ®

Updated at 2000 with comment from serpAPI.

Report