AI Companies Bypass Web Standards to Scrape News Publishers' Content for Generative AI Training

On June 24, Reuters reported that the content licensing startup TollBit has issued a warning to news publishers regarding several AI companies accused of circumventing standard web protocols to scrape content. These companies allegedly use the scraped material to train their generative AI systems. This development comes amid a public dispute between AI search startup Perplexity and media outlet Forbes, centered on the adherence to web standards.

A broader debate is emerging between technology and media firms about the value of content in the generative AI landscape. TollBit aims to serve as a mediator between AI companies seeking content and publishers willing to establish licensing agreements. Forbes has accused Perplexity of plagiarizing its reporting in AI-generated summaries, lacking proper attribution or consent.

Additionally, an investigative report by Wired highlighted that Perplexity might be bypassing the Robots Exclusion Protocol and other protective measures implemented by publishers. The News Media Alliance, representing over 2,000 U.S. publishers, has raised concerns over AI companies neglecting these "no scraping" regulations. Danielle Coffey, the organization’s president, stated, "If AI companies cannot halt large-scale scraping, we won’t be able to monetize valuable content or compensate journalists."

TollBit's findings reveal that Perplexity is not the only platform flouting publishers' "no scraping" policies. Their analysis suggests many AI services are sidestepping these rules, even as some publishers have designated "whitelist" areas for allowable scraping. TollBit remarked, "The more publisher logs we analyze, the more frequently this pattern emerges, indicating that AI platforms are retrieving content despite the robots.txt guidelines."

Prominent publishers like The New York Times have filed lawsuits against AI companies for copyright violations. Conversely, some publishers have chosen to sign licensing agreements with AI firms willing to pay for content, though disputes over the value of provided materials often arise. Many AI developers argue that acquiring content for free does not breach any laws.

This ongoing issue underscores the complex relationship between AI technology and traditional media, highlighting the urgent need for clear guidelines and compensation frameworks.

Most people like

Find AI tools in YBX