Amazon Investigates Perplexity AI Over Allegations of Unauthorized Website Scraping

Amazon Web Services (AWS) has launched an investigation into Perplexity AI to determine if it is violating rules related to web crawling. According to Wired, AWS is specifically looking into allegations that the service operates a crawler, hosted on its servers, which disregards the Robots Exclusion Protocol. This web standard allows developers to place a robots.txt file on their website, indicating which pages can be accessed by bots. While compliance is voluntary, most reputable crawlers have historically honored these instructions since the protocol’s inception in the 1990s.

Wired previously reported finding a virtual machine that circumvented its own robots.txt settings, hosted on an AWS server at the IP address 44.221.181.252, which is believed to be associated with Perplexity. This crawler has allegedly accessed Condé Nast sites hundreds of times over the last three months, as well as making multiple visits to The Guardian, Forbes, and The New York Times. To verify potential content scraping, Wired tested Perplexity’s chatbot with headlines and short descriptions from its articles. The chatbot produced responses that closely mirrored the articles with minimal attribution.

In a related report, Reuters noted that Perplexity isn't the only AI company bypassing robots.txt files to collect content for training large language models. However, it appears that Wired only provided AWS with information on Perplexity's crawler. An AWS spokesperson clarified, "Our terms of service prohibit abusive and illegal activities, and our customers must comply with those terms. We routinely receive reports of alleged abuse and engage with our customers to address these reports." The spokesperson confirmed that AWS is investigating the claims presented by Wired.

Perplexity spokesperson Sara Platnick stated that the company has responded to AWS's inquiries, denying that its crawlers violate the Robots Exclusion Protocol. "Our PerplexityBot—operating on AWS—respects robots.txt, and we confirmed that Perplexity-controlled services do not crawl in a way that violates AWS's terms," Platnick said. She added that AWS's inquiry was standard practice for addressing potential abuse and mentioned that Perplexity had no prior notice of an investigation before Wired's contact. Notably, Platnick acknowledged that PerplexityBot may ignore robots.txt when users provide specific URLs in queries.

Aravind Srinivas, the CEO of Perplexity, also refuted claims that the company is "ignoring the Robots Exclusion Protocol and lying about it." He admitted, however, to Fast Company that Perplexity employs third-party web crawlers in addition to its own, and acknowledged that the bot identified by Wired is one of these third-party tools.

Most people like

Find AI tools in YBX

Related Articles
Refresh Articles