After the Success of AgentGPT, Reworkd Shifts Focus to Web-Scraping AI Agents

Reworkd's founders gained significant attention on GitHub last year with their creation, AgentGPT—an innovative, free tool for building AI agents that attracted over 100,000 daily users within just a week. This success secured them a place in Y Combinator’s summer 2023 cohort. However, the co-founders quickly recognized that the scope of developing general AI agents was too vast. Today, Reworkd has pivoted to focus on web scraping, specifically creating AI agents capable of extracting structured data from the public web.

AgentGPT offered an intuitive browser interface where users could effortlessly design autonomous AI agents. This sparked widespread enthusiasm, with many declaring that AI agents represent the future of computing.

At the time AgentGPT surged in popularity, Asim Shrestha, Adam Watkins, and Srijan Subedi were based in Canada, and Reworkd had yet to be established. The unexpected spike in users caught them by surprise; Subedi, now Reworkd's COO, mentioned that the tool was incurring $2,000 a day in API costs, prompting the need for rapid establishment and funding. A prevalent use case for AgentGPT emerged as the development of web scrapers—a task that, while straightforward, demands high volume—leading Reworkd to hone in exclusively on this market.

In the AI age, web scrapers have become essential. According to Bright Data’s latest report, the primary reason organizations tap into public web data in 2024 is to enhance AI models. Traditional web scrapers, however, are labor-intensive and must be tailored for specific webpages, driving up costs. Reworkd’s AI agents, on the other hand, can efficiently scrape larger portions of the web with minimal human involvement.

Clients can provide Reworkd with lists of hundreds or even thousands of websites to target, along with specific data requirements. Reworkd’s AI agents leverage multimodal code generation to translate these requests into structured data, autonomously generating unique scraping code for each site to extract pertinent data for their clients’ use.

For instance, if you need statistics on every NFL player—but each team’s website has a distinct layout—Reworkd’s agents streamline the process. With only the links and data description, their agents construct scrapers for each site, potentially saving you hours—days if the task involves a thousand teams.

Recently, Reworkd secured $2.75 million in seed funding from notable investors including Paul Graham, AI Grant (co-founded by Nat Friedman and Daniel Gross), SV Angel, General Catalyst, and Panache Ventures. This follows a $1.25 million pre-seed investment from Panache Ventures and Y Combinator last year, increasing Reworkd's total funding to $4 million.

AI Utilizing the Internet

Shortly after founding Reworkd and relocating to San Francisco, the team welcomed Rohan Pandey as a founding research engineer. Living in AGI House SF, a prominent hacker house in the Bay Area, he has been described by some investors as a “one-person research lab within Reworkd.”

In an interview, Pandey remarked, “We envision ourselves as the realization of a 30-year dream of the Semantic Web,” referencing Tim Berners-Lee’s vision for a web where computers can comprehensively read the internet. “Even if some websites lack standard markup, LLMs can comprehend them similarly to humans, effectively transforming any website into an API. Essentially, Reworkd positions itself as the universal API layer for the internet.”

Reworkd aims to cater to the broader spectrum of customer data requirements, particularly excelling in scraping numerous smaller public websites that larger competitors often overlook. While companies like Bright Data have established scrapers for major platforms like LinkedIn or Amazon, Reworkd tackles smaller sites that may not justify the effort for human builders.

The Definition of "Public" Web Data

Despite the long-standing presence of web scrapers, their use has recently sparked controversy in the AI realm. Massive data scraping practices have led to legal challenges for companies like OpenAI and Perplexity, as media organizations accuse them of extracting copyrighted content from behind paywalls without appropriate compensation. Reworkd takes measures to navigate these pitfalls.

Shrestha, co-founder and CEO of Reworkd, shared in a interview, “We see ourselves as enhancing access to publicly available information. We strictly adhere to scraping only what is accessible—no sign-in walls or similar barriers.”

Reworkd further distinguishes its practices by opting not to scrape news content and selectively choosing collaborators. The company cites its partnership with Axis, which aids policy teams in complying with government regulations. Axis utilizes Reworkd’s AI to extract data from thousands of governmental regulation documents across various European Union countries, allowing the company to develop and refine an AI model based on that data for its clients.

Launching a web scraping venture nowadays can be viewed as entering treacherous waters, according to Aaron Fiske, a partner at Silicon Valley’s Gunderson Dettmer law firm. The legal landscape surrounding "public" web data remains murky, with ongoing debates about its use for AI applications. Nonetheless, Fiske notes that Reworkd's model—allowing customers to determine which sites to scrape—could protect them from legal repercussions.

He explained, “It’s akin to inventing a copying machine; there’s one use case for creating copies that emerges as immensely valuable, yet legally ambiguous. Web scraping for AI may not inherently pose risks, but engaging with AI companies focused on harvesting copyrighted content could be problematic.”

This is why Reworkd is deliberate about its partnerships. Traditionally, web scrapers have obscured blame in copyright infringement cases related to AI. In the case of OpenAI, for instance, The New York Times didn't sue the web scraper that gathered its articles but targeted the company reproducing the content. The question of whether OpenAI's actions constituted copyright infringement remains unresolved.

Recent court rulings indicate that web scrapers could operate within legal boundaries during this AI surge. A recent verdict favored Bright Data after it scraped user data from Facebook and Instagram; the dataset in question contained 615 million records of Instagram profiles that Bright Data markets for $860,000. Despite Meta’s claims that this violated their terms of service, the court decided that the data was public and, therefore, permissible to scrape.

Investors Support Reworkd's Potential

Reworkd has gained traction from illustrious initial investors, including Y Combinator and luminaries like Paul Graham, Daniel Gross, and Nat Friedman. Some investors believe that as technology advances, Reworkd's solutions will similarly develop and become more cost-effective. The startup claims that OpenAI’s latest GPT-4o is currently optimal for its multimodal code generation, noting that many of its technologies have only recently become feasible.

Viet Le of General Catalyst shared, “If you try to navigate the rapid pace of technological progress without leveraging it, founders may face difficulties. Reworkd embraces that philosophy, aligning its solutions with ongoing advancements.”

By addressing a specific market gap, Reworkd’s AI agents cater to the increasing demand for data as AI technologies evolve. As more companies develop bespoke AI models tailored to their needs, Reworkd anticipates growth in its customer base. Fine-tuning models requires abundant, high-quality structured data.

Reworkd touts a “self-healing” approach, ensuring that its web scrapers remain operational even when web pages undergo changes. Furthermore, the startup claims to mitigate hallucination issues often associated with AI models, as its agents autonomously create the code needed for scraping websites. While errors can occur, resulting in data inaccuracies, Reworkd has introduced Banana-lyzer—a robust open-source evaluation framework—to regularly monitor and ensure the accuracy of its outputs.

With a compact team of just four individuals, Reworkd faces considerable costs for running its AI agents. However, the startup is optimistic that competitive pricing will develop as operational expenses decrease. Recent innovations, such as OpenAI’s release of GPT-4o mini—a smaller, high-performance model—could also bolster Reworkd's competitive edge.

Paul Graham and AI Grant did not respond to requests for comment.

Most people like

Find AI tools in YBX