Feeding the Beast: How a Booming Data Market Fuels the Unquenchable Demand for LLMs | The AI Beat

Last week, I discussed Mark Zuckerberg's insights into Meta's AI strategy, highlighting a significant advantage: a vast and continuously expanding internal dataset that trains its Llama models.

Zuckerberg stated that Facebook and Instagram host "hundreds of billions of publicly shared images and tens of billions of public videos," surpassing the size of the Common Crawl dataset. Users also share vast quantities of public text posts across these platforms.

The Insatiable Data Needs of AI

However, the data for training models like those from Meta, OpenAI, or Anthropic is just the starting point in understanding the data requirements of today’s large language models (LLMs). The ongoing demand for inference—using LLMs for various applications—is what creates a never-ending cycle of data consumption. It's akin to the classic game Hungry Hungry Hippos, where AI models relentlessly gather data to function effectively.

Specific Datasets for Effective AI Inference

Brad Schneider, founder and CEO of Nomad Data, emphasized that "[Inference is] the bigger market, I don’t think people realize that." Nomad Data operates as a data discovery platform, connecting over 2,500 data vendors to companies seeking specific datasets for their LLM inference needs.

Rather than acting as a data broker, Nomad enables companies to search for data in natural language. For instance, a user might request "a data feed of every roof undergoing construction in the US every month." Schneider explained that many users are unaware of the exact nomenclature for the datasets they need. Nomad's LLMs help identify relevant vendors who can supply the data.

Instantaneous Data Matches

The rapid matching of demand and supply exemplifies the platform's effectiveness. Schneider recalled an insurance company that listed its data on Nomad: almost immediately, another company searched for detailed car accident data, unaware that such information fell under "insurance data."

"This is sort of the magic," Schneider noted.

The Importance of Continuous Data Feeding

While training data is essential, Schneider highlighted that models are trained infrequently, and inference happens continuously—sometimes thousands of times a minute. This ongoing demand for fresh data is crucial for companies leveraging generative AI, particularly for creating valuable insights.

"You need to feed something to it for it to do something interesting," he explained.

Identifying the right data "food" remains a challenge for large enterprises. Initially, utilizing internal data is critical, but incorporating high-quality external datasets has historically been difficult. Organizations often had difficulty extracting useful information from vast archives, such as millions of PDFs. Fortunately, LLMs can now analyze textual data from various sources—including consumer records and government filings—swiftly.

Unlocking the Value of Previously Untapped Data

Schneider likened this transformation to uncovering "buried treasure." Data once deemed useless has become highly valuable. Additionally, data is essential for customizing LLM training. For example, to develop a model for recognizing Japanese receipts, a dataset of such receipts is necessary. Similarly, creating a model that identifies advertisements in football field images requires a dataset of relevant videos.

Media Companies Monetizing Their Data

Large media companies are also beginning to license their data to LLM firms. OpenAI recently partnered with Axel Springer, while negotiations with the New York Times ended in a lawsuit. Nomad Data is actively collaborating with media outlets and other companies to expand its data vendor network. Schneider reported that Nomad has engaged several corporations—ranging from automotive manufacturers to insurance companies—who are listing their data on the platform.

The Continuous Demand for LLM Data

In essence, the LLM data supply chain is a self-reinforcing loop. Nomad Data employs LLMs to identify new data vendors and subsequently assists users in locating the data they require. This data is then utilized with LLM APIs for training and inference.

"LLMs are crucial to our business," Schneider emphasized. "As we gather more textual data, we continuously learn how to utilize these diverse datasets."

AI training data is a small fragment of the overall market, with LLM inference and customized training presenting the most exciting opportunities. Schneider remarked, "Now I can acquire data that previously held no value, which will be instrumental in building my business, thanks to these new technologies."

Most people like

Find AI tools in YBX