Is it feasible for an AI to be solely trained on data produced by another AI? While it may seem unconventional, this idea has gained traction as the availability of new, real data dwindles. Notably, Anthropic incorporated synthetic data to train its advanced model, Claude 3.5 Sonnet. Similarly, Meta enhanced its Llama 3.1 models with AI-generated data, and OpenAI is reportedly using synthetic data from its reasoning model, o1, for the upcoming Orion.
But why is data essential for AI? What type of data is required, and can synthetic data truly replace it?
Understanding the Importance of Annotations
AI systems operate as statistical machines. They learn patterns from numerous examples to make predictions, such as recognizing that the phrase “to whom” in an email usually precedes “it may concern.” Annotations—text labels that define the meaning or components of the data—play a pivotal role in this learning process. These labels guide the model in differentiating between objects, locations, and ideas.
For instance, a photo-classifying model trained on images labeled “kitchen” will begin to associate certain features (like fridges and countertops) with kitchens. After training, when presented with an unseen photo of a kitchen, the model should accurately identify it as such. However, if the indicated label was “cow” instead, it would create confusion, highlighting the necessity for accurate annotations.
The growing demand for AI and the requisite labeled data have driven the market for annotation services to an estimated $838.2 million today, forecasted to soar to $10.34 billion within a decade. While exact figures on how many individuals are involved in labeling work are unclear, a 2022 study suggests millions are engaged in this critical task.
Organizations, ranging from startups to large enterprises, depend on data annotation firms to produce labels for AI training datasets. Some positions offer competitive wages, particularly when specialized knowledge is required (e.g., in technical fields like mathematics). Unfortunately, many annotators in developing countries earn only a few dollars an hour, often without benefits and security of future opportunities.
A Drying Data Well
Beyond the human element, there are practical reasons to explore alternatives to human-generated labels. Human annotators have limitations in speed and potential biases that can affect the models trained on their inputs. Mistakes are inevitable, as are challenges caused by complex labeling instructions, all of which contribute to high costs for employing human labor.
Accessing data in general is costly. For instance, Shutterstock charges AI companies tens of millions of dollars for access to its extensive archives, while Reddit has profited substantially from licensing data to companies such as Google and OpenAI.
Acquiring data is becoming even more challenging. Many models are trained on vast collections of publicly available data, but data owners are increasingly restricting access out of concern over plagiarism and improper attribution. More than 35% of the world’s top 1,000 websites now block OpenAI's web scraper, and recent studies indicate that around 25% of data from reputable sources has become inaccessible for major AI training datasets. If this trend persists, a research group, Epoch AI, predicts developers could exhaust available data for generative AI training by between 2026 and 2032. This challenge, combined with fears of copyright infringement and problematic content within open datasets, has prompted reevaluation within the AI community.
Exploring Synthetic Alternatives
Initially, synthetic data appears to offer a comprehensive solution. Need annotations? Generate them. Require more example data? It's easy. While this approach holds promise, it presents its own set of challenges.
“If ‘data is the new oil,’ then synthetic data serves as a biofuel, produced without the drawbacks linked to its natural counterpart,” says Os Keyes, a PhD candidate at the University of Washington focused on the ethical impact of emerging technologies. “A small dataset can be used to simulate and derive new entries."
The AI sector has eagerly embraced this concept. Recently, Writer, a generative AI company, launched the Palmyra X 004 model, reportedly trained almost entirely on synthetic data at a development cost of just $700,000—significantly lower than the estimated $4.6 million for similar models from OpenAI.
Additionally, Microsoft and Google have utilized synthetic data for their Phi and Gemma models, respectively. Nvidia introduced a model family aimed at generating synthetic training data, and AI startup Hugging Face has claimed to produce the largest synthetic text training dataset to date. The synthetic data generation industry is on track to reach a valuation of $2.34 billion by 2030, with Gartner expecting that 60% of data for AI and analytics projects this year will be generated synthetically.
Luca Soldaini, a senior research scientist at the Allen Institute for AI, highlights that synthetic data methods can generate training data in ways that traditional scraping or licensing cannot readily achieve. For instance, while training Movie Gen, Meta used Llama 3 to create captions for training footage, which humans then refined for detail.
OpenAI has also reported fine-tuning GPT-4o using synthetic data to enhance the Canvas feature in ChatGPT. Similarly, Amazon generates synthetic data to supplement the real-world data used in training its speech recognition models for Alexa.
"Synthetic data models can effectively extend our understanding of what data is necessary to achieve a particular model behavior," Soldaini states.
The Risks of Synthetic Data
However, synthetic data is not devoid of risks. It encounters the same “garbage in, garbage out” issue as all AI systems. If the data used to train models contains biases or limitations, the synthetic data generated will reflect those flaws. If certain groups are underrepresented in the base dataset, the synthetic outputs will perpetuate this disparity.
“You can only extrapolate so much,” warns Keyes. For example, if there are only 30 Black individuals in a dataset, extrapolation may assist, but if those individuals are all from a specific socio-economic background, the resulting synthetic data will mirror that limitation.
A 2023 study by researchers at Rice University and Stanford found that heavy reliance on synthetic data during training could yield models with diminishing quality and diversity. The risk of sampling bias, where elements of the real world are poorly represented, leads to a decrease in model diversity after several training iterations. However, the researchers also discovered that incorporating some real-world data helps to mitigate these challenges.
Keyes raises concerns regarding complex models like OpenAI's o1, which may produce subtle hallucinations in their synthetic outputs, adversely affecting the accuracy of the trained models, particularly if the origins of these inaccuracies are not easily traceable. “Complex models produce hallucinations, and when built on flawed data, those models risk propagating errors,” Keyes explains.
Cascading hallucinations may result in models generating nonsensical data. Research featured in the journal Nature indicates that models trained on error-filled datasets not only reproduce inaccuracies but amplify them, leading to deteriorating future model generations. These models risk losing grasp of more nuanced knowledge over time, becoming increasingly generic and frequently providing irrelevant responses to user queries.
Soldaini concurs that without proper safeguards, raw synthetic data should be approached cautiously, especially to avoid training unresponsive chatbots and homogeneous image generators. Ensuring its safe use involves meticulous review, curation, and filtering, ideally in conjunction with fresh, real data, similar to the approach applied to any dataset.
“If not properly managed, synthetic data can lead to model collapse, diminishing creativity and increasing bias in outputs, which would severely impact functionality,” he warns. Monitoring the generation process and establishing robust quality checks are essential to prevent such risks.
While OpenAI CEO Sam Altman once suggested that AI might someday generate synthetic data capable of training itself, such technology is not yet available. No leading AI lab has released a model trained exclusively on synthetic data.
For the foreseeable future, it seems that human involvement is still vital to ensure the integrity of AI training processes.