Be Aware of the Risks of AI Model Failure

From customer service to content creation, artificial intelligence (AI) is making significant strides across various fields. However, an emerging issue known as “model collapse” threatens to undermine these advancements. Defined in a research paper published in July in the journal Nature, “model collapse” refers to the potential contamination of AI-generated datasets used to train future generations of machine learning models, which could severely affect their outputs.

Reports indicate that “model collapse” is not just a technical concern for data scientists; if left unchecked, it may profoundly impact businesses, technology, and the entire digital ecosystem.

What is Model Collapse?

Most AI models, including GPT-4, are trained on massive datasets predominantly sourced from the internet. Initially, this data comes from human-generated content, showcasing the complexity and diversity of human language, behavior, and culture. AI learns from this information to produce new content. However, when AI searches for new data online to train subsequent models, it risks ingesting its own generated content, creating a feedback loop where one AI's output becomes another's input. This self-referential training can lead to outputs that drift from reality, resembling the distortion seen when a document is repeatedly copied, each iteration losing clarity and detail.

According to The New York Times, as AI models rely less on human-generated content, the quality and diversity of their outputs decline. Expert 熊德意 points out that authentic human language data typically follows Zipf's Law, where word frequency is inversely proportional to its ranking. This law highlights a long-tail phenomenon in human language, illustrating the abundance of diverse but low-frequency content. However, due to sampling errors and biases, the long-tail trait diminishes in AI-generated outputs, leading to a decreased diversity that precipitates “model collapse.”

Is AI Self-Consumption a Bad Thing?

Despite concerns surrounding “model collapse,” some experts interpret this as a sign that AI is self-consuming. Publications like Forbes have reported that this issue may exacerbate biases and inequalities within AI systems. Nonetheless, not all synthetic data is detrimental. In certain scenarios, synthetic data can enhance AI learning, particularly when training smaller models using outputs from larger models or when validated outputs, like solutions to mathematical problems or strategies for games, are available.

Is AI Dominating the Internet?

The challenges of training new AI models reveal broader issues. Scientific American has noted that AI-generated content is increasingly saturating the internet, with large language models producing vast quantities of text across numerous websites. Compared to human-created content, AI can generate text more rapidly and in larger volumes. OpenAI CEO Sam Altman reported earlier this year that the company generates approximately 100 billion words daily, equivalent to the text of over a million novels, a significant portion of which ends up online.

This influx of AI-generated content—including automated tweets, absurd images, and false reviews—has sparked negative perceptions. According to Forbes, the "death of the internet" theory suggests that a majority of online traffic and content is now dominated by bots and AI-generated material, diminishing human influence on the internet’s trajectory. While this notion originally circulated in niche online forums, it has recently gained more attention. Fortunately, experts agree that the “death of the internet” is not yet a reality. Most popular posts, encompassing insightful opinions, sharp language, and acute observations, remain human-generated.

However, 熊德意 emphasizes that as large AI models become widespread, the proportion of AI-generated data on the internet could rise. An influx of low-quality AI synthetic data may not only trigger the “model collapse” but also pose societal risks, including the potential for misinformation to mislead certain groups. Therefore, the generation of AI content is not merely a technical issue; it is also a societal concern that requires effective responses from both safety governance and technological perspectives.

Most people like

Find AI tools in YBX