A recent study published in Nature highlights a serious issue with artificial intelligence (AI): training future generations of machine learning models on datasets generated by AI can severely "contaminate" their outputs, a phenomenon known as "model collapse." Research indicates that original content can become irrelevant "gibberish" after just nine iterations—illustrated by an example where initial architectural text transformed into the names of rabbits. This underscores the critical importance of using reliable data to train AI models.
As generative AI tools like large language models gain popularity, these systems rely predominantly on human-generated inputs for training. However, as AI models proliferate on the internet, computer-generated content may end up being recursively utilized to train other AI models. A collaborative research team, including members from the University of Oxford, has been exploring this issue and discussed the concept in earlier preprints.
In their formally published paper, the team used mathematical models to demonstrate the potential for "model collapse." They showed that AIs may overlook certain outputs from training data (such as less common text), causing them to self-train on only a subset of the dataset. The team analyzed how AI models process primarily AI-generated datasets, discovering that inputting AI-generated data weakens the learning capabilities of future generations of models, ultimately leading to "model collapse." Almost all of the recursive training language models they tested displayed signs of this problem. For instance, in one test using medieval architectural text as the original input, by the ninth generation, the output had devolved into meaningless strings of rabbit names.
The team asserts that training AI on datasets generated by earlier models inevitably results in collapse. They emphasize the necessity for rigorous data filtration. This suggests that AI models dependent on human-generated content might be more effective at training efficient models.
In many ways, "model collapse" can be likened to cancer in AI, with stages of early and late development. In the early stage, an AI exposed to generated data begins to lose some of the original correct information. By the late stage, such an AI may produce wildly inaccurate results that bear no relation to the underlying data. Alarmingly, once "model collapse" occurs, the AI becomes stubbornly entrenched in its errors, reinforcing them until it accepts incorrect outcomes as correct. This issue warrants careful attention from anyone engaged with generative AI, as it effectively "poisons" the AI's understanding of the real world.