Gerstgrasser, M., Schaeffer, R., et al. (2024).
arXiv (Cornell University).
Abstract
The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless. However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We empirically study this question by pretraining sequences of language models on text corpora. We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse; these results hold across a range of model sizes, architectures, and hyperparameters. We obtain similar results for deep generative models on other types of real data: diffusion models for molecule conformation generation and variational autoencoders for image generation. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework introduced by prior work in which a sequence of linear models are fit to the previous models' outputs. Previous work used this framework to show that if data are replaced, the test error increases with the number of model-fitting iterations; we extend this argument to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse no longer occurs.
Here are some thoughts:
This research directly addresses a critical concern for psychologists and researchers who rely on AI: the potential degradation of AI models when they are trained on data generated by previous AI models, a phenomenon known as "model collapse." While prior studies, often assuming old data is discarded and replaced with new AI-generated data, painted a dire picture of inevitable performance decline, this paper offers a more optimistic and realistic perspective. The authors argue that in the real world, data accumulates over time—new AI-generated content is added to the existing pool of human-generated data, not substituted for it. Through extensive experiments with language models, image generators, and molecular modeling tools, they demonstrate that this accumulation of data effectively prevents model collapse. Performance remains stable or even improves across successive generations of models trained on the growing, mixed dataset. The paper further supports this finding with a mathematical proof using a simplified linear model, showing that accumulating data bounds the error, preventing it from growing uncontrollably. For psychologists, this suggests that the increasing presence of AI-generated content on the internet may not catastrophically corrupt future AI tools used in research or clinical settings, as long as training datasets continue to incorporate diverse, original human data alongside synthetic content.