‘Andy Warhol is the only genius I’ve ever known with an IQ of 60.’ –Gore Vidal
AI has poisoned its own well
Replied to The Curse of Recursion: Training on Generated Data Makes Models Forget (arXiv.org)
What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. […] the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.
I suspect tech companies (particularly Microsoft / OpenAI and Google) have miscalculated, and in their fear of being left behind, have released their generative AI models too early and too wide. By doing so, they’ve essentially established a threshold for the maximum improvement of their products due to the threat of model collapse.[…]
They need an astronomical amount of training data to make any model better than what already exists. By releasing their models for public use now, when they’re not very good yet, too many people have pumped the internet full of mediocre generated content with no indication of provenance. […]
Obtaining quality training data is going to be very expensive in five years if AI doesn’t win all its lawsuits over training data being fair use.