近期,国际学术期刊《自然》发表的一篇计算机科学论文揭示了人工智能(AI)领域的一个新挑战:用AI生成的数据集训练未来的机器学习模型可能会导致“模型崩溃”。这一现象的出现意味着,原始内容在经过多代AI模型处理后,会逐渐演变为不相关的胡言乱语,这凸显了确保AI训练数据质量的重要性。
随着生成式AI工具的普及,尤其是大语言模型等工具的广泛应用,AI模型通过人类生成的输入进行训练。然而,随着这些模型在互联网的不断扩展,AI生成的内容有可能被反复用于训练其他AI模型或自身,从而形成递归循环。研究团队通过数学模型模拟了这一过程,并证明了AI模型可能会忽略训练数据中的某些输出,导致其仅依赖于部分数据集进行自我训练。
研究结果发现,当AI模型被输入AI生成的数据集时,其学习能力会逐渐减弱,最终导致模型崩溃。在测试的几乎所有递归训练的语言模型中,都能观察到重复短语的出现。例如,一个初始输入为中世纪建筑文本的模型,到了第九代输出时,其内容已经完全转变为关于野兔的名字。这表明AI模型在训练过程中容易形成固定的输出模式,而非学习更广泛、更丰富的知识。
为了解决这一问题,研究团队建议,在AI模型使用其自身输出进行训练时,必须对数据进行严格过滤。同时,依赖人类生成内容的科技公司可以通过改进数据过滤策略,来训练出更高效、更可靠的AI模型。这一发现对AI领域的研究者和开发者提出了挑战,要求他们在构建AI系统时,更加重视数据质量和训练过程的控制,以避免模型崩溃等潜在风险,确保AI技术能够持续安全、有效地发展。
英语如下:
### Training AI with AI Data: International Research Reveals Risk of “Model Failure”
A recent computer science paper published in the international academic journal “Nature” has unveiled a new challenge in the field of artificial intelligence (AI): training future machine learning models with datasets generated by AI could lead to “model failure.” This phenomenon manifests as the original content gradually transforming into irrelevant gibberish over multiple generations of AI models, underscoring the importance of maintaining high-quality AI training data.
With the widespread availability of generative AI tools, particularly large language models, AI models are being trained using human-generated inputs. However, as these models expand across the internet, the AI-generated content has the potential to be repeatedly used for training other AI models or themselves, creating a recursive loop. The research team simulated this process through mathematical models and demonstrated that AI models might ignore certain outputs from the training data, leading them to self-train primarily on a subset of the dataset.
The study found that when AI models are fed AI-generated datasets, their learning capacity diminishes over time, culminating in model failure. In virtually all recursive-trained language models tested, repeated phrases were observed. For instance, a model initially fed medieval architecture texts produced outputs that, by the ninth generation, were solely about names of rabbits. This indicates that AI models can easily form fixed output patterns during training, rather than learning a broader, richer knowledge base.
To address this issue, the research team recommends strict data filtering when AI models use their own output for training. Companies relying on human-generated content can also improve their data filtering strategies to train more efficient and reliable AI models. This discovery poses a challenge to AI researchers and developers, requiring them to pay greater attention to data quality and training process control when building AI systems, to avoid potential risks such as model failure, and to ensure the continued safe and effective development of AI technology.
【来源】http://www.chinanews.com/gj/2024/07-27/10258447.shtml
Views: 1