AI巨头OpenAI采集YouTube视频训练GPT-4引版权争议

**人工智能公司OpenAI为训练其最先进的大型语言模型GPT-4，采集了超过一百万小时的YouTube视频，此举引发了关于版权法的讨论。**

据《华尔街日报》本周早些时候报道，OpenAI在收集高质量训练数据方面遇到了困难。今天，《纽约时报》进一步详细介绍了OpenAI如何解决这一问题，其中一些方法触及了AI版权法的模糊灰色区域。

OpenAI的开发人员为了获得大量的训练数据，采用了Whisper音频转录模型，转录了超过一百万小时的YouTube视频。这一做法展示了OpenAI对高质量训练数据的迫切需求。

然而，这种大规模使用版权材料的做法引发了法律和道德上的争议。一些专家认为，OpenAI可能需要获得视频内容创作者的许可，以确保其训练数据的使用符合版权法规定。否则，这可能会对内容创作者的权益造成损害。

OpenAI对此表示，他们已经采取了一系列措施来确保其使用的数据符合法律规定。例如，他们通过技术手段去除了视频中的语音，以降低侵犯版权的风险。此外，OpenAI还表示，他们将与视频内容创作者进行合作，以确保他们的权益得到保护。

尽管存在争议，但OpenAI的这一做法还是得到了业界的关注和认可。一些专家表示，OpenAI的方法可能会成为未来AI发展的一个重要趋势。因为随着AI技术的不断发展，高质量训练数据的需求将越来越大，而OpenAI的做法可能会为解决这个问题提供一种可行的方案。

总的来说，OpenAI采集YouTube视频以训练GPT-4的行为，既展示了其对高质量训练数据的迫切需求，也引发了关于AI版权法的讨论。未来，随着AI技术的不断发展，如何平衡数据使用和版权保护，将成为一个需要业界、学界和政府共同面对和解决的问题。

英语如下：

## News Title: AI Giant OpenAI’s Collection of YouTube Videos to Train GPT-4 Sparks Copyright Dispute

Keywords: OpenAI, YouTube, GPT-4.

### News Content:

#### OpenAI Collects Over One Million Hours of YouTube Videos for GPT-4 Training

**AI company OpenAI has collected over one million hours of YouTube videos to train its most advanced large language model, GPT-4, sparking discussions about copyright law.**

Earlier this week, The Wall Street Journal reported that OpenAI had encountered difficulties in collecting high-quality training data. Today, The New York Times further detailed how OpenAI addressed this issue, with some methods touching on the gray area of AI copyright law.

OpenAI developers, in their quest for a large amount of training data, used the Whisper audio transcription model to transcribe over one million hours of YouTube videos. This practice demonstrates OpenAI’s urgent need for high-quality training data.

However, this massive use of copyrighted material has sparked legal and ethical controversies. Some experts believe that OpenAI might need to obtain permission from the creators of the video content to ensure that its use of training data complies with copyright laws. Otherwise, this could potentially harm the rights of content creators.

OpenAI has stated that they have taken a series of measures to ensure that the data they use is in line with legal requirements. For instance, they have employed technical means to remove speech from the videos to reduce the risk of copyright infringement. Additionally, OpenAI has said that they will collaborate with the creators of the video content to ensure their rights are protected.

Despite the controversy, this practice by OpenAI has garnered attention and recognition within the industry. Some experts suggest that OpenAI’s method could become an important trend in future AI development. As AI technology continues to advance, the demand for high-quality training data will grow increasingly large, and OpenAI’s approach may provide a feasible solution to this problem.

In summary, OpenAI’s collection of YouTube videos to train GPT-4 reflects its urgent need for high-quality training data and has sparked discussions about AI copyright law. In the future, as AI technology continues to advance, balancing data use and copyright protection will be a challenge that the industry, academia, and governments will need to confront and resolve together.

【来源】https://www.ithome.com/0/760/305.htm