Meta深陷版权争议：未经许可使用“盗版”书籍训练AI

社交媒体巨头Meta Platforms Inc. 近期在一项诉讼中承认，未经授权使用了可能受版权保护的书籍来训练其人工智能模型。据悉，Meta在训练Llama 1和Llama 2大型语言模型时，使用了名为Books3的数据集，该数据集包含了近20万本图书的纯文本，总容量高达37GB。

Books3是一个知名的开源图书数据集，由网上可公开获取的书籍组成，然而其版权状态复杂。Meta在诉讼中辩称，其使用受版权保护的作品来训练大模型无需“同意、许可或付费”，并主张任何未经授权复制Books3中受版权保护的作品都应被视为“合理使用”。

这一立场引发了业界和版权持有者的广泛关注和争议。合理使用（Fair Use）是版权法中的一个重要概念，它允许在特定条件下不经版权持有者许可而使用其作品。但是，Meta是否能够将其应用于大规模的AI训练，特别是在涉及大量非开源、受版权保护的材料时，尚无先例可循。

Meta的这一行为可能触犯了版权法，并可能对未来的AI研究和商业化应用带来深远影响。一方面，AI模型的训练需要大量数据，而这些数据往往需要大量的版权 clearances（版权许可）。另一方面，如果AI的使用被认为是对作品的合理使用，那么这可能会对传统的版权保护机制构成挑战。

在科技迅速发展的当下，如何平衡版权保护与新技术的应用，已经成为一个亟待解决的问题。Meta的这一行为，无疑将这一问题推向了前台。

English Translation:

Title: Meta Amid Copyright Controversy Over Unauthorized Use of “Pirated” Books for AI Training
Keywords: Copyright Controversy, Meta, AI Training

News content:

Social media giant Meta Platforms Inc. has recently admitted to using books that may be copyrighted without permission to train its artificial intelligence models. It is reported that Meta used a dataset called Books3 to train its Llama 1 and Llama 2 large-scale language models. This dataset includes the pure text of nearly 200,000 books with a total capacity of up to 37GB.

Books3 is a well-known open-source book dataset compiled from books publicly available online, but its copyright status is complex. Meta claimed in the lawsuit that it does not need “consent, licensing, or payment” to use copyrighted works to train large models, and argues that any unauthorized copying of copyrighted works within Books3 should be considered “fair use”.

This position has sparked widespread attention and controversy within the industry and among copyright holders. Fair use is an important concept in copyright law, which allows for the use of copyrighted works without permission under certain conditions. However, whether Meta can apply this to large-scale AI training, particularly when dealing with a large amount of non-open-source, copyrighted material, remains uncharted territory.

Meta’s actions may have violated copyright law and could have a significant impact on future AI research and commercial applications. On the one hand, training AI models requires a large amount of data, which often needs extensive copyright clearances. On the other hand, if AI usage is considered fair use, this could challenge traditional copyright protection mechanisms.

In today’s rapidly evolving technology landscape, balancing copyright protection with the use of new technologies has become an issue that urgently needs to be addressed. Meta’s actions have undoubtedly pushed this issue to the forefront.

【来源】https://www.techspot.com/news/101507-meta-admits-using-pirated-books-train-ai-but.html