英伟达秘密视频模型项目曝光：每天抓取80年视频数据

英伟达神秘视频基础模型「Cosmos」曝光，数据全靠偷

近日，英伟达（NVIDIA）公司的一项神秘视频基础模型「Cosmos」被曝光，该模型引起了广泛关注。据外媒404 Media获得的内部聊天、电子邮件和文件显示，英伟达正在从YouTube和其他来源疯狂爬取视频数据，以用于其AI产品的训练。

根据404 Media的报道，该项目内部命名为Cosmos，旨在构建一个最先进的视频基础模型，将光传输、物理和智能模拟封装在一个地方，以解锁对英伟达至关重要的各种下游应用。为了收集训练视频，英伟达员工使用名为「yt-dlp」的开源YouTube视频下载器，并选择使用Amazon Web Services中的20到30台虚拟机，每天下载相当于80年的视频。

然而，这一行为引起了版权持有者的担忧。谷歌发言人表示，如果OpenAI使用YouTube视频来改进其AI视频生成器Sora，这将「明显违反」YouTube的使用条款。Netflix发言人则称，公司与英伟达并未就内容采集达成协议，且平台的服务条款不允许抓取内容。

尽管如此，英伟达方面似乎并不在意。参与该项目的员工提出的法律问题经常被项目经理驳回，表示未经许可抓取视频的决定是「行政决定」，他们不需要担心。英伟达研究副总裁兼Cosmos项目负责人Ming-Yu Liu在5月份的一封电子邮件中表示，他们正在完成v1数据pipeline，并确保必要的计算资源，以构建一个视频数据工厂，该工厂每天可以产生相当于人类一生视觉体验的训练数据。

此外，404 Media还发现，英伟达研究员曾提议使用《阿凡达》或《指环王》这样的好莱坞电影来训练OpenAI Sora，但考虑到版权敏感性，这一提议并未得到实施。

随着科技的快速发展，AI模型的训练数据获取方式和版权问题越来越受到关注。英伟达的Cosmos项目再次引发了关于AI模型训练数据合法来源的讨论，同时也提醒业界和监管机构需要加强对AI模型训练数据的监管，以确保技术创新与版权保护之间的平衡。

英语如下：

Title: NVIDIA’s Secret Video Model Project Exposed: Capturing 80 Years of Video Data Daily

Keywords: NVIDIA, Video Model, Data Controversy

Content:

NVIDIA’s mysterious video foundation model known as “Cosmos” has been exposed, sparking widespread attention. According to internal chats, emails, and documents obtained by 404 Media, NVIDIA is furiously scraping video data from YouTube and other sources to train its AI products.

The project, internally named Cosmos, aims to build a state-of-the-art video foundation model that encapsulates light transmission, physics, and intelligence simulation in one place, unlocking various downstream applications crucial to NVIDIA. To collect training videos, NVIDIA employees used an open-source YouTube video downloader named “yt-dlp,” and selected between 20 to 30 virtual machines on Amazon Web Services to download video content equivalent to 80 years daily.

This practice has raised concerns from copyright holders. A Google spokesperson stated that if OpenAI uses YouTube videos to improve its AI video generator Sora, it would “clearly violate” YouTube’s terms of service. A Netflix spokesperson noted that the company has not reached an agreement with NVIDIA regarding content collection and that the platform’s terms of use do not permit content scraping.

Despite these concerns, NVIDIA seems unfazed. Legal questions raised by project participants are frequently dismissed by project managers, who claim that the decision to scrape video data without permission is a “policy decision” they need not worry about. NVIDIA’s Research Vice President and Cosmos project leader, Ming-Yu Liu, stated in a May email that they are completing the v1 data pipeline and ensuring necessary computational resources to build a video data factory capable of generating training data equivalent to a human’s lifetime visual experience daily.

Additionally, 404 Media discovered that NVIDIA researchers had proposed using Hollywood blockbusters like “Avatar” or “The Lord of the Rings” to train OpenAI Sora, but this proposal was not implemented due to copyright sensitivity.

As technology advances rapidly, the way AI models obtain training data and the issue of copyright are increasingly being brought to the forefront. NVIDIA’s Cosmos project has once again sparked discussions about the legitimate sources of AI model training data, and it also reminds the industry and regulatory bodies of the need to strengthen oversight over AI model training data to ensure a balance between technological innovation and copyright protection.

【来源】https://www.jiqizhixin.com/articles/2024-08-06-2