How Apple, Anthropic, and Other Companies Leveraged YouTube Videos for AI Training

More than 170,000 YouTube videos have been used to train AI systems for major tech companies, as revealed by an investigation from Proof News and Wired. Companies like Apple, Anthropic, Nvidia, and Salesforce have tapped into the “YouTube Subtitles” data, which was obtained without permission. This dataset comprises subtitles from over 48,000 YouTube channels, notably including popular creators such as MrBeast and Marques Brownlee, as well as clips from reputable news outlets like ABC News, the BBC, and The New York Times.

Brownlee, also known as MKBHD, highlighted in a post on X the implications of this practice: “Apple has sourced data for their AI from several companies. One of them scraped tons of data/transcripts from YouTube videos, including mine,” indicating that this issue is likely to persist. YouTube has not provided a response regarding the investigation.

Proof News has also published an interactive lookup tool, allowing users to check if their content, or that of their favorite YouTubers, is included in the dataset. This subtitles dataset is a segment of a larger collection from the nonprofit EleutherAI known as The Pile, which includes diverse datasets like books and Wikipedia articles. Previous analyses of datasets, such as Books3, have shown which authors' works were utilized to train AI systems, fueling lawsuits from authors against the companies responsible.

Transparency regarding the data used in AI systems remains limited. Recently, questions about the use of YouTube content have gained traction. When OpenAI launched its advanced video generation tool, Sora, CTO Mira Murati avoided detailed inquiries about the dataset, stating it consisted of publicly available or licensed data. However, she admitted uncertainty about whether YouTube content was included.

YouTube's CEO, Neal Mohan, has previously asserted that using video content to train AI, including transcripts, breaches the platform's terms of service. In conversations on Decoder, Google CEO Sundar Pichai echoed this sentiment, noting that training Sora with YouTube content would violate these terms. “We have terms and conditions, and we would expect people to abide by those terms and conditions when you build a product,” Pichai said.

Most people like

Find AI tools in YBX