An investigation by Proof News has revealed that some of the largest tech companies, including Apple, NVIDIA, and Anthropic, trained their AI models using a dataset that includes transcripts from over 173,000 YouTube videos—without obtaining permission from the creators. This dataset, compiled by the nonprofit EleutherAI, features transcripts from channels representing more than 48,000 creators, including prominent figures like Marques Brownlee and MrBeast, as well as major news organizations such as The New York Times, BBC, and ABC News.
This investigation highlights a troubling reality in AI development: much of the technology relies on data extracted from creators without their consent or compensation. While the dataset does not contain videos or images, it nonetheless incorporates substantial contributions from influential content creators.
Marques Brownlee expressed concerns on social media, pointing out that Apple sourced data from various companies, one of which scraped transcripts from YouTube videos, including his. He stated, “This is going to be an evolving problem for a long time,” acknowledging the complex ethical landscape surrounding data usage in AI.
A spokesperson for Google reiterated that statements from YouTube CEO Neal Mohan about the violation of the platform's terms of service by companies leveraging YouTube data for AI training still stand. Repeated attempts to obtain comments from Apple, NVIDIA, Anthropic, and EleutherAI have gone unanswered.
Transparency regarding the training data used by AI companies remains an elusive issue. Recently, Apple faced criticism from artists and photographers for not disclosing the source of the training data for its upcoming generative AI feature, Apple Intelligence. In response, Apple clarified that its OpenELM model—created strictly for research—does not power its AI or machine learning capabilities. The company has claimed that its AI models are trained on "licensed data" and publicly available information collected by web crawlers.
YouTube, as the world’s largest video repository, provides an abundance of transcripts, audio, video, and images, making it an appealing resource for developing AI models. Earlier this year, OpenAI’s Chief Technology Officer, Mira Murati, avoided questions regarding whether YouTube videos were used to train Sora, OpenAI’s upcoming AI video generation tool, stating that the data was either publicly available or licensed.
For those interested in determining whether subtitles from your YouTube videos or those of your favorite channels are included in this dataset, visit Proof News' lookup tool.