YouTuber Initiates Class Action Lawsuit Against OpenAI for Scraping Creator Transcripts

A YouTube creator is initiating a class action lawsuit against OpenAI, claiming the company trained its generative AI models using millions of transcripts from YouTube videos without informing or compensating their owners.

In a complaint filed on Friday in the U.S. District Court for the Northern District of California, attorneys representing David Millette, a Massachusetts-based YouTube creator, allege that OpenAI covertly transcribed Millette's and other creators’ videos to develop the models behind its AI chatbot platform, ChatGPT, among other generative AI tools. The complaint states that by collecting this data, OpenAI has "profited significantly" from the creators' content while breaching copyright law and violating YouTube's terms of service, which prohibit the use of videos in applications outside of its platform.

“As the sophistication of [OpenAI’s] AI products increases, their value to users grows, leading to more subscriptions to [OpenAI’s] offerings,” the complaint states. “However, much of the content in OpenAI’s training datasets comes from works that were copied without consent, credit, or compensation.”

Millette, with representation from the law firm Bursor & Fisher, is seeking a jury trial and more than $5 million in damages on behalf of all YouTube users and creators whose content may have been utilized in OpenAI's training processes.

Generative AI models, including those developed by OpenAI, do not possess true intelligence. Instead, they analyze vast amounts of input data—such as movies, voice recordings, and essays—to identify patterns and predict data likelihood within certain contexts. Most training data originates from public websites and datasets, with companies often arguing the fair use doctrine protects their indiscriminate data gathering. However, many copyright holders dispute this claim and are pursuing legal actions to stop these practices.

As traditional sources of data dwindle, video transcriptions have increasingly become essential training material. Data from Originality.AI reveals that over 35% of the top 1,000 websites currently block OpenAI’s web crawler. A study by MIT’s Data Provenance Initiative found that around 25% of data from “high-quality” sources is now restricted from major datasets used for training AI models. If the current trend continues, Epoch AI predicts developers could exhaust available training data for generative AI models between 2026 and 2032.

In April, The New York Times reported that OpenAI developed its first speech recognition model, Whisper, intended for transcribing audio from videos to gather additional training data. An OpenAI team, including President Greg Brockman, transcribed over a million hours of YouTube video content using Whisper and subsequently used these transcripts to enhance its text-generating and text-analyzing model, GPT-4. Some OpenAI employees reportedly expressed concerns that this approach could violate YouTube's policies.

In July, Proof News highlighted that companies like Anthropic, Apple, Salesforce, and Nvidia had drawn from a dataset called The Pile, which includes subtitles from hundreds of thousands of YouTube videos, to train generative AI models. Many YouTube creators whose subtitles were incorporated into The Pile were unaware and did not consent to their use. Apple later clarified that it had no intention of leveraging these models for AI-related features in its products.

Additionally, Google, YouTube's parent company, has sought to utilize transcripts for training its various models. Last year, Google revised its terms of service (ToS), in part, to allow for increased access to user data for generative AI model training. Under previous terms, it was ambiguous whether Google could utilize YouTube data for product development beyond the video platform, a limitation that has now been significantly relaxed.

We have reached out to OpenAI and Google for comments regarding the class action lawsuit and will provide updates if we receive any responses.

OpenAI has faced challenges recently, with Tesla and X CEO Elon Musk filing a new lawsuit against the company and CEO Sam Altman. Musk's complaint accuses OpenAI of straying from its original nonprofit mission by reserving some of its advanced technology for commercial clients. This echoes a similar lawsuit Musk filed against OpenAI in February, but the new filing additionally alleges that the company is involved in racketeering activities.

Most people like

Find AI tools in YBX