In an exclusive interview with the Wall Street Journal, OpenAI CTO Mira Murati discussed the company’s Sora text-to-video model, which she suggested could be available to the public within months. The demo featured clips that were both impressive and endearing—making viewers simultaneously intrigued and amused.
However, the conversation took a turn when Murati was questioned about the training data used for Sora. She stated, “We used publicly available and licensed data,” but struggled to clarify whether content from YouTube, Facebook, or Instagram was included. While she acknowledged using Shutterstock content, her uncertainty regarding other platforms raised eyebrows. “I’m not actually sure” appeared to be her response regarding YouTube, while for Facebook and Instagram, she offered a vague assertion that there “might be” publicly available videos but did not confirm any specifics.
This ambiguity likely did not please OpenAI’s public relations team, especially given the ongoing copyright-related lawsuits, including one from the New York Times. The details of the training data are crucial to many stakeholders—authors, photographers, and artists—who want clarity on what content was used to develop models like Sora. As reported by The Information, OpenAI allegedly utilized data from various online sources, intensifying scrutiny over the company’s practices.
The implications of training data extend beyond legal issues; they touch on trust and transparency. If OpenAI trained on content that was deemed “publicly available,” what happens if the broader public is unaware? Moreover, other tech giants like Google and Meta also leverage publicly shared content from platforms they own. While this may be legally permissible, recent warnings from the FTC about quietly changed Terms of Service raise questions about public awareness.
The discourse surrounding training data is foundational to generative AI, and the potential for a reckoning looms large—not just in courts but in public perception. As previously noted, the reliance on diverse datasets for training AI models is a consideration that affects those whose creative work contributes to these datasets.
Historically, data collection for marketing has operated on a give-and-take basis. Users provide data for enhanced experiences, albeit this exchange often benefits data brokers disproportionately. This dynamic shifts with generative AI; many view the use of their publicly shared works as exploitative, posing threats to jobs and creativity.
Experts advocate for well-curated training datasets to improve models, emphasizing their importance for research rather than commercial exploitation. Yet, as people become more aware of how their content is used to train profit-driven models, the question remains: Will acceptance wane if they learn their videos have contributed to commercial AI outputs?
As the landscape evolves, companies like OpenAI, Google, and Meta may capitalize on their early advantages. However, the ongoing challenges surrounding AI training data could lead to long-term repercussions, potentially turning today's advantages into a complex bargain.