As artificial intelligence (AI) technology continues to advance rapidly, data has emerged as a critical driver in the development of AI models. However, a recent report by the Wall Street Journal highlights unprecedented challenges faced by AI companies in acquiring high-quality training data. Today, The New York Times further explores the strategies employed by these companies to navigate this complex issue, particularly the intricacies of AI copyright laws.
OpenAI, a leader in the AI sector, has a particularly pressing need for training data. According to reports, the company has transcribed over a million hours of YouTube videos to develop its advanced GPT-4 large language model using its Whisper audio transcription technology. OpenAI has also aggregated various other data resources, including code from GitHub, chess move databases, and educational content from Quizlet.
This approach has sparked legal controversies. While OpenAI asserts that its data usage complies with fair use principles, The Times reveals that OpenAI President Greg Brockman was personally involved in the data collection process, complicating the copyright issues further.
In an interview with The Verge, an OpenAI spokesperson stated that the company orchestrates unique datasets for each model, aimed at enhancing its understanding of the world and maintaining competitive research on a global scale. The spokesperson also mentioned that OpenAI is exploring the generation of synthetic data to lessen its dependency on external data sources.
Google has expressed concern over OpenAI's practices, with a spokesperson noting via email that the company has observed unverified reports regarding OpenAI's activities, emphasizing that Google's robots.txt file and terms of service prohibit unauthorized scraping or downloading of YouTube content.
YouTube CEO Neal Mohan, in a recent interview, indicated that while there is no direct evidence that OpenAI used YouTube videos for training the Sora model, such actions would violate YouTube's terms of service.
Simultaneously, Meta is grappling with its own data availability challenges. According to The Times, as Meta's AI team strives to catch up with OpenAI, it is considering scenarios involving the unauthorized use of copyrighted works. To expand its datasets, Meta has reviewed a vast array of English-language books, essays, poetry, and news articles, discussing potential payments for book licensing or directly acquiring large publishers.
These developments underline the legal and ethical challenges the AI industry faces in data collection and usage. As technology progresses, the urgent question arises: how can AI models evolve while respecting copyright protections? Moving forward, it's essential for AI companies and regulatory bodies to collaborate in establishing clearer, fairer regulations that foster the healthy and sustainable development of AI technology.