OpenAI is actively developing a new tool designed to give creators greater control over how their content is utilized in the training of generative AI models. Named Media Manager, this innovative tool will enable creators and content owners to identify their works within OpenAI's systems and specify their preferences for the inclusion or exclusion of these materials in AI research and training.
OpenAI aims to roll out Media Manager by 2025 as part of its collaborative efforts with “creators, content owners, and regulators” to establish standards—potentially through the industry steering committee it has recently joined. In a blog post, OpenAI explained, “This requires cutting-edge machine learning research to create a unique tool that will help us identify copyrighted text, images, audio, and video across various sources while reflecting the preferences of creators. Over time, we plan to introduce additional choices and features.”
The introduction of Media Manager appears to be OpenAI's response to increasing scrutiny over its data usage practices, which have relied heavily on scraping publicly available data from the internet. Recently, eight major U.S. newspapers, including the Chicago Tribune, filed lawsuits against OpenAI for intellectual property infringement, alleging that the company improperly used their articles to train generative AI models, which it then commercialized without compensation or proper credit.
Generative AI models, including those developed by OpenAI, rely on vast datasets sourced primarily from public websites. Proponents argue that fair use—a legal doctrine permitting the use of copyrighted works for transformative secondary creations—protects their scraping practices for model training. However, this interpretation is contentious, and critics assert that the use of copyrighted materials should be more rigorously regulated.
In a bid to address these concerns and mitigate potential legal challenges, OpenAI has taken steps to find common ground with content creators. Last year, the organization allowed artists to “opt out” of having their work used in the training datasets for image-generating models by submitting individual images for removal. Additionally, website owners can now specify through the robots.txt protocol whether content on their sites can be scraped for AI training purposes. OpenAI is also actively pursuing licensing agreements with major content providers, including news organizations, stock media libraries, and Q&A platforms like Stack Overflow.
Despite these efforts, some content creators feel that OpenAI has not gone far enough. Artists have expressed frustration with the opt-out process for images, which necessitates submitting each image individually along with a description. Reports indicate that OpenAI pays relatively modest fees for licensed content. Furthermore, as OpenAI acknowledges, their current solutions do not adequately address situations where creators’ works are quoted, remixed, or shared on platforms beyond their control.
In addition to OpenAI's initiatives, several third-party companies are working to create universal tools for provenance and opt-out options in generative AI. Startup Spawning AI, which collaborates with Stability AI and Hugging Face, provides an app that tracks and blocks scraping attempts by identifying bots’ IP addresses. They also maintain a database where artists can register their works to prevent unauthorized training by compliant vendors. Other companies, such as Steg.AI and Imatag, assist creators in establishing ownership of their images through imperceptible watermarks, while the University of Chicago's Nightshade project "poisons" image data to disrupt its utility in AI training.
Overall, these efforts illustrate the ongoing dialogue regarding the ethical use of data in AI training and the need for greater transparency and protection for creators’ rights in the evolving landscape of generative AI.