When news surfaced last year about a significant partnership between AI powerhouse OpenAI and media giant Axel Springer, it signaled potential collaboration between content creators and tech firms that leverage their work in developing and training artificial intelligence models. This agreement came shortly after OpenAI established a similar partnership with the Associated Press.
However, as the year wound down, the New York Times filed a lawsuit against OpenAI and its investor Microsoft, claiming that the generative AI models created by OpenAI were constructed by "copying and utilizing millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more." The Times alleges that OpenAI's models can produce outputs that quote its content verbatim, summarize it closely, and replicate its distinctive style, citing numerous examples to support its claims.
The Times further asserted in its lawsuit that they "objected after discovering that the defendants were using Times content without permission to develop their models and tools," and that negotiations with OpenAI have not resulted in a resolution.
How to reconcile copyright protection with the necessity for AI development is a complex issue that won’t be solved overnight. The ongoing agreements and disputes between content creators and AI companies highlight a tense moment in the industry. Tech companies are rapidly integrating new generative AI models that rely on copyright-protected material into their products, with Microsoft leading this charge. Meanwhile, media organizations that have invested heavily in building their proprietary content are understandably frustrated that their hard work is being absorbed into AI systems that yield profits without compensating the original creators.
Both perspectives are compelling. Tech companies have a long-standing practice of traversing the internet and collecting data to help users navigate information, much like traditional search engines. So why should training AI models be viewed any differently? On the flip side, media professionals have seen their industry suffer—especially in journalism, where the Times plays a significant role—and they are reluctant to witness yet another generation of tech solutions profiting off their work while offering little to no compensation.
While we might each hold personal opinions on this debate, let’s explore some of the key arguments surrounding the contentious issue of AI training data. Understanding these facets will be essential in the conversation leading into 2024. Let’s dive in!
The Times’ Argument
The lawsuit is a lengthy read, but several key points stand out.
The Times emphasizes that producing high-quality journalism incurs significant costs, and it argues that copyright is vital for safeguarding its work and sustaining its business model. The Times has a history of licensing its materials, meaning others can utilize its journalism—but they must pay for that access. The publication differentiates these licensing agreements from how it collaborates with search engines: “While The Times allows search engines to index its content for traditional search results, it has never granted permission to any entity, including the defendants, to utilize its content for generative AI purposes.”
This raises an essential question: If large language models (LLMs) are trained on vast datasets, why does it matter where each individual piece originates? The Times contends that its material was used substantively in a way that contributed to the development of commercial products selling for profit.
In the lawsuit, the Times points out that the “training dataset for GPT-2 includes an internal corpus OpenAI constructed called ‘WebText,’ which consists of ‘the text contents of 45 million links posted by users of the ‘Reddit’ social network.’” The Times is prominently featured in this dataset. This is significant, as OpenAI stated that WebText was designed to enhance content quality. Therefore, the inclusion of Times material was intended to elevate the output.
The Times further references WebText2, utilized in GPT-3, revealing that it compromises “22% of the training mix for GPT-3, despite representing less than 4% of the total tokens.” In WebText2, “Times content—209,707 unique URLs—constitutes 1.23% of sources listed in OpenWebText2, an open-source re-creation of the WebText2 dataset used for training GPT-3.”
The Times underscores that even OpenAI concedes the importance of its work to developing some of its well-known models, arguing that instances of its material appeared more frequently than others and were specifically weighted for their quality.
Essentially, the Times' stance can be summarized as: “You utilized our content to enhance your product, which is now generating significant revenue, so you owe us compensation for our work."
The Tech Perspective
In an April discussion hosted by the U.S. Copyright Office, representatives from the technology and venture capital sectors, along with rights holders, debated this very issue. A conversation by the well-known venture firm a16z can be insightful.
A16z argued that "most of the time, the output from a generative AI service is not 'substantially similar' in the copyright sense to any particular copyrighted work used for training." They further asserted that “the volume of data required for AI model creation is so immense that collective licensing is nearly impossible. When discussing large language models, we’re effectively training on the entirety of written language.”
In an October commentary to the U.S. Copyright Office, the firm reiterated that "copies of copyrighted works needed for developing a productive technology with non-infringing outputs have long been supported by our copyright laws through the fair use doctrine." They emphasized that without this doctrine, technologies such as search engines and online book searches would not exist.
They feel that training AI models should be viewed similarly: "The mass use of copyrighted works to teach an AI model—by isolating statistical patterns and non-expressive information—does not infringe copyright. Should the U.S. impose liability on AI creators for copyright infringement, it may stifle their development."
While generative AI is still catching its breath in legal terms, the tech industry argues that precedents exist showing it’s acceptable to use vast amounts of data—including copyright-protected materials—for technological advancements without licensing fees.
Reevaluating Scale
A fascinating aspect of the debate is the question of scale. Benedict Evans, a prominent tech thinker, noted, "AI makes practical at a massive scale what was once possible only on a smaller scale. The difference between police carrying wanted posters in their pockets and surveillance cameras on every street corner—this change in scale could imply a shift in principle.”
Both the Times and the tech industry are debating current laws. Evans suggests that the vast ingestion of data for AI model training could lead to a situation where existing laws fail to align with societal desires, highlighting that changes in the law are possible if elected officials can enact them.
To summarize, the Times asserts that its data was heavily and beneficially utilized in training certain OpenAI models, justifying compensation for its use. In contrast, OpenAI and its supporters are banking on existing legal protections and fair use to minimize their liabilities while capitalizing on new technologies. There may also be a pressing need for new regulations to address these challenges, given that current laws may inadequately reflect the new scale of operations.
From my vantage point, I don’t expect OpenAI to compensate me for any of my writing it may have assimilated. However, since my work primarily belongs to my employers—who possess a substantial amount of material and greater legal resources—this situation may have broader implications down the line. Reporting on these evolving dynamics will undoubtedly be sensitive, but it may also shed light on how future AI technologies engage with copyright considerations.