OpenAI, the creator of ChatGPT, claims that training advanced AI models like GPT-4 without using copyrighted materials is fundamentally impossible. This assertion comes amid a lawsuit filed by The New York Times, accusing OpenAI and its major investor, Microsoft, of copyright infringement. The newspaper alleges that ChatGPT has been trained on its copyrighted news content and replicates it "near-verbatim."
In its defense, OpenAI emphasizes that limiting training data solely to public domain sources would hinder the development of AI technologies that cater to the demands of contemporary society. The company communicated to the U.K. House of Lords' Communications and Digital Select Committee that “copyright today encompasses virtually every form of human expression” including blog posts, photographs, forum discussions, and even government documents. As such, restricting access to these materials would critically impair AI development.
Additionally, OpenAI argues that its practices conform to existing laws, asserting that copyright law does not explicitly prohibit training AI models, framing it instead as a matter of fair use. The organization also highlighted mechanisms in place to respect copyright, such as allowing websites to prevent access by its web crawler, GPTBot, and providing an opt-out process for content creators who prefer their work not to be included in future training datasets.
OpenAI expressed disappointment upon learning about the lawsuit through The New York Times itself, noting that discussions regarding real-time content display with attribution in ChatGPT were progressing as recently as December 19, prior to the legal action filed on December 27. The startup continues to engage actively with media outlets, aiming to establish mutually beneficial solutions. It has already forged licensing agreements with companies such as Axel Springer, publisher of Politico and Business Insider, as well as the Associated Press, and anticipates securing additional partnerships in the near future.
In its blog post, OpenAI alleges that The New York Times is not presenting the full narrative surrounding the lawsuit. While the Times maintains that ChatGPT reproduces its articles nearly word-for-word, OpenAI acknowledges this occurrence as a “rare bug” that it is actively addressing. The company explained that such memorization is an infrequent failure in the model’s learning process, often arising when specific content appears repeatedly in training data from various public sources.
Moreover, OpenAI pointed out that instances of near-verbatim reproduction can be exacerbated by the Times’ usage of manipulated prompts that include extensive excerpts from its articles. Andrew Ng, founder of Google Brain, supported this viewpoint by noting that prompts employed by the Times do not reflect typical user inquiries and reiterated that the observed word-for-word recreation seems to be a bug.
OpenAI stressed that The New York Times could opt out of having its content used in training, a choice it exercised in August of the previous year. The company remains optimistic about continuing to collaborate with news organizations in order to enhance the delivery and production of high-quality journalism by harnessing the transformative capabilities of AI technology.