OpenAI Asserts New York Times Copyright Lawsuit Lacks Merit

In late December, The New York Times filed a lawsuit against OpenAI and its partner Microsoft, accusing them of violating copyright laws by using the Times’ content to train their generative AI models. OpenAI has since responded, labeling the lawsuit as baseless.

In a letter published on OpenAI’s official blog this afternoon, the company reiterated its stance that training AI models on publicly available data from the web—including content from The New York Times—is considered fair use. OpenAI asserts that it is not obligated to license or financially compensate for the examples used in developing generative AI systems like GPT-4 and DALL-E 3, which rely on vast datasets of images, ebooks, and articles to produce human-like text and visuals.

“We believe this principle supports creators, is essential for innovation, and is vital for U.S. competitiveness,” OpenAI stated.

The letter also addresses the issue of regurgitation, where generative AI models may reproduce training data when prompted in a specific way—such as generating an image identical to one by a renowned photographer. OpenAI argues that such occurrences are less likely with a single source of training data, like The New York Times, and emphasizes that users must use the models responsibly to prevent regurgitation.

“It’s interesting to note that the regurgitations mentioned in The New York Times lawsuit seem to come from years-old articles that have been widely shared across various third-party sites,” OpenAI noted. “It appears that they manipulated prompts, often using extensive excerpts from articles, to compel our model to regurgitate. Even when users do this, our models typically don’t perform as the Times suggests, indicating that they may have either directed the model to reproduce or selectively chosen examples.”

OpenAI’s response comes amid a growing debate about copyright in the context of generative AI.

An article published this week in IEEE Spectrum by AI critic Gary Marcus and visual effects artist Reid Southen discusses how AI systems, including DALL-E 3, might regurgitate information without explicit prompts, thereby undermining OpenAI’s claims. They reference The New York Times lawsuit, highlighting that it was able to provoke “plagiaristic” responses by simply inputting the opening words of a Times article.

The Times is just one among multiple copyright holders taking legal action against OpenAI for perceived IP violations. Actress Sarah Silverman has joined lawsuits against Meta and OpenAI, claiming that her memoir was used to train their AI models without consent. Additionally, a group of authors, including Jonathan Franzen and John Grisham, alleges that OpenAI used their works in training datasets without their permission. There’s also an ongoing case involving programmers against Microsoft, OpenAI, and GitHub over Copilot, an AI-based code-generating tool that plaintiffs argue was developed using their protected code.

Conversely, some news organizations have opted for licensing agreements with generative AI companies instead of resorting to legal battles. In July, the Associated Press reached an arrangement with OpenAI, and in December, Axel Springer, the German publisher behind Politico and Business Insider, did the same. OpenAI has also secured deals with the American Journalism Project and NYU.

However, these agreements often yield minimal financial compensation. As reported by The Information, OpenAI, which reportedly generates about $1.6 billion annually, offers between $1 million and $5 million a year for licensing rights to use copyrighted news articles for AI training.

Until recently, The New York Times was in discussions with OpenAI for a potentially lucrative partnership centered on displaying its brand in ChatGPT, OpenAI’s AI chatbot. However, these negotiations collapsed in mid-December.

Interestingly, public sentiment appears to lean in favor of publishers. A recent poll from The AI Policy Institute revealed that 59% of respondents believed AI companies should not utilize publisher content for training, while 70% felt these companies should provide compensation if they wish to use copyrighted materials.

Most people like

Find AI tools in YBX