Which AI Models Are Most Likely to Violate Copyrighted Content?

Recent research conducted by the startup Patronus AI reveals that OpenAI's GPT-4 reproduces copyrighted content in a significant percentage of its responses. Founded by former researchers from Meta AI, Patronus AI tested several popular large language models (LLMs), including OpenAI’s GPT-4, Anthropic’s Claude 2.1, Meta’s Llama 2 70B, and Mistral's Mixtral-8x7B-Instruct-v0.1. The findings highlighted varying rates of copyrighted content reproduction across these models.

In the experiments, GPT-4 replicated copyrighted material in an average of 44% of the prompts designed to evaluate content regurgitation. Comparatively, Mixtral-8x7B-Instruct-v0.1 produced copyrighted content in 22% of tested prompts, while Llama 2 70B had a much lower reproduction rate of 10%. The model with the least copyright reproduction was Claude 2.1, averaging only 8%.

Patronus AI's methodology involved crafting prompts based on text from books, such as asking for the first passage of well-known titles. For instance, inquiries about the opening of "Harry Potter and the Deathly Hallows" led to models generating exact reproductions of copyrighted material. Some responses even triggered warnings that the generated content could breach usage guidelines.

In a timely update, Anthropic introduced Claude 3, which demonstrated improved compliance by refusing to generate complete passages of copyrighted text. Instead, it opted to summarize specific sections, reflecting a shift towards safer content generation practices.

OpenAI faces a lawsuit from The New York Times concerning allegations that ChatGPT produced unlicensed reproductions of its copyrighted work. Authors and music publishers have also raised legal challenges related to copyright infringement against various LLM developers.

As these legal issues unfold, companies in the LLM sector are actively seeking partnerships with media organizations and social media platforms to ensure their models are trained on properly licensed data. OpenAI, for example, has secured agreements with entities like Axel Springer and the Associated Press, while Google recently initiated a collaboration with Reddit.

“Though industry frontrunners like Microsoft, Anthropic, and OpenAI are developing safeguards, the risk of generating exact reproductions of copyrighted content persists,” stated Anand Kannappan, CEO and co-founder of Patronus AI. “Transparent visibility into model risk is critical, particularly as liability remains ambiguous.”

The intellectual property risk is a major concern for many businesses contemplating the adoption of generative AI. A study by GitLab revealed that 95% of companies prioritize privacy and intellectual property protections when selecting an AI tool. In response to rising concerns, OpenAI, Anthropic, Amazon, Microsoft, and Google have committed to indemnifying their clients against copyright claims.

To address the challenges of identifying copyright infringement, Patronus AI also announced the launch of CopyrightCatcher, a tool designed to detect when an LLM outputs copyrighted material. This innovative application scores generated content and highlights specific segments containing potential copyright violations. A public demo of CopyrightCatcher allows users to assess its capabilities, focusing primarily on open-source models like Llama 2 70B, Mistral-8x7B-instruct, and Vicuna-13-v1.5. Unfortunately, GPT-4 is not included in this assessment.

This development underscores the increasing emphasis on intellectual property rights and the need for tools that can assist in navigating the complex landscape of generative AI.

With the interplay of technology, copyright, and enterprise concerns becoming more pronounced, it’s crucial for businesses to remain vigilant and informed as they explore the potential of AI-driven solutions.

Most people like

Find AI tools in YBX

Related Articles
Refresh Articles