It’s widely recognized that the data sets used to train artificial intelligence (AI) models are significantly flawed.
Image corpora tend to be predominantly U.S. and Western-centric, primarily due to the prevalence of Western images on the internet at the time these data sets were created. A recent study from the Allen Institute for AI underscored that large language models, such as Meta’s Llama 2, often incorporate toxic language and inherent biases.
These models can amplify these issues in harmful ways. In response, OpenAI has announced a new initiative to counteract these flaws by collaborating with external organizations to develop improved data sets.
OpenAI's Data Partnerships is a new venture aimed at partnering with third-party entities to create both public and private data sets for AI training. In an official blog post, OpenAI stated that these partnerships are designed to “enable more organizations to help steer the future of AI” and ultimately “benefit from more useful models.”
“To create AI that is safe and beneficial for humanity, it's essential for AI models to have a comprehensive understanding of various subjects, industries, cultures, and languages. This requires a training data set that is as diverse as possible,” the company explained. “Incorporating your content can enhance the helpfulness of AI models by improving their knowledge of your area of expertise.”
As part of the Data Partnerships program, OpenAI plans to gather “large-scale” data sets that accurately “reflect human society” and are not readily available online. While the initiative will encompass a wide array of data types—including images, audio, and video—OpenAI is particularly interested in data that “expresses human intention,” such as in-depth writing or conversations, across various languages, topics, and formats.
OpenAI will collaborate with organizations to digitize necessary training data, employing optical character recognition and automatic speech recognition tools. Any sensitive or personal information will also be removed as needed.
Initially, OpenAI aims to create two categories of data sets: an open-source data set that will be publicly accessible for general AI model training, and private data sets for proprietary use. The private data sets are targeted at organizations that wish to keep their data confidential while still enhancing OpenAI’s models’ understanding of their respective fields. So far, OpenAI has collaborated with the Icelandic Government and Miðeind ehf to boost GPT-4's proficiency in Icelandic, as well as with the Free Law Project to refine its models’ comprehension of legal documents.
“Ultimately, we seek partners who want to help us teach AI to better understand our world, so it can be maximally beneficial for everyone,” OpenAI stated.
However, can OpenAI outperform previous data-set-building initiatives? It remains questionable, as reducing data set bias is a challenge that has perplexed many experts in the field. At the very least, transparency about the process and the challenges faced in creating these data sets would be a welcome development.
Despite the ambitious tone of the blog post, there appears to be a clear commercial interest here, with the goal of enhancing the effectiveness of OpenAI's models while potentially sidelining others—without offering compensation to the original data owners. While this may be within OpenAI's rights, it comes across as somewhat insensitive, especially in light of recent open letters and lawsuits from creatives claiming that OpenAI has trained its models on their work without proper permission or remuneration.