Major Expansion of One of the World’s Largest AI Training Datasets Promises Enhanced Quality and Size

Massive AI training datasets, often referred to as corpora, are considered "the backbone of large language models" (LLMs). In 2023, EleutherAI garnered attention for creating one of the world’s largest open-source text corpora, the 825 GB Pile. This organization, a grassroots nonprofit established in 2020 as a Discord collective to explore OpenAI’s GPT-3, faced scrutiny amid growing legal and ethical concerns surrounding the datasets used for training popular LLMs like OpenAI's GPT-4 and Meta’s Llama.

EleutherAI was mentioned in numerous lawsuits focusing on generative AI. A notable case filed in October by former Arkansas Governor Mike Huckabee and several authors claimed their books were included in Books3, a contentious dataset featuring over 180,000 works that contributed to the Pile project. Books3 was originally uploaded in 2020 by Shawn Presser and was removed in August 2023 following a legal notice from a Danish anti-piracy group.

Despite these challenges, EleutherAI is developing an updated version of the Pile dataset, collaborating with institutions like the University of Toronto and the Allen Institute for AI, as well as independent researchers. Stella Biderman, EleutherAI's executive director, and Aviya Skowron, head of policy and ethics, revealed in a joint interview that the new Pile is expected to be finalized in a few months.

The updated Pile will be significantly larger and "substantially better" than its predecessor, according to Biderman. "There’s going to be a lot of new data," she noted, emphasizing the inclusion of previously unseen information. The new dataset will feature more recent data compared to the original, which was released in December 2020 and used to train models like the Pythia suite and Stability AI’s Stable LM suite. With lessons learned from training nearly a dozen LLMs, Biderman highlighted improved data preprocessing methods: "When we created the Pile, we had never trained an LLM. Now, we've gained valuable insights on how to refine data for optimal use in LLMs."

The updated dataset will also emphasize better quality and diverse data inclusion. "We’re planning to incorporate many more books and a wider variety of non-academic non-fiction works," she explained.

The original Pile comprised 22 sub-datasets, including Books3, PubMed Central, arXiv, Stack Exchange, Wikipedia, YouTube subtitles, and even Enron emails. Biderman remarked that the Pile remains the most well-documented LLM training dataset globally. The initiative aimed to construct an extensive dataset consisting of billions of text passages, rivaling the scale of OpenAI's training for GPT-3.

"When introduced in 2020, the Pile played a crucial role because it was unique," Biderman stated. At that time, only one publicly available large text corpus, C4, existed, which Google used for various language models. "But C4 is smaller and less diverse," she asserted, describing it as a refined Common Crawl scrape.

EleutherAI's approach to crafting the Pile involved selective curation of information and topics essential for enriching model knowledge. "More than 75% of the Pile was curated from specific domains," she noted. "Our aim was to provide meaningful insights about the world."

Skowron explained EleutherAI’s stance on model training and fair use, asserting that "current LLMs rely on copyrighted data." One goal of the Pile v2 project is to address issues linked to copyright and data licensing. The new Pile dataset will include public domain works, Creative Commons licensed texts, and government documents, ensuring compliance with legal standards. Additionally, it will feature datasets for which explicit permissions from rights holders have been obtained.

Criticism of AI training datasets gained traction following the release of ChatGPT in November 2022, raising concerns about copyright infringement. The series of generative AI lawsuits that ensued came from artists, writers, and publishers, culminating in significant legal challenges, including one from The New York Times against OpenAI and Microsoft.

The debate surrounding AI training data is complex. Biderman and Skowron stressed the importance of addressing morally troubling cases, such as the discovery of child sexual abuse images in the LAION-5B dataset, which recently led to its removal. Biderman noted that the methodology used to flag such content may not be legally accessible to organizations like LAION.

Furthermore, they acknowledged the concerns of creatives whose works were used to train AI models, emphasizing that many did so under permissive licenses without anticipating AI's evolution. "In hindsight, many would have chosen different licensing options," Biderman reflected.

While AI training datasets were once primarily research tools, they have transitioned into commercial products. "Now, the primary purpose is fabrication," said Biderman, highlighting the growing awareness of commercial implications for AI model training.

Interestingly, Biderman and Skowron argued that AI models trained on open datasets like the Pile are safer, as increased visibility into the data fosters ethical usage across various contexts. "To achieve many policy objectives, there must be transparency, including thorough training documentation," said Skowron.

As EleutherAI continues refining the Pile, Biderman expressed optimism about releasing the new models soon. "We've been working on this for about a year and a half, and I'm eager to see the results. I anticipate it will make a small but meaningful difference."

Most people like

Find AI tools in YBX

Related Articles
Refresh Articles