Red Pajama 2: Explore the Massive 30 Trillion Token Public Dataset

AI startup Together has launched an impressive dataset containing a staggering 30 trillion tokens, which translates to approximately 20 trillion words. This new release marks the latest version of RedPajama, initially introduced in April with a more modest offering of 1.2 trillion tokens aimed at supporting the development of open-source large language models (LLMs).

The newly revised RedPajama comprises trillions of filtered and deduplicated tokens sourced from 84 CommonCrawl dumps, encompassing five languages: English, French, Spanish, German, and Italian. According to Together, RedPajama v2 now stands as the largest public dataset specifically tailored for LLM training, enhanced with over 40 pre-computed data quality annotations. These annotations facilitate further filtering and weighting, allowing developers to streamline their model training processes effectively.

Together argues that many publicly accessible datasets, like CommonCrawl, often suffer from quality issues. They note that the data is typically not ideal for direct use in LLM training due to artifacts resulting from the HTML-to-plain text conversion process. The updated RedPajama aims to alleviate these challenges by eliminating time-consuming and resource-intensive tasks associated with raw data filtering. With its comprehensive annotations, the dataset enables developers to more easily curate their own pre-training datasets, improving the quality and efficiency of model development.

While other similar datasets exist—such as C4, RedPajama-1T, Refinedweb (Falcon), Dolma (AI2), and SlimPajama—many of these alternatives only represent a fraction of the CommonCrawl data and implement specific filtering methods. The team behind RedPajama aims to minimize this burden on the community by providing an extensive pool of web data that serves as a foundation for high-quality LLM training datasets. This initiative is designed to encourage thorough research into LLM training data and its applications.

In addition to the initial offerings, Together plans to expand the dataset's quality annotations, making RedPajama a dynamic and evolving project. The dataset is open-source, licensed under Apache License v2, making it suitable for commercial applications. Data processing scripts can be found on GitHub, and all datasets are readily available on Hugging Face.

Developers are encouraged to enhance their data mixtures by integrating other resources, such as Stack by BigScience for code generation and s2orc by AI2 for scientific articles. Since its original version's launch, RedPajama has been downloaded over 190,000 times, with more than 500 successful implementations showcased on Hugging Face. Noteworthy projects utilizing the classic RedPajama include Alibaba’s Data-Juicer and innovative chat concepts from the Analytics Club at ETH Zürich, a prominent Swiss academic institution.

Most people like

Find AI tools in YBX

Related Articles
Refresh Articles