Red Pajama 2: Explore the Massive 30 Trillion Token Public Dataset

Home AI News Red Pajama 2: Explore the Massive 30 Trillion Token Public Dataset

Updated on October 24 2024

AI startup Together has launched an impressive dataset containing a staggering 30 trillion tokens, which translates to approximately 20 trillion words. This new release marks the latest version of RedPajama, initially introduced in April with a more modest offering of 1.2 trillion tokens aimed at supporting the development of open-source large language models (LLMs).

The newly revised RedPajama comprises trillions of filtered and deduplicated tokens sourced from 84 CommonCrawl dumps, encompassing five languages: English, French, Spanish, German, and Italian. According to Together, RedPajama v2 now stands as the largest public dataset specifically tailored for LLM training, enhanced with over 40 pre-computed data quality annotations. These annotations facilitate further filtering and weighting, allowing developers to streamline their model training processes effectively.

Together argues that many publicly accessible datasets, like CommonCrawl, often suffer from quality issues. They note that the data is typically not ideal for direct use in LLM training due to artifacts resulting from the HTML-to-plain text conversion process. The updated RedPajama aims to alleviate these challenges by eliminating time-consuming and resource-intensive tasks associated with raw data filtering. With its comprehensive annotations, the dataset enables developers to more easily curate their own pre-training datasets, improving the quality and efficiency of model development.

While other similar datasets exist—such as C4, RedPajama-1T, Refinedweb (Falcon), Dolma (AI2), and SlimPajama—many of these alternatives only represent a fraction of the CommonCrawl data and implement specific filtering methods. The team behind RedPajama aims to minimize this burden on the community by providing an extensive pool of web data that serves as a foundation for high-quality LLM training datasets. This initiative is designed to encourage thorough research into LLM training data and its applications.

In addition to the initial offerings, Together plans to expand the dataset's quality annotations, making RedPajama a dynamic and evolving project. The dataset is open-source, licensed under Apache License v2, making it suitable for commercial applications. Data processing scripts can be found on GitHub, and all datasets are readily available on Hugging Face.

Developers are encouraged to enhance their data mixtures by integrating other resources, such as Stack by BigScience for code generation and s2orc by AI2 for scientific articles. Since its original version's launch, RedPajama has been downloaded over 190,000 times, with more than 500 successful implementations showcased on Hugging Face. Noteworthy projects utilizing the classic RedPajama include Alibaba’s Data-Juicer and innovative chat concepts from the Analytics Club at ETH Zürich, a prominent Swiss academic institution.

Discover New Models and API Updates Unveiled at This Week’s OpenAI DevDay

"How to Scrape YouTube Videos for Enhanced AI Training"

Most people like

Leia Inc.

Leia Inc. presents the groundbreaking Lume Pad 2, the world's first immersive 3D tablet powered by advanced 3D•AI technology. Discover a new dimension of digital experiences with Leia's innovative offerings.

3D AI 3D Model Generator

PromptWise.ai

Elevate your ChatGPT experience with expertly designed prompts. Unlock the full potential of artificial intelligence by using thoughtfully curated inputs that enhance creativity and engagement.

AI AI Content Generator

Curious Thing

Introducing the AI-powered voice assistant designed to enhance customer inquiries and boost engagement. This advanced tool revolutionizes the way businesses interact with their clients, ensuring swift responses and a personalized experience that keeps customers coming back. Discover how this innovative technology can transform your customer service approach.

Voice AI Large Language Models (LLMs)

Dify

Dify empowers users to effortlessly build sustainable applications, making eco-friendly development accessible to all.

LLMOps AI Product Description Generator

Find AI tools in YBX