Zyphra Technologies Unveils Zyda: A Groundbreaking Language Model Dataset
Zyphra Technologies has announced the launch of Zyda, an extensive dataset designed to enhance language model training. Comprising 1.3 trillion tokens, Zyda is a meticulously filtered and deduplicated collection derived from premium open datasets, including RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. Initial ablation studies indicate that Zyda outperforms the datasets it was constructed from. An early version of this dataset is already powering Zyphra’s Zamba model, with plans for it to be available for download on Hugging Face.
“We created Zyda while developing a pretraining dataset for our Zamba series of models,” shares Yury Tokpanov, Zyphra’s machine learning research engineer and product lead. This dataset provides an exceptionally high-quality resource for training language models, eliminating the need for others to recreate something like Zyda from scratch.”
Zyphra aimed to improve existing datasets by combining various open-source collections. They meticulously cleaned the tokens to ensure uniqueness, employing syntactic filtering to eliminate low-quality documents and implementing a rigorous deduplication process both within and across datasets. As Zyphra notes in a blog post, “Cross deduplication is crucial, as many datasets contain overlapping documents from common sources such as Common Crawl.”
Among the seven open language modeling datasets used, RefinedWeb is the largest contributor, making up 43.6% of Zyda. Other significant sources include Slimpajama (18.7%) and StarCoder (17.8%), while the remainder accounts for smaller percentages.
“In total, we discarded approximately 40% of our initial dataset, reducing its token count from an estimated 2 trillion to 1.3 trillion,” Tokpanov explains.
Being open-sourced, Zyda enables developers to leverage this state-of-the-art language modeling dataset for various applications, from enhanced word predictions and text generation to improved language translation. If Zyda performs as anticipated, it will allow developers to streamline their processes, reducing production time and costs.
Curious about the name Zyda? Tokpanov reveals it’s a blend of “Zyphra Dataset.”
You can download Zyda on Zyphra’s Hugging Face page.
Updated: June 7, 2024 – Corrected attribution from Krithik Puthalath to Yury Tokpanov.