Hugging Face has announced its acquisition of XetHub, a Seattle-based collaborative development platform founded by former Apple researchers. This platform aims to improve efficiency for machine learning teams working with large datasets and models.
While the financial details of the acquisition are not disclosed, CEO Clem Delangue stated in an interview with Forbes that this is Hugging Face's largest acquisition to date.
The Hugging Face team plans to integrate XetHub’s technology into its platform to enhance its storage backend. This upgrade will enable developers to host more extensive models and datasets with minimal effort.
CTO Julien Chaumond emphasized the significance of this acquisition in a blog post, saying, “The XetHub team will help us unlock the next five years of growth for HF datasets and models by implementing our own, improved version of LFS as the storage backend for the Hub’s repositories.”
What XetHub Adds to Hugging Face
Founded in 2021 by Yucheng Low, Ajit Banerjee, and Rajat Arya, XetHub has become known for offering enterprises a robust platform to explore and manage large models and datasets. It features Git-like version control for repositories scaling up to terabytes, allowing teams to track changes, collaborate, and maintain reproducibility throughout their machine learning workflows.
Over the past three years, XetHub has attracted notable clients, including Tableau and Gather AI, thanks to its advanced capabilities in handling complex scalability requirements. These include techniques such as content-defined chunking, deduplication, instant repository mounting, and file streaming.
With this acquisition, the XetHub platform will be discontinued, and its data and model management capabilities will enhance the Hugging Face Hub, providing a more optimized storage and versioning backend.
Currently, the Hugging Face Hub uses Git LFS (Large File Storage) as its backend, launched in 2020. However, Chaumond acknowledged that this system would eventually become insufficient due to the ever-increasing volume of large files in the AI ecosystem. The integration of XetHub marks a critical upgrade.
XetHub currently supports individual files larger than 1TB, with repository sizes exceeding 100TB. This significantly surpasses Git LFS, which has a maximum file size of 5GB and a 10GB repository limit. This enhancement will facilitate the Hugging Face Hub's ability to host larger models and datasets.
Moreover, XetHub’s advanced storage and transfer functionalities will make the collaboration even more efficient. For example, its content-defined chunking and deduplication features will enable users to upload only the necessary chunks of data when updating datasets, vastly reducing upload times.
“As the industry moves toward trillion-parameter models in the coming months, our goal is for this new technology to unlock greater scale for both our community and enterprise clients,” noted Chaumond. The companies will collaborate closely to introduce solutions designed to enhance teamwork on HF Hub assets and track their evolution.
Currently, the Hugging Face Hub houses 1.3 million models, 450,000 datasets, and 680,000 spaces, amounting to around 12PB in LFS storage. With the integration of the enhanced storage backend, it will be interesting to see how these numbers expand to accommodate larger models and datasets. However, the timeline for this integration and the rollout of additional features remains uncertain.