Nous Research made waves this month with the release of its open-source Llama 3.1 variant, Hermes 3. Now, the small research team focused on developing “personalized, unrestricted AI” models has unveiled another groundbreaking innovation: DisTrO (Distributed Training Over-the-Internet). This new optimizer significantly reduces the data transfer needed between GPUs (graphics processing units) during AI model training.
disTrO empowers individuals and institutions worldwide to train advanced AI models collaboratively over consumer-grade internet connections, eliminating the need for major corporations to dominate the training process. In a recent technical paper, Nous Research revealed that DisTrO achieves an impressive 857 times efficiency increase compared to the popular All-Reduce training algorithm. It also decreases data transmission from 74.4 gigabytes to just 86.8 megabytes per training step, resulting in only a slight performance drop. The findings are summarized in the table below from their research paper.
Ultimately, DisTrO could democratize access to powerful AI model training, allowing more people to explore and experiment without corporate barriers.
The challenge of AI training: substantial hardware demands
As previously discussed, Nvidia’s GPUs are in high demand during the generative AI boom. These expensive graphics cards offer the necessary parallel processing power for efficient and rapid AI training. The training process heavily relies on clusters of GPUs communicating to share insights learned from training datasets.
This “inter-GPU communication” requires meticulously architected GPU clusters to minimize latency and maximize throughput. Consequently, companies like Tesla are investing in physical “superclusters” consisting of thousands of GPUs housed in large facilities.
Due to these stringent requirements, training generative AI, especially the most sophisticated models, is often a capital-intensive endeavor, accessible primarily to well-funded companies such as Tesla, Meta, OpenAI, Microsoft, Google, and Anthropic.
Each of these organizations has its own training methodology, but all generally use similar hardware and closely control their AI training processes, making it challenging for newcomers or casual developers to compete with similarly parameterized models.
However, Nous Research differs by advocating for accessible, powerful AI development that anyone can customize without restrictions.
What sets DisTrO apart
Conventional AI training methods necessitate synchronizing full gradients among multiple GPUs and depend on high-bandwidth connections. In contrast, DisTrO minimizes communication overhead by four to five orders of magnitude.
While the specific algorithms that enable this efficiency have not been fully disclosed, the authors plan to share more details soon. The reduction was achieved without relying on amortized analysis or compromising the convergence rate, allowing large-scale models to be trained over slower internet connections—100 Mbps download and 10 Mbps upload, widely accessible to consumers.
The research team tested DisTrO with Meta's Llama 2, a 1.2 billion-parameter large language model (LLM). The results demonstrated comparable training performance to traditional methods while significantly reducing data transfer. The team notes this model is the smallest effective with DisTrO and remains uncertain how the bandwidth reduction scales with model size.
Preliminary tests indicate a potential bandwidth reduction of 1000x to 3000x during pre-training and up to 10000x during post-training, with no noticeable degradation in performance. They also speculate that DisTrO could be applied to train large diffusion models, like Stable Diffusion and similar image generation services.
The ongoing necessity for GPUs
It’s critical to note that DisTrO still requires GPUs but allows them to operate in a globally distributed manner rather than co-located in the same facility.
Specifically, the evaluation involved 32 H100 GPUs using the Distributed Data Parallelism (DDP) strategy, where each GPU housed the entire model in VRAM. This framework enabled rigorous testing of DisTrO’s capabilities, proving it can match the convergence rates of AdamW+All-Reduce, all while significantly reducing communication needs.
DisTrO could disrupt traditional training methods without sacrificing model quality, offering a scalable solution for large-scale distributed training. By lowering the need for high-speed connections, it enables collaborative model training across decentralized networks, even among users with standard internet services.
The research report further delves into the implications of DisTrO for federated learning and decentralized training. Its efficiency may also help mitigate the environmental impact of AI training by optimizing existing infrastructure and reducing reliance on large data centers.
Moreover, these innovations could shift the paradigm of large-scale model training from centralized, resource-heavy data centers to more distributed, collaborative methods that utilize diverse computing resources.
What's next for Nous Research and DisTrO?
The research team invites others to join them in exploring DisTrO’s possibilities. Preliminary reports and additional materials are available on GitHub, and they actively seek collaborators to refine and expand this innovative technology.
AI influencers, such as @kimmonismus on X, have praised this research as potentially transformative for the field, declaring, “This could change everything!”
With DisTrO, Nous Research is not only enhancing AI training capabilities but also fostering a more inclusive research ecosystem capable of unlocking significant advancements in artificial intelligence.