The emergence of ChatGPT in late 2022 ignited a competitive race among AI companies and tech giants, all aiming to lead the rapidly expanding market for large language models (LLMs). In response to this fierce competition, many firms chose to provide their language models as proprietary services, offering API access while concealing the underlying model weights and details about their training datasets and methodologies.
Contrary to the trend of proprietary models, 2023 saw substantial growth in the open-source LLM ecosystem, highlighted by the release of models that can be downloaded and customized for specific applications. This development has solidified open-source as a significant player in the LLM landscape, effectively keeping pace with proprietary solutions.
Is Bigger Better?
Before 2023, the common belief was that increasing the size of LLMs was essential for improved performance. Open-source models like BLOOM and OPT, which are comparable to OpenAI's GPT-3 with its 175 billion parameters, exemplified this approach. However, these large models required substantial computational resources and expertise to operate effectively.
This paradigm shifted dramatically in February 2023 when Meta launched Llama, a series of models ranging from 7 to 65 billion parameters. Llama proved that smaller models could match the performance of their larger counterparts, supporting the assertion that model size isn't the only determinant of effectiveness.
The key to Llama's success lay in its training on a much larger dataset. While GPT-3 utilized around 300 billion tokens, Llama's models ingested up to 1.4 trillion tokens, demonstrating that training smaller models on a more extensive token dataset could be a powerful approach.
The Benefits of Open-Source Models
Llama's popularity stemmed from two main advantages: its ability to run on a single GPU and its open-source release. This accessibility allowed the research community to quickly build upon its architecture and findings, sparking the emergence of several notable open-source LLMs, including Cerebras-GPT by Cerebras, Pythia by EleutherAI, MosaicML’s MPT, X-GEN by Salesforce, and Falcon by TIIUAE.
In July 2023, Meta released Llama 2, which quickly became the foundation for numerous derivative models. Mistral.AI also made waves with the introduction of its two models—Mistral and Mixtral—gaining acclaim for their performance and cost-effectiveness.
“Since the original Llama's release, the open-source LLM landscape has accelerated, with Mixtral now recognized as the third most helpful model in human evaluations, following GPT-4 and Claude,” stated Jeff Boudier, Head of Product and Growth at Hugging Face.
Additional models like Alpaca, Vicuna, Dolly, and Koala were developed using these foundational models, tailored for specific applications. Data from Hugging Face reveals that developers have created thousands of forks and specialized versions. Notably, there are over 14,500 results for “Llama,” 3,500 for “Mistral,” and 2,400 for “Falcon.” Despite its December 2023 release, Mixtral has already served as the basis for 150 projects.
The open-source nature of these models fosters innovation by enabling developers to create new models and combine existing ones in various configurations, enhancing the practicality of LLMs.
The Future of Open-Source Models
As proprietary models continue to evolve, the open-source community remains a formidable contender. Tech giants are increasingly incorporating open-source models into their products, recognizing their value. Microsoft, a primary backer of OpenAI, has released two open-source models, Orca and Phi-2, and has improved the integration of open-source models within its Azure AI Studio platform. Likewise, Amazon has introduced Bedrock, a cloud service designed to host both proprietary and open-source models.
“In 2023, enterprises were largely surprised by the capabilities of LLMs, particularly following the success of ChatGPT,” noted Boudier. “CEOs tasked their teams with defining Generative AI use cases, leading to rapid experimentation and proof of concept applications using closed model APIs.”
However, relying on external APIs for critical technologies poses risks, including the potential exposure of sensitive source code and customer data—an unsustainable long-term strategy for businesses focused on data privacy and security.
The emerging open-source ecosystem offers a promising path for businesses looking to implement generative AI while addressing privacy and compliance needs.
“As AI transforms technology development, just as with past innovations, organizations will need to create and manage AI solutions in-house, ensuring the privacy, security, and regulatory compliance required for customer information,” Boudier concluded. “Based on historical trends, this will likely mean embracing open-source.”