Researchers from Stanford University's Scaling Intelligence Lab have unveiled a new inference framework called Archon, designed to enhance the efficiency of large language models (LLMs) in generating responses.
Archon employs an inference-time architecture search (ITAS) algorithm that boosts LLM performance without necessitating additional training. This model-agnostic, open-source framework is easily implementable with both large and small models.
Archon aims to assist developers in creating AI systems by leveraging various inference techniques to streamline response generation. According to the Scaling Intelligence Lab, these techniques can significantly reduce costs associated with model development and inference. As LLMs evolve towards larger parameters and more sophisticated reasoning, expenses can rise, despite expectations from companies like OpenAI for greater affordability.
The researchers emphasize that Archon automatically crafts architectures that enhance task generalization, allowing models to tackle challenges beyond their original training scope. "Our Archon framework and ITAS algorithm are inspired by neural architectures and architecture search practices," the researchers explained. "Archon consists of layers of LLMs, where models within the same layer operate in parallel, while each subsequent layer processes results sequentially."
These layers employ various inference techniques to modify candidate responses, using both generation and fusion (such as linear transformations) and response refinement (such as non-linearities).
In benchmark tests including MT-Bench, Arena-Hard-Auto, Alpaca-2.0 Eval, MixEval, MixEval Hard, MATH, and CodeContests, Archon surpassed GPT-4o and Claude 3.5 Sonnet by 15.1 percentage points. It also outperformed open-source LLMs by 11.2 percentage points.
Components of Archon
The ITAS algorithm consists of several key components that execute inference techniques:
1. Generator: Generates potential answers for the model.
2. Fuser: Combines these responses into a cohesive answer. For instance, if asked the capital of France, it synthesizes responses like “the capital of France is Paris” and “France is in Europe” into one statement: “The capital of France, a country in Europe, is Paris.”
3. Ranker: Ranks the generated answers.
4. Critic: Evaluates the quality of the ranked responses.
5. Verifier: Checks for logical consistency and correctness.
6. Unit Test Generator and Evaluator: Conducts small tests to verify response accuracy.
The structured approach of Archon enables quicker improvement in the quality of LLM responses without the need for additional fine-tuning.
Limitations of Archon
Currently, Archon performs best with LLMs that have 70 billion parameters or more, like Meta’s Code Llama 70B. This limitation arises from smaller models' reduced ability to follow instructions due to narrower context windows. The research highlighted a significant 16% performance drop when Archon was applied to 7B models.
Moreover, models using the Archon framework lag 15.7% behind single-turn models. The Stanford lab noted that Archon is not suited for applications requiring the rapid latency of a single LLM call, such as chatbots. Its architecture involves multiple LLM calls, making it less effective for straightforward query-response tasks. However, Archon may excel in tackling more complex tasks that require intricate instructions, such as programming or advanced customer service scenarios.
Despite these challenges, the researchers hope Archon can accelerate the development of high-performing LLMs without the need for increased capital investment in inference and training.