Large language models (LLMs) excel at generating text, coding, translating languages, and crafting diverse forms of creative content. However, their complex inner workings often remain opaque, presenting challenges for researchers and practitioners alike.
This lack of interpretability becomes critical in applications with a low tolerance for errors that require transparency. In response, Google DeepMind has introduced Gemma Scope, a groundbreaking suite of tools designed to illuminate the decision-making processes of its Gemma 2 models.
Understanding LLM Activations with Sparse Autoencoders
When a language model processes input, it navigates through an intricate network of artificial neurons. The resulting values, termed "activations," represent how the model comprehends the input and forms its responses.
By analyzing these activations, researchers can glean insights into the information processing and decision-making capabilities of LLMs. Ideally, this analysis helps identify which neurons correspond to specific concepts. However, the vast number of neurons—often numbering in the billions—complicates this task. Each inference generates a complex array of activation values across multiple model layers, with myriad activations tied to various concepts.
A primary method for interpreting these activations involves using sparse autoencoders (SAEs). These models assist in understanding LLMs by examining activations across different layers, a process known as “mechanistic interpretability.” SAEs are designed to condense input activations into a manageable set of features and reconstruct the original activations from these features, facilitating comprehension of how input features trigger various parts of the LLM.
Introducing Gemma Scope
While previous SAE research has mainly targeted smaller models or specific layers, DeepMind's Gemma Scope adopts a holistic approach. It offers SAEs for every layer and sublayer of the Gemma 2 models, encompassing over 400 SAEs that collectively represent more than 30 million learned features. This comprehensive framework enables researchers to explore how features evolve and interact across layers, yielding a deeper understanding of the model’s decision-making process.
DeepMind emphasizes that "this tool will enable researchers to study how features evolve throughout the model and interact to form more complex features."
Gemma Scope utilizes DeepMind’s innovative JumpReLU SAE architecture. Traditional SAE architectures use a rectified linear unit (ReLU) function to enforce sparsity, zeroing out activation values below a certain threshold. While effective for identifying significant features, this approach complicates the estimation of feature strength, as lower values are discarded.
JumpReLU overcomes this limitation by allowing the SAE to learn a unique activation threshold for each feature. This adjustment enhances the SAE's ability to balance feature detection with strength estimation while maintaining low sparsity and improving reconstruction fidelity.
Moving Toward Robust and Transparent LLMs
DeepMind has made Gemma Scope publicly accessible on Hugging Face, fostering further interpretability research. “We hope today’s release enables more ambitious interpretability research,” DeepMind states. Such efforts hold promise for developing more robust AI systems, enhancing safeguards against model hallucinations, and mitigating risks associated with autonomous AI behavior.
As LLMs continue to evolve and find applications across enterprises, AI labs are striving to create tools that enhance understanding and control of these models. SAEs, exemplified by those in Gemma Scope, represent a promising avenue for discovering and mitigating unwanted behavior in LLMs, such as biased content generation.
Gemma Scope's release positions researchers to address various challenges, including detecting and remedying LLM jailbreaks and steering model behavior. Other organizations, like Anthropic and OpenAI, are advancing their SAE research, alongside exploring non-mechanistic techniques to decode LLM inner workings, such as OpenAI's recent peer-verification approach that encourages verifiable and comprehensible outputs.