The Search for Next-Gen AI Architectures: Beyond Transformers
After years of dominant performance by transformer-based AI architectures, the quest for innovative alternatives is heating up. Transformers drive models such as OpenAI's video-generating Sora and text-generating giants like Anthropic's Claude, Google's Gemini, and GPT-4o. However, these models are increasingly hitting technical limitations, particularly in terms of computational efficiency.
Transformers struggle to process and analyze large datasets efficiently when deployed on standard hardware. Consequently, companies face unsustainable spikes in power consumption as they scale their infrastructure to meet transformers' needs.
A groundbreaking architecture recently introduced is test-time training (TTT), developed over 18 months by a collaborative team from Stanford, UC San Diego, UC Berkeley, and Meta. This research asserts that TTT models can handle a significantly greater volume of data compared to transformers while consuming far less computational power.
Understanding the Hidden State in Transformers
At the core of transformers lies the "hidden state," a comprehensive data repository that the model uses to retain information. As a transformer processes input, it continually updates this hidden state to "remember" previous context. For example, when reading through a book, the hidden state holds representations of words and phrases.
Yu Sun, a postdoctoral researcher at Stanford and co-author of the TTT study, explained, “If you envision a transformer as an intelligent entity, the hidden state functions as its brain.” This unique brain-like structure supports the well-known capabilities of transformers, including in-context learning.
However, the hidden state can also become a bottleneck. For a transformer to generate a simple response regarding a book it has read, it must traverse its entire lookup table, which is as computationally intensive as rereading the entire book.
Sun and his team proposed replacing the hidden state with a more efficient machine learning model, likening it to "nested dolls of AI"—a model within another model. The TTT model’s internal machine learning structure doesn’t expand as it processes additional data; instead, it uses representative variables known as weights. This characteristic allows TTT models to perform effectively without fluctuating in size, regardless of the volume of data handled.
Sun is optimistic that future TTT models could efficiently analyze billions of data elements, from text to images, audio recordings, and even video, far exceeding the capabilities of current AI systems.
“Our system allows us to generate X words about a book without the computational complexity of rereading the book X times,” Sun noted. “Unlike transformer-based models such as Sora, which can only process 10 seconds of video due to their lookup table structure, our goal is to create a system capable of processing extended videos that emulate the visual experiences of a human life.”
Evaluating the Future of TTT Models
Could TTT models eventually replace transformers? It's a possibility, but it may be too soon to tell. Current TTT models are not direct substitutes for existing transformers, and researchers have only developed a couple of small models for initial investigations, making it challenging to benchmark TTT effectively against larger transformer implementations.
“I find this innovation fascinating. If future data supports the claim of efficiency gains, that would be great, but I can’t definitively say if TTT models outperform existing architectures,” remarked Mike Cook, a senior lecturer in King’s College London’s informatics department, who was not involved in the TTT research. “A former professor of mine used to joke that to solve any computer science problem, just add another layer of abstraction. A neural network within a neural network certainly echoes that sentiment.”
The growing momentum in research towards alternatives to transformers underscores a collective recognition of the need for a transformative breakthrough in AI architecture.
This week, the AI startup Mistral launched Codestral Mamba, an innovative model based on state space models (SSMs), another alternative to transformers. SSMs, similar to TTT models, promise enhanced computational efficiency and scalability for larger datasets.
AI21 Labs and Cartesia are also investigating SSMs, with Cartesia pioneering some of the early versions like Mamba and Mamba-2. Should these initiatives prove successful, the advancements could democratize generative AI technology, making it more accessible than ever—offering both exciting opportunities and potential challenges ahead.