The Evolution and Challenges of Large Language Models
Large language models (LLMs) have revolutionized the narrative landscape, enabling the creation of vibrant stories and facilitating deep dialogues with users. Meanwhile, visual models possess the capabilities to recognize, categorize, and generate images, earning them the title of digital artists. Multi-modal models seamlessly process text, images, audio, and video, showcasing their versatility. Built on the Transformer architecture, these models are making significant impacts across various facets of human society, transitioning from millions to trillions of parameters.
However, this advancement brings forth new challenges, including substantial computational power requirements, high energy consumption during training and inference, and concerns over data quality. As a result, researchers are encountering a crossroads in the development of these models.
The Need for Layered Memory in Future Models
Current large models can infer similar terms, yet they still fall short of the human brain’s predictive capabilities. The difference lies in "layered memory." When various types of knowledge—meta-knowledge, high-frequency, and low-frequency—are fed into the brain, they are processed into implicit memory (conditioned reflexes), explicit memory (conscious recollections), and working memory (temporary data storage). For next-generation LLMs to achieve greater intelligence, they too need a layered approach.
"To advance the next generation of large models, we must effectively utilize comprehensive data while minimizing costs and energy consumption," stated Academician He Weinan of the Chinese Academy of Sciences. During the forum on "Transcending Boundaries: Exploring Fundamental Research in Next-Gen Large Models," He outlined a four-layer technical framework for artificial intelligence. The first layer is a universal AI database that actively engages in data analysis and decision-making, encompassing structured, unstructured, and semi-structured data. Building upon this, the second layer integrates general models with specialized knowledge bases, fostering depth and precision in various fields. These elements combine to create a third layer of intelligent agents (smaller models) that leverage the extensive capabilities of larger models.
The Three Laws Guiding Large Model Development
According to Zhou Bin, CTO of Huawei's Ascend Computing, three fundamental laws govern the development of large models:
1. Scaling Law: The size of a model dictates its potential capabilities. Research indicates that larger models perform better, showing a dependency on scale that aligns with increases in computational power, data, and parameters.
2. Chinchilla Law: This law suggests that, within the confines of limited computational resources, there exists an optimal ratio between model size and the volume of training data. Both the size of the model and the number of training tokens should expand proportionately.
3. Emergent Abilities: This principle emphasizes that certain capabilities only manifest when computational volume surpasses a specific threshold. Recent testing shows that LLMs begin to demonstrate significant emergent abilities after performing approximately 10^22 floating-point operations.
As models approach the trillion-parameter mark, it is expected that these laws will continue to hold. New models designed for handling ultra-long sequences, such as Gemini and Sora, are anticipated to become the standard in large model configurations.
The Challenges of Next-Generation Computational Power
"Planning for the next generation of infrastructure innovations is crucial to enhance AI capabilities," asserts Frank Shaw, Microsoft's Chief Communication Officer. This ambition necessitates unprecedented investments, computational resources, and energy usage, presenting several challenges for the evolution of large models.
From a computational perspective, the scaling demands for training single models have seen exponential growth. Between GPT-2 and GPT-4, the computational power required has surged by 3,000 to 10,000 times. Over the past decade, the annual increase in computational demands has been approximately threefold. By 2027, the value of a singular AI cluster could reach billions of dollars.
In terms of data, Zhou presented that high-quality language data is expected to deplete by 2026. Although low-quality data may remain sufficient until around 2040, the growth rate of image datasets is currently around 18% to 31%, with depletion predicted between 2030 and 2060. As AI models become larger, the bandwidth demands of single NPU/GPU chips rise rapidly, outpacing the growth of conventional switching chips. This burgeoning need for interconnectivity further complicates the infrastructure required for next-generation models.
Innovative Pathways to Intelligent Computing
As we ponder the next pivotal moment in AI development, Zhou Bin posits that we may enter an era defined by the automation of AI research. This involves utilizing AI to autonomously advance its own research and development. He envisions a trajectory where intelligent computing evolves beyond the limits of conventional architectures, pushing towards new computational paradigms.
Future advancements will stem from significant modifications across five dimensions: storage, transmission, computation, energy consumption, and material properties. Innovations may range from electronic computing to quantum computing and will require the creation of hybrid models that integrate specialized knowledge across diverse scientific fields. This will enhance LLMs’ online learning and reinforcement learning capabilities, positioning them for sustained evolution in the realm of intelligent computing.