Controversial New Finding by Google DeepMind on Transformers: Is the Advancement of AGI Capabilities Delayed?

The findings of a recent study have not yet undergone extensive validation, but they have certainly captured the attention of industry experts. For instance, François Chollet, the creator of Keras, remarked that if these claims are accurate, they could significantly alter the landscape of large models. Google’s Transformer architecture is foundational to the current generation of large models, and the "T" in GPT refers specifically to this technology. Several large models demonstrate impressive contextual learning capabilities and can quickly adapt to new tasks. However, Google researchers appear to have uncovered a critical flaw: the performance of these models falters when they encounter data beyond their training scope or existing knowledge, leading many professionals to believe that Artificial General Intelligence (AGI) remains out of reach.

Some online commentators pointed out key details in the study that were overlooked. Notably, the experiments were conducted using a GPT-2 scale model with training data not entirely consisting of language models. Over time, more in-depth analyses of the paper have led some to conclude that while the research's findings are valid, there may be a tendency to overinterpret the implications. Following the widespread discussion, one of the authors clarified two main points: first, the experiments employed a simple Transformer model, not a large or specialized language model; second, while the model can learn new tasks, its ability to generalize across different task types is limited.

In one instance, a user replicated the experiment in a Colab environment and found results that contradicted the original conclusions. Let’s delve into this paper and the divergent views presented by Samuel, who reported different outcomes.

In the experiment, the authors utilized a Jax-based machine learning framework to train a Transformer model, close in scale to GPT-2, that contained only a decoder. This model included 12 layers, 8 attention heads, a dimension of 256 in the embedding space, and approximately 9.5 million parameters. To assess the model’s generalization ability, researchers trained it on linear and sine functions—known entities for the model that yielded satisfactory predictions. However, when the researchers created a convex combination of these two functions, the model’s performance significantly declined. The function they constructed was \(f(x) = a \cdot kx + (1 - a) \sin(x)\). While the operation may appear straightforward, for a model familiar only with linear and sine functions, it posed a new challenge.

When confronted with this new function, the Transformer’s predictions were nearly inaccurate, leading the authors to conclude that the model lacks adequate generalization ability for functions. To validate this, the researchers adjusted the weights of the linear and sine functions, but the model's predictive performance did not show marked improvement. The only exception was when one weight approached 1, which meant that the new function essentially became one of the functions seen during training—rendering the assessment of generalization capability moot.

Further experiments indicated that Transformers are not only sensitive to the types of functions but also to variations in frequency within the same function. The researchers noted that altering the frequency of the sine function caused significant fluctuations in the model's predictions. Only when the frequency closely matched the functions seen in training did the model make relatively accurate predictions. Deviations in frequency, either too high or too low, led to serious inaccuracies. Thus, the authors concluded that even slight changes could hinder the large model’s performance, pointing to a lack of generalization.

Despite the authors outlining the limitations of their study, they also encountered challenges when attempting similar experiments with language models, including defining task families and convex combinations. Another researcher, Samuel, used a smaller model with just 4 layers and managed to achieve generalization on the combination of linear and sine functions after a mere 5 minutes of training on Colab.

Overall, the conclusions drawn by the CEO of Quora appear somewhat narrow and contingent upon specific assumptions. UCLA professor Gu Quanquan echoed this sentiment, stating that while the conclusions of the paper are valid, they should not be overstated. Existing research indicates that Transformers often struggle with content that significantly deviates from their pre-training data, although the generalization ability of large models can frequently be gauged by the diversity and complexity of tasks.

Even if large models exhibit shortcomings in generalization, how concerning is that? Jim Fan, an AI scientist at Nvidia, noted that this is not surprising. Transformer models are not universally applicable; their impressive performance hinges on training data that aligns with our needs. He likened the situation to training a visual model on a billion cat and dog photos and then expecting it to correctly identify airplanes—encountering difficulty in recognition would be expected. This isn't solely an issue for large models; humans also struggle to find solutions in unfamiliar tasks, raising the question of whether humans too lack generalization abilities.

Ultimately, whether regarding large models or humanity, the overarching goal remains problem-solving, with generalization ability serving as merely a tool. One might even say, if generalization is lacking, we should train on unseen data. What are your thoughts on this research?

Most people like

Find AI tools in YBX