Convolutional Neural Networks (CNNs) have long been recognized as the leading architecture for tasks in computer vision, particularly for image classification. Recently, Vision Transformers (ViTs) have emerged as a compelling alternative due to their enhanced performance in accuracy and efficiency when scaled. However, research from Google DeepMind reveals that both CNNs and ViTs can achieve comparable results, with the amount of computing power employed during training being the critical factor.
This groundbreaking insight suggests that organizations with computer vision requirements need not transition to the ViT architecture to achieve top-tier accuracy. Instead, by utilizing ample data and computational resources, CNNs can improve their performance in a predictable manner. This means investing in larger models and robust training infrastructures can yield substantial returns.
In their study titled “ConvNets Match Vision Transformers at Scale,” researchers demonstrated that using an advanced CNN architecture, NFNet, trained on an enormous dataset of four billion images, resulted in performance levels on par with those achieved by similar ViT systems. The researchers employed up to 110,000 hours of training on Google's TPU chips, which ultimately matched the accuracy demonstrated by existing ViT models.
Yann LeCun, Chief AI Scientist at Meta and a recipient of the Turing Award, highlighted in a post on social media that these findings underscore the importance of computational resources. He emphasized that both CNNs and ViTs have significant roles in the landscape of computer vision.
**Key Insights:**
1. **Choice of Architecture**: The research indicates that the selection between CNNs and ViTs for computer vision applications is nuanced. CNNs remain a viable and effective option, especially when supplemented with adequate resources.
2. **Computational Scaling**: As the compute budget for training NFNet models was increased, there was a noticeable improvement in performance on validation sets, following a log-log scaling law. This principle helps model developers understand how exponential increases in one parameter lead to linear improvements in another, thereby facilitating efficient scaling strategies.
3. **Predictable Gains**: The study revealed that enhancing the compute budget leads to consistent improvements in model accuracy for CNNs, without encountering diminishing returns.
The researchers argued, “Although the advancements of ViTs in the field are remarkable, there is no substantial evidence that pre-trained ViTs surpass pre-trained ConvNets in a fair evaluation.” They concluded that the critical determinants of model performance are primarily the amount of compute and the quality of data available during training.
Ultimately, the research by Google DeepMind offers significant validation for organizations already leveraging CNNs, suggesting that with the right investment in computational resources, these models can continue to deliver exceptional results in computer vision tasks.