In the evolving landscape of multi-modal applications, a new player has emerged alongside Transformer models: large-kernel CNNs. A collaborative effort between Tencent AI Lab and the Chinese University of Hong Kong has led to the development of a groundbreaking CNN architecture that outperforms Transformer models in both image recognition accuracy and processing speed. This innovative architecture seamlessly adapts to other modalities, such as point clouds, audio, and video, requiring only simple preprocessing to achieve or exceed state-of-the-art (SOTA) performance.
The team outlined four key guidelines for designing large-kernel CNN architectures and created a powerful model known as UniRepLKNet. Pre-training this model on the ImageNet-22K dataset yielded impressive results, including 88% accuracy on ImageNet, 56.4 average precision (AP) on COCO, and 55.6 mean intersection over union (mIoU) on ADE20K. Its efficiency extends to time series predictions, where it surpasses previous SOTA models based on Transformers in forecasting global temperature and wind speed.
Before diving into the intricacies of UniRepLKNet, the authors tackled two essential questions. First, why continue to explore CNNs in a Transformer-focused era? They argue that both models represent valuable design philosophies, each with strengths in different areas. The emergence of models like ConvNeXt and RepLKNet has demonstrated that CNNs remain potent contenders in various tasks.
Second, how can a CNN designed for images be adapted for audio, video, point clouds, and time series data? UniRepLKNet retains its core architecture while transforming other modalities into embedding maps formatted as C×H×W (C for color channels). For example, audio spectrograms can be treated as single-channel images, while video frames are combined into larger images. This flexible approach has yielded remarkable results across all tested modalities.
In 2022, RepLKNet pioneered the use of large convolutional kernels (ranging from 13×13 to 31×31) and introduced various design principles. Although based on the Swin Transformer, traditional convolutional networks often rely on smaller kernels to balance receptive field expansion and feature abstraction. This paper argues for decoupling these factors, proposing distinct structures for desired outcomes: leveraging large kernels for broader receptive fields, small depthwise convolutions for feature abstraction, and efficient designs, such as SE Blocks or Bottleneck structures, to enhance model depth and representation.
UniRepLKNet embodies these architectural principles, featuring blocks with depthwise convolutions, SE Blocks, and Feedforward Networks (FFNs). This structure efficiently integrates both large kernels and smaller depthwise convolutions, boosting performance without excessive depth.
Performance assessments reveal that UniRepLKNet excels in traditional image classification tasks, achieving notable improvements over contemporary models even with pre-training on ImageNet-22K. For instance, UniRepLKNet-XL attained 88% accuracy on ImageNet and processed images three times faster than DeiT III-L. In COCO object detection, while UniRepLKNet-L lagged behind InternImage-L, UniRepLKNet-XL demonstrated its superiority by outperforming InternImage-XL. Additionally, it achieved a maximal mIoU of 55.6 on ADE20K segmentation, surpassing ConvNeXt-XL by 1.6 points.
The model's capabilities extend to time series data, evidenced by its success in predicting global temperatures and wind speeds. Despite being primarily designed for image tasks, UniRepLKNet outperformed the previously SOTA CorrFormer, underscoring the surprising versatility of CNNs in this area.
These findings indicate that the promise of large-kernel CNNs remains largely untapped. Even in domains traditionally dominated by Transformers—such as unified modeling—large-kernel CNNs exhibit remarkable strength. Notably, a reduction in kernel size from 13 to 11 led to significant performance declines across all modalities, emphasizing the importance of kernel size in model performance. The authors encourage further exploration by making their code and experimental scripts publicly accessible.