AI Models Stolen? Tencent's New Research Reveals: C Language Multimodal Capabilities Rival Transformers

Home AI News AI Models Stolen? Tencent's New Research Reveals: C Language Multimodal Capabilities Rival Transformers

Updated on November 14 2024

In the evolving landscape of multi-modal applications, a new player has emerged alongside Transformer models: large-kernel CNNs. A collaborative effort between Tencent AI Lab and the Chinese University of Hong Kong has led to the development of a groundbreaking CNN architecture that outperforms Transformer models in both image recognition accuracy and processing speed. This innovative architecture seamlessly adapts to other modalities, such as point clouds, audio, and video, requiring only simple preprocessing to achieve or exceed state-of-the-art (SOTA) performance.

The team outlined four key guidelines for designing large-kernel CNN architectures and created a powerful model known as UniRepLKNet. Pre-training this model on the ImageNet-22K dataset yielded impressive results, including 88% accuracy on ImageNet, 56.4 average precision (AP) on COCO, and 55.6 mean intersection over union (mIoU) on ADE20K. Its efficiency extends to time series predictions, where it surpasses previous SOTA models based on Transformers in forecasting global temperature and wind speed.

Before diving into the intricacies of UniRepLKNet, the authors tackled two essential questions. First, why continue to explore CNNs in a Transformer-focused era? They argue that both models represent valuable design philosophies, each with strengths in different areas. The emergence of models like ConvNeXt and RepLKNet has demonstrated that CNNs remain potent contenders in various tasks.

Second, how can a CNN designed for images be adapted for audio, video, point clouds, and time series data? UniRepLKNet retains its core architecture while transforming other modalities into embedding maps formatted as C×H×W (C for color channels). For example, audio spectrograms can be treated as single-channel images, while video frames are combined into larger images. This flexible approach has yielded remarkable results across all tested modalities.

In 2022, RepLKNet pioneered the use of large convolutional kernels (ranging from 13×13 to 31×31) and introduced various design principles. Although based on the Swin Transformer, traditional convolutional networks often rely on smaller kernels to balance receptive field expansion and feature abstraction. This paper argues for decoupling these factors, proposing distinct structures for desired outcomes: leveraging large kernels for broader receptive fields, small depthwise convolutions for feature abstraction, and efficient designs, such as SE Blocks or Bottleneck structures, to enhance model depth and representation.

UniRepLKNet embodies these architectural principles, featuring blocks with depthwise convolutions, SE Blocks, and Feedforward Networks (FFNs). This structure efficiently integrates both large kernels and smaller depthwise convolutions, boosting performance without excessive depth.

Performance assessments reveal that UniRepLKNet excels in traditional image classification tasks, achieving notable improvements over contemporary models even with pre-training on ImageNet-22K. For instance, UniRepLKNet-XL attained 88% accuracy on ImageNet and processed images three times faster than DeiT III-L. In COCO object detection, while UniRepLKNet-L lagged behind InternImage-L, UniRepLKNet-XL demonstrated its superiority by outperforming InternImage-XL. Additionally, it achieved a maximal mIoU of 55.6 on ADE20K segmentation, surpassing ConvNeXt-XL by 1.6 points.

The model's capabilities extend to time series data, evidenced by its success in predicting global temperatures and wind speeds. Despite being primarily designed for image tasks, UniRepLKNet outperformed the previously SOTA CorrFormer, underscoring the surprising versatility of CNNs in this area.

These findings indicate that the promise of large-kernel CNNs remains largely untapped. Even in domains traditionally dominated by Transformers—such as unified modeling—large-kernel CNNs exhibit remarkable strength. Notably, a reduction in kernel size from 13 to 11 led to significant performance declines across all modalities, emphasizing the importance of kernel size in model performance. The authors encourage further exploration by making their code and experimental scripts publicly accessible.

Is Apple's AI Revolution Around the Corner? Introducing the Ferret Multimodal Model for iPhone

Most people like

Snapy AI

62.2K

Introducing an AI Video Editor that automatically removes silence and creates engaging shorts effortlessly.

AI video editor AI Video Editor

BoldDesk by Syncfusion

162.5K

Enhance your customer support by leveraging AI and automation solutions. Discover how these technologies can streamline processes, improve response times, and elevate customer satisfaction.

customer support AI Product Description Generator

CoCoClip.AI

253.1K

Introducing the Ultimate AI Video Editor for Crafting Engaging Social Media Clips Unlock the power of our advanced AI video editor, designed to help you effortlessly create stunning videos for social media. Whether you’re a seasoned creator or just starting out, this intuitive tool simplifies the editing process, allowing you to produce eye-catching content that captivates your audience and enhances your online presence.

AI video editor AI Video Generator

Parlay Ideas | AI Powered Class Discussions

85.1K

Introducing our revolutionary AI-powered platform designed to enhance class discussions. With cutting-edge technology, this tool fosters engaging conversations, promotes critical thinking, and streamlines participation for educators and students alike. Transform your learning environment today with our intuitive platform that revolutionizes the way discussions are facilitated in the classroom.

AI-powered AI Education Assistant

Find AI tools in YBX