In the field of Visual Language Models (VLM), high computational costs have been a major barrier to widespread adoption. A collaboration between Harbin Institute of Technology and Du Xiaoman has resulted in an innovative adaptive pruning algorithm called SmartTrim. This algorithm effectively reduces redundant computations in multimodal large models, enhancing efficiency significantly. The research has been accepted for presentation at COLING 2024, a leading conference in natural language processing.
SmartTrim leverages an adaptive pruning mechanism to analyze redundancies in token representations and attention heads across model layers. By intelligently identifying and removing unnecessary computations, SmartTrim maintains performance while improving computational efficiency. It evaluates the significance of tokens within individual modalities and highlights their roles in cross-modal interactions.
The SmartTrim framework incorporates two main components: the cross-modal aware Token Pruner and the modality-adaptive Attention Head Pruner. The Token Pruner uses a multi-layer perceptron (MLP) to identify and eliminate unimportant tokens at each layer, considering both the isolated importance of tokens and their contributions to cross-modal interactions. Simultaneously, the Attention Head Pruner integrates directly into the model's self-attention mechanism, pruning redundant attention heads to optimize performance.
During the training of the SmartTrim model, researchers applied a dual optimization strategy that balances task objectives with computational cost. By utilizing re-parameterization techniques, they navigated the challenges presented by non-differentiable binary masks, facilitating end-to-end model training. The use of self-distillation and curriculum learning further enhanced the performance of the pruned model and ensured stability throughout the training process.
Experimental results indicate that SmartTrim achieves a 2-3 times acceleration on the METER and BLIP VLMs, all while minimizing performance loss. This breakthrough reflects not just theoretical innovation but also offers valuable insights for optimizing models in practical applications. Notably, SmartTrim outperforms original models with a speed-up ratio of 1.5 times, showcasing its advantages over other acceleration methods. The introduction of SmartTrim represents a significant advancement in multimodal large model research, with plans for its integration into the Xuanyuan large model to further enhance large model technologies.