Hugging Face launched its Idefics visual language model in 2023, leveraging technology initially developed by DeepMind. The upgraded version, Idefics2, is now available on Hugging Face and features a smaller parameter size, an open license, and enhanced Optical Character Recognition (OCR) capabilities.
Idefics, which stands for Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentions, is a versatile multimodal model capable of processing both text and image prompts. While the original Idefics boasted 80 billion parameters, Idefics2 has been streamlined to just 8 billion, making it comparable to models like DeepSeek-VL and LLaVA-NeXT-Mistral-7B.
Key improvements in Idefics2 include advanced image manipulation, supporting native resolutions of up to 980 x 980 pixels without the need for resizing to fit a fixed-size square ratio, a common limitation in traditional computer vision.
The model's OCR capabilities have also seen enhancements through the incorporation of data derived from the transcription of text in images and documents. The Hugging Face team has improved Idefics2’s ability to respond to questions related to charts, figures, and documents.
Moreover, the architecture of Idefics2 has been simplified by moving away from the gated cross-attention mechanisms used in its predecessor. According to Hugging Face, “The images are fed into the vision encoder, followed by learned Perceiver pooling and a Multilayer Perceptron modality projection. This pooled sequence is concatenated with the text embeddings to create an interleaved sequence of images and text.”
To train Idefics2, Hugging Face utilized a combination of publicly available datasets, including Mistral-7B-v0.1 and siglip-so400m-patch14-384. Additional training data included web documents, image-caption pairs, OCR data, and image-to-code resources.
The release of Idefics2 comes amid a surge of multimodal models in the AI landscape, including Reka’s Core model, xAI’s Grok-1.5V, and Google’s Imagen 2.