Is Apple's AI Revolution Around the Corner? Introducing the Ferret Multimodal Model for iPhone

In October, Apple and Columbia University launched their research project, Ferret, initially aimed at academic applications and attracting limited attention. However, the AI landscape has transformed significantly since then, with tech giants racing to unveil their innovations and discussions surrounding local models for delivering intelligent experiences on smaller devices gaining traction.

Since December, Apple has made steady advancements in artificial intelligence. Recently, the company introduced the MLX AI framework, tailored for Apple Silicon, along with a method for running large language models directly on devices. This innovative approach allows models twice the size of standard DRAM to function on edge devices, thereby conserving computational resources and enhancing privacy.

Last week, Apple unveiled further advancements, introducing Ferret—a new large language model—along with benchmark testing tools and datasets. Ferret is a multi-modal model capable of processing text, audio, images, and data inputs. A research paper released in October highlighted Ferret's ability to interpret various shapes and granularities of images, providing accurate contextual grounding for open vocabulary descriptions.

To improve referencing and localization, Ferret uses a hybrid region representation technique that combines individual orientations and continuous features to delineate specific areas within images. This includes a spatial-aware visual sampler, enabling the model to accept diverse region inputs such as points, bounding boxes, and freeform shapes.

Apple enhanced Ferret's performance using the GRIT dataset (Ground-and-Refer Instruction-Tuning), which contains over 1.1 million samples rich in hierarchical spatial knowledge, supplemented by 95,000 negative samples for model refinement. In evaluations against multi-modal language models like Kosmos-2 and GPT4-ROI, Apple’s Ferret-13B excelled in traditional referencing and grounding tasks, particularly in region-based tasks involving localization, detailed descriptions, and complex reasoning.

In visual comparison tasks, Apple asserts that Ferret exhibits exceptional spatial understanding and common-sense reasoning. Additionally, it reportedly generates significantly fewer object hallucinations than popular models such as Shikra and InstructBLIP. Apple has made available the codes for the Ferret7B and 130B models, the GRIT dataset, the Ferret-Bench benchmarking tools, and checkpoints for Ferret 70B and 130B.

Interestingly, many in the AI community have only recently become aware of Apple's large model. Bart de Witte expressed his surprise on X, noting his previous oversight and eagerly anticipating a future where local large language models function as integrated services on a redesigned iOS, seamlessly operating on devices like the iPhone. As Apple continues to innovate, reports indicate that companies like Anthropic and OpenAI are also seeking substantial new funding for their proprietary large language models.

Most people like

Find AI tools in YBX

Related Articles
Refresh Articles