Following Microsoft Build and Google I/O, Apple faced significant expectations to showcase its on-device AI capabilities at the Worldwide Developers Conference 2024. Apple effectively integrated generative AI into the user experience across all its devices, demonstrating impressive advancements in this area.
A standout feature of Apple's demonstrations was the extensive on-device processing. By leveraging its advanced processors and a wealth of open research, Apple delivered high-quality, low-latency AI functionalities on its phones and computers. Here’s what we learned about Apple’s on-device AI:
Apple’s Model Overview
In the Apple State of the Union presentation and a blog post released on June 10, it was revealed that Apple employs a 3-billion parameter model. While Apple didn’t disclose the specific base model used, it recently introduced several open models, including the OpenELM family of language models, which features a 3-billion parameter version optimized for resource-constrained devices.
OpenELM has undergone modifications to enhance model quality without increasing parameters, indicating that Apple's foundation model may be a specialized variant of OpenELM-3B. This model was trained on 1.8 trillion tokens of open datasets, including licensed and publicly available data collected by AppleBot.
Licensed Data Partnerships
Apple has established partnerships for licensed data, including a $25-$50 million deal with Shutterstock for images and a potential $50 million agreement with major news and publishing organizations.
Training and Optimization Techniques
The model has been fine-tuned to follow instructions effectively through reinforcement learning from human feedback (RLHF) and a rejection sampling fine-tuning algorithm involving a teacher committee. RLHF utilizes human-annotated data to refine language models based on user preferences, gaining popularity with the release of ChatGPT. Rejection sampling generates multiple training examples, selecting the best outcome for model updates, a technique also employed by the Llama-2 team.
Technical Optimizations
Apple implemented various techniques to enhance model performance while maintaining resource efficiency. The foundation model utilizes “grouped query attention” (GQA), developed by Google Research, to accelerate inference speed with minimal memory and compute impact. The model also employs “palletization,” which compresses weights using look-up tables, alongside quantization, which reduces the number of bits per parameter.
The models are optimized for devices with M1 and later chips and the iPhone 15 Pro and Pro Max featuring the A17 Pro chip. This suggests the use of optimization techniques tailored for Apple chips, such as the large language model (LLM) in flash introduced last year.
Performance Metrics
Reported results on an iPhone 15 Pro show a time-to-first-token latency of approximately 0.6 milliseconds per prompt token, with a generation rate of 30 tokens per second. For example, submitting a 1,000-token prompt would yield a response within 0.6 seconds, subsequently generating tokens at a rate of 30 per second—demonstrating impressive performance.
Customization with Low-Rank Adaptation
To enhance functionality without duplicating the model, Apple engineers developed fine-tuned versions using low-rank adaptation (LoRA) adapters. LoRA updates a small subset of weights for specific tasks, and the adapters—each under 100 megabytes—allow devices to store multiple options for various functions like proofreading, summarization, and email replies.
Evaluating Performance
According to Apple's assessments, its model generally outperforms similarly sized and even larger models, including Gemma-2B, Mistral-7B, and Phi-3B-Mini.
In summary, Apple’s on-device AI illustrates the potential of combining compact models with effective optimization techniques, quality data, and robust hardware. The company has made significant strides in balancing accuracy with user experience. It will be intriguing to see how this technology performs when rolled out to consumers this fall.