Apple Claims ‘Responsible’ Methodology in Training Its Apple Intelligence Models

Apple has released a technical paper outlining the models developed to enhance Apple Intelligence, a suite of generative AI features coming soon to iOS, macOS, and iPadOS. In the paper, Apple defends itself against claims of ethical misconduct in its model training, emphasizing that no private user data was used. Instead, the training drew from a mix of publicly available and licensed data.

In the document, Apple explains, “[The] pre-training data set consists of licensed data from publishers, curated public datasets, and information collected through our web crawler, Applebot.” The company reiterates its commitment to user privacy by clarifying that no private Apple user data is part of their training data.

In July, Proof News reported that Apple had utilized a dataset called The Pile, which includes subtitles from countless YouTube videos, to train models for on-device processing. Many YouTube creators were unaware their subtitles were used, leading to concerns about consent. Apple later clarified that these models would not be utilized for any AI features in their products.

The technical paper, which provides more details on the Apple Foundation Models (AFM) first announced at WWDC 2024, stresses that the training data was gathered responsibly, at least by Apple's standards. The AFM models were trained on both publicly available web content and licensed information from undisclosed publishers. According to The New York Times, Apple had approached several major publishers at the end of 2023, including NBC and Condé Nast, seeking multi-year agreements worth at least $50 million for access to their news archives. The AFM models also benefited from open-source code available on GitHub, including languages such as Swift, Python, and Java.

Training models on code without permission has sparked debate among developers. Some argue that certain open-source codebases are not licensed for AI training, which can create legal ambiguities. In response, Apple claimed it “license-filtered” the code used to only include repositories with minimal usage restrictions.

To enhance the AFM models’ mathematical capabilities, Apple included math questions and answers from a variety of sources, including forums and educational blogs. The company also utilized high-quality publicly available datasets—although unnamed—ensuring that sensitive information was filtered out.

The training dataset totals approximately 6.3 trillion tokens, a term that refers to smaller data pieces that generative AI models can more efficiently process. For context, this figure is less than half the 15 trillion tokens that Meta employed to train its prominent text-generating model, Llama 3.1 405B.

Additionally, Apple incorporated input from human feedback and synthetic data to refine the AFM models and address potential issues like toxicity.

“Our models are designed to assist users in daily tasks across Apple products, founded on our core values and responsible AI principles at every stage,” Apple asserts.

The paper does not reveal any groundbreaking information, which is likely intentional given competitive pressures and legal considerations. While some companies use publicly available web data under fair use doctrine, this issue remains contentious and is increasing in legal attention.

Apple notes that webmasters can prevent its crawler from accessing their data, but this does not help individual creators whose portfolios may be exposed on sites that do not block data scraping. Ultimately, the legal landscape will shape the future of generative AI models and their training methodologies. Meanwhile, Apple seeks to position itself as an ethical leader while managing potential legal challenges.

Most people like

Find AI tools in YBX