Getty Images Launches ‘Cleanest’ Visual Dataset for Training AI Foundation Models

Getty Images is committed to becoming a trusted data partner in the AI space. Renowned for facilitating the discovery, sharing, and purchase of visual content from a global pool of photographers and videographers, the company has announced the release of a sample open dataset on Hugging Face.

While many visual datasets are available on the Hugging Face hub, Getty Images asserts that its offering is uniquely reliable and commercially safe. This assurance allows enterprise developers to integrate the dataset into their AI training pipelines with confidence, mitigating concerns over quality or legal complications.

As Andrea Gagliano, the head of data science and AI/ML at Getty Images, explained, “Imagine enhancing your AI/ML capabilities with data that is both diverse and high quality, sourced responsibly. That’s what we provide.”

Getty's long-term objective is to foster an ecosystem where AI developers prefer to use officially licensed content from its platform for training their models.

What Does the Getty Images Dataset Include?

Developers often face challenges when dealing with poorly sourced, low-quality data during AI/ML model training. To address this, they typically engage in extensive efforts to clean and enrich their datasets—removing duplicates, damaged files, and irrelevant content such as celebrity images, trademarks, low-resolution images, and materials lacking proper metadata.

This time-consuming process can lead to inefficiencies and potential legal disputes, as harmful or copyrighted materials may inadvertently make their way into model outputs.

The open dataset from Getty Images seeks to overcome these hurdles by providing a curated collection of high-quality images across 15 categories.

“This sample dataset features 3,750 images from categories including abstracts, built environments, business, education, healthcare, industry, nature, illustrations, and travel,” Gagliano detailed.

Clean and Curated Content

The dataset comes exclusively from Getty’s own creative library, ensuring that all images are commercially safe for use. Developers can leverage this curated set without the burdens of cleaning or enrichment, as it is specifically designed for machine learning training, featuring high-resolution images and rich structured metadata, free from unwanted elements like NSFW content. Gagliano describes it as the “cleanest, highest quality dataset” available for training ML models.

Usage Conditions

While the sample dataset is open for use, certain usage conditions ensure that the licensed content is employed responsibly for commercial applications and academic research. Restrictions include:

- No redistribution of the dataset

- No development of models or software that recreate or generate reproductions of the dataset content

- No creation of products or services that compete directly with Getty Images

- No use of biometric identifiers derived from the dataset

- Compliance with all relevant laws and regulations

Through this initiative, Getty Images aims to engage the developer community, showcasing the extensive range of content it offers and positioning itself as a “trusted partner” for high-quality licensed data for responsible AI training.

Gagliano emphasizes, “Our goal is to demonstrate that it is possible to accommodate licensing for all the content needed to train functional AI models while respecting creator IP.” Developers seeking additional data can reach out to Getty Images for tailored licensing options.

This approach ensures that original content creators receive annual compensation, following a model Getty Images also applied to its AI image generation tool, developed in partnership with Nvidia.

Most people like

Find AI tools in YBX