Getty Images Launches ‘Cleanest’ Visual Dataset for Training AI Foundation Models

Home AI News Getty Images Launches ‘Cleanest’ Visual Dataset for Training AI Foundation Models

Updated on September 6 2024

Getty Images is committed to becoming a trusted data partner in the AI space. Renowned for facilitating the discovery, sharing, and purchase of visual content from a global pool of photographers and videographers, the company has announced the release of a sample open dataset on Hugging Face.

While many visual datasets are available on the Hugging Face hub, Getty Images asserts that its offering is uniquely reliable and commercially safe. This assurance allows enterprise developers to integrate the dataset into their AI training pipelines with confidence, mitigating concerns over quality or legal complications.

As Andrea Gagliano, the head of data science and AI/ML at Getty Images, explained, “Imagine enhancing your AI/ML capabilities with data that is both diverse and high quality, sourced responsibly. That’s what we provide.”

Getty's long-term objective is to foster an ecosystem where AI developers prefer to use officially licensed content from its platform for training their models.

What Does the Getty Images Dataset Include?

Developers often face challenges when dealing with poorly sourced, low-quality data during AI/ML model training. To address this, they typically engage in extensive efforts to clean and enrich their datasets—removing duplicates, damaged files, and irrelevant content such as celebrity images, trademarks, low-resolution images, and materials lacking proper metadata.

This time-consuming process can lead to inefficiencies and potential legal disputes, as harmful or copyrighted materials may inadvertently make their way into model outputs.

The open dataset from Getty Images seeks to overcome these hurdles by providing a curated collection of high-quality images across 15 categories.

“This sample dataset features 3,750 images from categories including abstracts, built environments, business, education, healthcare, industry, nature, illustrations, and travel,” Gagliano detailed.

Clean and Curated Content

The dataset comes exclusively from Getty’s own creative library, ensuring that all images are commercially safe for use. Developers can leverage this curated set without the burdens of cleaning or enrichment, as it is specifically designed for machine learning training, featuring high-resolution images and rich structured metadata, free from unwanted elements like NSFW content. Gagliano describes it as the “cleanest, highest quality dataset” available for training ML models.

Usage Conditions

While the sample dataset is open for use, certain usage conditions ensure that the licensed content is employed responsibly for commercial applications and academic research. Restrictions include:

- No redistribution of the dataset

- No development of models or software that recreate or generate reproductions of the dataset content

- No creation of products or services that compete directly with Getty Images

- No use of biometric identifiers derived from the dataset

- Compliance with all relevant laws and regulations

Through this initiative, Getty Images aims to engage the developer community, showcasing the extensive range of content it offers and positioning itself as a “trusted partner” for high-quality licensed data for responsible AI training.

Gagliano emphasizes, “Our goal is to demonstrate that it is possible to accommodate licensing for all the content needed to train functional AI models while respecting creator IP.” Developers seeking additional data can reach out to Getty Images for tailored licensing options.

This approach ensures that original content creators receive annual compensation, following a model Getty Images also applied to its AI image generation tool, developed in partnership with Nvidia.

Navigating the Complexities of Copyright in the Age of AI | Insights from Devcom Panel

Is AI the Future of Sales? How Salesforce's Innovative Models Could Revolutionize the Industry

Most people like

SongGenerator.io

31.2K

Quickly Transform Text into Royalty-Free AI Music for Your Projects

Other AI Music Generator

Acrostic AI

13K

Discover the creative world of an AI-powered acrostic poem generator, where each letter inspires a unique line of poetry. Unleash your imagination and let artificial intelligence craft personalized acrostics that resonate with your themes and emotions. Perfect for poetry enthusiasts, educators, and anyone looking to add a lyrical touch to their writing, this innovative tool transforms simple words into beautiful expressions. Experience the magic of poetry creation at your fingertips!

acrostic poem generator AI Content Generator

Pillar

182.2K

Unlock your creative potential with our AI-powered platform designed for selling digital products and securing brand partnerships. Ideal for creators looking to elevate their business through innovative tools and streamlined processes, our platform simplifies the journey from product creation to successful brand collaborations. Start thriving in the digital marketplace today!

AI-powered checkout Bio Link

SHRED: Home & Gym Workouts App

33.4K

Introducing a personalized training app designed for both home and gym workouts. With this innovative app, you can tailor your fitness journey to meet your individual needs, whether you're looking to build strength, improve endurance, or enhance flexibility. Get ready to elevate your workouts and achieve your fitness goals with a program that fits your lifestyle!

Fitness Fitness

Find AI tools in YBX