Text-generating AI is impressive, but AI models that can analyze both images and text offer even greater possibilities for innovation.
Take Twelve Labs, for instance. This San Francisco-based startup focuses on training AI systems to address the intricate challenges of “video-language alignment,” as articulated by co-founder and CEO Jae Lee.
“Twelve Labs was founded … to develop a framework for multimodal video understanding, with the initial focus on semantic search — think of it as ‘CTRL+F for videos,’” Lee explained in an email interview. “Our vision at Twelve Labs is to empower developers to create applications that can see, hear, and understand the world as we do.”
Twelve Labs’ AI models translate natural language into the events and sounds occurring in a video, identifying actions, objects, and audio backgrounds. This capability enables developers to craft applications that can search videos, categorize scenes, extract relevant topics, automatically summarize content, and segment videos into chapters.
Lee mentioned that the technology can enhance ad insertion and content moderation efforts—such as distinguishing between violent and instructional videos featuring knives. Additionally, it can facilitate media analytics and automatically generate highlight reels, headlines, and tags from video content.
When I inquired about the risk of bias in these AI models—given that it’s well-documented that models can reflect the biases present in their training data—Lee acknowledged the risks. For instance, training a model on news clips that sensationalize crime could result in biased interpretations.
He assured me that Twelve Labs is committed to meeting internal metrics for bias and “fairness” prior to model deployment, and the company intends to provide benchmarks and datasets related to model ethics in the future.
Lee emphasized that “our product differs from large language models like ChatGPT in that it is specifically designed to understand video. It integrates visual, auditory, and spoken elements within the content.” He noted, “We have really pushed the technical boundaries of video understanding.”
Google is also developing a multimodal model for video comprehension, known as MUM, which fuels video recommendations on Google Search and YouTube. In addition to MUM, major players like Google, Microsoft, and Amazon provide AI-driven API services that identify objects, locations, and actions in videos while extracting detailed metadata on a frame-by-frame basis.
However, Lee believes Twelve Labs stands out due to the quality of its models and its platform’s fine-tuning capabilities. These features enable customers to customize the models with their own data for specialized video analysis.
Today, Twelve Labs is proud to introduce Pegasus-1, an innovative multimodal model capable of processing various prompts for comprehensive video analysis. For example, Pegasus-1 can be asked to produce a detailed report on a video or highlight specific segments with timestamps.
“Enterprise organizations are beginning to realize the potential of harnessing their extensive video data for new opportunities. Yet, many conventional video AI models provide limited capabilities that fall short of meeting the complex needs of businesses,” Lee noted. “By utilizing advanced multimodal video understanding models, enterprises can attain near-human comprehension of video content without labor-intensive analysis.”
Since launching in private beta in early May, Twelve Labs reports a growth to 17,000 developers in its user base. Although Lee did not disclose the exact number, the company is collaborating with various firms across sports, media and entertainment, e-learning, and security sectors, including the NFL.
Twelve Labs continues to seek funding, which is essential for any startup. Recently, the company secured a strategic funding round of $10 million from notable investors such as Nvidia, Intel, and Samsung Next, bringing the total amount raised to $27 million.
“This investment focuses on strategic partners capable of accelerating our growth in research, product development, and distribution,” Lee said. “It fuels our ongoing innovation in video understanding, enabling us to deliver powerful models tailored to diverse business needs. We are advancing the industry, empowering companies to achieve remarkable results.”