London-based Synthesia, a startup specializing in AI video creation for enterprises, has enhanced its platform with the introduction of “expressive avatars.”
Starting today, these AI avatars advance the capabilities of traditional digital avatars by adjusting their tone, facial expressions, and body language based on the content's context. This launch follows Microsoft’s recent showcase of VASA, an AI framework that transforms human headshots into animated talking and singing videos with expressions and head movements. However, unlike VASA, which remains a research project, Synthesia’s expressive avatars offer genuine technology designed to help enterprises create more realistic AI videos for their audiences.
Synthesia’s Innovative Leap in AI Videos
Founded in 2017 by AI researchers and entrepreneurs from Stanford and Cambridge, Synthesia has developed an end-to-end platform that combines custom AI voices and avatars. Users can create studio-quality AI videos using pre-written scripts or AI-generated content, contributing to significant adoption across enterprises. Over 200,000 users have created more than 18 million videos, although previous avatars lacked the ability to convey sentiment effectively—digital avatars could not modify their tone or expressions based on the script in real-time.
With the launch of expressive avatars, this limitation is addressed.
According to Synthesia, the new AI avatars can comprehend the sentiment and context within text, adjusting their tone and expressions accordingly. They can convey a range of emotions through subtle changes in expressions, blinking, and eye movements. For example, an avatar might smile when discussing a joyful topic or slow down their speech with longer pauses for somber content.
“Our goal is not just to create digital renders but to introduce digital actors,” stated Jon Starck, Synthesia’s CTO, in a blog post. “This technology enhances the realism of digital avatars, blurring the line between the virtual and the real.”
Technical Foundation of Expressive Avatars
To achieve this nuanced sentiment understanding, Synthesia employs EXPRESS-1, a deep learning model trained on extensive text and video data reflecting real-world spoken communication.
“EXPRESS-1 predicts movements and facial expressions in real-time, perfectly aligning with speech nuances and emphasis, resulting in extraordinarily natural performances,” Starck explained. The new avatars also feature improved lip-sync and voice capabilities across multiple languages.
Implications of Expressive Avatars
While AI avatars with human-like emotions present potential risks for misuse, Synthesia is committed to fostering positive enterprise applications, particularly in communication and knowledge sharing. For instance, healthcare companies could use expressive avatars to produce more empathetic patient videos, while marketing teams might convey enthusiasm for a new product.
To promote responsible usage, Synthesia has revised its policies to restrict content types on its platform and is actively investing in early detection of misuse and content verification technologies like C2PA.
Currently, with a workforce of 300, Synthesia collaborates with over 55,000 businesses, including half of the Fortune 100. Among its clients is Zoom, which reports a 90% increase in efficiency for creating sales and training videos using Synthesia.