AI-as-a-service provider Assembly AI has launched its latest speech recognition model, Universal-1. Trained on over 12.5 million hours of multilingual audio data, Universal-1 achieves impressive speech-to-text accuracy in English, Spanish, French, and German. The company asserts that Universal-1 reduces hallucinations by 30% on speech data and by 90% on ambient noise compared to OpenAI’s Whisper Large-v3 model.
In a recent blog post, Assembly AI described Universal-1 as a significant step in their goal to deliver accurate, reliable, and robust speech-to-text capabilities across multiple languages. The model can effectively code-switch, transcribing multiple languages within a single audio file.
Universal-1 excels in improved timestamp estimation, critical for audio and video editing as well as conversation analytics. It outperforms its predecessor, Conformer-2, by 13%, boasting better speaker diarization and an enhanced concatenated minimum-permutation word error rate (cpWER) of 14%. Additionally, speaker count estimation accuracy has risen to 71%.
The model also features optimized parallel inference, greatly reducing processing time for lengthy audio files. Universal-1 transcribes one hour of audio in just 21 seconds on Nvidia Tesla T4 machines, five times faster than Whisper Large-v3, which takes 107 seconds for the same task with a smaller batch size.
Enhanced speech-to-text AI models offer numerous benefits, including producing more accurate and reliable notes, identifying action items, and sorting metadata like proper nouns, speaker identification, and timing. This improvement will aid various applications, from AI-powered video editing to telehealth platforms that require precise clinical note entry and claims submission.
The Universal-1 model is now accessible via Assembly AI’s API.