Speech Recognition: Enhancing Multimodal AI Systems with aiOla's Jargon-Understanding Approach
Speech recognition is a vital component of multimodal AI systems. While many enterprises are eager to adopt this technology, challenges remain, particularly in the accurate interpretation of industry-specific jargon. aiOla, an innovative Israeli startup, has made significant strides in addressing this issue. The company has introduced a new approach designed to help speech recognition models better grasp specialized vocabulary tailored to specific industries.
This development boosts the accuracy and responsiveness of speech recognition systems, making them more effective in complex enterprise environments, even under challenging acoustic conditions. In its initial case study, aiOla adapted OpenAI’s Whisper model, successfully reducing its word error rate and enhancing overall detection accuracy.
The Challenge of Jargon in Speech Recognition
In recent years, deep learning advancements have contributed to the emergence of high-performing automatic speech recognition (ASR) and transcription systems. OpenAI’s Whisper has garnered attention for its human-level robustness and accuracy in English speech recognition. However, since its launch in 2022, many have observed that Whisper's performance can suffer in real-world scenarios, with noisy environments complicating accurate audio interpretation. For instance, deciphering safety alerts amidst heavy machinery noise or understanding commands laden with specialized terminology in fields like medicine or law can prove challenging.
Organizations utilizing cutting-edge ASR models, like Whisper, often strive to tailor their systems to meet unique industry needs. Although this fine-tuning can enhance performance, it typically comes at a high cost in terms of both time and financial resources.
“Fine-tuning ASR models takes days and thousands of dollars — and that’s if you already have the data. If you don’t, collecting and labeling audio data could take months and cost tens of thousands of dollars,” says Gil Hetz, VP of Research at aiOla.
To address these challenges, aiOla has developed a two-step "contextual biasing" approach. First, the AdaKWS keyword spotting model identifies industry-specific jargon from speech samples. Next, these identified keywords guide the ASR decoder in incorporating the terms into the final transcribed text, enhancing the model’s ability to recognize specialized language effectively.
In initial tests, aiOla employed Whisper and experimented with two techniques to enhance performance: KG-Whisper (keyword-guided Whisper) and KG-Whisper-PT (prompt tuning). Both adaptations demonstrated improved performance compared to the original Whisper model across various datasets, even in challenging acoustic environments.
“Our new model (KG-Whisper-PT) significantly reduces the Word Error Rate (WER) and enhances accuracy (F1 score). In tests on a medical dataset, it achieved an F1 score of 96.58, compared to Whisper’s 80.50, and a WER of 6.15 versus Whisper’s 7.33,” Hertz states.
Crucially, this method is compatible with various ASR models. While aiOla utilized Whisper, the same approach can be applied to Meta’s MMS and other proprietary speech-to-text models, allowing enterprises to create a tailored recognition system without the need for retraining. Simply providing a list of industry-specific terms to the keyword spotter suffices.
“This model enables complete ASR capabilities that accurately identify jargon. It allows us to adapt quickly to different industries by merely altering the jargon vocabulary without retraining the entire system. Essentially, it’s a zero-shot model, able to predict without having seen specific examples during training,” Hertz explains.
Time-Saving Benefits for Fortune 500 Companies
With its adaptability, aiOla's approach can benefit a wide array of industries with technical jargon, including aviation, transportation, manufacturing, and logistics. The company has begun deploying its adaptive model with Fortune 500 clients, significantly improving their efficiency in managing jargon-heavy processes.
For example, a Fortune 50 global shipping and logistics leader employed aiOla’s model to automate daily truck inspections, reducing each inspection from about 15 minutes to under 60 seconds. Similarly, one of Canada’s leading grocery chains utilized the model for monitoring product and meat temperatures, leading to projected annual time savings of 110,000 hours, over $2.5 million in anticipated savings, and a 5X ROI.
aiOla has shared its research in hopes of inspiring further advancements in AI by other research teams. However, the company is not offering API access to the adapted model or releasing its weights at this time. Enterprises can access this technology exclusively through aiOla’s subscription-based product suite.