There have been numerous initiatives aimed at creating open-source, AI-powered voice assistants—such as Rhasspy, Mycroft, and Jasper—all designed to offer privacy-focused, offline experiences without sacrificing functionality. However, progress in this area has been exceptionally slow. The unique challenges associated with open-source projects, combined with the complexity of building a robust voice assistant, have hindered advancements. Major players like Google Assistant, Siri, and Alexa benefit from years, if not decades, of research and development, as well as substantial infrastructure.
Despite these obstacles, the Large-scale Artificial Intelligence Open Network (LAION), a German nonprofit known for curating some of the world’s most popular AI training datasets, is undeterred. This month, LAION unveiled its new project, BUD-E, which aims to develop a “fully open” voice assistant capable of running on consumer-grade hardware.
But why initiate yet another voice assistant project in a landscape filled with abandoned attempts? According to Wieland Brendel, a fellow at the ELLIS Institute and contributor to BUD-E, existing open-source solutions lack the extensible architecture needed to harness emerging Generative AI (GenAI) technologies, particularly large language models (LLMs) similar to OpenAI’s ChatGPT.
“Most interactions with voice assistants rely on chat interfaces that can feel cumbersome and unnatural,” Brendel explained in an email interview. “While they effectively execute simple commands—like playing music or turning on lights—they fall short for more engaging conversations. The goal of BUD-E is to create a voice assistant that feels far more natural and replicates the flow of human dialogue, complete with the ability to remember past interactions.”
Brendel emphasized that LAION is committed to ensuring all components of BUD-E can be integrated with applications and services without licensing fees, even for commercial use—an advantage not always present with other open assistant projects.
In partnership with the ELLIS Institute in Tübingen, tech consultancy Collabora, and the Tübingen AI Center, BUD-E—short for “Buddy for Understanding and Digital Empathy”—has an ambitious roadmap. In a blog post, the LAION team outlined its objectives for the coming months, particularly focusing on imbuing BUD-E with “emotional intelligence” and enabling it to manage conversations with multiple participants.
“There’s a significant demand for an effective natural voice assistant,” Brendel stated. “LAION has demonstrated its capability in building engaged communities, while the ELLIS Institute and Tübingen AI Center are dedicated to providing the resources necessary to advance our development efforts.”
BUD-E is currently operational, and you can download and install it from GitHub for Ubuntu or Windows PC (with macOS support forthcoming)—though it’s important to note that the project is still in its early stages.
LAION has integrated various open models to create a minimal viable product (MVP), which includes Microsoft’s Phi-2 LLM, Columbia’s StyleTTS2 for text-to-speech, and Nvidia’s FastConformer for speech-to-text. Consequently, the initial user experience has room for optimization. Achieving a response time of approximately 500 milliseconds—comparable to commercial voice assistants like Google Assistant and Alexa—currently necessitates robust hardware such as Nvidia’s RTX 4090 GPU.
Collabora is assisting pro bono by adapting its open-source speech recognition and text-to-speech engines, WhisperLive and WhisperSpeech, specifically for BUD-E. “Building our own text-to-speech and speech recognition models allows us to customize them in ways that proprietary models available via APIs cannot,” said Jakub Piotr Cłapa, an AI researcher at Collabora and member of the BUD-E team.
“In joining forces with the broader open-source community, we aim to make our models more accessible and beneficial to a wider audience.”
In the short term, LAION plans to reduce BUD-E’s hardware requirements and latency. A longer-term goal includes creating a dialogue dataset to fine-tune BUD-E, developing a memory system for storing past conversations, and establishing a speech processing pipeline for managing discussions involving multiple speakers.
I inquired about the importance of accessibility, considering that speech recognition technologies have historically struggled with non-English languages and diverse accents. A Stanford study revealed that speech recognition systems from major companies were nearly twice as likely to misinterpret Black speakers compared to their white counterparts with similar demographics.
Brendel acknowledged that while LAION is not overlooking accessibility, it is currently not the immediate focus for BUD-E. “Our initial effort is on reimagining the interaction experience with voice assistants. Once we accomplish that, we can broaden the application to accommodate a diverse range of accents and languages,” he shared.
LAION also has intriguing ideas for BUD-E, including an animated avatar to personalize the assistant and using webcam analysis to gauge users’ emotional states. Naturally, the ethics surrounding facial recognition are complex. However, Robert Kaczmarczyk, a co-founder of LAION, asserted that the organization prioritizes safety.
“We strictly follow the ethical and safety guidelines set by the EU AI Act,” he stated, referring to the legislation governing AI’s development and use in the European Union. This act allows member states to enact stricter rules for AI deemed “high-risk,” including those involving emotion classification technology.
“This commitment to transparency helps identify and rectify potential biases from the outset, promoting scientific integrity,” Kaczmarczyk added, underscoring LAION’s intent to make its datasets widely available for research that meets high standards of reproducibility.
While LAION’s previous ventures have faced ethical scrutiny, its pursuit of BUD-E may signal a positive change; only time will tell.