Groq now enables lightning-fast queries and other tasks with advanced large language models (LLMs) directly on its website.
The company introduced this capability quietly last week, showcasing impressive speeds that surpass previous demonstrations. Users can type or voice their queries for enhanced interaction.
In my tests, Groq responded at an astonishing speed of approximately 1256.54 tokens per second, almost instantaneous compared to the 800 tokens per second demonstrated in April. This level of speed is a notable advancement, particularly as GPU chips from companies like Nvidia struggle to match it.
By default, Groq’s engine utilizes Meta’s open-source Llama3-8b-8192 LLM, with options to select from larger models like Llama3-70b, as well as various Google and Mistral models, with more options coming soon.
This experience illustrates the speed and flexibility of LLM chatbots for both developers and non-developers. Groq’s CEO, Jonathan Ross, believes that LLM usage will grow significantly as users recognize the ease of operating on Groq’s fast engine. The demo showcases potential tasks like generating and editing job postings or articles in real time.
For instance, I requested a critique of the agenda for our VB Transform event on generative AI. Groq provided instantaneous feedback, including suggestions for clearer categorization and enhanced speaker profiles. When I asked for diverse speaker recommendations, it quickly generated a list with affiliations in a table format, which I could modify on the spot.
In a second exercise, I asked Groq to organize my speaking sessions for next week into a table. It not only produced the tables I needed but also allowed for quick edits, including spelling corrections and additional columns for forgotten details. It can even translate content into different languages. While a few adjustments required multiple prompts, these issues typically stem from the LLM level rather than processing speed, underscoring the vast potential of LLM capabilities at such high speeds.
Groq has garnered attention for its promise of doing AI tasks faster and more affordably than competitors, thanks to its language processing unit (LPU), which operates more efficiently than GPUs by utilizing linear processing. While GPUs excel at model training, LLM "inference"—the model’s actions during deployment—demands greater efficiency and reduced latency.
Currently, Groq offers its service for powering LLM workloads for free and has attracted over 282,000 developers since launching just 16 weeks ago.
Groq provides a console for developers to create applications, similar to other inference providers. Notably, it allows developers who use OpenAI to switch their applications to Groq effortlessly with just a few steps.
In preparation for my talk at VB Transform, where Ross is an opening speaker, he expressed that the event will focus on the deployment of enterprise generative AI. Large companies are moving towards AI application deployment, necessitating more efficient processing for their workloads.
Users can not only type queries but can also speak them by pressing a microphone icon. Groq integrates OpenAI's Whisper Large V3 model for automatic speech recognition, converting voice to text before passing it to the LLM.
Groq claims its technology consumes roughly one-third the power of a GPU at worst, with most workloads using as little as one-tenth of the energy. In a world of scaling LLM workloads and increasing energy demands, Groq’s efficiency poses a significant challenge to the GPU-centric computation landscape.
Ross asserts that by next year, over half of the world's inference computing could be reliant on their chips. Further insights will be revealed at the upcoming Transform 2024 event.