Today, Databricks announced its acquisition of Lilac, a Boston-based applied research startup specializing in data understanding and manipulation. The financial terms of the acquisition remain undisclosed.
Led by Ali Ghodsi, Databricks aims to integrate Lilac’s team and technology into its data intelligence platform, previously known as the data lakehouse. This integration will provide users across various domains with a streamlined approach to enhance dataset quality for developing high-performance large language model (LLM) applications.
This acquisition aligns with Databricks' vision of becoming a comprehensive platform for data and generative AI solutions. Recently, the company also invested an undisclosed sum in Mistral, a leading generative AI startup that has achieved substantial success in Europe.
Lilac: Simplifying Data Exploration
The acquisition of Mosaic AI last year marked Databricks' strategic shift towards an AI-driven future, enabling users to securely build generative AI applications using hosted data. Since then, Databricks has rolled out multiple open models, empowering clients to develop, deploy, and maintain high-quality LLM applications tailored to various business needs.
As the industry well knows, high-quality data is the foundation of effective AI initiatives, including LLM systems. To ensure optimal model training and real-world performance testing—addressing issues like bias and hallucinations—teams need reliable data. Lilac addresses these critical data quality challenges within Databricks.
Traditionally, teams have employed labor-intensive manual methods to explore unstructured data and rectify its shortcomings. Founded in 2023 by former Google engineers Daniel Smilkov and Nikhil Thorat, Lilac provides a scalable, open-source solution. Its intuitive user interface and AI-enhanced features allow users to analyze, understand, and modify unstructured text data efficiently.
Features of Lilac
According to Lilac's website, data scientists and AI researchers can leverage its capabilities for tasks such as:
- Clustering and categorizing documents
- Performing semantic and keyword searches
- Detecting personal information or duplicates and making necessary adjustments with comparison views
- Tailoring datasets for specific needs
"The team behind Lilac specifically designed their product to analyze model outputs for bias or toxicity, and to prepare data for Retrieval-Augmented Generation (RAG) and fine-tuning or pre-training LLMs,” noted Databricks executives Matei Zaharia, Naveen Rao, Jonathan Frankle, Hanlin Tang, and Akhil Gupta in a joint blog post.
They further emphasized that Lilac’s technology will be integrated into Databricks’ Mosaic AI tooling, enhancing developers' ability to curate datasets for customized generative AI systems. Although specific integration details are yet to be disclosed, the goal remains clear: to simplify data tailoring for evaluating and monitoring LLM outputs and preparing datasets for important processes like RAG and model fine-tuning.
Expanding Generative AI Capabilities
This acquisition is a significant step for Databricks towards offering end-to-end tools for developing robust generative AI applications. Users on the Databricks platform already have access to everything needed to create LLM-powered systems. This includes open models from industry leaders like Meta, Stability, and Mistral, alongside specialized Mosaic tools for experimentation and optimization.
In response to similar market demands, competitors like Snowflake are also advancing in this space, having introduced Cortex, a fully managed service to aid customers in building apps powered by advanced open models.