When ChatGPT launched over a year ago, it provided internet users with an always-available AI assistant for various tasks, from generating natural language content like essays to analyzing complex information. This rapid rise highlighted the powerful technology behind it: the GPT series of large language models (LLMs).
Today, LLMs, including the GPT series, are not just enhancing individual tasks; they are revolutionizing entire business operations. Companies are utilizing commercial model APIs and open-source solutions to automate repetitive tasks, improve efficiencies, and streamline key functions. Imagine engaging with AI to design ad campaigns for marketing teams or expedite customer support by accessing the right database swiftly.
The Transformation of the Data Stack
Data is crucial for the performance of large language models. When trained effectively, these models enable teams to manipulate and analyze their data efficiently. As ChatGPT and its competitors gained traction over the past year, many enterprises integrated generative AI into their data workflows, simplifying the user experience and allowing customers to save time and resources for their core tasks.
One of the most significant advancements was the introduction of conversational querying capabilities. This feature allows users to interact with structured data (data organized in rows and columns) using natural language, eliminating the need to write complex SQL queries. With this text-to-SQL functionality, even non-technical users can input queries in plain language and receive insights from their data.
Several key vendors have pioneered this capability, including Databricks, Snowflake, Dremio, Kinetica, and ThoughtSpot. Kinetica, which initially utilized ChatGPT, now employs its proprietary LLM. Snowflake offers two main tools: a copilot for conversational data inquiries and SQL query generation, and a Document AI tool that extracts information from unstructured datasets like images and PDFs. Databricks operates similarly with its ‘LakehouseIQ’ solution.
Emerging startups are also focusing on AI-based analytics. For example, California-based DataGPT provides a dedicated AI analyst that executes thousands of queries in real-time, delivering results in a conversational format.
Supporting Data Management and AI Initiatives
In addition to generating insights, LLMs are increasingly facilitating data management tasks critical for building robust AI products. In May, Informatica introduced Claire GPT, a multi-LLM conversational AI tool that helps users discover, manage, and interact with their Intelligent Data Management Cloud (IDMC) data assets using natural language inputs. Claire GPT performs various functions, including data discovery, pipeline creation, metadata exploration, and quality control.
To further assist teams in developing AI offerings, Refuel AI has introduced a tailored LLM for data labeling and enrichment tasks. Research published in October 2023 indicates that LLMs can also effectively reduce noise in datasets, an essential step in ensuring quality AI.
LLMs are also applicable in data engineering, particularly in data integration and orchestration. They can generate the necessary code to convert diverse data types, connect to different sources, or create YAML and Python templates for constructing Airflow DAGs.
Looking Ahead
In just a year, LLMs have significantly impacted the enterprise landscape, and as these models advance in 2024, we can expect even more applications across the data stack, including the emerging field of data observability. Monte Carlo has introduced Fix with AI, a tool that identifies issues in data pipelines and recommends corrective code. Similarly, Acceldata has acquired Bewgle to enhance LLM integration for data observability.
As new applications emerge, it is crucial for teams to ensure that their language models, whether developed in-house or fine-tuned, maintain high performance. Even minor errors can lead to significant downstream impacts, potentially disrupting the customer experience.