Today, Databricks launched its annual Data and AI Summit by making a significant change: it has open-sourced its Unity Catalog platform, which has been developing for the past three years to provide a comprehensive solution for data governance.
Previously a proprietary offering, Unity Catalog is now available under the Apache 2.0 license. This shift allows companies to utilize the underlying architecture and code to create and customize their own catalogs without incurring costs from Databricks. Additionally, Unity Catalog will feature an OpenAPI specification, server, and client support.
This announcement follows closely on the heels of a similar initiative by Snowflake, Databricks’ major competitor, which recently introduced the Polaris Catalog—its own open catalog system for enterprises. However, while Databricks immediately open-sourced Unity Catalog (with Databricks CTO Matei Zaharia demonstrating the code live), Snowflake’s Polaris will be open-sourced over the next 90 days.
Unity Catalog OSS: Empowering Customer Control
Databricks originally launched Unity Catalog as a proprietary data governance tool designed to manage access to data and AI assets within its ecosystem. It included features such as centralized data access management, auditing, data discovery, lineage tracking, and secure data sharing.
However, its closed-source nature limited users' ability to integrate it with other technologies, particularly with query engines compatible with Apache Iceberg or Hudi—two widely-used open table formats. Recognizing this limitation, Databricks developed the Delta Lake Universal Format (UniForm) last year. This new feature automatically generates the necessary metadata for Apache Iceberg and Hudi while unifying table formats into a single copy accessible from any supported engine.
With the open-sourcing of Unity Catalog and the introduction of open APIs, Databricks aims to provide a universal interface that accommodates all three open data formats through UniForm. This development enhances compatibility across various query engines, tools, and cloud platforms.
Joel Minnick, Databricks' VP of Product Marketing, explained, “With open-sourced Unity Catalog, current Databricks customers can leverage a broad ecosystem of Delta Lake and Apache Iceberg compatible engines, providing them the flexibility to access their managed data and AI assets via their preferred tools. Existing deployments utilize the same open APIs, allowing external clients to read from all tables, volumes, and functions in Unity Catalog with existing access controls.”
Unity Catalog also ensures interoperability with major cloud platforms (Microsoft Azure, AWS, GCP, and Salesforce) and compute engines such as Apache Spark, Presto, Trino, and others. It supports various data and AI platforms, including dbt Labs, Confluent, Fivetran, Granica, and more.
In addition to supporting open formats and engines, the catalog complies with Iceberg REST Catalog and Hive Metastore (HMS) interface standards, promoting cohesive governance across both tabular and non-tabular data and AI assets. This capability simplifies large-scale management of diverse data types, including machine learning models and generative AI tools.
How Does Unity Catalog Compare to Snowflake’s Polaris Catalog?
Like Unity Catalog, Snowflake’s Polaris Catalog emphasizes open catalog implementation for interoperability. However, Polaris is limited to data formatted for Apache Iceberg, while Unity Catalog OSS supports data in any format, including Iceberg, Delta, Hudi, Parquet, CSV, and JSON.
Furthermore, Databricks’ offering extends to unstructured datasets (volumes) and AI tools, enabling organizations to manage images, documents, and other files essential for generative AI applications—a capability not available with Polaris.
Minnick added, “Snowflake's proprietary storage format tables cannot be accessed via Polaris, whereas Unity Catalog OSS APIs allow external clients to read from all tables, volumes, and functions in Databricks Unity Catalog.”
Globally, over 10,000 organizations, including NASDAQ, Rivian, and AT&T, rely on Unity Catalog within the Databricks Data Intelligence Platform. The transition to open-source is expected to influence adoption rates significantly.
The Databricks Data and AI Summit runs from June 10 to June 13, 2024.