Introducing Apache Airflow 2.10: A New Era for AI Data Orchestration

Getting data from its source to effective use in analytics and AI often isn’t straightforward. Data orchestration technology, such as the open-source Apache Airflow project, plays a vital role in facilitating data pipelines that deliver data where it's needed.

Today marks the release of Apache Airflow 2.10, the project's first significant update since Airflow 2.9 in April. This new version introduces hybrid execution, allowing organizations to optimize resource allocation for various workloads, from simple SQL queries to demanding machine learning (ML) tasks. Enhanced lineage capabilities provide greater visibility into data flows, which is essential for governance and compliance.

Astronomer, the leading commercial vendor behind Apache Airflow, is also updating its Astro platform to integrate the open-source dbt-core (Data Build Tool). This integration unifies data orchestration and transformation workflows on a single platform.

These updates collectively aim to streamline data operations and bridge the gap between traditional data workflows and emerging AI applications, offering enterprises a more adaptable approach to data orchestration that addresses the complexities of diverse data environments and AI processes.

Julian LaNeve, CTO of Astronomer, commented, “When you adopt orchestration, it’s about coordinating activities across the entire data supply chain and ensuring central visibility.”

How Airflow 2.10 Enhances Data Orchestration with Hybrid Execution

A significant enhancement in Airflow 2.10 is the introduction of hybrid execution. Previously, Airflow users had to choose a single execution mode for their entire deployment, typically either a Kubernetes cluster or the Celery executor. Kubernetes excels at handling complex compute-intensive jobs, while Celery is more efficient for lighter tasks.

Real-world data pipelines, however, often encompass a mix of workload types. LaNeve pointed out that an organization might need to perform a simple SQL query alongside a complex machine learning workflow in the same deployment. Hybrid execution now enables this flexibility, allowing each component of the data pipeline to be optimized for the appropriate level of compute resources.

LaNeve noted, “Choosing execution modes at the pipeline and task level, rather than uniformly across the entire deployment, provides a new level of flexibility and efficiency for Airflow users.”

The Importance of Data Lineage in AI-Oriented Data Orchestration

Data lineage—understanding the origin and journey of data—is critical for both traditional analytics and emerging AI workloads. Robust lineage tracking is vital in AI and machine learning, where the quality and provenance of data can significantly impact outcomes.

Before the release of Airflow 2.10, data lineage tracking had limitations. With the new features, Airflow now enhances its ability to capture dependencies and data flows within pipelines, even for custom Python code. This improved lineage tracking fosters trust in AI systems, as LaNeve stated, “A key component to any AI application today is trust.” Users need assurance that the outputs generated by AI are reliable. Clear lineage provides an auditable trail documenting how data was sourced, transformed, and utilized for training models, thereby enhancing data governance and security around sensitive information.

Looking Ahead to Airflow 3.0

As data governance, security, and privacy continue to gain importance, LaNeve is already anticipating the future of Airflow with version 3.0. This upcoming release aims to modernize Airflow for the age of generative AI, with priorities that include making the platform more language-agnostic, allowing users to write tasks in any programming language, and enhancing data-awareness by focusing more on managing data flows than simply orchestrating processes.

LaNeve emphasized, “We want to ensure that Airflow remains the standard for orchestration over the next 10 to 15 years.”

Most people like

Find AI tools in YBX