Exclusive: Voltron Data Enhances AI Capabilities with Theseus Distributed Query Engine

The fictional Voltron robot, from the animated series of the same name, represents the power of combining multiple robot lions into a single formidable entity capable of achieving great feats.

Voltron Data, which launched in 2022 with $110 million in funding, aims to harness various open-source technologies, including Apache Arrow, Apache Parquet, and Ibis, to enhance data access. Today, Voltron Data has announced the Theseus distributed query engine, designed to significantly accelerate data queries for demanding AI workloads.

Theseus is engineered to optimize large-scale data pipelines and queries by leveraging GPUs and other hardware accelerators.

“We built Theseus on the same principles that guided our open-source initiatives—modular, composable, accelerated libraries that enhance data systems,” said Josh Patterson, co-founder and CEO of Voltron Data, in an exclusive interview. “This is our next step in becoming a leader in designing and building advanced data systems.”

Theseus: Built for Massive Volumes of Data

Theseus is tailored for executing distributed queries on large datasets of 10 terabytes or more, targeting organizations with petabyte-scale data processing needs, including Fortune 500 companies, government agencies, hedge funds, telecommunications, and media entertainment firms.

A primary objective of Theseus is to speed up ETL (extract, transform, load), feature engineering, and other data preparation tasks, enabling faster data integration for downstream AI and analytics systems. As AI systems evolve, the demand for real-time data transformation increases.

“Our users have shared that the biggest issue they face is not feeding their AI systems fast enough,” Patterson stated. “This need inspired the development of Theseus.”

Traditional data queries often face limitations due to CPU performance. Theseus transcends Standard CPU technologies by utilizing accelerated computing, including GPUs. Patterson described Theseus as “accelerator native,” optimized to fully leverage technologies like Nvidia GPUs and advanced networking and storage solutions.

This accelerator-native approach allows Theseus to execute queries more swiftly than conventional CPU-based engines like Apache Spark at scale.

AI Applications with Theseus

One significant application for Theseus is hyperparameter optimization, where organizations can efficiently process numerous parameters for feature engineering, allowing them to refine model inputs more effectively.

“The quicker you can execute feature engineering and ETL processes, the fresher your data and the better your models will be,” Patterson noted.

Interoperability at Its Core

Theseus embraces open standards such as Apache Arrow, Apache Parquet, and Ibis to ensure interoperability.

“It’s not a proprietary, siloed system; any Apache Arrow-compatible data lake can be queried using Theseus,” explained Patterson. The architecture allows data to be integrated seamlessly with various popular machine learning tools and frameworks, including PyTorch and TensorFlow.

“We have created a straightforward method for moving data in and out of our systems,” Patterson added.

Theseus is fundamentally a distributed query engine and does not include its own user interface. Instead, it utilizes SQL queries and Ibis, enabling easy integration with existing front-end systems and workflows.

Partnerships and Future Initiatives

Voltron Data is entering the market with Theseus through strategic partnerships, starting with Hewlett Packard Enterprise (HPE).

This collaboration will integrate Theseus into the HPE GreenLake hybrid cloud platform, which provides the necessary infrastructure while allowing customers to unify queries across different engines using Ibis.

Looking ahead, Patterson indicated that Voltron Data aims to expand Theseus partnerships and enhance its functionality, including user-defined functions. The focus for 2024 will be on streamlining integration into comprehensive data science pipelines.

“Our goal is to make it faster and easier to connect with various components of the data science pipeline, empowering users in the process,” Patterson concluded.

Most people like

Find AI tools in YBX