DataPelago: Unlock Major Savings for Enterprises with Universal Data Processing Solutions

As the importance of data grows, businesses are striving to extract maximum value from their information. However, enterprise data is increasing rapidly—doubling every two years—while the computing power needed to process it efficiently is becoming limited.

DataPelago, a California-based startup, aims to tackle this challenge with its "universal data processing engine." This innovative platform enhances the performance of existing data query engines, including open-source options, by leveraging high-performance computing technologies like GPUs and FPGAs (Field Programmable Gate Arrays). As a result, it can handle exponentially growing volumes of complex data in various formats.

Emerging from stealth mode, DataPelago claims to achieve a five-fold reduction in query/job latency while significantly cutting costs. The company has secured $47 million in funding from several venture capital firms, including Eclipse, Taiwania Capital, Qualcomm Ventures, Alter Venture Partners, Nautilus Venture Partners, and Silicon Valley Bank.

Navigating the Data Challenge

Over a decade ago, structured and semi-structured data analysis was essential for data-driven growth, offering enterprises insights into their performance. However, the technological evolution has led to a surge in unstructured data—images, PDFs, audio, and video—which now represents 90% of all created information. This shift poses a challenge for enterprises aiming to utilize their extensive data assets for advanced applications like large language models.

As companies seek to mobilize both structured and unstructured data, they face performance bottlenecks and struggle for timely and cost-effective processing. According to DataPelago CEO Rajan Goyal, the limitations stem from legacy platforms initially designed for structured data and general-purpose computing (CPUs).

"Today, companies face two choices for accelerated data processing,” Goyal explained. “Open-source systems offered by cloud providers have lower licensing fees but lead to higher cloud infrastructure costs. Conversely, proprietary services, while potentially more performant, come with steep licensing fees. Both options increase the total cost of ownership (TCO) for customers."

To bridge this performance and cost gap, DataPelago offers a unified platform that dynamically enhances query engines using GPU and FPGA technology, allowing them to meet advanced processing requirements without substantially raising TCO.

“Our engine accelerates open-source query engines like Apache Spark and Trino by leveraging GPUs, resulting in a 10:1 reduction in server count and corresponding infrastructure and licensing cost reductions. Customers experience significant price-to-performance advantages, enabling them to fully utilize their data assets,” Goyal stated.

Core Components of DataPelago

DataPelago's offering comprises three core components: DataApp, DataVM, and DataOS. The DataApp is a pluggable layer that integrates with open data processing frameworks like Apache Spark and Trino, enhancing their capabilities at both the planning and execution levels.

Once deployed, queries or data pipelines run unmodified, requiring no changes to the user-facing application. On the backend, the framework planner formulates a plan that DataPelago executes using an open-source library, such as Apache Gluten. This plan is then transformed into an open-standard Intermediate Representation (IR), known as Substrait, which DataOS converts into an executable Data Flow Graph (DFG).

DataVM evaluates the DFG nodes and dynamically allocates them to the most appropriate computing resources—be it CPU, FPGA, or Nvidia/AMD GPUs—based on availability and cost/performance metrics. This system directs workloads to the most suitable hardware, maximizing performance and cost efficiency.

Impact on Early Adopters

Despite the novel technology of dynamically accelerating query engines, DataPelago claims to deliver a five-fold reduction in query/job latency and a two-fold decrease in TCO compared to current data processing systems.

“One client was spending $140 million on a single workload, with 90% of that cost on compute. We reduced their total expense to under $50 million,” Goyal shared.

While he did not disclose the total number of customers, he noted significant interest across various sectors, including security, manufacturing, finance, telecommunications, SaaS, and retail. DataPelago's customer base features prominent names such as Samsung SDS, McAfee, and insurance technology provider Akad Seguros.

“DataPelago’s engine allows us to unify our GenAI and data analytics pipelines, processing structured, semi-structured, and unstructured data simultaneously while cutting our costs by over 50%,” stated André Fichel, CTO of Akad Seguros.

Looking ahead, Goyal is focused on expanding DataPelago’s reach to more enterprises in search of efficient data workload acceleration.

“Our next growth phase involves building a go-to-market team to handle the increasing number of customer conversations and expand our global presence,” he concluded.

Most people like

Find AI tools in YBX