Monday , 4 December 2023

Bringing order to data lakehouses, Onehouse is expanding its Apache Hudi technology with $25M raise

Check out all the on-demand sessions from the Intelligent Security Summit here.

Managed data lakehouse vendor Onehouse today announced that it has raised $25 million in a series A round of funding to help further advance its go-to-market and technology efforts based on the open-source Apache Hudi project.

Onehouse emerged from stealth a year ago, in Feb. 2022, as the first commercial vendor providing support and service for Apache Hudi. Hudi, which is an acronym for Hadoop Upserts Deletes and Incrementals, traces its roots back to Uber in 2016 where it was first developed as a technology to help bring order to the massive volumes of data that were being stored in data lakes.

The Hudi technology provides a data lake table format as well as services to help with clustering, archiving and data replication. Hudi competes against multiple other open-source data lake table technologies including Apache Iceberg and Databricks Delta Lake.

The goal at Onehouse is to create a cloud-managed service that can help organizations benefit from a managed data lakehouse. Alongside the new funding, Onehouse announced its Onetable initiative that aims to enable users of Iceberg and Delta Lake to interoperate with Hudi. With Onetable, organizations can use Hudi for data ingestion into a data lake while still being able to benefit from query engine technologies that run on Iceberg — including Snowflake — as well as Databricks’ Delta Lake.


Intelligent Security Summit On-Demand

Learn the critical role of AI & ML in cybersecurity and industry specific case studies. Watch on-demand sessions today.

Watch Here

“We are really trying to build a new way of thinking about data architecture,” Onehouse founder and CEO Vinoth Chandar, told VentureBeat. “We are very convinced that people should start with an interoperable lakehouse.”

Understanding the data lakehouse trend

The data lakehouse is a term first coined by Databricks. 

The goal of a data lakehouse is to take the best aspects of a data lake, which provides large volumes of data storage, with a data warehouse that provides structured data services for queries and data analytics. A 2022 report from Databricks identified a number of key benefits of the data lakehouse approach including improved data quality, increased productivity and better data collaboration.

A key component of the data lakehouse model is the ability to apply structure to data lakes, which is where the open-source data lake table formats, including Hudi, Delta Lake and Iceberg fit in. Multiple vendors are now building full platforms with those table formats as a foundation.

Among the many supporters of Apache Iceberg is Cloudera, which launched its data lakehouse service in August 2022. Dremio is another strong Iceberg supporter, using it as part of its data lakehouse platform. Even Snowflake, one of the pioneers of the cloud data warehouse concept, is now supporting Iceberg.

Onetable isn’t another data lake table format 

At the core of the major data lake formats today, including Hudi, Delta Lake and Iceberg, are files that organizations want to be able to use for analytics, business intelligence or operations.

A challenge that has emerged, though, is that vendor technologies have been increasingly vertically integrated — combining the data storage and query engines. Kyle Weller, head of product at Onehouse, explained he’s seen organizations confused about which vendor to choose based on which data lake table format approach is supported. The Onetable approach is intended to abstract away the differences across the data lake table formats, to create an interoperability layer.

“The goal and the mission of Onehouse is about decoupling data processing data query engines from how your core data infrastructure operates,” Weller told VentureBeat.

Weller added that at the foundation of many data lakes today are files stored in the Apache Parquet data storage format. What Onetable is essentially doing is providing a metadata layer on top of Parquet that enables easy translation from one table format to another.

Where Onetable fits into the data lakehouse use case

Chandar noted that Hudi provides advantages over other formats, such as transactional replication and fast data ingestion.

One potential use case where he sees the Onetable feature fitting in, is for organizations using Hudi to do massive volumes of data ingestion, but want to be able to use the data with another query engine or technology such as a Snowflake Data Cloud deployment, for some type of analytics.

Chandar said a lot of companies have data sitting in data warehouses and they are increasingly deciding to build a data lake either because of costs or because they want to start a new data science team. The first thing those organizations will do is data ingestion, bringing all their transactional data to the lake, which is where Chandar said Hudi and the Onehouse service excels.

Now with the benefit of the Onetable technology, the same organization that has ingested data into Onehouse, can also use other technologies such as Snowflake and Databricks for data queries on the data, for analytics.

Looking forward for both Hudi and the Onehouse platform, Chandar emphasized that further optimizing the ability for organizations to utilize data quickly will remain a key theme.

“We have announced in the Hudi project that we want to add a caching layer at some point,” he said. “We are thinking about anything and everything around data and how we can optimize it really well.”

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Check Also


1 Days AdSense Ad Serving Has Been Removed With PDF

How to Fix AdSense Ad Serving Has Been Limited [Why and How to Fix It]. …

Leave a Reply

Your email address will not be published. Required fields are marked *