Understanding COW and MOR in Apache Hudi: Choosing the Right Storage Strategy 

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a powerful framework designed for managing large datasets on cloud storage systems, enabling efficient data ingestion, storage, and retrieval. One of the key features of Hudi is its support for two distinct storage types: Copy-On-Write (COW) and Merge-On-Read (MOR). Each of these storage strategies has unique characteristics and serves different use cases. In this blog, we will explore COW and MOR. 

Prerequisites 

Before you begin, ensure you have the following installed on your local machine: 

  • Docker 
  • Docker Compose 

Local Setup 

To set up Apache Hudi locally, follow these steps:  

  1. Clone the Repository: 
git clone https://github.com/dnisha/hudi-on-localhost.git  cd hudi-on-localhost
  1. Start Docker Compose:
docker-compose up -d 
  1. Access the Notebooks: 

Username: minioadmin  Password: minioadmin 

What is Copy-On-Write (COW)? 

Copy-On-Write (COW) is a storage type in Apache Hudi that allows for atomic write operations. When data is updated or inserted: 

  • Hudi creates a new version of the entire data file. 
  • The existing data file remains unchanged until the new file is successfully written. 

This ensures that the operation is atomic, meaning it either completely succeeds or fails without partial updates. 

Steps to Evaluate COW 

  1. Open the Notebook: 

  • In your browser, navigate to hudi_cow_evaluation.ipynb.

2. Run Configuration Code: 

  • Execute all configuration-related code in the notebook. 
  • Ensure you specify the COPY_ON_WRITE table type, as shown in the provided image.  

3. Updating a Record: 

a. Focus on updating a record in the 34 partition of the COW bucket. 

b. Since you are using the COPY_ON_WRITE table type, a new Parquet file will be created for this update. You can find this file in the bucket located at warehouse/cow/transactions/document=34.  Open 

What is Merge-On-Read (MOR)? 

Merge-On-Read (MOR) is an alternative storage type in Apache Hudi that employs a different approach to data management. Here’s how it works: 

  • Base Parquet Files and Log Files: In MOR, Hudi maintains a combination of base Parquet files alongside log files that capture incremental changes. 
  • On-the-Fly Merging: When a read operation is executed, Hudi merges the base files and log files in real-time, providing the most up-to-date view of the data. 

This approach allows for efficient handling of updates and inserts while enabling faster read operations, as the system does not need to rewrite entire files for every change. 

Steps to Evaluate MOR 

1. Open the Notebook: 

  • In your browser, navigate to hudi_mor_evaluation.ipynb. 

 2. Run Configuration Code: 

  • Execute all configuration-related code in the notebook. 
  • Ensure you specify the MERGE_ON_READ table type, as shown in the provided image.

  1. Updating a Record:

  • Focus on updating a record in the 34 partition of the MOR bucket.
  • Since you are using the MERGE_ON_READ table type, a new row-based file (e.g., Avro) will be created for this update. You can find this file in the bucket located at warehouse/mor/transactions/document=34. 

Choosing Between COW and MOR 

The choice between COW and MOR in Apache Hudi largely depends on your specific requirements: 

  • Read vs. Write Frequency: If your workload is read-heavy, COW may be the better choice due to its optimized read performance. Conversely, for write-heavy applications where data is ingested frequently, MOR can handle the load more efficiently.
  • Data Consistency: If your application requires strong consistency and atomicity during writes, COW is preferable. MOR is better suited for scenarios where eventual consistency is acceptable.
  •  Use Case: For analytical workloads and batch processing, COW shines. MOR is often the way to go for real-time data processing and streaming applications. 

OpsTree is an End-to-End DevOps Solution Provider.

Connect with Us

Author: Deepak Nishad

I’m Deepak Nishad, a DevOps Engineer at Opstree Solutions, where I specialize in AWS cloud cost optimization, infrastructure automation, and database performance tuning. With a strong foundation in DevOps practices, I focus on designing scalable and efficient systems that enhance software delivery processes. My expertise includes working with tools like Jenkins, Docker, Kubernetes, and Terraform to streamline deployment pipelines and ensure system reliability. I'm passionate about bridging the gap between development and operations teams, ensuring seamless integration and continuous delivery.

Leave a Reply