Unlocking Debezium: Exploring the Fundamentals of Real-Time Change Data Capture with Debezium and Harnessing its Power in Docker Containers

Introduction

In today’s rapidly evolving data-driven landscape, businesses need to react quickly to changes in their data systems. Real-time data integration and analysis have become crucial for making informed decisions. This is where Debezium comes into play, offering a powerful open-source platform for change data capture (CDC). In this blog post, we will explore the basic understanding of Debezium and how it revolutionizes real-time data streaming.

What is Change Data Capture (CDC)?

Change Data Capture is a technique used to capture and propagate data changes in real-time from one system to another. CDC enables applications to access and utilize the most up-to-date data without relying on full database snapshots or batch processing. By capturing individual data changes such as inserts, updates, and deletes, CDC provides a granular and event-driven approach to data integration.

Introducing Debezium

Debezium, developed by Red Hat, is an open-source CDC platform built on Apache Kafka. It simplifies the process of extracting change data from various databases and streaming it to Apache Kafka topics. As a mature and reliable CDC solution, Debezium offers support for a wide range of databases, including MySQL, PostgreSQL, Oracle, SQL Server, and MongoDB.

How Does Debezium Work?

Debezium operates on a principle known as “log-based CDC,” leveraging the transaction log or change data capture log present in most databases. Instead of directly querying the database for changes, Debezium connects to the database’s log and captures every data manipulation operation, producing a stream of change events.

The key components of Debezium are:

Connectors: Debezium provides connectors for different databases, allowing seamless integration with the source database. Each connector understands the database’s log format and translates it into a standard set of change events.

Apache Kafka: Debezium streams the captured change events to Apache Kafka, a distributed streaming platform. Kafka acts as a centralized, fault-tolerant, and scalable backbone for handling the high volume of change data.

Change Events: Debezium produces change events in a standardized format, including details such as the source database, table, primary key, old and new values, timestamp, and transaction metadata. These change events are stored in Kafka topics and can be consumed by downstream applications or services.

Benefits of Debezium

Real-time Data Streaming: Debezium enables organizations to stream data changes in real-time, ensuring that downstream applications have access to the most up-to-date information. This is crucial for building responsive and data-driven systems.

Low Latency: By leveraging database logs, Debezium achieves low-latency data capture. The change events are propagated to Kafka almost instantly, allowing near real-time data processing and analysis.

Scalability and Fault Tolerance: Apache Kafka provides scalability and fault tolerance, allowing Debezium to handle large volumes of change data without data loss or downtime. Kafka’s distributed nature ensures high availability and reliability.

Database Independence: Debezium abstracts the underlying database details and presents a unified change event format. This enables organizations to decouple their systems from specific databases, making it easier to switch databases or adopt a polyglot architecture.

Use Cases for Debezium:

Microservices Architecture: Debezium plays a crucial role in event-driven microservices architectures, where each microservice can react to specific changes in the data. By consuming the change events, services can update their local view of data or trigger further actions

Data Synchronization: Debezium can be used to keep multiple databases in sync by replicating changes from one database to another in real-time. This is especially useful in scenarios where data needs to be replicated across geographically distributed systems or in cases where different databases serve specific purposes within an organization.

Stream Processing and Analytics: Debezium’s real-time change data capture capabilities make it an excellent choice for streaming data processing and analytics. By consuming the change events from Debezium, organizations can perform real-time analysis, monitoring, and aggregations on the data. This can be particularly beneficial for applications such as fraud detection, real-time dashboards, and personalized recommendations.

Data Warehousing and ETL (Extract, Transform, Load): Debezium can play a vital role in populating data warehouses or data lakes by capturing and transforming the change events into the desired format. It eliminates the need for batch processing or periodic data extraction, enabling near real-time data updates in analytical systems.

Data Integration and Replication: Debezium simplifies data integration by providing a reliable and efficient way to replicate data changes across different systems. It allows organizations to easily integrate and synchronize data between legacy systems, modern applications, and cloud-based services. This is particularly valuable in scenarios involving hybrid cloud architectures or when migrating from one database platform to another.

Audit Trail and Compliance: Debezium’s ability to capture every data manipulation operation in a database’s log makes it an ideal solution for generating an audit trail. Organizations can use Debezium to track and record all changes made to critical data, ensuring compliance with regulations and providing a reliable historical record of data modifications.

Streaming Data with PostgreSQL + Kafka + Debezium:

To streamline the setup process, we will leverage Docker and Docker Compose for configuring Postgres, Kafka, and Debezium.

Pre-requisites:

Docker
Docker Compose

Getting Started: Cloning the Repository for Docker and Docker Compose Files.

To begin, you should clone the repository containing all the necessary Docker Compose and connector configuration files that we will be utilizing in this blog.

https://github.com/sunil9837/Debezium-Setup.git

Navigate to the directory “dbz-kafka-connect-setup” by using the command “cd Debezium-Setup” after successfully cloning the repository.

Setting up the Required Containers with Docker Compose:

Utilizing the Docker Compose file, we will now bring up all the essential containers for our setup, including Kafka, Postgres, ZooKeeper, and Kafka Connector. This will enable us to proceed with our configuration seamlessly.

docker-compose up -d

Streaming Events to PostgreSQL:

Let’s create a table to test event streaming.
Login to your Postgres container using

docker exec -it ubuntu_db_1 bash

Now login to your postgres db using

psql -U postgres -d postgres

Create a table using

CREATE TABLE transaction ( name VARCHAR(100), age INTEGER);

Activate Debezium:

We’re ready to activate Debezium!

We can communicate with Debezium by making HTTP requests to it.
We need to make a POST request whose data is a configuration in JSON format. This JSON defines the parameters of the connector we’re attempting to create.
We are using our debezium.Json file from our repo that we just clone.
Then use cURL to send it to Debezium.

curl -i -X POST \
         -H "Accept:application/json" \
         -H "Content-Type:application/json" \
         127.0.0.1:8083/connectors/ \
         --data "@debezium.json"

We will get this output after successful POST requeste.

Test Kafka Streaming Setup:

Now we are streaming! After inserting, updating, or deleting a record, we will see the changes as a new message in the Kafka topic associated with the table.
Kafka Connect will create 1 topic per SQL table. To verify that this is working correctly, we’ll need to monitor the Kafka topic.
Kafka comes with some shell scripts that help you poke around your Kafka configuration. They are handy when we want to test your configuration and are conveniently included in the Docker image we are using.
The first one we’ll use lists all of the topics in your Kafka cluster.
Let’s run it and verify that we see a topic for our `test` table.

docker exec -it \
  $(docker ps | grep ubuntu_kafka_1 | awk '{ print $1 }') \
  /kafka/bin/kafka-topics.sh \
    --bootstrap-server localhost:9092 --list

Real-Time Topic Monitoring with Console Consumer:

Now we can use another tool called the console consumer to watch the topic in real-time. It’s called “console consumer” because it is a type of Kafka “Consumer”— a utility that consumes messages from a topic and does something with them.
A consumer can do anything with the data it ingests, and the console consumer does nothing besides print it out to the console.

docker exec -it \
  $(docker ps | grep ubuntu_kafka_1 | awk '{ print $1 }') \
  /kafka/bin/kafka-console-consumer.sh \
    --bootstrap-server localhost:9092 \
    --topic emp.public.transaction

By default, the console consumer only consumes messages it hasn’t already. If you want to see every message in a topic, you can add --from-beginning to the console command.

Testing:

Now that our consumer is watching the topic for new messages, we run an INSERT and watch for output.
Insert data in your table that we created in Streaming Events to PostgreSQL step. Using

INSERT INTO transaction (name, age) VALUES ('Opstree', 30);

Back on our Kafka consumer, we will get this response.

Along with some metadata, you can see the primary key and the name and age field of the record we inserted.

Congrats we have set up Postgres to stream its data to a Kafka cluster.

Conclusion:

Debezium is a powerful open-source platform that revolutionizes real-time change data capture. With its log-based CDC approach and seamless integration with Apache Kafka, Debezium provides organizations with the ability to capture and stream data changes in real time, enabling them to build responsive, event-driven systems. From microservices architectures to data synchronization, stream processing, and data integration, Debezium offers a wide range of use cases that empower businesses to leverage their data effectively and make informed decisions in a rapidly changing digital landscape.

Reference:

https://debezium.io/documentation/reference/stable/tutorial.html

https://debezium.io/documentation/reference/stable/architecture.html

https://www.infoq.com/presentations/data-streaming-kafka-debezium/

Blog Pundits: Deepak Gupta, Naveen Verma and Sandeep Rawat

OpsTree is an End-to-End DevOps Solution Provider.

Connect with Us

Author: Sunil Kumar

A DevOps Engineer Passionate about Bridging the Gap Between Development and Operations.Join me on this exhilarating DevOps journey as we unlock the true potential of software development by embracing automation, collaboration, and continuous improvement. Together, we can shape a future where software delivery becomes an art of seamless integration and unparalleled efficiency. View all posts by Sunil Kumar