How to Stream Real-Time Playback Events to the Browser with Kafka and Flask

Continue reading “How to Stream Real-Time Playback Events to the Browser with Kafka and Flask”

End-to-End Data Pipeline for Real-Time Stock Market Data!

Transform your data landscape with powerful, flexible, and flexible data pipelines. Learn the data engineering strategies needed to effectively manage, process, and derive insights from comprehensive datasets.. Creating robust, scalable, and fault-tolerant data pipelines is a complex task that requires multiple tools and techniques.

Unlock the skills of building real-time stock market data pipelines using Apache Kafka. Follow a detailed step-by-step guide from setting up Kafka on AWS EC2 and learn how to connect it to AWS Glue and Athena for intuitive data processing and insightful analytics.
Continue reading “End-to-End Data Pipeline for Real-Time Stock Market Data!”

Stream and Analyze PostgreSQL Data from S3 Using Kafka and ksqlDB: Part 2

Introduction

In Part 1, we set up a real-time data pipeline that streams PostgreSQL changes to Amazon S3 using Kafka Connect. Here’s what we accomplished:

  • Configured PostgreSQL for CDC (using logical decoding/WAL)
  • Deployed Kafka Connect with JDBC Source Connector (to capture PostgreSQL changes)
  • Set up an S3 Sink Connector (to persist data in S3 in Avro/Parquet format)

In Part 2 of our journey, we dive deeper into the process of streaming data from PostgreSQL to S3 via Kafka. This time, we explore how to set up connectors, create a sample PostgreSQL table with large datasets, and leverage ksqlDB for real-time data analysis. Additionally, we’ll cover the steps to configure AWS IAM policies for secure S3 access. Whether you’re building a data pipeline or experimenting with Kafka integrations, this guide will help you navigate the essentials with ease.

Continue reading “Stream and Analyze PostgreSQL Data from S3 Using Kafka and ksqlDB: Part 2”

Kafka on Kubernetes Made Easy with Strimzi (Step-by-Step Deployment)

Deploying Kafka on Kubernetes with Strimzi has quickly become the industry-standard approach for building scalable and production-ready data streaming platforms. As modern applications generate massive volumes of real-time data, organizations need a reliable, flexible, and automated way to manage Kafka clusters. Kubernetes provides the perfect foundation with its built-in orchestration, scaling, and self-healing capabilities, making it ideal for running stateful distributed systems like Kafka.

What is Event Streaming?

Event streams are similar to digital root systems in business, capturing real-time data from multiple sources for further processing and analysis. Continue reading “Kafka on Kubernetes Made Easy with Strimzi (Step-by-Step Deployment)”

Kafka within EFK Monitoring

Introduction

Today’s world is completely internet driven. Whether it is shopping, banking, or entertainment, almost everything is available with a single click.

From a DevOps perspective, modern e-commerce and enterprise applications are usually built using a microservices architecture. Instead of running one large monolithic application, the system is divided into smaller, independent services. This approach improves scalability, manageability, and operational efficiency.

However, managing a distributed system also increases complexity. One of the most critical requirements for maintaining microservices is effective monitoring and log management.

A commonly used monitoring and logging stack is the EFK stack, which includes Elasticsearch, Fluentd, and Kibana. In many production environments, Kafka is also introduced into this stack to handle log ingestion more reliably.

Kafka is an open-source event streaming platform and is widely used across organizations for handling high-throughput data streams.

This naturally raises an important question.

Why should Kafka be used along with the EFK stack?

In this blog, we will explore why Kafka is introduced, what benefits it brings, and how it integrates with the EFK stack.

Let’s get started.

Why Kafka Is Needed in the EFK Stack

While traveling, we often see crossroads controlled by traffic lights or traffic police. At a junction where traffic flows from multiple directions, these controls ensure smooth movement by allowing traffic from one direction while holding others temporarily.

In technical terms, traffic is regulated by buffering and controlled flow.

Kafka plays a very similar role in log management.

Imagine hundreds of applications sending logs directly to Elasticsearch. During peak traffic, Elasticsearch may become overwhelmed. Scaling Elasticsearch during heavy ingestion is not always a good solution because frequent scaling and re-sharding can cause instability.

Kafka solves this problem by acting as a buffer layer. Instead of pushing logs directly to Elasticsearch, logs are first sent to Kafka. Kafka then delivers them in controlled, manageable batches to Elasticsearch.

High-Level Architecture Overview

The complete flow consists of the following blocks.

  • Application containers or instances

  • Kafka

  • Fluentd forwarder

  • Elasticsearch

  • Kibana

Each block is explained below with configurations.

Block 1 Application Logs and td-agent Configuration

This block represents application containers or EC2 instances where logs are generated. The td-agent service runs alongside the application to collect logs and forward them to Kafka.

td-agent is a stable distribution of Fluentd maintained by Treasure Data and the Cloud Native Computing Foundation. It is a data collection daemon that gathers logs from various sources and forwards them to destinations such as Kafka or Elasticsearch.

td-agent Configuration

Use the following configuration inside the td-agent configuration file.

Source Configuration

<source>
@type tail
read_from_head true
path <path_of_log_file>
tag <tag_name>
format json
keep_time_key true
time_format <time_format_of_logs>
pos_file <pos_file_location>
</source>

The source block defines how logs are collected.

  • path specifies the log file location

  • tag is a user-defined identifier for logs

  • format defines the log format such as json or text

  • keep_time_key preserves the original timestamp

  • time_format defines the timestamp pattern

  • pos_file tracks the read position of logs

Match Configuration to Kafka

<match <tag_name>>
@type kafka_buffered
output_include_tag true
brokers <kafka_hostname:port>
default_topic <kafka_topic_name>
output_data_type json
buffer_type file
buffer_path <buffer_path_location>
buffer_chunk_limit 10m
buffer_queue_limit 256
buffer_queue_full_action drop_oldest_chunk
</match>

The match block defines where logs are sent.

  • kafka_buffered ensures reliable delivery

  • brokers defines Kafka host and port

  • default_topic is the Kafka topic for logs

  • buffer settings control local buffering and backpressure

Block 2 Kafka Setup

Kafka acts as the central buffering and streaming layer.

Kafka uses Zookeeper for coordination and self-balancing. In production setups, Zookeeper is usually deployed separately.

Download Kafka

wget http://mirror.fibergrid.in/apache/kafka/0.10.2.0/kafka_2.12-0.10.2.0.tgz

Extract the Package

tar -xzf kafka_2.12-0.10.2.0.tgz

Starting Zookeeper

Zookeeper must be started before Kafka.

Update JVM heap size in the shell profile.

vi .bashrc
export KAFKA_HEAP_OPTS="-Xmx500M -Xms500M"

The heap size should be approximately 50 percent of the available system memory.

Reload the configuration.

source .bashrc

Start Zookeeper in the background.

cd kafka_2.12-0.10.2.0
nohup bin/zookeeper-server-start.sh config/zookeeper.properties > ~/zookeeper-logs &

Starting Kafka

cd kafka_2.12-0.10.2.0
nohup bin/kafka-server-start.sh config/server.properties > ~/kafka-logs &

Stopping Services

bin/kafka-server-stop.sh
bin/zookeeper-server-stop.sh

For advanced configurations, always refer to the official Kafka documentation.

Block 3 td-agent as Kafka Consumer and Elasticsearch Forwarder

At this stage, logs are available in Kafka topics. The next step is to pull logs from Kafka and send them to Elasticsearch.

Here, td-agent is configured as a Kafka consumer and forwarder.

Kafka Source Configuration

<source>
@type kafka_group
brokers <kafka_dns:port>
consumer_group <consumer_group_kafka>
topics <kafka_topic_name>
</source>
  • consumer_group ensures distributed consumption

  • each log record is consumed by only one consumer

Match Configuration to Elasticsearch

<match <kafka_topic_name>>
@type forest
subtype elasticsearch
<template>
host <elasticsearch_ip>
port <elasticsearch_port>
user <es_username>
password <es_password>
logstash_prefix <index_prefix>
logstash_format true
include_tag_key true
tag_key tag_name
</template>
</match>

Key concepts used here.

  • forest dynamically creates output instances per tag

  • logstash_prefix defines index naming in Elasticsearch

  • logs become visible in Kibana using this index

Block 4 Elasticsearch Setup

Elasticsearch acts as the storage and indexing layer.

Follow the official Elasticsearch documentation to install and configure Elasticsearch on Ubuntu or your preferred operating system.

Block 5 Kibana Setup

Kibana provides visualization and search capabilities on top of Elasticsearch.

Install Kibana using the official documentation.

You can configure Nginx to expose Kibana on port 80 or 443 for easier access.

Final Architecture Summary

With this setup, the complete EFK stack is integrated with Kafka.

  • Applications send logs to td-agent

  • td-agent pushes logs to Kafka

  • Kafka buffers and streams logs

  • td-agent forwarder consumes logs

  • Elasticsearch stores logs

  • Kibana visualizes logs

The same architecture can be used in standalone environments for learning or across multiple servers in production.