A Passionate Programmer : Apache Kafka

Introducing Apache Kafka

1.Introduction

Apache Kafka is a message broker, but it way beyond that , it offers :

Store Message Persistently ( Offering such scalable, high available & fault tolerant Storage makes it a better Message Broker)
Publish & Subscribe stream of records
Processing it ( Stream of records) as it comes

So Kafka is a Message Broker + Store + Stream API + Best Stream Connectors ( Input & Output).

It is a decentralised distributed System.

2. Components

Kafka consists of :

Consumer API : To consume Streams
Producer API : To produce Streams
Stream API : Consume Streams , process it and produce the processed Streams.
Connector API : Connect to existing systems to consume or Produce Streams.

3. Use Cases of Apache Kafka

That brings in the question, where would you use such System ? Read above few lines again and you will get the answer :

Real Time Streaming Application that process the incoming data
Act as a Connector between two systems

The point to be understood here , especially for the use case #2 above ( Connector between Systems) , it is not a play & forget Connector . It transfers the data but at the same time , it will persist the data that was transferred so you can get that data in future from Kafka if required . Data Storage comes inbuilt with features like Scalability, high availability, fault tolerance etc .This comes to the rescue in case of some disaster or so hence it makes Apache Kafka a Connector++ System.

Going further on that, you can chose to process the data when it is being transferred. To add a cherry to that , Kafka provides inbuilt Connectors for your help , that's overwhelming for you :)

4.Kafka Clusters

Now, coming to the deployment part:

Apache Kafka is run in cluster which consists of multiple message brokers. A Message Broker is nothing but a server/node where Kafka is running on. Cluster will span across multiple datacenters.

Communication between client & Kafka Broker would be through TCP protocol.

5.Kafka Storage Structure

Stream of records are stored against a topic and each topic has multiple partitions. A Partition obviously has multiple records. Each record has sequence id called as offset. So a Producer writes to a particular record within a given partition while Consumer reads from a record within a given partition.
Consumer can do fast forward or reset to older offset.

This offers parallelism , you can have multiple consumers , of course it will depend on other things whether order of consumptions is important, exact one delivery is required or not. You can also group Consumers together. This also helps in doing parallel processing.
You can also dedicate a particular Partition to a particular consumer group only.

Producer can decide which record/partition it will right to.

A topic can be subscribed my multiple Consumers. As far as durability of data in Kafka goes, you can configure it as per your requirement.

6.Kafka Lead or Follow !

Each Partition is replicated across multiple servers in a cluster so that your data is available even when 1 or more nodes goes down.

Now, obvious question is , if a partition is available across multiple servers or nodes in a cluster, when you get a fetch request from client, from which server it is fetched?

For this, Apache Kafka has a concept of Leader and followers. For each partition there will be a Leader and multiple followers. Data is available in both leader & followers but request is forwarded to Leader and data is served from there.
If Leader fails, one of the follower will become. This also means, a Server will be Leader for some partitions while will act as a follower for the rest.

7. What Next

That's it for now. Keep watching this space, in next blog, we will see how to install Apache Kafka and run some jobs.

Apache Kafka - Episode No 1