An introduction to Kafka

Megha Raizada
3 min readFeb 24, 2021

Kafka is a replicated, distributed streaming platform, and multi-distributed system which provides commit log service. It provides the capability of a messaging system between the message producers and message consumers.

Common terminologies:

  1. Topic: Kafka bifurcates feeds of messages in different categories termed as a topic.
  2. Producers: Processes that publish messages to the Kafka messaging system are termed, Producers.
  3. Consumers: Processes that subscribe to the topics of messages and actually consume the feeds of messages are termed, Consumers.

In Kafka messaging system, the producers send messages over the Kafka cluster and the consumers receive these messages. Clients and servers use TCP for communication. A java client is available for Kafka for common use but other clients are also available in other programming languages like Python, Ruby, etc.

Kafka messaging system.

Topics

Topic is defined as a type or category to which messages of a particular feed are published. For each topic, Kafka generates a partition based log. Each partition is a well defined ordered collection of messages that are continuosly added to the commit log. The messages are given a sequential id which is called as offset which uniquely identifies a particular message in the partition.

The messages are retained in the system for a particular period of time independent of the fact that whether they are consumed or not. After which they are discarded to empty the available space. A partitioned log looks like following:

Partitions in the log are very important and they serve the following purposes:

  1. They provide parallelism functionality.
  2. They allow the log to enlarge beyond a particular size that fits a server. Although the partition must fit into the server that hosts it but a topic may have multiple partition that can hold arbitrary amount of data.

Distribution of Partitions over server:

The partitions are divided over several servers in a kafka cluster with partitions replicated over many servers for fault tolerance. Each server handles data request and for a particular group of servers.

Every partition has a main server called as leader and zero or more alterante servers called as follower of the leader. The leader handles all the read and write request for a partition while followers passively replicate the leader. So a server might be a leader for some partition and follower for other partitions.

Producers:

Processes that publish messages to the Kafka messaging system are termed, Producers.They assigns data to a particular topic of their choice. They can also decide which message to be published to which partition. This process can be round robin to manage load over the servers.

Consumers:

Processes that subscribe to the topics of messages and actually consume the feeds of messages are termed, Consumers. The only meta data that is pertained by Kafka about the consumer is the position of the consumers in the log known as offset. Offset is controlled by the consumers i.e. the value of the offset is decreased linearly as they process the messages.

Messaging has two models: Queuing and Publish-Subscribe. In the first method, the pool of the consumers reads message from a server and each message goes to one of the consumer. In the second method, the message is broadcast among all the consumers. Kafka gives a single level of abstraction that combines the both methods known as consumer group. Consumers assign themselves as a part of consumer group name. Each message is send to the consumer instance subscribing a particular consumer group.

Usecases of Kafka:

  1. Messaging: Kafka provides a better option for all traditional message brokers due to its better throughput, fault tolerance, built-in partitioning and replication.
  2. Log aggregation: Log aggregation includes combining several log files from servers and processing them. Kafka provides an abstraction on logs and data as messages. This allows lower latency and multiple source data processing.
  3. Metrics: Kafka is also used for operation monitoring data pipelines.

--

--