Kafka Basics Flashcards

1
Q

What is the problem with ingesting real time data directly into streaminng engine?

A

If that engine goes down or if the load spikes and the system is overwhelmed and starts dropping messages then we cannot process data in real time. We either lose data or the data becomes stale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can Kafka messaging system help in case of Real time data processing?

A

If we place messaging system between Source and Real time processing engine then it solves both problems. If the system goes down then messaging system preserves the messages. If load increases then also the messages are queued and we don’t lose anything.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the solution when there are multiple connections or data pipelines between multiple systems and communication looks a mess?

A

Place a messaging system like Kafka in between and all the communications go through the messaging system. This simplifies the architecture of communication.

Kafka DECOUPLES data pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which is a big advantage in terms of producers and consumers in Kafka?

A

It decouples producers and consumers. So it decouples the data pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Kafka in terms of messaging?

A

It is a highly-distributed publish-subscribe messaging system. It was originally developed in LinkedIn. Kafka is fast, horizontally scalable, reliable, fault-tolerant, durable, distributed by design.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is difference between distributed queuing and publish-subscribe systems?

A

The main difference is that in queuing once a message is read it is taken from the queue and only one will be able to read it. While in case of publish subscribe system if there are multiple consumer groups which are interested in that message then all of them will receive it. Just like in case of mail, all the people who have subscribed the mail chain will get the mail.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Is Kafka push based or pull based?

A

Kafka is a pull based messaging system. The cosumers have to pull the messages at whichever rate they can support. So if there are two subscribers to particular message stream then one consumer can be faster and ahead but other can be slower and behind.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which way to read is better in Kafka, single message at a time or batch messages at a time?

A

Batch of messages at a time is better, it will reduce IO and will give much better performance. Which is multiple Nx times better than single message at a time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the default duration time for storing the data on disk? Retention duration?

A

It is n days (maybe). After that the data is deleted. It is only restricted by the disk space you have.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which are Kafka terminologies?

A

1) Producer
2) Message
3) Consumer
4) Topic
5) Partition
6) Zookeeper
7) Broker

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a Producer?

A

A producer can be any application who can publish messages to a topic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a Consumer?

A

A consumer can be any application that subscribes to a topic and consume the messages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a Topic?

A

Logically a topic is a feed name to which records are published. Its like labelling traffic in MPLS system. Different label messages may be published by different producers and consumers can express interest only in subscribing to messages of particular labels/topics. Technically there is no visible thing called Topic. A Topic is amalgamation of partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a Broker?

A

Kafka cluster is a set of servers each of which is called a broker. It is the real processing heart of Kafka.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a partition?

A

Topics are broken into “ordered” commit logs called partitions/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is a partition called ordered commit log?

A

Because partition is append only structure. You cannot change what was written. So it is similar to commit log. Ordered because again append mode, messages are read in order of their writing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Zookeeper?

A

It is used for managing and coordinating Kafka broker. It is important piece of system.

18
Q

Is it mandatory that one server can have one broker?

A

No its not necessary but it is recommended. Because if one server goes down, then multiple brokers can go down.

19
Q

Producer connects to zookeeper or Broker?

A

Producer connects to zookeeper first for some metadata and then directly connect to Broker.

20
Q

What is used to determine which message will go to which partition?

A

Whenever producer is written to kafka it is written as key, value pair. Key decides which partition the message will go to.

21
Q

How are messages in a partition numbered?

A

Each message in a partition is assigned a unique number called “offset”. The consumers can read the message based on offset.

22
Q

What is offset?

A

Each message written on a partition of topic is assigned a unique incrementing number called offset.

23
Q

Is offset globally unique?

A

No, offset is unique within messages of a paricular topic partition. offsets are unique, immutable identifiers for a message within a topic-partition. Each partition has its own sequence of offsets, but a (topic, partition, offset) triple uniquely and persistently identifies a particular message.

24
Q

What uniquely identifier a message in Kafka?

A

(topic, partition, offset) triple uniquely and persistently identifies a particular message.

25
Q

Why is Kafka fast even if we write data to the disk?

A

Because the writes happening in Kafka are sequential writes and sequential writes are much faster than random writes in disk.

26
Q

For how long are messages of a topic preserved?

A

It is configureable.

27
Q

Is reading of message only one time and one direction?

A

No reading of messages can be n number of times. Consumer can also rewind or skip to a particular offset.

28
Q

Does each partition have an ID?

A

Yes each partition has an ID and it starts from 0 till number of partitions - 1. Lets say if we have 4 partitions then we have [0-3]

29
Q

What will be the replica ID for a partition?

A

Replica ID will be same as broker ID.

30
Q

On which level does the replication happen? Broker, Topic, Parition, Message?

A

Replication happens at partition level.

31
Q

What is leader of partiton?

A

When a partition is created with n replication factor then one replica is assigned the role of a leader replica. The writes and reads for that partition will go to the leader of that partition.

32
Q

What is a message in Kafka?

A

1) A unit of data in Kafka is message.
2) Message with same key are written to same partition.
3)

33
Q

What happens if key is null?

A

If key is null then Kafka will send it randomly to all partitions in load balanced way to make sure partitions are equally loaded.

34
Q

What is the role of key in Kafka?

A

The role of key is only to decide the partition and after that there is no role of key.

35
Q

Which are important configuration in zookeeper?

A

dataDir=path
Where zookeeper will store metadata about cluster, topics and such
clientPort=port
Brokers will connect with zookeeper using this port

36
Q

Replication count is restricted by which value?

A

Replication count is restricted by broker count. You cannot have more replication count than broker count.

37
Q

How to create cluster of Kafka?

A

All the brokers that connect to same zookeeper will be part of same cluster.

38
Q

Basic of Kafka cluster and controller?

A

In a Kafka cluster, one broker is assigned role of Controller. Controller is responsible for administrative operations.

39
Q

What are responsibilities of Controller node of cluster?

A

Controller node does administrative jobs like:

1) Assigning partitions to brokers
2) Monitoring brokers for failure in cluster

40
Q

Types of Kafka Clusters

A

1) Single node - Single Broker
2) Single Node - Multiple Broker
3) Multiple Node - Multiple Broker

41
Q

Kafka can be used in how many perspectives?

A

1) Stream Processing System
2) Near Real time system
3) Offline-Processing system/Batch processing

42
Q

Can consumer also be a producer ?

A

Yes. They can be different components of a service.