Kafka Cluster Architectures and Administering Kafka Flashcards

1
Q

When do we need cross cluster mirroring?

A

When there are multiple clusters and they are interdependent on each other and the administrators need to continuously copy data between the clusters.
Copying of data between clusters in called Mirroring.
Apache Kafka’s built in cross cluster replicator is called Apache MirrorMaker.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is mirroring?

A

Copying of data from one cluster to other is called Mirroring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Use cases of mirroring?

A

1) Regional and central clusters - Suppose you have regional kafka application which write data to regional cluster and there is also a central cluster which is used by central team.
2) Disaster Recovery (DR) Redundancy
3) Cloud migrations - A new application which is deployed on the cloud but also needs some data that is updated by applications runninng in on-premise datacentre.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Drawbacks of Cross-datacenter communication?

A

1) High latencies
2) Limited bandwith
3) High costs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which are different types of Multi-cluster architectures?

A

1) Hub and Spokes Architecture
2) Active-Active Architecture
3) Active-Standby Architecture
4) Stretch Clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Hub and Spokes architecture?

A

Where there are multiple local Kafka clusters and we need all of the data in centralized Kafka cluster for motinoring or something. This is called Hub and Spokes architecture.
When using this architecture, for each regional datacenter we need atleast one mirroring process on the central datacenter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Advantage and Disadvantage of Hub and Spokes architecture?

A

Disadvantage:
Data from one location will not be available in other location.

Advantage:
Data is always produced at local cluster and events from each datacenter are only mirrored once - to the central datacenter.
Applications that process data from local datacenter can be located at local location.
Applications that need to process data from multiple datacenters will be located at the central datacenter.
Architecture is simple to deploy, configure and monitor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define Active-Active architecture?

A

Data from A is copied to cluster B, Data from cluster B will be copied to cluster A. Same is true for cluster C and B, A.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define Active-Standby architecture?

A

All the data which is produced to A, the same data will be copied to B. It is used in Disaster Recovery.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define Stretch clusters?

A

TBA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Apache MirrorMaker?

A

Kafka contains a simple tool for mirroring data between two datacenters called MirrorMaker.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What MirrorMaker internally has?

A

1) MirrorMaker has single producer
2) MirrorMaker has multiple consumers
3) Each consumer is running in its own thread.
4) Each consumer consumes events from the topics and partitions it was assigned to on the source cluster and use the shared producer to send those events to the target cluster.
5) Every 60 seconds (default), the consumers tell producer to send all the events it has to Kafka and wait until ack is received for these events.
6) Then the consumers contact the source Kafka Cluster to commit the offsets for those events. This guarantees no data loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which is the most important pre-requisite before starting mirroring in MirrorMaker?

A

The topics configured in source cluster must be present in destination cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Sample MirrorMaker command

A

sh kafka-mirror-maker –consumer.config –producer.config –new.consumer –num.streams 2 -whitelist “.*”

whitelist tells the list of topics that need to be mirrorred.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What configuration needs to be provided in MirrorMaker consumer.config properties?

A

1) group.id - All consumers in the MirrorMaker share same configuration, which means there can be only one source cluster and one group id. So all the consumers are part of same consumer group.
2) bootstrap.servers

MirrorMaker automatically commits offsets and mirrormaker default starts replicating latest events that are written after MirrorMaker starts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does num.streams parameter represent in MirrorMaker command?

A

It represents the number of consumers that will be used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What configuration needs to be provided in MirrorMaker command producer.config properties?

A

Only mandatory configuration is bootstrap.servers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How many source clusters can be there for single process of MirrorMaker?

A

There can only be one source cluster per MirrorMaker process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is new.consumer property in MirrorMaker command?

A

It represents new version of consumer is to be used and not old consumer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is num.streams property in MirrorMaker command?

A

Each stream is another consumer reading from the source cluster.
All consumers in the same MirrorMaker share the same producer.
It will take multiple consumers to saturate producer.
If we need more throughput after this point, we will have to create multiple MirrorMaker processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is whitelist property in MirrorMaker command?

A

A regular expression of the topics that need to be mirrored. All the topic names that match the regular expression will be mirrored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Which are important things to monitor when deploying MirrorMaker in production?

A

1) Lag monitoring - The lag is difference between latest message in the source Kafka and the latest message in destination.
2) Metrics monitoring - Collect and monitor metrics available in MirrorMaker producer and consumer.

Consumer: fetch-size-avg, fetch-size-max, fetch-rate, fetch-throttle-time-avg and fetch-throttle-time-max.

Producer: batch-size-avg, batch-size-max, requests-in-flight and record-retry-rate.

Both: io-ratio and io-wait-ratio.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Which parameters are useful for tuning Mirrormaker?

A

For producer:

1) max.in.flight.requests.per.connection - Default is one. This means every request sent by producer has to be acked first before the next message will be sent.
2) linger.ms and batch size

For consumer:

1) partition.assignment.strategy - Round robin is suggested if many topics and partitions need to be mirrored.
2) fetch.max.bytes - If metrics show that fetch-size-avg and fetch.size.max are close to fetch.max.bytes then can increase this to allow consumer to read more than one message.
3) fetch.min.bytes and fetch.max.wait - If we see consumer fetch rate is high the consumer is sending too many messages to brokers and not receiving enough data in each request. So we can increase these two to allow more data per consume request.

24
Q

Creating a topic

A

sh kafka-topics.sh –zookeeper –create –topic –replication-factor –partitions

25
Q

Altering a topic

A

sh kafka-topics.sh –zookeeper –alter –topic –partitions

26
Q

Does rebalancing occur when the number of partitions are altered?

A

No rebalancing does not occur when partitions are increased. It occurs when the no of consumers change.

27
Q

What can be the reasons to increase the partition count for a topic?

A

The reasons can be to spread out the topic further or to decrease the throughput for a single partition.

28
Q

Is it possible to reduce the number of partitions of a topic?

A

No it is not possible to delete/reduce existing partition count as it will mean data loss. To reduce the partition delete the topic and recreate it.

29
Q

Deleting a topic

A

sh kafka-topics.sh –zookeeper –topics –delete

30
Q

Is deleting a topic reversible?

A

No deleting a topic will delete all the messages and it is not reversible.

31
Q

Listing topics in cluster

A

sh kafka-topics.sh –zookeeper –list

32
Q

Is topic deleted immediately when we delete it?

A

No it is marked for deletion and kafka runs a periodic job that deletes it.

33
Q

Describe a topic

A

sh kafka-topics.sh –zookeeper –describe –topic

34
Q

Which are the two filters available to troubleshoot topic issues?

A

sh kafka-topics.sh … –under-replicated-partitions -> It will show all the partitions where one or more of the replicas for the partition are not in sync with the leader

sh kafka-topics.sh … –unavailablepartitions -> It will show all partitions without a leader means that the partition is unavailable for produce or consume clients

35
Q

Where is consumer group information stored?

A

For older consumers the information is stored in zookeeper and for new consumer the information is stored in brokers

36
Q

Command to list down consumer groups for old consumers

A

sh kafka-consumer-groups.sh –zookeeper –list

37
Q

Command to list down consumer groups for new consumers

A

sh kafka-consumer-groups.sh –bootstrapserver –new-consumer –list

38
Q

How to see details about consumer group like topics being consumed and lag etc?

A

sh kafka-consumer-groups.sh –zookeeper –describe –group

39
Q

What happens on deleting consumer groups

A

Deletion of consumer groups is only supported for old consumer clients. It will delete entire group from Zookeeper and all offsets of all the topics that group is consuming.
In order to perform this operation all the consumers in the group have to shut down first.

40
Q

Suppose a consumer group is consuming from multiple topics and we need to delete offsets for that consumer group but for specific topic. How to do that with command line?

A

sh kafka-consumer-groups.sh –zookeeper –delete –group –topic

41
Q

Exporting offsets

A
There is no script to export offsets directly. But we can use kafka-runclass.sh script to execute any java class for the tool in proper environment.
Exporting offsets will generate a file that will produce out for topic partition and its offset and will be in format that import can understand.
42
Q

Exporting offsets command line

A

sh kafka-runclass.sh kafka.tools.ExportZkOffsets –zkconnect –group –output-file

43
Q

How to Import offsets to desired values for consumer group?

A

1) First we need to export current offsets of the consumer group. This will create the format file that import can understand
2) Then for partitions for which we want to change value we can change the offset to desired values.
3) Before importing new offsets it is important that all the consumers in that consumer group are stopped.
4) Import offsets using command line

44
Q

Importing offsets command line

A

sh kafka-runs-class.sh kafka.tools.ImportZkOffsets –zkconnect –input-file

Notice that group name is not required while importing offsets, it is because it is embedded in the export file.

45
Q

What is override configuration?

A

There are some common configurations that are applied on cluster wide level to all the topics and consumer groups. But sometimes for some special topics we need to override those defaults. Those configurations are called override configurations.

46
Q

How to override retention.time default using command line for a topic?

A

sh kafka-configs.sh –zookeeper –alter –entity-type topics –entity-name –add-config =|,=

sh kafka-configs.sh –zookeeper –alter –entity-type topics –entity-name –add-config retention.ms=360000

47
Q

Which are some of valid configuration overrides for topics?

A

delete. retention.ms - How long in ms, deleted tombstones will be retained for the topic. Only valid for log compacted topics
flush. ms - How long before forcing flush of this topic’s message to disk
max. message.bytes - The maximum size of single message for this topic.
min. insync.replicas - min number of replicas to be in sync for the partition to be available.
retention. bytes - The amount of messages in bytes to be retained for this topic.
retention. ms - How long the messages should be retained for this topic in ms
segment. ms - How frequently the log segment for each partition should be rotated.

48
Q

Which are some valid overrides for producer/consumer?

A

For producer/consumer only valid overrides are their quotas.

producer-bytes-rate - rate at which producer can produce per broker
consumer-bytes-rate - rate at which consumer can consume per broker

49
Q

What is producer/consumer quotas?

A

It is the rate bytes/sec at which the specific client of CLIENT ID is allowed to either produce/consume on a per broker basis.

Suppose we have five broker cluster and we specify quota of 10 MB/sec for a client. That client will be allowed to produce 10 MB/sec on each broker at the same time for a total of 50MB/sec.

50
Q

Command to show or list the configuration overrides for a particular topic?

A

sh kafka-configs.sh –zookeeper –describe –entity-type topics –entity-name

51
Q

Which are scripts avaiable for partition management in Kafka?

A

1) For re-election of leader replicas

2) Utility for assigning partitions to brokers.

52
Q

Use of partition management scripts in Kafka?

A

TBA

53
Q

What is the use of preferred replica election script?

A

TBA

54
Q

Command to trigger preferred replica election?

A

sh kafka-preferred-replica-election.sh –zookeeper

55
Q

Which are the options that we can provide to Console consumer?

A

1) –topic - name of the single topic to consume
2) –whilelist - regular expression for topics to consume
3) –blacklist - topics to cosume except those provided by this regular expression

only one of these options can be provided

4) –from-beginning - Consume messages in the topic(s) from the oldest offset. Otherwise default is latest.
5) –max-messages - Consume atmost NUM messages before exiting.
6) –partition - Consume only from partition ID NUM
7) –formatter - Specifies message formatter class to use to decode the messages. This defaults to kafka.tools.DefaultFormatter

56
Q

Which are inbuilt formatters available in Kafka for console consumer?

A

1) DefaultFormatter
2) ChecksumFormatter
3) LoggingMessageFormatter
4) NoOpMessageFormatter

57
Q

Which are the options that we can provide to Console producer?

A

1) –broker-list - specifies comma separated list of brokers in the cluster
2) –topic - Topic that you are producing messages to
3) key.serializer - message encoder to use to serialize key. Defaults to DefaultEncoder
4) value.serializer - message encoder to use to serialize value. Defaults to DefaultEncoder
5) compression.codec - specify the type of codec to use when producing message. gzip, snappy or lz4
6) –sync - produces message synchronously, waiting for each message to be acknowledged before sending the next one.