Q & A Flashcards

1
Q

What is a batch?

A

A collection of messages being produced to the same topic and partition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a controller?

A

The “lead” broker within a cluster that manages the state of partitions, replicas, and partition reassignment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is a controller elected?

A

When a controller goes down, the available nodes within Zookeeper will determine the controller.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Zookeeper?

A

Zookeeper is a centralized service that keeps track of the brokers, configurations, topics, and partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a leader?

A

Partitions are owned by a single broker in the cluster and that broker is the leader. The leader is the only one responsible for produce-consumer operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three types of replicas?

A

Leader, follower, and preferred.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are new leader replicas determined?

A

When the existing leader goes down (or becomes unresponsive), if auto.leader.rebalance.enable=true is set (default) it will check if the preferred leader is in-sync and select it as the leader. Otherwise, another in-sync replica will be chosen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is retention? What are the two types of retention?

A

Retention is a configurable time window that will determine how long messages are stored within a given topic. The two types of retention are delete and compact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is log compaction?

A

Compaction is a type of retention in which only the latest value of a given key is retained after the retention period has elasped.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Where is broker information stored in Zookeeper?

A

Under the /brokers/ids directory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an ephemeral node?

A

When a broker starts up, an ephemeral node is created to represent it in Zookeeper. This node will stick around to allow brokers that go offline to immediately rejoin the cluster once back online.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is an ensemble?

A

A cluster of Zookeeper nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the preferred number of Zookeeper nodes?

A

An odd number, preferably something that adheres to 2N+1 as Zookeeper requires a quorum to make elections and respond to requests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the default port(s) for Zookeeper?

A

2181 is the primary port, 2888 is used for elections, and 3888 is the leader port.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the default broker port?

A

9092

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the default KSQL port?

A

8088

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the default schema registry port?

A

8081

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the auto.create.topics.enable setting? What actions can cause a new topic to be created?

A

When enabled, auto.create.topics.enable allows the broker to dynamically create topics if they don’t already exist. Any attempts to produce, consume, or request metadata from a topic will cause it to be created (using the default replication and partition settings from the broker).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the default number of partitions?

A

One partition is the default, although it is not preferred for scaling purposes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How is a request handled in Kafka?

A

The process goes:

  • Client Request
  • Broker
  • Partition Leader(s)
  • Response
  • Client
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Are there any guarantees within Kafka with regards to ordering of messages?

A

Messages are always guaranteed to be ordered over a single partition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a segment? What do they contain?

A

Partitions are divided into segments, which default to either 1GB of data or a week of messages. Each segment contains the messages (keys, values) over two indices (one related to offsets and another related to timestamps)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the unit of storage within Kafka?

A

A partition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is stored within a message on disk?

A

The key, the value, a checksum for corruption, the encoding, format, timestamp.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

When is compaction evaluated?

A

When a given segment of a partition is closed, any log compaction will be performed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Where does Kafka store any dynamic topic configurations?

A

In Zookeeper

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What producer setting determines when a given message is ready to be consumed? What are the different options for it?

A

The acks property determines when a message has been “received” and it can be set to the values of 0, 1, or all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What do the different ack configurations mean?

A

If acks=0, no acknowledge is required from the broker that the message was received. If acks=1, the leader partition must confirm to have received the message. If acks=all, the topic specific min.insync.replicas property will be used to govern how many replicas have to write the message before returning a response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are the three minimally required settings for a producer to have configured?

A

A producer at a minimum needs a broker (e.g. bootstrap server), a key serializer, and a value serializer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is linger.ms? How can it help with batching?

A

linger.ms defines the time to wait before sending a batch, which can allow time for more messages to arrive and be sent, which can increase throughput.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are some common mechanisms to increase producer throughput?

A

You want to ensure that batching is being performed effectively, so adequate min/max batch sizes, adequate linger values, and enabling compression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the default compression used for Kafka messages? What are the other options?

A

By default, messages are not compressed. Supported compression types are snappy, gzip, and lz4.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are the roles of the producer and consumer during compression?

A

The producer is responsible for compressing the messages and the consumer is responsible for decompressing them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is max.in.flight.requests.per.connection? What are the potential dangers of adjusting it?

A

max.in.flight.requests.per.connection is a producer setting that indicates how many messages can be sent to the server without receiving a response.

Setting this too high can result in batching becoming less efficient, increased memory usage, and error may result in loss of proper ordering.

Setting the value to 1 will ensure that only a single message is sent at a time, guaranteeing message ordering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are the two types of errors when producing a Kafka message? What are some examples of each?

A

Retriable and non-retriable.

Retriable messages include those were the broker or leader was not available, which will be automatically retried in hopes that a new one was elected and the message resolves.

Non-retriable messages include those that will consistently fail without some type of interaction such as a MessageSizeExceeded, which will throw an exception immediately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Why are the brokers referred to as bootstrap servers?

A

Since each broker contains metadata about the other brokers, any of them is capable of receiving a message and sending it to the appropriate place.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is unclean.leader.election.enable?

A

If unclean.leader.election.enable is set to true, it allows for an out-of-sync replica to be elected as leader. This is only useful if you don’t necessarily care about data integrity and prefer availability over it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is a consumer group?

A

A group of related consumers that perform a specific task. Each consumer within a group will have mutually exclusive partitions and offsets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is a consumer?

A

Similar to a producer that produces messages to a broker and a specific topic, a consumer consumes messages from a given topic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is a rebalance? What can cause them?

A

Rebalancing is the act of changing responsibility for a given partition from one consumer to another. This can occur when the number of partitions for a topic changes or the addition/removal of a consumer from the group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What are the three supported syntaxes to subscribe to a topic?

A

A topic can be subscribed to via a string, an array of topics, or a pattern (e.g. regular expression).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is the polling loop responsible for? What happens on the initial poll?

A

The polling loop will handle retrieving records, metadata, partition rebalancing info, heartbeats, and more during each poll. Additionally the initial poll is responsible for registering with the group coordinator and finding the appropriate partition to use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the session.timeout.ms property?

A

It’s the number of milliseconds that a consumer can be unresponsive from the broker without timing out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is auto.offset.reset? What are the supported values of it?

A

auto.offset.resets lets a consumer group know where to begin/continue reading offsets from for a given topic. It can be set to earliest, latest, or none (which will blow up if any existing offsets are found).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is a worker?

A

A single Java process, generally used with regards to Kafka Connect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is the tasks.max property? What are the preferred configurations for sink connectors? What about source connectors?

A

This designates the number of workers that will work sending data into/out of Kafka topics.

For source connectors, it should be set to one. For sink connectors, it can be higher.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Where does Kafka store schemas from the schema registry?

A

In the __schemas topic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is a SpecificRecord? What about a GenericRecord?

A

A SpecificRecord is a strongly typed Java class generated via a Maven or Gradle plugin targeting existing Avro files.

A GenericRecord is a explicitly declared schema that must be accessed via index or name specifically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What happens if a schema isn’t found locally for a message?

A

If it isn’t found in the cache, it will be requested from the Schema Registry.

50
Q

What is a magic byte?

A

It’s a bytecode value that represents the schema from the schema registry that is stored in the message.

51
Q

What are the common types of compatibility types the schema registry supports?

A

There are multiple: backwards (indicates a chance can be produced/consumed by the previous version of the schema), forwards (it can only be read by newer versions of the schema), full (it’s forwards and backwards compatible), and none (schema isn’t validated)

52
Q

What is a topology?

A

It’s the dynamic acyclic graph that represents how a message flows through a streaming environment.

53
Q

What’s the difference between a stream and a table?

A

A stream represents a full changelog whereas a table only contains the latest value of every key.

54
Q

What language was Kafka written in? By what company?

A

It’s mostly written in Scala and was written by an internal team at LinkedIn.

55
Q

What is Kafka?

A

It’s a distributed, resilient, real-time pub-sub system.

56
Q

What 5 Kafka Streams operators are stateful?

A

The following are all stateful: aggregate, reduce, count, join, windowing.

57
Q

What are the 11 operations in Kafka Streams are stateless?

A

The following operations are stateless: map, flatMap, mapValues, flatMapValues, branch, peek, groupBy, groupByKey, filter, inverseFilter, and forEach.

58
Q

What are the four types of windows? Explain each.

A

Tumbling - time-based, fixed size, non-overlapping (e.g. 0-5, 5-10, 10-15)

Hopping - time-based, fixed size, overlapping (e.g. 0-5, 3-8, 6-11)

Sliding - fixed size window based on timestamp differences (only used in joins)

Session - non-time based, dynamically sized based on records (e.g. used to aggregate key based events into sessions)

59
Q

What are the eight primitive types in Avro?

A

string, int, float, long, null, double, byte[], boolean

60
Q

What are the six complex types in Avro?

A

record, array, fixed, enum, union, map

61
Q

What are the optional fields for an Avro record?

A

doc, aliases, order, and default are all optional.

62
Q

What are the two supported protocols for accessing the Schema Registry?

A

HTTP and HTTPS are the only supported protocols.

63
Q

What occurs when you attempt to send a non-supported type to the Schema Registry?

A

You’ll receive a SerializationException from the registry itself.

64
Q

What type of schema change occurs when you remove a field with a default value?

A

This would be a Full schema revision as it would be forwards and backwards compatible.

65
Q

A field was added to the an schema without a default, what kind of revision is this? Why?

A

This is a forward revision as it’s only forwards compatible since previous versions would not have the new field.

66
Q

What are two primarily reasons Kafka is so efficient at messaging?

A

It leverages Zero-Copy and thus never writes bytes directly to disk during transfer and it never transforms the data.

67
Q

What common requirement can cause an application to lose the Zero-Copy optimization?

A

The use of SSL will require that messages are decrypted by consumers at each stage for consumption.

68
Q

Can a message be modified after sending?

A

No. Messages in Kafka are immutable.

69
Q

At what level is ordering guaranteed in Kafka?

A

At the partition level all messages are guaranteed to be in-order.

70
Q

Why might you define a key for a message in Kafka? What change could be made that could affect that reasoning.

A

You can define a key in order to influence partitioning as a partitioner will always place a given key in a given partition. This doesn’t hold true if the number of partitions changes.

71
Q

If you have multiple definitions within a configuration for retention period (e.g. hours, minutes, milliseconds), which wins?

A

The smallest existing unit of measure will take precedence (i.e, 120ms over 120s)

72
Q

What is the limit for the number of replicas that can exist for a topic?

A

A topic cannot have more replicas than brokers exist in the cluster.

73
Q

When might min.insync.replicas be ignored?

A

If acks=all is not set, this property isn’t recognized at all.

74
Q

What producer setting can you use to remove duplicate that may have been introduced from network latency?

A

You can use enable.idempotence=true to help remove duplicates from being produced.

75
Q

If a message is produced with a null key, what partition is it sent to? Conversely, what is a message is sent with a key and a null value?

A

If no key is provided, the message will be routed to a random partition. If a message has a key, but a null value, the message is considered a tombstone and it will signal that all messages with that key need to be removed on the next cleanup.

76
Q

What’s the default hashing algorithm used by the Kafka partitioner?

A

The murmur2 algorithm.

77
Q

Can producers be multithreaded? What about consumers?

A

Producers in Kafka are thread-safe, however consumers are not and should not ever be assigned multiple threads.

78
Q

What is one of the most important consumer metrics to monitor? What can it tell you?

A

records-max-lag can indicate the lag over a window of time that the consumer(s) are falling behind. If this increases over time it can be a sign that the consumers are ill-equipped to keep up with producer and need to be scaled/tuned

79
Q

Where are consumer offsets stored?

A

In the __consumer_offsets topic.

80
Q

What are the three delivery semantics used by Kafka?

A

At-Most-Once (commits offsets after receiving but prior to processing), At-Least-Once (default, commits after processing, can allow potential dupes), Exactly-Once (can only be achieved via Kafka-Kafka)

81
Q

What are the differences between KTables and GlobalKTables?

A

Both store all of the latest versions of a message (by key) however GlobalKTables load every message across all partitions, removing the need to repartition tables. They also allow you to join on non-key fields.

82
Q

What’s a Sink Connector? Likewise, what’s a Source Connector?

A

A Sink connector handles importing data from Kafka into external source (e.g. a database). A Source connector doe the opposite and imports data from external sources into Kafka.

83
Q

What are the two important timing configurations for Zookeeper? What do they do.

A

initLimit (how long it takes for Zookeeper to initialize with the current leader) and syncLimit (how long they can be out of sync); both of these multiples of tickTime.

84
Q

What’s the command to find all partitions without a leader?

A

kafka-topics.sh –bootstrap-server localhost:9092 –describe –unavailable-partitions

85
Q

What’s the CLI tool for consuming from Kafka topics?

A

kafka-console-consumer.sh

86
Q

If you want to read from a specific partition in Kafka Streams, which function would you use?

A

The assign() function allows you to pass in a collection of partitions to read from. It cannot be used in conjunction with a normal subscribe call.

87
Q

What does replication factor mean with regards to topics?

A

A replication factor of five would mean than each partition of the topic would live on five different brokers.

88
Q

When running Kafka inside Kubernetes, you noticed terribly slow startup times related to restoring the state, what can you do to improve this?

A

Consider mounting the RocksDB as a persistent volume.

89
Q

What’s the output of a KStream-KTable join?

A

A KStream.

90
Q

In KSQL, if you want to read all of the messages for a topic, what should you do?

A

Set the auto.offset.reset to earliest.

91
Q

What component in a Kafka cluster stores ACL information? Where does it store it?

A

It’s store in Zookeeper, under the /kafka-acl/ directory.

92
Q

What are the three internal Kafka Connect topics?

A

connect-offsets (stores offsets), connect-configs (stores source connector offsets), and connect-status (elects leaders for connect)

93
Q

A request just returned a NotLeaderForPartitionException, what should happen next?

A

NotLeaderForPartitionException is a retriable exception so send another request and hopefully a leader has been elected and can fulfill the request.

94
Q

If a Zookeeper ensemble contains five nodes, what happens if two go down?

A

Nothing, Zookeeper will still have a quorum (2N+1) to continue to respond.

95
Q

What two things are needed to consume from a topic?

A

Any of the brokers from the cluster the topic resides in and the name of the topic.

96
Q

How are offsets committed for a topic?

A

A request is made to the broker responsible for the topic, sometimes calls the Global Coordinator broker, to set the offsets . They are not committed directly.

97
Q

How is a Controller elected? What is it responsible for?

A

It’s elected by a Zookeeper ensemble and responsible for managing partition leader elections and all other traditional broker responsibilities.

98
Q

What type of joins are always windowed within Kafka Streams applications?

A

A KStream-KStream join always requires a window.

99
Q

When is log compaction evaluated? What determines if it’s performed?

A

Every time a segment (component of a partition) is closed, compaction is evaluated. If there is enough data that is considered “dirty” (via configuration), then it’s performed.

100
Q

Where does the Schema Registry inside of a Kafka cluster?

A

It’s contained in a separate JVM component and interacted with via a REST interface.

101
Q

If you have a long running process that is continually interrupted by rebalances, what can you adjust to help with this?

A

Adjust the max.poll.interval.ms property by a significant margin / percentage to let the broker know the process isn’t dead so it doesn’t rebalance.

102
Q

What’s the response from a send() operation in Kafka?

A

A Future object that contains information related to the record, it’s offsets, partition, etc.

103
Q

What confirmation property needs to be set to ensure all of the consumers share offsets?

A

You need to use the group.id to make sure that consumers are part of the same consumer group and delegate partitions accordingly.

104
Q

What are the three supported authentication mechanisms in Kafka?

A

SASL/SCRAM, SSL, and SASL/GSSAPI

105
Q

If you have a producer that can produce at 10MB/s, how would you determine your maximum throughput?

A

Just multiply your production rate by the number of brokers you have (or add more brokers to increase it)

106
Q

In terms of modeling, how are streams and tables related to static and and non-static data?

A

Tables are generally comprised of static data, whereas streams will consist of more dynamic transactional data.

107
Q

A customer would like all of their data sent to a specific partition, what might you do to handle this?

A

Consider writing a custom partitioner to route data for their key to a specific partition.

108
Q

After failures, your consumer is taking too long to rebalance, what two settings can you adjust? How do those help the situation?

A

You can decrease the session.timeout.ms to help speed up the rebalancing itself and decrease the heartbeat.interval.ms to identify the inactive consumer more quickly (to trigger the rebalance)

109
Q

Is heap size important to Kafka? How much is an adequate amount?

A

Not really, it uses a minimal amount and the rest is used as a page cache. 4-8GB is usually more than enough.

110
Q

What type of joins do not require each side to be co-partitioned?

A

Any joins involving GlobalKTables, e.g. KStream-GlobalKTable, since they already contain all the data across all partitions.

111
Q

What is Mirrormaker? What is it used for?

A

Mirrormaker can copy data across clusters and acts as a middleman for consumers/producers on each side.

112
Q

What can calling a get() after a send() request get you? What’s the downside here?

A

The response of the get() call will allow you to access the metadata for the record. The downside is that this will decrease throughput since it is waiting on the response from the broker before moving on.

113
Q

What is Active-Passive mirroring?

A

This is the concept that one cluster is for reading and writing (Active) while another copy is mirroring it but only used for reading purposes (Passive).

114
Q

What do leading underscores denote with regards to topics?

A

This indicates they are internal topics to Kafka.

115
Q

Define a stream?

A

An abstraction representing an unbounded data set.

116
Q

What are the three types of time in Kafka?

A

Event Time (when the event actually occurred), Append Time (when the message arrived in Kafka), and Processing Time (when the processor received it).

117
Q

If you have a consumer that doesn’t automatically commit offsets and you call close() what immediately happens.

A

Since the consumer doesn’t wait to commit offsets (they are committed manually) it will have nothing to wait for and will immediately trigger a rebalance.

118
Q

What’s a common pattern for generated unique identifiers in Kafka?

A

topic + partition + offset

119
Q

What is the broker.rack configuration? What does it do?

A

When setting this, assuming a proper configuration, will ensure that partitions will be spread across multiple racks.

120
Q

What are two common exceptions that can occur on a client prior to sending a message?

A

A SerializationException or a BufferExhaustedException.