December 28, 2018

Learn Apache Kafka

Topics, Partitions and Offsets

  • Topics: a particular stream of data
    • Similar to a table in a database (without all the constraints)
    • You can have as many topics as you want
    • A topic is identified by its name
  • Topics are split in partitions
    • Each partition is ordered
    • Each message within a partition gets an incremental id, called offset
    • Offset only have a meaning for a specific partition
      • E.g. offset 3 in partition 0 doesn’t represent the same data as offset 3 in partition 1
    • Order is guaranteed only within a partition (not across partitions)
    • Data is kept only for a limited time (default is one week)
    • Once the data is written to a partition, it can’t be changed (immutability)
    • Data is assigned randomly to a partition unless a key is provided

Brokers

  • A Kafka cluster is composed of multiple brokers (servers)
  • Each broker is identified with its ID (integer)
  • Each broker contains certain topic partitions
  • After connecting to any broker (called a bootstrap broker), you will be connected to the entire cluster
  • A good number to get started is 3 brokers, but some big clusters have over 100 brokers
  • In these examples we choose to number brokers starting at 100 (arbitrary)

Brokers and Topics

  • Example of Topic-A with 3 partitions
  • Example of Topic-B with 2 partitions
    • Note: Data is distributed and Broker 103 doesn’t have any Topic B data

Topic Replication Factor

  • Topics should have a replication factor > 1 (usually between 2 and 3)
  • This way if a broker is down, another broker can serve the data
  • Example: Topic-A with 2 partitions and replication factor of 2

  • Example: we lost Broker 102
  • Result: Broker 101 and 103 can still serve the data

Concept of Leader for a Partition

  • At any time only ONE broker can be a leader for a given partition
  • Only that leader can receive and serve data fr a partition
  • The other brokers will synchronize the data
  • Therefore each partition has one leader and multiple ISR (in-sync replica)

Producers

  • Producers write data to topics (which is made of partitions)
  • Producers automatically know to which broker and partition to write to
  • In case of Broker failures, Producers will automatically recover

  • Producers can choose to receive acknowledgement of data writes
    • acks=0: Producer won’t wait for acknowledgement (possible data loss)
    • acks=1: Producer will wait for leader acknowledgement (limited data loss)
    • acks=all: Leader + replicas acknowledgement (no data loss)

Producers: Message keys

  • Producers can choose to send a key with the message (string, number, etc..)
  • If key=null, data is sent round robin (broker 101 then 102 then 103…)
  • If a key is sent, then all messages for that key will always go to the same partition
  • A key is basically sent if you need message ordering for a specific field
    • We get this guarantee thanks to key hashing, which depends on the number of partitions

Consumers

  • Consumers read data from a topic (identified by name)
  • Consumers know which broker to read from
  • In case of broker failures, consumers know how to recover
  • Data is read in order within each partitions

Consumer Groups

  • Consumers read data in consumer groups
  • Each consumer within a group reads from exclusive partitions
  • If you have more consumers than partitions, some consumers will be inactive
  • Note: Consumers will automatically use a GroupCoordinator and a Consumer Coordinator to assign a consumer to a partition.

What if too many consumers?

  • If you have more consumers than partitions, some consumers will be inactive

Consumer Offsets

  • Kafka stores the offsets at which a consumer group has been reading
  • The offsets committed live in Kafka topic named __consumer_offsets
  • When a consumer in a group has processed data received from Kafka, it should be committing the offsets
  • If a consumer dies, it will be able to read back from where it left off thanks to the committed consumer offsets.

Delivery Semantics for Consumers

  • Consumers choose when to commit offsets
  • There are 3 delivery semantics:
  • At most once:
    • offsets are commited as soon as the message is received.
    • If the processing goes wrong, the message will be lost (it won’t be read again).
  • At least once (usually preferred):
    • offsets are committed after the message is processed.
    • If the processing goes wrong, the message will be read again.
    • This can result in duplicate processing of messages. Make sure your processing is idempotent. (i.e. processing again the messages won’t impact your systems)
  • Exactly once:
    • Can be achived for Kafka => Kafka workflows using Kafka Streams API
    • For Kafka => External System workflows, use an idempotent consumer.

Kafka Broker Discovery

  • Every Kafka broker is also called a “bootstrap server”
  • That means that you only need to connect to one broker, and you will be connected to the entire cluster.
  • Each broker knows about all brokers, topics and partitions (metadata)

Zookeeper

  • Zookeeper manages brokers (keeps a list of them)
  • Zookeeper helps in performing leader election for partitions
  • Zookeeper sends notifications to Kafka in case of changes (e.g. new topic, broker dies, broker comes up, delete topics, etc…)
  • Kafka can’t work without Zookeeper
  • Zookeeper by design operates with an odd number of servers (3, 5, 7)
  • Zookeeper has a leader (handle writes) the rest of the servers are followers (handle reads)
  • (Zookeeper does NOT store consumer offsets with Kafka > v0.10)

Kafka Guarantees

  • Messages are appended to a topic-partition in the order they are sent
  • Consumers read messages in the order stored in a topic-partition
  • With a replication factor of N, producers and consumers can tolerate up to N-1 brokers being down
  • This is why a replication factor of 3 is a good idea:
    • Allows for one broker to be taken down for maintenance
    • Allows for another broker to be taken down unexpectedly
  • As long as the number of partitions remains constant for a topic (no new partitions), the same key will always go to the same partition

Theory Roundup

CLI Command

Start Kafka

  • Start Zookeeper:

zookeeper-server-start config/zookeeper.properties

  • Start Kafka:

kafka-server-start config/server.properties

Kafka Topics CLI (kafka-topics)

  • Create a topic:

kafka-topics –zookeeper 127.0.0.1:2181 –topic first_topic –create –partitions 3 –replication-factor 1

  • List a topic:

kafka-topics –zookeeper 127.0.0.1:2181 —list

  • Describe a topic:

kafka-topics –zookeeper 127.0.0.1:2181 –topic first_topic —describe

  • Delete a topic:

kafka-topics –zookeeper 127.0.0.1:2181 –topic second_topic —delete

  • Change config:

Go to config/server.properties

Kafka Console Producer CLI (kafka-console-producer)

kafka-console-producer –broker-list 127.0.0.1:9092 –topic first_topic

kafka-console-producer –broker-list 127.0.0.1:9092 –topic first_topic –producer-property acks=all

(Producer with keys:)

kafka-console-producer –broker-list 127.0.0.1:9092 –topic first_topic –property parse.key=true –property key.separator=,

Kafka Console Consumer CLI (kafka-console-consumer)

kafka-console-consumer –bootstrap-server 127.0.0.1:9092 –topic first_topic

kafka-console-consumer –bootstrap-server 127.0.0.1:9092 –topic first_topic –from-beginning

kafka-console-consumer –bootstrap-server 127.0.0.1:9092 –topic first_topic –group my-first-application

kafka-console-consumer –bootstrap-server 127.0.0.1:9092 –topic first_topic –group my-first-application –from-beginning

(Consumer with keys:)

kafka-console-consumer –bootstrap-server 127.0.0.1:9092 –topic first_topic –from-beginning –property print.key=true –property key.separator=,

Kafka Consumer Groups CLI (kafka-consumer-groups)

kafka-consumer-groups –bootstrap-server localhost:9092 –list

kafka-consumer-groups –bootstrap-server localhost:9092 –describe –group my-second-application

Resetting Offset

kafka-consumer-groups –bootstrap-server localhost:9092 –group my-first-application –reset-offsets –to-earliest –execute –topic first_topic

kafka-consumer-groups –bootstrap-server localhost:9092 –group my-first-application –reset-offsets –shift-by 2 –execute –topic first_topic


All the notes above are from a Udemy course: Apache Kafka Series - Learn Apache Kafka for Beginners v2.

There’s also a very good Kafka introduction Youtube video I watched: https://www.youtube.com/watch?v=UEg40Te8pnE

Here’s my Github repository link where I learn Kafka: https://github.com/lisading/kafkabeginner

comments powered by Disqus