Kafka: Can consumers read messages before all replicas are in sync? - apache-kafka

I'm designing an event driven distributed system.
One of the events we need to distribute needs
1- Low Latency
2- High availability
Durability of the message and consistency between replicas is not that important for this event type.
Reading the Kafka documentation it seems that consumers need to wait until all sync replicas for a partition have applied the message to their log before consumers can read it from any replica.
Is my understanding correct? If so is there a way around it

If configured improperly; consumers can read data that has not been written to replica yet.
As per the book,
Data is only available to consumers after it has been committed to Kafka—meaning it was written to all in-sync.
If you have configured min.insync.replicas=1 then only Kafka will not wait for replicas to catch-up and serve the data to Consumers.
Recommended configuration for min.insync.replicas depends on type of application. If you don't care about data then it can be 1, if it's critical piece of information then you should configure it to >1.
There are 2 things you should think:
Is it alright if Producer don't send message to Kafka? (fire & forget strategy with ack=0)
Is it alright if consumer doesn't read a message? (if min.insync.replica=1 then if a broker goes down then you may lose some data)

Related

Message queue (like RabbitMQ) or Kafka for Microservices?

We are starting a new project, where we are evaluating the tech stack for asynchronous communication between microservices? We are considering RabbitMQ and Kafka for this.
Can anyone shed some light on the key considerations to decide one between these twos?
Thanks
Selection depends upon what exactly your microservices needs. Both has something different as compared to other.
RabbitMQ in a nutshell
Who are the players:
Consumer
Publisher
Exchange
Route
The flow starts from the Publisher, which send a message to exchange, Exchange is a middleware layer that knows to route the message to the queue, consumers can define which queue they are consuming from (by defining binding), RabbitMQ pushes the message to the consumer, and once consumed and acknowledgment has arrived, message is removed from the queue.
Any piece in this system can be scaled out: producer, consumer, and also the RabbitMQ itself can be clustered, and highly available.
Kafka
Who are the players
Consumer / Consumer groups
Producer
Kafka source connect
Kafka sink connect
Topic and topic partition
Kafka stream
Broker
Zookeeper
Kafka is a robust system and has several members in the game. but once you understand well the flow, this becomes easy to manage and to work with.
Producer send a message record to a topic, a topic is a category or feed name to which records are published, it can be partitioned, to get better performance, consumers subscribed to a topic and start to pull messages from it, when a topic is partitioned, then each partition get its own consumer instance, we called all instances of same consumer a consumer group.
In Kafka messages are always remaining in the topic, also if they were consumed (limit time is defined by retention policy)
Also, Kafka uses sequential disk I/O, this approach boosts the performance of Kafka and makes it a leader option in queues implementation, and a safe choice for big data use cases.
Use Kafka if you need
Time travel/durable/commit log
Many consumers for the same message
High throughput
Stream processing
Replicability
High availability
Message order
Use RabbitMq if you need:
flexible routing
Priority Queue
A standard protocol message queue
For more info
In order to select a message broker, I think this list could be really helpful.
 Supported programming languages: You probably should pick one that supports a
variety of programming languages.
 Supported messaging standards: Does the message broker support any standards,
such as AMQP and STOMP, or is it proprietary?
 Messaging order: Does the message broker preserve the ordering of messages?
 Delivery guarantees: What kind of delivery guarantees does the broker make?
 Persistence: Are messages persisted to disk and able to survive broker crashes?
 Durability: If a consumer reconnects to the message broker, will it receive the
messages that were sent while it was disconnected?
 Scalability: How scalable is the message broker?
 Latency: What is the end-to-end latency?
 Competing consumers: Does the message broker support competing consumers?
Kafka
Rabbit MQ
It's a distributed streaming platform, working on the pub-sub model.
It's a message-broker, that works on pub-sub and queue-based model.
No out of the box support for retries and DLQ
Supports retries and DLQ out of the box(via DLX).
Consumers can't filter messages specifically.
Topic exchange and header exchange facilitate consumer-based message filtering.
Messages are retained till their validity period.
Messages are gone as soon as they are read.
Does not support scheduled or delayed message routing.
Supports scheduled and delayed routing of messages.
Scales horizontally
Scales vertically
Pull based approach
Pull based apporach
Supports event replay with consumer groups
No way to event replay

How to maintain ordering of message in Kafka active active site

I have a business requirement of maintaining messages in active active site, i am planning to use kafka for the same.
The producer puts messages into JMS/MQ, which will be consumed by KAFKA.
So when a batch message of 1 million messages are placed in MQ/JMS by producer, Is it possible to maintain the sequence of message in geographically distributed active-active kafka cluster?
(assuming we are having one partition and one consumer per topic)
Thanks in advance
Yes, the order of messages per partition of a topic is preserved. Between different topics there are no guarantees. So if your entire batch is sent to the same single-partition topic by one producer, yes the order will be preserved. There are some nuances of the configuration that you should be aware of, for instance the ordering guarantee will not hold if max inflight requests per connection > 1 and retries are enabled. The defaults, however, are safe. For more details look for "max.in.flight.requests.per.connection" in https://kafka.apache.org/documentation/#configuration
If your setup has redundant producers with failover, then you may want to consider enabling idempotence.

Kafka: Why broker isn't pull based like consumers

I was reading Kafka docs where it was mentioned that:-
Consumers pulls data from broker by requesting from offset.
Producer pushes messages to broker.
Making Kafka consumers pull based make sense that the consumers can drive the pace and broker can store the data for a really long time.
However with producers being push based, How does Kafka make sure that speed mismatch between producer and kafka won't happen? Also producers don't have persistance by design.This seems to be a bigger problem, when producers and brokers are separated over high latency network(internet).
As a distributed commit log, Kafka solves exactly this (impedance mismatch). You produce your events at the rate at which they occur into Kafka, and then you consume them at the rate at which your application can. The data is persisted in Kafka regardless. If your application needs to consume at a greater rate, you scale it out and partition your topic and consume in parallel. Because the data is persisted the only factor is how fast you want to consume the data.

How does Kafka message processing scale in publish-subscribe mode?

All, Forgive me I am a newbie just beginner of Kafka. Currently I was reading the document of Kafka about the difference between traditional message system like Active MQ and Kafka.
As the document put.
For the traditional message system. they can not scale the message processing.
Since
Publish-subscribe allows you broadcast data to multiple processes, but
has no way of scaling processing since every message goes to every
subscriber.
I think this make sense to me.
But for the Kafka. Document says the Kafka can scale the message processing even in the publish-subscribe mode. (Please correct me if I was wrong. Thanks.)
The consumer group concept in Kafka generalizes these two concepts. As
with a queue the consumer group allows you to divide up processing
over a collection of processes (the members of the consumer group). As
with publish-subscribe, Kafka allows you to broadcast messages to
multiple consumer groups.
The advantage of Kafka's model is that every topic has both these
properties—it can scale processing and is also multi-subscriber—there
is no need to choose one or the other.
So my question is How Kafka make it ? I mean scaling the processing in the publish-subscribe mode. Thanks.
The main unique features in Kafka that enables scalable pub/sub are:
Partitioning individual topics and spreading the active partitions across multiple brokers in the cluster to take advantage of more machines, disks, and cache memory. Producers and consumers often connect to many or all nodes in the cluster, not just a single master node for a given topic/queue.
Storing all messages in a sequential commit log and not deleting them when consumed. This leads to more sequential reads and writes, offloads the broker from having to deal with keeping track of different copies of messages, deleting individual messages, handling fragmentation, tracking which consumer has acknowledged consuming which messages.
Enabling smart parallel processing of individual consumers and consumer groups in a way that each parallel message stream can come from the distributed partitions mentioned in #1 while offloading the offset management and partition assignment logic onto the clients themselves. Kafka scales with more consumers because the consumers do some of the work (unlike most other pub/sub brokers where the bulk of the work is done in the broker)

Why do Kafka consumers connect to zookeeper, and producers get metadata from brokers?

Why is it that consumers connect to zookeeper to retrieve the partition locations? And kafka producers have to connect to one of the brokers to retrieve metadata.
My point is, what exactly is the use of zookeeper when every broker already has all the necessary metadata to tell producers the location to send their messages? Couldn't the brokers send this same information to the consumers?
I can understand why brokers have the metadata, to not have to make a connection to zookeeper each time a new message is sent to them. Is there a function that zookeeper has that I'm missing? I'm finding it hard to think of a reason why zookeeper is really needed within a kafka cluster.
First of all, zookeeper is needed only for high level consumer. SimpleConsumer does not require zookeeper to work.
The main reason zookeeper is needed for a high level consumer is to track consumed offsets and handle load balancing.
Now in more detail.
Regarding offset tracking, imagine following scenario: you start a consumer, consume 100 messages and shut the consumer down. Next time you start your consumer you'll probably want to resume from your last consumed offset (which is 100), and that means you have to store the maximum consumed offset somewhere. Here's where zookeeper kicks in: it stores offsets for every group/topic/partition. So this way next time you start your consumer it may ask "hey zookeeper, what's the offset I should start consuming from?". Kafka is actually moving towards being able to store offsets not only in zookeeper, but in other storages as well (for now only zookeeper and kafka offset storages are available and i'm not sure kafka storage is fully implemented).
Regarding load balancing, the amount of messages produced can be quite large to be handled by 1 machine and you'll probably want to add computing power at some point. Lets say you have a topic with 100 partitions and to handle this amount of messages you have 10 machines. There are several questions that arise here actually:
how should these 10 machines divide partitions between each other?
what happens if one of machines die?
what happens if you want to add another machine?
And again, here's where zookeeper kicks in: it tracks all consumers in group and each high level consumer is subscribed for changes in this group. The point is that when a consumer appears or disappears, zookeeper notifies all consumers and triggers rebalance so that they split partitions near-equally (e.g. to balance load). This way it guarantees if one of consumer dies others will continue processing partitions that were owned by this consumer.
With kafka 0.9+ the new Consumer API was introduced. New consumers do not need connection to Zookeeper since group balancing is provided by kafka itself.
You are right, the consumers don't need to connect to ZooKeeper since kafka 0.9 release. They redesigned the api and new consumer client was introduced:
the 0.9 release introduces beta support for the newly redesigned
consumer client. At a high level, the primary difference in the new
consumer is that it removes the distinction between the “high-level”
ZooKeeper-based consumer and the “low-level” SimpleConsumer APIs, and
instead offers a unified consumer API.
and
Finally this completes a series of projects done in the last few years
to fully decouple Kafka clients from Zookeeper, thus entirely removing
the consumer client’s dependency on ZooKeeper.