How to maintain ordering of message in Kafka active active site - apache-kafka

I have a business requirement of maintaining messages in active active site, i am planning to use kafka for the same.
The producer puts messages into JMS/MQ, which will be consumed by KAFKA.
So when a batch message of 1 million messages are placed in MQ/JMS by producer, Is it possible to maintain the sequence of message in geographically distributed active-active kafka cluster?
(assuming we are having one partition and one consumer per topic)
Thanks in advance

Yes, the order of messages per partition of a topic is preserved. Between different topics there are no guarantees. So if your entire batch is sent to the same single-partition topic by one producer, yes the order will be preserved. There are some nuances of the configuration that you should be aware of, for instance the ordering guarantee will not hold if max inflight requests per connection > 1 and retries are enabled. The defaults, however, are safe. For more details look for "max.in.flight.requests.per.connection" in https://kafka.apache.org/documentation/#configuration
If your setup has redundant producers with failover, then you may want to consider enabling idempotence.

Related

Does kafka ensures that every record is processed exactly once if there are multiple servers accessing single topic

I will be having 10 servers which would be accessing single kafka topic. how should i ensure that every server(Consumer) will have distinct records to be processed by the server.
I will be having 10 or more instances of my code running. So every instances will act as a consumer.
Per the Kafka Consumer Group protocol, you're guaranteed that no two partitions will be shared amongst the same consumer group.
This means if there are 10+ partitions on this single topic, and each of the 10 server shares the same group.id consumer setting, then every server is getting distinct events, though not necessarily unique events
If you have less than 10 partitions in the topic, then you have idle servers reading nothing until one of the other servers crashes.
Regarding exactly once, you must disable auto commits, manage offset management yourself, and look at the documentation around transactional producers and look at the documentation around the isolation.level consumer setting... Otherwise, you're going to get at least once delivery

Kafka message partitioning by key

We have a business process/workflow that is being started when initial event message is received and closed when the last message is processed. We have up to 100,000 processes executed each day. My problem is that the order of the messages that come to specific process has to be processed by the same order messages were received. If one of the messages fails, the process has to freeze until the problem is fixed, despite that all other processes has to continue. For this kind of situation i am thinking of using Kafka. first solution that came to my mind was to use Topic partitioning by message key. The key of the message would be the ProcessId. This way i could be sure that all process messages would be partitioned and kafka would guarantee the order. As i am new to Kafka what i managed to figure out that partitions has to be created in advance and that makes everything to difficult. so my questions are:
1) when i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
2) there can be more than 100,000 active partitions on the topic, is that a problem?
3) can partition be deleted after all messages from that topic were read?
4) maybe you can suggest other approaches to my problem?
When i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
You need to specify number of partitions while creating topic. New Partitions won't be create automatically(as is the case with topic creation), you have to change number of partitions using topic tool.
More Info: https://kafka.apache.org/documentation/#basic_ops_modify_topi
As soon as you increase number of partitions, producer and consumer will be notified of new paritions, thereby leading them to rebalance. Once rebalanced, producer and consumer will start producing and consuming from new partition.
there can be more than 100,000 active partitions on the topic, is that a problem?
Yes, having this much partitions will increase overall latency.
Go through how-choose-number-topics-partitions-kafka-cluster on how to decide number of partitions.
can partition be deleted after all messages from that topic were read?
Deleting a partition would lead to data loss and also the remaining data's keys would not be distributed correctly so new messages would not get directed to the same partitions as old existing messages with the same key. That's why Kafka does not support decreasing partition count on topic.
Also, Kafka doc states that
Kafka does not currently support reducing the number of partitions for a topic.
I suppose you choose wrong feature to solve you task.
In general, partitioning is used for load balancing.
Incoming messages will be distributed on given number of partition according to the partitioning strategy which defined at broker start. In short, default strategy just calculate i=key_hash mod number_of_partitions and put message to ith partition. More about strategies you could read here
Message ordering is guaranteed only within partition. With two messages from different partitions you have no guarantees which come first to the consumer.
Probably you would use group instead. It's option for consumer
Each group consumes all messages from topic independently.
Group could consist of one consumer or more if you need it.
You could assign many groups and add new group (in fact, add new consumer with new groupId) dynamically.
As you could stop/pause any consumer, you could manually stop all consumers related to specified group. I suppose there is no single command to do that but I'm not sure. Anyway, if you have single consumer in each group you could stop it easily.
If you want to remove the group you just shutdown and drop out related consumers. No actions on broker side is needed.
As a drawback you'll get 100,000 consumers which read (single) topic. It's heavy network load at least.

Implementation of queues using kafka server

I want to implement a queue mechanism using kafka. But could not find anywhere that if it's possible to just peek data from the queue created for any topic without moving forward into it.
I want to read data from the queue and on the basis of different conditions want to remove the existing message or add another message into this queue. Also, is it possible to use a single kafka server from different machines.
I referred to tutorialspoint for learning more about it.
Thanks in advance. Any leads would be appreciated.
Keep in mind that Kakfa scales with multiple partitions per topic, and it doesn't give any ordering guarantee between partitions. So don't use kafka if you want strict ordering. Within a consumer group, if you want n consumers per topic, you need to have atleast n partitions.
Consumers don't remove messages, they commit the offset of a message. Default configuration in most clients is to auto commit offset on read. You can re-insert messages into the topic anytime. But you cannot skip a message and expect to process it later.
You can connect as many machines as you want to a kafka server. Typically, you have multiple servers as a kafka cluster, with replication for fault tolerance.

Maximum subscription limit of Kafka Topics Per Consumer

What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.
500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.
To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.
Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.
Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.

kafka log deletion and load balancing across consumers

Say a consumer does a time intensive processing. In order to scale consumer side processing, i would like to spawn multiple consumers and consumer messages from kafka topic in a round robin fashion. Based on the documentation, it seems like if i create multiple consumers and add them in one consumer group, only one consumer will get the messages. If i add consumers to different consumer groups, each consumer will get the same message. So, in order to achieve the above objective, is the only solution to partition the topic ? This seems like an odd design choice, because the consumer scalability is now bleeding into topic and even producer design. Ideally, if a topic does not partitioning, there should be no need to partition it. This puts un-necessary logic on producer and also causes other consumer types to consume from these partitions that may only make sense to one type of consumer. Plus it limits the usecase, where a certain consumer type may want ordering over the messages, so splitting a topic into partitions may not be possible.
Second if i choose "cleanup.policy" to compact, does it mean that kafka log will keep increasing as it will maintain the latest value for each key? If not, how can i get log deletion and compaction?
UPDATE:
It seems like i have two options to achieve scalability on consumer side, which are independent of topic scaling.
Create consumer groups and have them consume odd and even offsets. This logic would have to be built into the consumers to discard un-needed messages. Also doubles the network requirements
Create a hierarchy of topics, where the root topic gets all the messages. Then some job classifies the logs and publish them again to more fine grained topics. In this case, the strong ordering can be achieved at root and more fine grained topics for consumer scaling can be constructed.
In 0.8, kafka maintains the consumer offset, so publishing messages in a round robin across various consumers is not a too far fetched requirement from their design.
Partitions are the unit of parallelism in Kafka by design. Not just for consumtion but kafka distributes the partiotions accross cluster which has different other benifits like sharing load among different servers, replication management for ensuring no Data loss, managing log to scale beyond a size that will fit on a single server etc.
Ordering of messages is a key factor as if you do not need a storng ordering then diving topics with multiple partitions will allow you to evenly distribute the load while producing (this will be handled by the producer itself). And while using consumer group you just need to add more consumer instances in the same group in order to consume them parallely.
Plus it limits the usecase, where a certain consumer type may want ordering over the messages, so splitting a topic into partitions may not be possible.
True,from the doc
However, if you require a total order over messages this can be achieved with a topic that has only one partition, though this will mean only one consumer process.
Maintaining ordering whiile consuming in distributed manner requires the messaging system to maintain per-message state to keep track of message acknowledgement. But this will involve a lot of expensive random I/O in the system. So clearly there is a trade-off.
Ideally, if a topic does not partitioning, there should be no need to partition it. This puts un-necessary logic on producer and also causes other consumer types to consume from these partitions that may only make sense to one type of consumer
Distributing messages across partitions is typically handled by the producer it self without any intervention from the programmers end (assuming you don't want to categories messages using key). And for the consumers as you just mentioned here the better choice would be to use Simple/Low level consumers which will allow you to consume only a subset of the partitions in a topic.
This seems like an odd design choice, because the consumer scalability is now bleeding into topic and even producer design
I believe for a system like Kafka which focuses on high throughput ( handle hundreds of megabytes of reads and writes per second from thousands of clients ), ensuring scalability and strong durability and fault-tolerance guarantees might not be a good fit for someone having totally a different business requirements.
Topic partitioning is primarily a way to scale out consumers and brokers so if you need many consumers to keep up then you need to partition the topic and add multiple consumer instances in the same consumer group. The producer API will manage partitions transparently. If you need to have certain consumers subscribing only to some partitions, then you need to use the simple consumer API instead of the high level API and in this case you don't have the consumer group concept and have to coordinate consumption yourself.
Message ordering is guaranteed within partitions but not between partitions so if this is a requirement it needs to be dealt with on consumer side.
Setting cleanup.policy=compact means that the Kafka brokers will keep the latest version of a message key indefinitely and use cases like that should be more for recording of data updates for things you intend to keep around rather than the log stream buffering use case.
You need to factor out the reading of Kafka messages from the subsequent processing of those messages. You can use partitions and consumer groups to make reading messages as fast as possible, but if you process the messages as part of your consumer logic then you'll just slow down your consumers. By streaming the messages from consumers to other classes that will perform your processing you can adjust the parallelism of the consumers and of the processors independently. You'll see this approach in technologies like Spark and Storm.
This approach does add one complication and that is that the consumer has to commit the message offset before the message has been processed. You may have to track the messages in flight to insure execute-exactly-once.