The question:
How can I randomly fetch an old chunk of messages with a given range definition of [partition, start offset, end offset]. Hopefully ranges from multiple partitions at once (one range for each partition). This needs to be supported in a concurrent environment too.
My ideas for solution so far
I guess I can use a pool of consumers for the concurrency, and for each fetch, use Consumer.seek and Consumer.poll with max.poll.records. But this seems wrong. No promise that I will get the same exact chunk, for example in a case when a message get deleted (using log compact). As a whole this seek + poll method not seems like the right fit for one time random fetch.
My use case:
Like the typical consumer, mine reads 10MB chunks of messages and processes it.
In order to process that chunk I am pushing 3-20 jobs to different topics, in some kind of workflow.
Now, my goal is to avoid pushing the same chunk into the other topics again and again. Seems to me that it is better to push a reference to that chunk. e.g. [Topic X / partition Y, start offset, end offset]. Then, on the processing of the jobs, it will fetch the exact chunk again.
Your idea seems fine, and is practically the only solution with the Consumer API. There's nothing you can do once messages are removed between offsets.
If you really needed every single message between each and every possible offset range, then you should consider consuming that data as it's actively produced into some externally indexable destination where offset scans are also a common operation. Plenty of Kafka Connectors exist, and lots of databases or filesystems. But the takeaway here is that, I think you might have to reconsider your options for these "reprocessing" jobs
Related
I am trying to design Kafka consumers, and I have a road block on how to design the process. I am thinking of two options:
1. Process records directly from Kafka.
2. Staging table write from Kafka and process records.
Approach 1: Process Key messages on the go from Kafka:
• Read messages one at a time from Kafka & if no records to process break the loop (configurable messages to process)
• Execute business rules.
• Apply changes to consumer database.
• Update Kafka offset to read after processing message.
• Insert into staging table (used for PD guide later on)
Questions with above approach:
• Is it OK to subscribe to a partition and keep the lock open on Kafka partition until configurable messages are processed
and then apply business rules, apply changes to database. All happens in the same process, any performance issues doing this way ?
• Is it OK to manually commit the offset to Kafka? (Performance issues with manual offset commit).
Approach 2: Staging table write from Kafka and process records
Process 1: Consuming events from Kafka and put in staging table.
Process 2: Reading staging table (configurable rows), execute business rules, apply consumer database changes
& update the status of processed records in staging table. (we may have multiple process to do this step)
I see a lot of downside on this approach:
• We are missing the advantage of offset handling provided by Kafka and we are doing manual update of processed records in staging table.
• Locking & Blocking on staging tables for multi instance, as we are trying to insert & do updates after processing in the same staging table
(note: I can design separate tables and move this data there and process them but that could is introducing multiple processes again.
How can I design Kafka with multi instance consumer and huge data to process, which design is appropriate, is it good to read data on the go from Kafka and process the messages or stage it to a table and write another job to process these messages ?
This is how I think we can get the best throughput without worrying about the loss of messages-
Maximize the number of partitions.
Deploy the consumers(at max the number of partitions, even less if your consumers can operate multi-threaded without any problem.)
Read single-threadedly from within each consumer(with auto offset commit) and put the messages in a Blocking Queue which you can control based upon the number of actual processing threads in each consumer.
If the processing fails, you can retry for success or else put messages in a dead-letter queue. Don't forget the implementation of shut down hookups for processing already consumed messages.
If you want to ensure ordering like processing events with the same key one after the another or on any other factor from a single partition, you can use a deterministic executor. I have written a basic ExecutorService in Java that can execute multiple messages in a deterministic way without compromising on the multi-threading of logically separate events. Link- https://github.com/mukulbansal93/deterministic-threading
To answer your questions-
Is it ok to subscribe to a partition and keep the lock open on Kafka partition until configurable messages are processed and then apply business rules, apply changes to database. All happens in the same process, any performance issues doing this way? I don't see much performance issues here as you are processing in bulk. However, it is possible that one of your consumed messages is taking a long time while others get processes. In that case, you will not read other messages from Kafka leading to a performance bottleneck.
Is it ok to manually commit the offset to Kafka? (Performance issues with manual offset commit). This is definitely going to be the least throughput approach as offset committing is an expensive operation.
The first approach where you consume the data and update a table accordingly sounds like the right way.
Kafka guarantees
At least once: you may get the same message twice.
that means that you need the messages to be idempotent -> set amount to x and not add an amount to the previous value.
order (per partition): Kafka promise that you consume messages in the same order the messages were produced - per partition. Like a queue per partition.
if when you say "Execute business rules" you need to also read previous writes, that means you need to process them one by one.
How to define the partitions
If you define one partition you won't have a problem with conflicts but you will only have one consumer and that doesn't scale.
if you arbitrarily define multiple partitions then you may lose the order.
why is that a problem?
you need to define the partitions according to your business model:
For example, let's say that every message updates some user's DB. when you process a message you want to read the user row, check some fields, and then update (or not) according to that field.
that means that if you define the partition by user-id -> (user-id % number of partitions)
you guarantee that you won't have a race condition between two updates on the same user and you can scale to multiple machines/processes/threads. each consumer in-charge of some set of users but it's always the same users.
The design of your consumer depends on your usecase.
If there are other downstream processes that is expecting the same data and has a limitation to connect to your kafka cluster. In this case having a staging table is a good idea.
I think in your case approach 1 with a little alteration is a good way to go.
However you dont need to break the loop if there are no new messages in the topic.
Also, theres a consumer property that helps to configure the number of records that you want to poll from kafka in a single request (default 500) you might want to change it to a lower number if each message takes a long time to process (To avoid timeout or unwanted repartitioning issues).
Since you mentioned the amount of data is huge I would recommend having more partitions for concurrency if processing order doesnot matter for you. Concurrency can be achieved my creating a consumer group with instance count no more than the number of partitions for the topic. (If the consumer instance count is more than the number of partitions the extra instances will be ideal)
If order does matter, The producer should ideally send logically grouped messages with the same message key so that all messages with the same key land in the same partition.
About offset commiting, if you sync commit each message to kafka you will definitely have performance impact. Usually in offset is commited for each consumed batch of record. eg poll 500 records-> process -> commit the batch of records.
However, If you need to send out a commit for each message you might want to opt for Async Commit.
Additionally, when partitions are assigned to a consumer group instance it doesnot lock the partitions. Other consumer groups can subscribe to the same topic and consume messages concurrently.
While working to adapt Java's KafkaIOIT to work with a large dataset I encountered a problem. I want to push 100M records through a Kafka topic, verify data correctness and at the same time check the performance of KafkaIO.Write and KafkaIO.Read.
To perform the tests I'm using a Kafka cluster on Kubernetes from the Beam repo (here).
The expected result would be that first the records are generated in a deterministic way, next they are written to Kafka - this concludes the write pipeline.
As for reading and correctness checking - first, the data is read from the topic and after being decoded into String representations, a hashcode of the whole PCollection is calculated (For details, check KafkaIOIT.java).
During the testing I ran into several problems:
When the predermined number of records is read from the Kafka topic, the hash is different each time.
Sometimes not all the records are read and the Dataflow task waits for the input indefinitely, occasionally throwing exceptions.
I believe there are two possible causes of this behavior:
either there is something wrong with the Kafka cluster configuration
or KafkaIO behaves erratically on high data volumes, duplicating and/or dropping records.
I found a Stack answer that I believe might explain the first behavior:
link - if messages are delivered more than once, it's obvious that the hash of the whole collection would change.
In this case, I don't really know how to configure KafkaIO.Write in Beam to produce exactly once.
This leaves the issue of messages being dropped unsolved. Can you help?
As mentioned in the comments, a practical appraoch would be to start small and see if this is a problem of scaling up.
E.g. starting with 10 messages, and multiplying the number till you see something strange.
Furthermore, one thing that stands out is that you send data to a topic, and check the hash after reading from the topic. However, you do not mention partitions, is it possible that you are in fact seeing different results because there are multiple partitions?
Kafka guarantees order within a partition.
I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon
Imagine a scenario where we have 3 partitions belonging to 3 different topics on a machine which runs a kafka process/broker. This broker will receive messages for all three partitions. It will store them on different log subdirectories. My question is how does the kafka broker schedule these writes? How does it decide which partition/topic will be written next?
For ordering over requests, the image below shows roughly, how the broker internally handles produce requests:
There is a number of network threads that pull bytes of the network layer and convert these to internal requests. These requests are then stuck in a fifo request queue, from where the io threads pull them and append the contained messages to the relevant partitions. So in short messages are processed in the order they are received in.
Looking through the code I am unsure, whether there may be potential for a race condition here, where a smaller request could "overtake" a large request that was sent immediately before it. However even if this were possible it is an extremely unlikely fringe case that I can't see ever occurring for a single producer. Maybe someone with a better understanding of the code can weigh in here?
As for ordering of batched messages in one request, the request stores messages internally in a HashMap, which uses TopicPartition as a key, since as far as I am aware a Scala HashMap does not preserve ordering of the inserted elements, I don't think that there are any guarantees around the order in which multiple partitions in one request get processed - which is fine, as ordering is only guaranteed to be preserved within the partition.
Within each partition, messages are processed in the order they were given to the producer before sending.
I'm working on Kafka 0.9. I'm wondering if there is any approach to retrieve a message, which has been processed, from its topic by knowing the partition and offset. For example, the consumer is currently consuming the message at partition 1 and offset 10. And I want to get the message at the same partition and offset 5.
One way that I can think of is to reset the offset to 5 and consume one single message. But the poll() method can only return a batch of messages. So I have to take the first message and disregard the others. After processing the message, the offset is reset back.
I think this will work. But still want to know if there is any other elegant way of doing it.
Kafka is designed to read long stripes of data off of the disk without moving the disk heads around -- in other words, it is optimized to use linear reads. It seems inefficient to disregard a whole chunk of data you had to read off of disk (and possibly serve over the network) but it is actually a lot more inefficient to make the disk head jump around a lot. Check out Kafka's design philosophy, and about it's use of disks, here.
In other words, your approach probably works. But you are thinking more like the way someone uses a relational database, not a messaging system.
You should be able to use the "seek" method to read the message from the offset you require.
Take a look at the "Controlling the Consumer's Position"
https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html