I'm working on Kafka 0.9. I'm wondering if there is any approach to retrieve a message, which has been processed, from its topic by knowing the partition and offset. For example, the consumer is currently consuming the message at partition 1 and offset 10. And I want to get the message at the same partition and offset 5.
One way that I can think of is to reset the offset to 5 and consume one single message. But the poll() method can only return a batch of messages. So I have to take the first message and disregard the others. After processing the message, the offset is reset back.
I think this will work. But still want to know if there is any other elegant way of doing it.
Kafka is designed to read long stripes of data off of the disk without moving the disk heads around -- in other words, it is optimized to use linear reads. It seems inefficient to disregard a whole chunk of data you had to read off of disk (and possibly serve over the network) but it is actually a lot more inefficient to make the disk head jump around a lot. Check out Kafka's design philosophy, and about it's use of disks, here.
In other words, your approach probably works. But you are thinking more like the way someone uses a relational database, not a messaging system.
You should be able to use the "seek" method to read the message from the offset you require.
Take a look at the "Controlling the Consumer's Position"
https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
Related
I'm looking to try out using Kafka for an existing system, to replace an older message protocol. Currently we have a number of types of messages (hundreds) used to communicate among ~40 applications. Some are asynchronous at high rates and some are based upon request from user/events.
Now looking at Kafka, it breaks out topics and partitions etc. But I'm a bit confused as to what constitutes a topic. Does every type of message my applications produce get their own topic allowing hundreds of topics, or do I cluster them together to related message types? If the second answer, is it bad practice for an application to read a message and drop it when its contents are not what its looking for?
I'm also in a dilemma where there will be upwards of 10 copies of a single application (a display), all of which getting a very large amount of data (in form of a light weight video stream of sorts) and would be sending out user commands on each particular node. Would Kafka be a sufficient form of communication for this? Assuming that at most 10, but sometimes these particular applications may not have the desire to get the video stream at all times.
A third and final question: I read a bit about replay-ability of messages. Is this only within a single topic, or can the replay-ability go over a slew of different topics?
Kafka itself doesn't care about "types" of message. The only type it knows about are bytes, meaning that you are completely flexible to how you will serialize your datasets. Note, however that the default max message size is just 1MB, so "streaming video/images/media" is arguably the wrong use case for Kafka alone. A protocol like RTMP would probably make more sense
Kafka consumer groups scale horizontally, not in response to load. Consumers poll data at a rate at which they can process. If they don't need data, then they can be stopped, if they need to reprocess data, they can be independently seeked
I am trying to, better, understand what happens in the level of resources when you create a KStream and a KTable. Below, I wil mention some conclusions that I have come to, as I understand them (feel free to correct me).
Firstly, every topic has a number of partitions and all the messages in those partitions are stored in the hard disk(s) in continuous order.
A KStream does not need to store the messages, that are read from a topic, again to another location, because the offset is sufficient to retrieve those messages from the topic which is connected to.
(Is this correct? )
The question regards the KTable. As I have understand, a KTable, in contrast with a KStream, updates every message with the with the same key. In order to do that, you have to either store externally the messages that arrive from the topic to a static table, or read all the message queue, each time a new message arrives. The later does not seem very efficient regarding time performance. Is the first approach I presented correct?
read all the message queue, each time a new message arrives.
All messages are only read at the fresh start of the application. Once the app reads up to the latest offset, it's just updating the table like any other consumer
How disk usage is determined ultimately depends on the state store you've configured for the application, along with its own settings. For example, in-memory vs rocksdb vs an external state store interface that you've written on your own
I have requirement to make the poll batch size as 500 and do a batch commit once 500 messages are processed. So incase of the last set where there are less than 500 messages , i need to commit once the last message in the batch is processed. Is there way i can get to know how many messages were fetched in the poll , if the number of messages left to be processed in the topic happens to be less than the poll size.
Streams isn't really set up to support a use case like this, although it's often discussed under the heading of "async processing", and we'd like to design it in the future.
Right now, if you really want to use Streams, your best bet would be to wrap your DB persistence logic inside a custom Processor or Transformer, which would also buffer the records and send the batches when it has enough.
However, if you really just need to "copy" data from a topic into the DB, you might get more mileage out of using a Connector or even just the Kafka Consumer directly.
Hope this helps!
The question:
How can I randomly fetch an old chunk of messages with a given range definition of [partition, start offset, end offset]. Hopefully ranges from multiple partitions at once (one range for each partition). This needs to be supported in a concurrent environment too.
My ideas for solution so far
I guess I can use a pool of consumers for the concurrency, and for each fetch, use Consumer.seek and Consumer.poll with max.poll.records. But this seems wrong. No promise that I will get the same exact chunk, for example in a case when a message get deleted (using log compact). As a whole this seek + poll method not seems like the right fit for one time random fetch.
My use case:
Like the typical consumer, mine reads 10MB chunks of messages and processes it.
In order to process that chunk I am pushing 3-20 jobs to different topics, in some kind of workflow.
Now, my goal is to avoid pushing the same chunk into the other topics again and again. Seems to me that it is better to push a reference to that chunk. e.g. [Topic X / partition Y, start offset, end offset]. Then, on the processing of the jobs, it will fetch the exact chunk again.
Your idea seems fine, and is practically the only solution with the Consumer API. There's nothing you can do once messages are removed between offsets.
If you really needed every single message between each and every possible offset range, then you should consider consuming that data as it's actively produced into some externally indexable destination where offset scans are also a common operation. Plenty of Kafka Connectors exist, and lots of databases or filesystems. But the takeaway here is that, I think you might have to reconsider your options for these "reprocessing" jobs
I am pllaned to develop a reliable streamig application based on directkafkaAPI..I will have one producer and another consumer..I wnated to know what is the best approach to achieve the reliability in my consumer?..I can employ two solutions..
Increasing the retention time of messages in Kafka
Using writeahead logs
I am abit confused regarding the usage of writeahead logs in directkafka API as there is no receiver..but in the documentation it indicates..
"Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. "
so I wanted to know what is the best approach..if it suffices to increase the TTL of messages in kafka or I have to also enable write ahead logs..
I guess it would be good practice if I avoid one of the above since the backup data (retentioned messages, checkpoint files) can be lost and then recovery could face failure..
Direct Approach eliminates the duplication of data problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka.
Also, Direct approach by default supports exactly-once message delivery semantics, it does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints.