I have got many servers, which publish into one RabbitMQ queue. I have also got three readers, which consume this queue. The readers are supposed to read a set of messages and treat it as one "unit-of-work". So, for example if READER-1 reads msg-1 & msg-2. This reader needs to mark these two messages as one-group so that if it dies before committing these messages, one of the two remaining READERs gets msg-1 & msg-2 instead of the default behavior, which is msg-1 is read be READER-2 and msg-2 is read by READER-3.
Once a set of messages are marked as one group, they should be sent to only one reader.
So, is there a way for a READER to mark a set of read msgs as one group?
Related
Is it okay to batch 100 messages into a single object and send those objects to kafka or should I split those 100 messages into individual messages and then put them in kafka
Say for example, I have an object that contains a List. I can put 100 strings in that list and send the object to kafka. Is it better to do it that way or should i split the list of strings and send individual strings to kafka instead
What are some pros and cons to the above approaches
Batching is always good when async processing, until you need to partially process the batch in case of errors.
If you are processing an order and the list of 100 are the items. send them together, as they will be processed together. If you are sending 100 orders, and will process the independently, process them one by one, as the error in one order should not block the others.
As for message sizes, kafka has some message size limits, but these are configurable.
Definitively you need to improve your question.
You want to send a huge message that is more than the max.message.bytes configuration of your kafka broker(let's assume you can't change it). You break it down and put it back together at the consumer side.
This would require some work around the limitations of kafka deployment as of now. For e.g
Should your consumer process all these 100 strings as if they were one batch? when should your consumer decide to commit the offsets for these messages? Is your consumer processing idempotent? Do you have one consumer or multiple consumer instances? what if the 100 strings were split across 5 partitions? which consumer gets which subset of these 100 strings?
An approach is to create 100 messags all with the same batch id like so
(batch1:message1, batch1:message2, batch1:message3)
On the consumer side collect all these messages with the same key
(batch1: (message1, message2, message3))
But, how would you know when the batch ends? does the sequence message1, message2, message3 matter?
So you do something like this
(batch1:message1of3, batch1:message2of3, batch1:messsage3of3)
Now what if you received message1of3 and message2of3 but not message3of3? how long do you wait for it?
As you can see, at each step there are multiple ways to go about this and you will have to make choices right for your problem. Perhaps, you will use timeouts, perhaps in your case batches are interleaved like this
(batch1:message1of3, batch2:message2of5, batch1:message2of3...)
Expect to make some compromises. With Kafka your consumer group is guaranteed to receive all messages, and while it's running, any consumer is assigned one or more partitions(meaning a single partition is not assigned to more than one consumer at the same time). Kafka will also assign messages with the same key to the same partition. With these two properties in mind you can design a system that can consume messages in batches with some obvious trade-offs and limitations.
How can I produce/consume delayed messages with Apache Kafka? Seems like standard Kafka (and Java kafka-client) functionality doesn't have this feature. I know that I could implement it myself with standard wait/notify mechanism, but it doesn't seem very reliable, so any advices and good practices are appreciated.
Found related question, but it didn't help.
As I see: Kafka is based on sequential reads from file system and can be used only to read topics straightforward keeping message ordering. Am I right?
Indeed, kafka lowest structure is a partition, which are sequential events in a queue with incremental offset - you can't insert a log anywhere else than the end at the moment you produce it. There is no concept of delayed messages.
What do you want to achieve exactly?
Some possibilities in your case:
You want to push a message at a specific time (for example, an event "start job"). In this case, use a scheduled task (not from kafka, use some standard way on your os / language / custom app / whatever) to send the message at the given time - consumers will receive them at the proper time.
You want to send an event now, but which should not be taken into account now by consumers. In this case, you can use a custom structure which would include a "time" in its payload. Consumers will have to understand this field and have custom processing to deal with it. For exemple: "start job at 2017-12-27T20:00:00Z". You could also use headers for this, but headers are not supported by all clients for now.
You can change the timestamp of the message sent. Internally, it would still be read in order, but some functions implying time would work differently, and consumer could use the timestamp of the message for its action - this is kinda like the previous proposition, except the timestamp is one metadata of the event, and not the event payload itself. I would not use this personally - I only deal with timestamp when I proxy some events.
For your last question: basically, yes, but with some notes:
Topics are actually split in partition, and order is only preserved in partition. All message with same key are send to same partition.
Most of time, you only read from memory, except if you read old events - in this case, as those are sequentially read from disk, this is very fast
You can choose where to begin to read - a given offset or a given time - and even change it at runtime
You can parallelize read across process - multiple consumers can read the same topics and never reading the same messages twice (each reading different partition, see consumer groups)
i am trying to make my head regarding Kafka consumers and I'd like to know if the following use case can be solved using Kafka.
My use case is basically this one:
I have a stream that I'd like to be consumed in sync by several consumers. In other words, I have a first consumer that starts to consume the stream, then another consumer arrives later. I'd like this second consumer to start to consume the stream at the offset where is currently the first consumer.
I know that I need to have the consumers in two different groups. But it is not clear for me :
on how or if it is possible to coordinate the groups offset
if I would expect a latency for such coordination task
You do not need two different groups, all consumers can check one topic. Or as many as they like, for that matter.
offset
Messages typically are identified by their arrival date, so all the clients need to tell the producer "my last visit was at 10:00, give me all new messages". So all each client needs to keep track of is when which individual topic was checked last.
latency
this is kind of "of scope" at this point. Of course there will be latency, but it depends on the environment, like "how many consumers", "how many topics", "message format" etc.
so can your usecase be solved using kafka
In short: yes. "Can one consumer continue where another has left", the consumers could exchange the latest index between each other, of course that would require some internal synchronization. Kafka itself does not care about consumers, so it will not keep track itself about the latest index. You need to do the work. Another possibility would be to actually consume the messages (like, delete them from queue once consumed), so each time another consumer hits the queue it is guaranteed to receive the messages another consumer left off. Of course that would depend on your usecase, can you actually delete your messages from the queue.
This is not a problematic treated by kafka directly (consumer group is to distribute partitions among members, not to attribute the same offset), but you can do somehting for this. You could simply create an other topic, where consumer1 would post either offset or copy of the message read (so you would need bth consumer and producer for this), and your other synchronized consumer would react against this - of course there ould be some latency for this.
What is your use case behind this? Why can't you consume at different offset? Couldn't you rather having one consumer, which would then dispatch the message read to to different processes, so that they are indeed synchronized? (with no latency)
What do you mean by synchronized: should consumer2 (and 3 and more) only consume the same message than consumer1 (ie can't consume faster, what I assume in both previous solution) While this is possible, it would really be better to know the reason behind this, maybe there is a better way for you to process data
Let's say there is a batch API for performing tasks List[T]. In order to do the job all the tasks needs to be pushed to kafka. There are 2 ways to do that :
1) Pushing List as a message in kafka
2) Pushing individual task T in kafka
I believe approach 1 would be better since i don't have to push the messages to kafka mutiple times for a single batch call. Can some one please tell me if there is any harm in such approach ?
A Kafka producer can batch together individual messages sent within a short time window (the particular config is linger.ms), so the cost of sending individual messages is probably a lot lower than you think.
Probably a more important factor to consider is how the consumer is going to consume messages. What should happen if the consumer cannot process one of the tasks, for example? If the consumer is just just going to call some other batch-based API which succeeds or fails as a batch, the a single message containing a list of tasks would be a perfectly good fit. On the other hand if the consumer ultimately has to process tasks individually then sending individual messages is probably a better fit, and will probably save you from having to implement some sort of retry logic in your consumer, because you can probably configure Kafka to behave with the semantics you need.
Starting from Kafka v0.11 you can also use transactions in the producer to publish your entire batch atomically. i.e. you begin the transaction, then publish your tasks message by message, finally you commit the transaction. Even though the messages can be sent to kafka in multiple batches, they will only become visible to consumers once you commit the transaction, as long as your consumers are running in read-committed mode.
Option 1 is the preferred method in Kafka so long as the entire batch should always stay together. If you publish a List of records as a batch then they will be stored as a batch, they will be (optionally) compressed as a batch yielding better compression, and they will be fetched by consumers as a batch yielding fewer fetch requests.
If you send individual messages then you will have to give them a common key or they will get spread out over different partitions and possibly be sent out of order, or to different consumers of a consumer group.
I'm new to messaging and a little unclear as to whether it is possible for MSMQ to deliver out-of-order messages for transactional queues. I suppose it must be because if a message is not processed correctly (and since we will be using multiple "competing consumers"), then other consumers could continue to process messages while the failed message is placed back on queue. Just can't seem to find a black-and-white answer anywhere on this.
Negative black-and-white answer are hard to find( they don't often exist).
You are confusing two terms here( I think). delivery is from the sender to the queue. consuming is from the queue to the consumer. Those two action can't be put in the same transaction. They are totally separate action ( this is one of the points of queuing )
More to the point: from "Microsoft Message Queuing Services (MSMQ) Tips"
That these messages will either be sent together, in the order they were sent, or not at all. In addition, consecutive transactions initiated from the same machine to the same queue will arrive in the order they were committed relative to each other.
This is the only case of order in msmq.
Sadly you won't find anything about ordered consuming because its not relevant. You can consume messages from msmq any way you want.
Update: If you must have ordered processing, than I don't see the reason to use many consumers. You will have to implement the order in your code.
Do your messages need to be processed in order because:
1) They are different steps of a workflow? If so, you should create different queues to handle the different steps. Process 1 reads Queue 1, does its thing, then writes to Queue 2, and so forth.
2) They have different priorities? If the priority levels are fairly coarse (and the order of messages within priorities doesn't matter), you should create high-priority and low-priority queues. Consumers read from the higher priority queues first.
3) A business rule specifies it. For example, "customer orders must be processed in the order they are received." Message queues are not appropriate for this kind of sequencing since they only convey the order in which messages are received. A process that periodically polls a database for an ordered list of tasks would be more suitable.