I'm new to messaging and a little unclear as to whether it is possible for MSMQ to deliver out-of-order messages for transactional queues. I suppose it must be because if a message is not processed correctly (and since we will be using multiple "competing consumers"), then other consumers could continue to process messages while the failed message is placed back on queue. Just can't seem to find a black-and-white answer anywhere on this.
Negative black-and-white answer are hard to find( they don't often exist).
You are confusing two terms here( I think). delivery is from the sender to the queue. consuming is from the queue to the consumer. Those two action can't be put in the same transaction. They are totally separate action ( this is one of the points of queuing )
More to the point: from "Microsoft Message Queuing Services (MSMQ) Tips"
That these messages will either be sent together, in the order they were sent, or not at all. In addition, consecutive transactions initiated from the same machine to the same queue will arrive in the order they were committed relative to each other.
This is the only case of order in msmq.
Sadly you won't find anything about ordered consuming because its not relevant. You can consume messages from msmq any way you want.
Update: If you must have ordered processing, than I don't see the reason to use many consumers. You will have to implement the order in your code.
Do your messages need to be processed in order because:
1) They are different steps of a workflow? If so, you should create different queues to handle the different steps. Process 1 reads Queue 1, does its thing, then writes to Queue 2, and so forth.
2) They have different priorities? If the priority levels are fairly coarse (and the order of messages within priorities doesn't matter), you should create high-priority and low-priority queues. Consumers read from the higher priority queues first.
3) A business rule specifies it. For example, "customer orders must be processed in the order they are received." Message queues are not appropriate for this kind of sequencing since they only convey the order in which messages are received. A process that periodically polls a database for an ordered list of tasks would be more suitable.
Related
As of now I have single Kafka Topic with 10 partitions. We have 10000 clients who keep dumping uncontrolled data into streams. The problem currently is that
A tenant with out any notice (or little notice) floods the topic
now the messages from other tenants suffer --> because their messages (handful) are queued behind and will take several hours to get their turn for processing
Question:
Can I somehow read may be 1k messages per tenant and roundrobin --> essentially like fair scheduling of Hadoop yarn
Can Apache pulsar help me in this? If yes then is there any example you can point me to?
I went through: https://www.confluent.io/blog/prioritize-messages-in-kafka/ already; but given the volume of clients it may not be practical to have 100k partitions etc.
I'm not aware of any way to get what you want out of the box. You could probably have the consumer pause some partitions to prioritize consumption from the ones with more messages (for example, by checking the lag per partition after every few poll iterations).
I'm not familiar enough with Apache Pulsar to have a clear answer.
I have a similar problem: a single customer can monopolize the resources and delay execution from all other customers, just because their events arrived first.
On a different application with a low amount of messages, we just load all the events in memory, creating a in-memory queue for every customer and then dequeuing up to N events from each customer queue and re-queue them again into a different queue, lets call it the re-ordered queue. The re-ordered queue has a capacity limit. (lets say...100*N), so no additional elements are queue until there is space. This guarantees equal treatment to all customers.
I am facing the same problem now with an application that processes billions of messages. The solution above is impossible; there is just not enough RAM. We can't keep all the data in memory. Creating a topic for each customer also sounds overkill; specially if you have a variable set of active customers at any given point in time. Nevertheless, Pulsar seems to handle well thousands, even millions, of topics.
So the technique above may work well for you (and for me).
Just read from thousands of topics... write to another topic a limited number of messages and then wait for it to have "space" to continue enqueuing.
Let's imaging a simple message processing pipeline, like on the image below:
A group of consumers listens to a topic, picks messages one by one, does some sort of processing and sends them over to the next topic.
Some messages crash the consumer or make it stuck forever (so then a liveness probe kills the consumer after timeout).
In this case a consumer is not able to commit the offset, so the malicious message gets picked up by another consumer. And also makes it crash.
Ideally we want to move the message to a dead letter topic after N such attempts.
This can be achieved by introducing a shared storage:
But this creates coupling between the services and introduces a Single Point of Failure (SPOF) which is the shared database.
I'm looking for ideas on how to work this around with stateless services.
If your context is correct with this approach (that's something you should judge, as I'm only trying to give a suggestion), please consider decoupling the consumption and the processing.
In your case, the consumer is stopped, not because it was not able to read from kafka, and/or the kafka broker wasn't able to provide messages, but because the processing of the message was too slow and/or unsuccesful.
The consumer, in fact, was correctly receiving the messages. It was the processing of them that made it be declared dead.
First of all, the KafkaConsumer javadoc block regarding this (just above the constructor summary). The second option is the one quoted here
2. Decouple Consumption and Processing
Another alternative is to have one or more consumer threads that do
all data consumption and hands off ConsumerRecords instances to a
blocking queue consumed by a pool of processor threads that actually
handle the record processing. This option likewise has pros and cons:
PRO: This option allows independently scaling the number of consumers
and processors. This makes it possible to have a single consumer that
feeds many processor threads, avoiding any limitation on partitions.
CON: Guaranteeing order across the processors requires particular care
as the threads will execute independently an earlier chunk of data may
actually be processed after a later chunk of data just due to the luck
of thread execution timing. For processing that has no ordering
requirements this is not a problem.
CON: Manually committing the position becomes harder as it requires
that all threads co-ordinate to ensure that processing is complete for
that partition.**
Esentially, works like this. The consumer keeps reading and gives the responsibility of the processing and process-timeout management to the processor threads .
The error handling of the message processing would be responsibility of the processor threads as well. For example, if a timeout is thrown or an exception occurs, the processor will send the message to your defined "dead" queue, or whatever management of this you wish to perform, without involving the consumer. Regardless of the processor threads' success or fail, the consumer will continue its job and never be considered dead for not calling poll() in the specified timeout.
You should control the amount of messages the consumer retrieves in its poll call in order not to saturate the processors. Its a game regarding how fast the processors finish their job, how many messages the consumer retrieves (max.poll.records) at each iteration, and what's the specified timeout for the consumer.
Decoupled workflow
The first element to be quoted is the queue (with a limited size, which you should also manage in order not getting too filled - OOM).
This queue would be the link between consumer and processor threads, essentially a buffer that could dynamically get bigger or smaller depending on the specific word load at each time; It would manage overloads, something like a dam, or barrier, to find a similarity.
----->WORKERTHREAD1
KAFKA <------> CONSUMER ----> QUEUE -----|
----->WORKERTHREAD2
What you get is a second queue-lag mechanism:
1. Kafka Consumer LAG (the messages still to be read from the partition/topic)
2. Queue LAG (received messages still need to be processed)
--->WORKERTHREAD1
KAFKA <--(LAG)--> CONSUMER ----> QUEUE --(LAG)--|
--->WORKERTHREAD2
The queue could be some kind of synchronized queue, such a ConcurrentLinkedQueue. for example. Or you could manage yourself the synchronization with a customized queue.
Essentially, the duties would be divided, and the consumer is given the easiest one (as its the one that is most crucial).
Responsibilities:
Consumer
consume-->send to queue
Workers
read from queue|-->[manage timeout]
|==>PROCESS MESSAGE ==> send to topic
|-->[handle failed messages]
You should also manage if the processor threads die/deadlock; but usually those mechanisms are already implemented in most of ThreadPool variants.
I suggest the workers to share a unique KafkaProducer; The producer is thread safe and since the output topic would be the same for the group of consumers, this would also increase its performance. Also from the Kafka Producer javadoc:
The producer is thread safe and sharing a single producer instance
across threads will generally be faster than having multiple
instances.
In resume, each consumer thread feeds n processor threads. Some variants could be:
- 1 consumer - 1 worker (no processing paralellization, just division of duties)
- 1 consumer - 2 workers
- 1 consumer - 4 workers
- 2 consumers - 4 workers (2 for each)
- 2 consumers - 8 workers (4 for each)
...
Read carefully the pros and contras from this mechanism in the javadoc, and judge if this could be a solution to your specific case.
In my oppinion, there's a PRO that doesn't get reflected in the docs, which is the root of this answer/suggestion:
Consumption shouldn't be affected by processing. This approach avoids any consumer thread being considered dead due to a slow processing of the messages, and offers an extra "safety-window" thanks to the queue. I'm not saying that, at the point in which all processors fail for every message, or the queue hits maximum size, for example, the consumer would continue happily as if that didn't affect it; It will in fact be stopped by processing, but much, much later and due to bigger reasons that couldn't be avoided. This approach offers some extra time, or extra shield, for that to happen. Just like a dam can fail if it can't hold any more water.
Well, hope you take this as a suggestion, and may it be helpful somehow. It may avoid most of the dead consumer issues you're having. If well managed, it's a good approach for 24/7 real time data workflow.
Let's say there is a batch API for performing tasks List[T]. In order to do the job all the tasks needs to be pushed to kafka. There are 2 ways to do that :
1) Pushing List as a message in kafka
2) Pushing individual task T in kafka
I believe approach 1 would be better since i don't have to push the messages to kafka mutiple times for a single batch call. Can some one please tell me if there is any harm in such approach ?
A Kafka producer can batch together individual messages sent within a short time window (the particular config is linger.ms), so the cost of sending individual messages is probably a lot lower than you think.
Probably a more important factor to consider is how the consumer is going to consume messages. What should happen if the consumer cannot process one of the tasks, for example? If the consumer is just just going to call some other batch-based API which succeeds or fails as a batch, the a single message containing a list of tasks would be a perfectly good fit. On the other hand if the consumer ultimately has to process tasks individually then sending individual messages is probably a better fit, and will probably save you from having to implement some sort of retry logic in your consumer, because you can probably configure Kafka to behave with the semantics you need.
Starting from Kafka v0.11 you can also use transactions in the producer to publish your entire batch atomically. i.e. you begin the transaction, then publish your tasks message by message, finally you commit the transaction. Even though the messages can be sent to kafka in multiple batches, they will only become visible to consumers once you commit the transaction, as long as your consumers are running in read-committed mode.
Option 1 is the preferred method in Kafka so long as the entire batch should always stay together. If you publish a List of records as a batch then they will be stored as a batch, they will be (optionally) compressed as a batch yielding better compression, and they will be fetched by consumers as a batch yielding fewer fetch requests.
If you send individual messages then you will have to give them a common key or they will get spread out over different partitions and possibly be sent out of order, or to different consumers of a consumer group.
Imagine a scenario where we have 3 partitions belonging to 3 different topics on a machine which runs a kafka process/broker. This broker will receive messages for all three partitions. It will store them on different log subdirectories. My question is how does the kafka broker schedule these writes? How does it decide which partition/topic will be written next?
For ordering over requests, the image below shows roughly, how the broker internally handles produce requests:
There is a number of network threads that pull bytes of the network layer and convert these to internal requests. These requests are then stuck in a fifo request queue, from where the io threads pull them and append the contained messages to the relevant partitions. So in short messages are processed in the order they are received in.
Looking through the code I am unsure, whether there may be potential for a race condition here, where a smaller request could "overtake" a large request that was sent immediately before it. However even if this were possible it is an extremely unlikely fringe case that I can't see ever occurring for a single producer. Maybe someone with a better understanding of the code can weigh in here?
As for ordering of batched messages in one request, the request stores messages internally in a HashMap, which uses TopicPartition as a key, since as far as I am aware a Scala HashMap does not preserve ordering of the inserted elements, I don't think that there are any guarantees around the order in which multiple partitions in one request get processed - which is fine, as ordering is only guaranteed to be preserved within the partition.
Within each partition, messages are processed in the order they were given to the producer before sending.
Below is the desirable design of the queue with:
P producer. The application that insert data
X exchange.
C1-C3 consumer. The applications that read from the queue
Queue details:
A. Is just like queue log, if there is no client binding then message will be discarded.
B. This is a working queue. it will do something if there is criteria match.
C. Also a working queue. it will transform the data
A is optional, but B. C. will always in queue until some client process connect it.
The problem is determine which type of exchange that i should use.
is it a fanout, direct or topic ? because I wanted the A queue to discard the message if there is no client connected but B & C should always keep the message.
And should the producer write once to the exchange, or write multiple time with different routing key or topic ?
Answer the question: Do I want all queues to receive all messages?
If the answer is yes then you should use fanout. If the answer is no then you should use direct or topic. The whole point of direct or topic is that the queues themselves will only receive messages based on matching the routing key to the binding key.
Queue A should be instantiated by the consumer C1, and set to autodelete and non durable. This way when C1 disconnects the queue will be deleted and the messages will be discarded.
Conversely Queues B and C should be instantiated when the exchange is, either separately, or by the producer. The should be set to non autodelete and probably durable. If you are using durable queues you might want to have persistent messages (don't worry if queue A doesn't exist even persistent message won't be a problem here). This way as soon as the producer starts sending messages the queues will start queuing them up and no message will be missed, even if the consumers are not yet running.
Whether to use direct or topic exchanges is personal preference. I understand that direct exchanges should be faster while topic exchanges allow a lot of flexibility with routing/binding keys.
I am not 100% what you mean by your last question. Each message should only be written once to an exchange. If using fanout the exchange will take care of routing the messages to the queues correctly and that is it. If you are using direct or topic exchanges then its down to the binding keys to make sure that each queue receives the correct messages. You should not need to send a message with more than one routing key, if you are wishing to do something like that then you have got something backwards in your understanding. But you can have multiple binding keys to the exchange from a single queue.
Simple example. X is a direct exchange. B has the binding key black, C has one binding key of black and one binding key of white. P sends messages with either the routing key black or white. If it is black then both B and C will receive the message, if it is white then only C will receive it.