Kafka to Kafka mirroring with sampling - apache-kafka

Any idea how to make kafka-to-kafka mirroring but with a sampling (for example only 10% of the messages)?

You could use MirrorMakerMessageHandler (which is configured by message.handler parameter):
https://github.com/apache/kafka/blob/1.0/core/src/main/scala/kafka/tools/MirrorMaker.scala#L430
The handler itself would need to make a decision whether to forward a message. A simple implementation would be just a counter of messages received, and forwarding if 0 == counter % 10.
However this handler is invoked for every message received, so it means that you'd be receiving all of messages & throwing away 90% of them.
The alternative is to modify main loop, where the mirror maker consumer receives the message, and forwards it to producers (that send the message to mirror cluster) is here
https://github.com/apache/kafka/blob/1.0/core/src/main/scala/kafka/tools/MirrorMaker.scala#L428
You would need to modify the consumer part to either-or:
forward only N-th (10th) message/offset
seek to only N-th message in log
I prefer the former idea, as in case of multiple MM instances in the same consumer group, you would still get reasonable behaviour. Second choice would demand more work from you to handle reassignments.
Also, telling which message is from 10% is non-trivial, I just assumed that it's every 10th message received.

Related

How to test a verticle that does not wait for acks to its messages?

I want to test a worker verticle that receives requests over EventBus and sends the results also over EventBus. A single request may result in 0,1,2,... responses - in general cases we don't know how many responses we'll get.
The business logic is that requests are acked once the processing is complete, however the responses are sent in "fire and forget" manner - therefore we only know the responses were sent, not necessarily that they were delivered already.
I am writing a test for this verticle.
The test code is planned to be like this:
1. set up consumer for responses
2. send a request
3. wait until request is acked by the worker verticle
4. wait until consumer finishes validating the responses
The problem here is step 4 - in general case we don't know if there are still some responses in flight or not.
A brute force solution is obviously to wait some reasonable time - a few milliseconds is usually enough. However. I'd prefer something more conceptual.
A solution that comes to my mind is this:
send some request for which we know for sure that there would be a single response;
wait until the consumer receives the corresponding response.
That should work, but I dislike the fact that I pump two messages through the SUT instead of just a single one.
A different solution would be to send one extra response from test code, once we have a confirmation that the request was processed - but would it be considered to be the same sender? The EventBus only guarantees delivery order from the same sender, not from different ones. The test doesn't run in cluster mode, all operations are performed on the same machine, though not necessarily in the same thread.
Yet another solution would be to somehow check that EventBus is now empty, but as I understand, this is not possible.
Is there any other (better) solution?
The solution I would choose now (after half a year more experience with vertx/EventBus) is to send two messages.
The second message would get acked only after the processing of the first one is complete.
This would only work if you have a single consumer so that your two messages can't be processed in parallel.

Kafka producer future metadata in callback

In my application when I send messages I use the Metadata in the callback to save the offset of the record for future usage. However sometimes the metadata.offset() returns -1 which makes things hard later.
Why does this happen and is there a way to get the offset without consuming the topic to find it.
Edit: I am on ack 0 currently, when I pass to ack 1 I don't have these errors anymore however my performance drops drastically. From 100k message in 10 sec to 1 min.
acks=0 If set to zero then the producer will not wait for any
acknowledgment from the server at all. The record will be immediately
added to the socket buffer and considered sent. No guarantee can be
made that the server has received the record in this case, and the
retries configuration will not take effect (as the client won't
generally know of any failures). The offset given back for each
record will always be set to -1.
This is not exactly true as out of 100k messages I got 95k with offsets but I guess it's normal.
Still will need to find another solution to get the offset with ack=0

How can a kafka consumer doing infinite retires recover from a bad incoming message?

I am kafka newbie and as I was reading the docs, I had this design related question related to kafka consumer.
A kafka consumer reads messages from the kafka stream which is made up
of one or more partitions from one or more servers.
Lets say one of the incoming messages is corrupt and as a result the consumer fails to process. But when processing event logs you don't want to drop any events, as a result you do infinite retries to avoid transient errors during processing. In such cases of infinite retries, how can the consumer move forward. Is there a way to blacklist this message for next retry?
I'd think it needs manual intervention. Where we log some message metadata (don't know what exactly yet) to look at which message is failing and have logic in place where each consumer checks redis (or someplace else?) after n reties to see if this message needs to be skipped. The blacklist doesn't have to be stored forever in the redis either, only until the consumer can skip it. Here's a pseudocode of what i just described:
while (errorState) {
if (msg in blacklist) {
//skip
commitOffset()
} else {
errorState = processMessage(msg);
if (!errorState) {
commitOffset();
} else {
// log this msg so that we can add to blacklist
logger.info(msg)
}
}
}
I'd like to hear from more experienced folks to see if there are better ways to do this.
We had a requirement in our project where the processing of an incoming message to update a record was dependent on the record being present. Due to some race condition, sometimes update arrived before the insert. In such cases, we implemented couple of approaches.
A. Manual retry with a predefined delay. The code checks if the insert has arrived. If so, processing goes as normal. Otherwise, it would sleep for 500ms, then try again. This would repeat 10 times. At the end, if the message is still not processed, the code logs the message, commits the offset and moves forward. The processing of message is always done in a thread from a pool, so it doesn't block the main thread either. However, in the worst case each message would take 5 seconds of application time.
B. Recently, we refined the above solution to use a message scheduler based on kafka. So now if insert has not arrived before the update, system sends it to a separate scheduler which operates on kafka. This scheduler would replay the message after some time. After 3 retries, we again log the message and stop scheduling or retrying. This gives us the benefit of not blocking the application threads and manage when we would like to replay the message again.

What's the best way to subsample a ZMQ PUB SUB connection?

I have a ZMQ_PUB socket sending messages out at ~50Hz. One destination needs to react to each message, so it has a standard ZMQ_SUB socket with a while(true) loop checking for new messages. A second destination should only react once a second to the "most recent" message. That is, my second destination needs to subsample.
For the second destination, I believe I'd want to have a time-based loop that is called at my desired rate (1Hz) and recv() the latest message, dropping the rest. I believe this is done via a ZMQ_HWM on the subscriber. Is there another option that needs to be set somewhere?
Do I need to worry about the different subscribers having different HWMs? Will the publisher become angry? It's a shame ZMQ_RATE only applies to multicast sockets.
Is there a best way to accomplish what I'm attempting?
zmq v3.2.4
The HighWaterMark will not be a fantastic solution for your problem. Setting it on the subscriber to, let's say, 10 and reading 1 message per second, will just give you the old messages first, slowly, and throw away all the new, because it's limit are reached.
You could either use a topic on you publisher that makes you able to filter out every 50th message like making the topic messageCount % 50 and subscribe to 0.
Otherwise maybe you shouldn't use zmq's pub/sub, but instead do you own look alike with router/dealer that allows you to subscribe to sampled messages.
Lastly you could also just send them all. 50 m/s is hardly anything in zmq (if they aren't heavy on data, like megs) and then only use every 50th message.

RabbitMQ - Message order of delivery

I need to choose a new Queue broker for my new project.
This time I need a scalable queue that supports pub/sub, and keeping message ordering is a must.
I read Alexis comment: He writes:
"Indeed, we think RabbitMQ provides stronger ordering than Kafka"
I read the message ordering section in rabbitmq docs:
"Messages can be returned to the queue using AMQP methods that feature
a requeue
parameter (basic.recover, basic.reject and basic.nack), or due to a channel
closing while holding unacknowledged messages...With release 2.7.0 and later
it is still possible for individual consumers to observe messages out of
order if the queue has multiple subscribers. This is due to the actions of
other subscribers who may requeue messages. From the perspective of the queue
the messages are always held in the publication order."
If I need to handle messages by their order, I can only use rabbitMQ with an exclusive queue to each consumer?
Is RabbitMQ still considered a good solution for ordered message queuing?
Well, let's take a closer look at the scenario you are describing above. I think it's important to paste the documentation immediately prior to the snippet in your question to provide context:
Section 4.7 of the AMQP 0-9-1 core specification explains the
conditions under which ordering is guaranteed: messages published in
one channel, passing through one exchange and one queue and one
outgoing channel will be received in the same order that they were
sent. RabbitMQ offers stronger guarantees since release 2.7.0.
Messages can be returned to the queue using AMQP methods that feature
a requeue parameter (basic.recover, basic.reject and basic.nack), or
due to a channel closing while holding unacknowledged messages. Any of
these scenarios caused messages to be requeued at the back of the
queue for RabbitMQ releases earlier than 2.7.0. From RabbitMQ release
2.7.0, messages are always held in the queue in publication order, even in the presence of requeueing or channel closure. (emphasis added)
So, it is clear that RabbitMQ, from 2.7.0 onward, is making a rather drastic improvement over the original AMQP specification with regard to message ordering.
With multiple (parallel) consumers, order of processing cannot be guaranteed.
The third paragraph (pasted in the question) goes on to give a disclaimer, which I will paraphrase: "if you have multiple processors in the queue, there is no longer a guarantee that messages will be processed in order." All they are saying here is that RabbitMQ cannot defy the laws of mathematics.
Consider a line of customers at a bank. This particular bank prides itself on helping customers in the order they came into the bank. Customers line up in a queue, and are served by the next of 3 available tellers.
This morning, it so happened that all three tellers became available at the same time, and the next 3 customers approached. Suddenly, the first of the three tellers became violently ill, and could not finish serving the first customer in the line. By the time this happened, teller 2 had finished with customer 2 and teller 3 had already begun to serve customer 3.
Now, one of two things can happen. (1) The first customer in line can go back to the head of the line or (2) the first customer can pre-empt the third customer, causing that teller to stop working on the third customer and start working on the first. This type of pre-emption logic is not supported by RabbitMQ, nor any other message broker that I'm aware of. In either case, the first customer actually does not end up getting helped first - the second customer does, being lucky enough to get a good, fast teller off the bat. The only way to guarantee customers are helped in order is to have one teller helping customers one at a time, which will cause major customer service issues for the bank.
It is not possible to ensure that messages get handled in order in every possible case, given that you have multiple consumers. It doesn't matter if you have multiple queues, multiple exclusive consumers, different brokers, etc. - there is no way to guarantee a priori that messages are answered in order with multiple consumers. But RabbitMQ will make a best-effort.
Message ordering is preserved in Kafka, but only within partitions rather than globally. If your data need both global ordering and partitions, this does make things difficult. However, if you just need to make sure that all of the same events for the same user, etc... end up in the same partition so that they are properly ordered, you may do so. The producer is in charge of the partition that they write to, so if you are able to logically partition your data this may be preferable.
I think there are two things in this question which are not similar, consumption order and processing order.
Message Queues can -to a degree- give you a guarantee that messages will get consumed in order, they can't, however, give you any guarantees on the order of their processing.
The main difference here is that there are some aspects of message processing which cannot be determined at consumption time, for example:
As mentioned a consumer can fail while processing, here the message's consumption order was correct, however, the consumer failed to process it correctly, which will make it go back to the queue. At this point the consumption order is intact, but the processing order is not.
If by "processing" we mean that the message is now discarded and finished processing completely, then consider the case when your processing time is not linear, in other words processing one message takes longer than the other. For example, if message 3 takes longer to process than usual, then messages 4 and 5 might get consumed and finish processing before message 3 does.
So even if you managed to get the message back to the front of the queue (which by the way violates the consumption order) you still cannot guarantee they will also be processed in order.
If you want to process the messages in order:
Have only 1 consumer instance at all times, or a main consumer and several stand-by consumers.
Or don't use a messaging queue and do the processing in a synchronous blocking method, which might sound bad but in many cases and business requirements it is completely valid and sometimes even mission critical.
There are proper ways to guarantuee the order of messages within RabbitMQ subscriptions.
If you use multiple consumers, they will process the message using a shared ExecutorService. See also ConnectionFactory.setSharedExecutor(...). You could set a Executors.newSingleThreadExecutor().
If you use one Consumer with a single queue, you can bind this queue using multiple bindingKeys (they may have wildcards). The messages will be placed into the queue in the same order that they were received by the message broker.
For example you have a single publisher that publishes messages where the order is important:
try (Connection connection2 = factory.newConnection();
Channel channel2 = connection.createChannel()) {
// publish messages alternating to two different topics
for (int i = 0; i < messageCount; i++) {
final String routingKey = i % 2 == 0 ? routingEven : routingOdd;
channel2.basicPublish(exchange, routingKey, null, ("Hello" + i).getBytes(UTF_8));
}
}
You now might want to receive messages from both topics in a queue in the same order that they were published:
// declare a queue for the consumer
final String queueName = channel.queueDeclare().getQueue();
// we bind to queue with the two different routingKeys
final String routingEven = "even";
final String routingOdd = "odd";
channel.queueBind(queueName, exchange, routingEven);
channel.queueBind(queueName, exchange, routingOdd);
channel.basicConsume(queueName, true, new DefaultConsumer(channel) { ... });
The Consumer will now receive the messages in the order that they were published, regardless of the fact that you used different topics.
There are some good 5-Minute Tutorials in the RabbitMQ documentation that might be helpful:
https://www.rabbitmq.com/tutorials/tutorial-five-java.html