How to discard some number of messages in rabbitmQ - queue

I have a use case where I need to get data from a queue on an exchange that I dont have control on.
the usecase is that from this queue I get messages constantly. Just wonder if in rabbitmq or by using/writing a plugin I can discard 90% of the messages at a time before saving them to my local datastore. The reason for this is that I'm not capable of storing all the messages but 10% of it.
Obviously one way is in my application to do so. but I wonder if there is a way to do it on rabbitmq level.
Just wonder if you have any thoughts/solutions on this.

If you don't have control of the exchange, you're pretty much limited to doing it in your app.
You can bulk-reject messages using a nack - here's the help page:
http://www.rabbitmq.com/nack.html

Due to the AMQP specs, a rabbitmq queue passes its messages to the connected consumers in a round robin algorithm. So if your code is the sole consumer of the rabbitmq queue & you want your application to neglect about 90% of recieved messages and process only remaining 10% then,....
connect to the same queue using 10 different consumers simultaneously (all may be written in same language or diff. dose not matter) and write your message processing logic in any one or two of them....abandon the rest 8/9 consumers(these will be used by rabbitmq [and conceptually by us] to drain off about 90% of messages)

You can simply consume the messages and do nothing about it. Using rabbitmqadmin is the easiest way to do this as below:
rabbitmqadmin get queue=queuename requeue=false count=1

Related

How to replace Kafka with Aeron

At present, the trading system of our production environment is using Kafka. Because Kafka latency is too high, we hope to replace Kafka with Aeron. How can I use Aeron correctly?
Aeron isn't an out of the box replacement for Kafka although it does provide primitives that would allow you to replicate much of the functionality.
Kafka latencies are in the order of milliseconds whereas Aeron latencies are typically measured in microseconds.
What exactly you would need to build in Aeron very much depends on your use case.
One of the primary uses of Kafka is as a persistent queue.
To build a simple persistent queue for a single publisher use case. You would need:
Publisher
ArchivingMediaDriver - this component runs and Aeron MediaDriver which handles send/receiving messages over the network and and Archive which allows you to record and replay streams.
A Publication to send messages to be recorded by the Archive. See AeronArchive.addRecordedPublication.
Subsciber(s)
MediaDriver - this component handles send/receiving messages over the network.
A Susbcription that replays data from a specific position in the recorded stream of messages. See AeronArchive.replay.
There are examples of this in the aeron-samples.
RecordedBasicPublisher.java
ReplayedBasicSubscriber.java
Latency could be reduced further by having the publisher send messages over multicast/MDC and having the subscriber use ReplayMerge to seamlessly transition from the recorded stream to the live stream.
Worth noting that real-logic do provide commercial support.

Avoid Data Loss While Processing Messages from Kafka

Looking out for best approach for designing my Kafka Consumer. Basically I would like to see what is the best way to avoid data loss in case there are any
exception/errors during processing the messages.
My use case is as below.
a) The reason why I am using a SERVICE to process the message is - in future I am planning to write an ERROR PROCESSOR application which would run at the end of the day, which will try to process the failed messages (not all messages, but messages which fails because of any dependencies like parent missing) again.
b) I want to make sure there is zero message loss and so I will save the message to a file in case there are any issues while saving the message to DB.
c) In production environment there can be multiple instances of consumer and services running and so there is high chance that multiple applications try to write to the
same file.
Q-1) Is writing to file the only option to avoid data loss ?
Q-2) If it is the only option, how to make sure multiple applications write to the same file and read at the same time ? Please consider in future once the error processor
is build, it might be reading the messages from the same file while another application is trying to write to the file.
ERROR PROCESSOR - Our source is following a event driven mechanics and there is high chance that some times the dependent event (for example, the parent entity for something) might get delayed by a couple of days. So in that case, I want my ERROR PROCESSOR to process the same messages multiple times.
I've run into something similar before. So, diving straight into your questions:
Not necessarily, you could perhaps send those messages back to Kafka in a new topic (let's say - error-topic). So, when your error processor is ready, it could just listen in to the this error-topic and consume those messages as they come in.
I think this question has been addressed in response to the first one. So, instead of using a file to write to and read from and open multiple file handles to do this concurrently, Kafka might be a better choice as it is designed for such problems.
Note: The following point is just some food for thought based on my limited understanding of your problem domain. So, you may just choose to ignore this safely.
One more point worth considering on your design for the service component - You might as well consider merging points 4 and 5 by sending all the error messages back to Kafka. That will enable you to process all error messages in a consistent way as opposed to putting some messages in the error DB and some in Kafka.
EDIT: Based on the additional information on the ERROR PROCESSOR requirement, here's a diagrammatic representation of the solution design.
I've deliberately kept the output of the ERROR PROCESSOR abstract for now just to keep it generic.
I hope this helps!
If you don't commit the consumed message before writing to the database, then nothing would be lost while Kafka retains the message. The tradeoff of that would be that if the consumer did commit to the database, but a Kafka offset commit fails or times out, you'd end up consuming records again and potentially have duplicates being processed in your service.
Even if you did write to a file, you wouldn't be guaranteed ordering unless you opened a file per partition, and ensured all consumers only ran on a single machine (because you're preserving state there, which isn't fault-tolerant). Deduplication would still need handled as well.
Also, rather than write your own consumer to a database, you could look into Kafka Connect framework. For validating a message, you can similarly deploy a Kafka Streams application to filter out bad messages from an input topic out into a topic to send to the DB

Anyevent::RabbitMQ Perl QoS prefetch_count not working

I've been trying to use RabbitMQ perl library Net::RabbitFoot which uses AnyEvent::RabbitMQ underneath. According to RabbitMQ Tutorial, setting prefetch_count to 1 should ensure fair dispatch, as in should not dispatch a message to a worker that is already busy on another message. However, the perl implementation Net::RabbitFoot, does not seem to work that way even after setting the qos as described here, line 54. It seems to just do vanilla round-robin dispatch and ends up dispatching to machine that is already executing a job. This is the qos implementation. Could you help me with figuring out why this is happening? Is it a bug in the library?
Thanks in advance.
Edit:
This is my setup: 2 consumers attached to the same-named queue. When I dispatch a lot of messages, I see this pattern: Consumer 1: Msg1, Msg3, Msg5 ... Consumer 2: Msg2, Msg4, ... All messages are from the same queue. What happens now is if Msg3 hogs Consumer 1, still Msg5 is sent to Consumer 1 while Consumer 2 is sitting free.
vanilla round-robin? uh?
The prefetch_count=1 comes useful when there are many consumers attached to the same common queue. In fact by default the client libraries will prefetch many messages in one shot.
So the default odd effect, that you want to avoid by setting it to one, is that one client get most (or all) the messages, and other consumers get few or none, being the load unbalanced.
However you speak of "vanilla round-robin": that happens when you have different (probably unnamed/temporary) queues attached to a direct exchange, one per consumer. But in this way you have no way to balance the load dynamically.
If I'm guessing right you need to change your configuration and let all the consumers attach to the same named queue.
EDIT: from the comment of the OP, this is not the case.
Alternatively it's possible that your consumers are configured with auto-ack, or they do send the ACK before completing their job. In this case too the RabbitMQ client API thinks that it's free to get another message: you need to send the ack back only after the local task regarding that message has been completed.

How to get Acknowledgement from Kafka

How to I exactly get the acknowledgement from Kafka once the message is consumed or processed. Might sound stupid but is there any way to know the start and end offset of that message for which the ack has been received ?
What I found so far is in 0.8 they have introduced the following way to choose from the offset for reading ..
kafka.api.OffsetRequest.EarliestTime() finds the beginning of the data in the logs and starts streaming from there, kafka.api.OffsetRequest.LatestTime() will only stream new messages.
example code
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
Still not sure about the acknowledgement part
Kafka isn't really structured to do this. To understand why, review the design documentation here.
In order to provide an exactly-once acknowledgement, you would need to create some external tracking system for your application, where you explicitly write acknowledgements and implement locks over the transaction id's in order to ensure things are only ever processed once. The computational cost of implementing such as system is extraordinarily high, and is one of the main reasons that large transactional systems require comparatively exotic hardware and have arguably lower scalability than systems such as Kafka.
If you do not require strong durability semantics, you can use the groups API to keep rough track of when the last message was read. This ensures that every message is read at least once. Note that since the groups API does not provide you the ability to explicitly track your applications own processing logic, that your actual processing guarantees are fairly weak in this scenario. Schemes that rely on idempotent processing are common in this environment.
Alternatively, you may use the poorly-named SimpleConsumer API (it is quite complex to use), which enables you to explicitly track timestamps within your application. This is the highest level of processing guarantee that can be achieved through the native Kafka API's since it enables you to track your applications own processing of the data that is read from the queue.

MSMQ as a job queue

I am trying to implement job queue with MSMQ to save up some time on me implementing it in SQL. After reading around I realized MSMQ might not offer what I am after. Could you please advice me if my plan is realistic using MSMQ or recommend an alternative ?
I have number of processes picking up jobs from a queue (I might need to scale out in the future), once job is picked up processing follows, during this time job is locked to other processes by status, if needed job is chucked back (status changes again) to the queue for further processing, but physically the job still sits in the queue until completed.
MSMQ doesn't let me to keep the message in the queue while working on it, eg I can peek or read. Read takes message out of queue and peek doesn't allow changing the message (status).
Thank you
Using MSMQ as a datastore is probably bad as it's not designed for storage at all. Unless the queues are transactional the messages may not even get written to disk.
Certainly updating queue items in-situ is not supported for the reasons you state.
If you don't want a full blown relational DB you could use an in-memory cache of some kind, like memcached, or a cheap object db like raven.
Take a look at RabbitMQ, or many of the other messages queues. Most offer this functionality out of the box.
For example. RabbitMQ calls what you are describing, Work Queues. Multiple consumers can pull from the same queue and not pull the same item. Furthermore, if you use acknowledgements and the processing fails, the item is not removed from the queue.
.net examples:
https://www.rabbitmq.com/tutorials/tutorial-two-dotnet.html
EDIT: After using MSMQ myself, it would probably work very well for what you are doing, as far as I can tell. The key is to use transactions and multiple queues. For example, each status should have it's own queue. It's fairly safe to "move" messages from one queue to another since it occurs within a transaction. This moving of messages is essentially your change of status.
We also use the Message Extension byte array for storing message metadata, like status. This way we don't have to alter the actual message when moving it to another queue.
MSMQ and queues in general, require a different set of patterns than what most programmers are use to. Keep that in mind.
Perhaps, if you can give more information on why you need to peek for messages that are currently in process, there would be a way to handle that scenario with MSMQ. You could always add a database for additional tracking.