use of Mongo/Redis on a full firehose stream - mongodb

I have been reading up on how DataSift uses different technologies to consume the twitter firehose and since I need to follow the same concept, wanted to get some understanding on the differences between mongo/redis and its use in storage of realtime data. My understanding is this:
The stream volume is way too high to simply consume and place the data (tweets etc) in for example a rabbitmq bunch of queues. My concern is the issue of data loss. My current architecture involves connecting to an open stream and consume the data and push each post or message into a couple of queues in rabbitmq. The queues hold a copy of each message with one being the processing queue and one being the storage queue. I then consume each queue by doing processing on the processing queue which is time intensive but my workers keep up fine and write all the storage queue contents to files and that works fine.
If my volume was to increase 100x, I am told this current setup will not be able to handle the volume and using the mongo/redis approach would be better. So not sure how this would be implemented: would I then consume the stream into mongo and then from there into queues and why would this be a better approach.

Related

How to replace Kafka with Aeron

At present, the trading system of our production environment is using Kafka. Because Kafka latency is too high, we hope to replace Kafka with Aeron. How can I use Aeron correctly?
Aeron isn't an out of the box replacement for Kafka although it does provide primitives that would allow you to replicate much of the functionality.
Kafka latencies are in the order of milliseconds whereas Aeron latencies are typically measured in microseconds.
What exactly you would need to build in Aeron very much depends on your use case.
One of the primary uses of Kafka is as a persistent queue.
To build a simple persistent queue for a single publisher use case. You would need:
Publisher
ArchivingMediaDriver - this component runs and Aeron MediaDriver which handles send/receiving messages over the network and and Archive which allows you to record and replay streams.
A Publication to send messages to be recorded by the Archive. See AeronArchive.addRecordedPublication.
Subsciber(s)
MediaDriver - this component handles send/receiving messages over the network.
A Susbcription that replays data from a specific position in the recorded stream of messages. See AeronArchive.replay.
There are examples of this in the aeron-samples.
RecordedBasicPublisher.java
ReplayedBasicSubscriber.java
Latency could be reduced further by having the publisher send messages over multicast/MDC and having the subscriber use ReplayMerge to seamlessly transition from the recorded stream to the live stream.
Worth noting that real-logic do provide commercial support.

Kafka Streams and RPC: is calling REST service in map() operator considered an anti-pattern?

The naive approach for implementing the use case of enriching an incoming stream of events stored in Kafka with reference data - is by calling in map() operator an external service REST API that provides this reference data, for each incoming event.
eventStream.map((key, event) -> /* query the external service here, then return the enriched event */)
Another approach is to have second events stream with reference data and store it in KTable that will be a lightweight embedded "database" then join main event stream with it.
KStream<String, Object> eventStream = builder.stream(..., "event-topic");
KTable<String, Object> referenceDataTable = builder.table(..., "reference-data-topic");
KTable<String, Object> enrichedEventStream = eventStream
.leftJoin(referenceDataTable , (event, referenceData) -> /* return the enriched event */)
.map((key, enrichedEvent) -> new KeyValue<>(/* new key */, enrichedEvent)
.to("enriched-event-topic", ...);
Can the "naive" approach be considered an anti-pattern? Can the "KTable" approach be recommended as the preferred one?
Kafka can easily manage millions of messages per minute. Service that is called from the map() operator should be capable of handling high load too and also highly-available. These are extra requirements for the service implementation. But if the service satisfies these criteria can the "naive" approach be used?
Yes, it is ok to do RPC inside Kafka Streams operations such as map() operation. You just need to be aware of the pros and cons of doing so, see below. Also, you should do any such RPC calls synchronously from within your operations (I won't go into details here why; if needed, I'd suggest to create a new question).
Pros of doing RPC calls from within Kafka Streams operations:
Your application will fit more easily into an existing architecture, e.g. one where the use of REST APIs and request/response paradigms is common place. This means that you can make more progress quickly for a first proof-of-concept or MVP.
The approach is, in my experience, easier to understand for many developers (particularly those who are just starting out with Kafka) because they are familiar with doing RPC calls in this manner from their past projects. Think: it helps to move gradually from request-response architectures to event-driven architectures (powered by Kafka).
Nothing prevents you from starting with RPC calls and request-response, and then later migrating to a more Kafka-idiomatic approach.
Cons:
You are coupling the availability, scalability, and latency/throughput of your Kafka Streams powered application to the availability, scalability, and latency/throughput of the RPC service(s) you are calling. This is relevant also for thinking about SLAs.
Related to the previous point, Kafka and Kafka Streams scale very well. If you are running at large scale, your Kafka Streams application might end up DDoS'ing your RPC service(s) because the latter probably can't scale as much as Kafka. You should be able to judge pretty easily whether or not this is a problem for you in practice.
An RPC call (like from within map()) is a side-effect and thus a black box for Kafka Streams. The processing guarantees of Kafka Streams do not extend to such side effects.
Example: Kafka Streams (by default) processes data based on event-time (= based on when an event happened in the real world), so you can easily re-process old data and still get back the same results as when the old data was still new. But the RPC service you are calling during such reprocessing might return a different response than "back then". Ensuring the latter is your responsibility.
Example: In the case of failures, Kafka Streams will retry operations, and it will guarantee exactly-once processing (if enabled) even in such situations. But it can't guarantee, by itself, that an RPC call you are doing from within map() will be idempotent. Ensuring the latter is your responsibility.
Alternatives
In case you are wondering what other alternatives you have: If, for example, you are doing RPC calls for looking up data (e.g. for enriching an incoming stream of events with side/context information), you can address the downsides above by making the lookup data available in Kafka directly. If the lookup data is in MySQL, you can setup a Kafka connector to continuously ingest the MySQL data into a Kafka topic (think: CDC). In Kafka Streams, you can then read the lookup data into a KTable and perform the enrichment of your input stream via a stream-table join.
I suspect most of the advice you hear from the internet is along the lines of, "OMG, if this REST call takes 200ms, how wil I ever process 100,000 Kafka messages per second to keep up with my demand?"
Which is technically true: even if you scale your servers up for your REST service, if responses from this app routinely take 200ms - because it talks to a server 70ms away (speed of light is kinda slow, if that server is across the continent from you...) and the calling microservice takes 130ms even if you measure right at the source....
With kstreams the problem may be worse than it appears. Maybe you get 100,000 messages a second coming into your stream pipeline, but some kstream operator flatMaps and that operation in your app creates 2 messages for every one object... so now you really have 200,000 messages a second crashing through your REST server.
BUT maybe you're using Kstreams in an app that has 100 messages a second, or you can partition your data so that you get a message per partition maybe even just once a second. In that case, you might be fine.
Maybe your Kafka data just needs to go somewhere else: ie the end of the stream is back into a Good Ol' RDMS. In which case yes, there's some careful balancing there on the best way to deal with potentially "slow" systems, while making sure you don't DDOS yourself, while making sure you can work your way out of a backlog.
So is it an anti-pattern? Eh, probably, if your Kafka cluster is LinkedIn size. Does it matter for you? Depends on how many messages/second you need to drive, how fast your REST service really is, how efficiently it can scale (ie your new kstreams pipeline suddenly delivers 5x the normal traffic to it...)

How to discard some number of messages in rabbitmQ

I have a use case where I need to get data from a queue on an exchange that I dont have control on.
the usecase is that from this queue I get messages constantly. Just wonder if in rabbitmq or by using/writing a plugin I can discard 90% of the messages at a time before saving them to my local datastore. The reason for this is that I'm not capable of storing all the messages but 10% of it.
Obviously one way is in my application to do so. but I wonder if there is a way to do it on rabbitmq level.
Just wonder if you have any thoughts/solutions on this.
If you don't have control of the exchange, you're pretty much limited to doing it in your app.
You can bulk-reject messages using a nack - here's the help page:
http://www.rabbitmq.com/nack.html
Due to the AMQP specs, a rabbitmq queue passes its messages to the connected consumers in a round robin algorithm. So if your code is the sole consumer of the rabbitmq queue & you want your application to neglect about 90% of recieved messages and process only remaining 10% then,....
connect to the same queue using 10 different consumers simultaneously (all may be written in same language or diff. dose not matter) and write your message processing logic in any one or two of them....abandon the rest 8/9 consumers(these will be used by rabbitmq [and conceptually by us] to drain off about 90% of messages)
You can simply consume the messages and do nothing about it. Using rabbitmqadmin is the easiest way to do this as below:
rabbitmqadmin get queue=queuename requeue=false count=1

MSMQ as a job queue

I am trying to implement job queue with MSMQ to save up some time on me implementing it in SQL. After reading around I realized MSMQ might not offer what I am after. Could you please advice me if my plan is realistic using MSMQ or recommend an alternative ?
I have number of processes picking up jobs from a queue (I might need to scale out in the future), once job is picked up processing follows, during this time job is locked to other processes by status, if needed job is chucked back (status changes again) to the queue for further processing, but physically the job still sits in the queue until completed.
MSMQ doesn't let me to keep the message in the queue while working on it, eg I can peek or read. Read takes message out of queue and peek doesn't allow changing the message (status).
Thank you
Using MSMQ as a datastore is probably bad as it's not designed for storage at all. Unless the queues are transactional the messages may not even get written to disk.
Certainly updating queue items in-situ is not supported for the reasons you state.
If you don't want a full blown relational DB you could use an in-memory cache of some kind, like memcached, or a cheap object db like raven.
Take a look at RabbitMQ, or many of the other messages queues. Most offer this functionality out of the box.
For example. RabbitMQ calls what you are describing, Work Queues. Multiple consumers can pull from the same queue and not pull the same item. Furthermore, if you use acknowledgements and the processing fails, the item is not removed from the queue.
.net examples:
https://www.rabbitmq.com/tutorials/tutorial-two-dotnet.html
EDIT: After using MSMQ myself, it would probably work very well for what you are doing, as far as I can tell. The key is to use transactions and multiple queues. For example, each status should have it's own queue. It's fairly safe to "move" messages from one queue to another since it occurs within a transaction. This moving of messages is essentially your change of status.
We also use the Message Extension byte array for storing message metadata, like status. This way we don't have to alter the actual message when moving it to another queue.
MSMQ and queues in general, require a different set of patterns than what most programmers are use to. Keep that in mind.
Perhaps, if you can give more information on why you need to peek for messages that are currently in process, there would be a way to handle that scenario with MSMQ. You could always add a database for additional tracking.

Flink kafka to S3 with retry

I am a new starter in Flink, I have a requirement to read data from Kafka, enrich those data conditionally (if a record belongs to category X) by using some API and write to S3.
I made a hello world Flink application with the above logic which works like a charm.
But, the API which I am using to enrich doesn't have 100% uptime SLA, so I need to design something with retry logic.
Following are the options that I found,
Option 1) Make an exponential retry until I get a response from API, but this will block the queue, so I don't like this
Option 2) Use one more topic (called topic-failure) and publish it to topic-failure if the API is down. In this way it won't block the actual main queue. I will need one more worker to process the data from the queue topic-failure. Again, this queue has to be used as a circular queue if the API is down for a long time. For example, read a message from queue topic-failure try to enrich if it fails to push to the same queue called topic-failure and consume the next message from the queue topic-failure.
I prefer option 2, but it looks like not an easy task to accomplish this. Is there is any standard Flink approach available to implement option 2?
This is a rather common problem that occurs when migrating away from microservices. The proper solution would be to have the lookup data also in Kafka or some DB that could be integrated in the same Flink application as an additional source.
If you cannot do it (for example, API is external or data cannot be mapped easily to a data storage), both approaches are viable and they have different advantages.
1) Will allow you to retain the order of input events. If your downstream application expects orderness, then you need to retry.
2) The common term is dead letter queue (although more often used on invalid records). There are two easy ways to integrate that in Flink, either have a separate source or use a topic pattern/list with one source.
Your topology would look like this:
Kafka Source -\ Async IO /-> Filter good -> S3 sink
+-> Union -> with timeout -+
Kafka Source dead -/ (for API call!) \-> Filter bad -> Kafka sink dead