Asynchronous or synchronous pull for counting stream data in pub sub pub/sub? - publish-subscribe

I would like to count the number of messages in the last hour (last hour referring to a timestamp field in the message data).
I currently have a code that will count the messages synchronously (I am using Google Cloud Pub/Sub Synchronous pull), but I noticed it will take quite long.
My code will repeatedly poll the subscription for a predefined (I set it to 100+) number of times so that I am sure there are no more messages in the last hour that are coming in out of order.
This is not an acceptable design because it means the user has to wait for 5-10 mins for the service to count the messages when they want the metric!
Are there best practices in Pub Sub design for solving this kind of problem?
This seems like a simple problem to solve (count the number of events in the last X timeframe) so I thought there might be.
Will asynchronous design help? How would an async design work? I am not too sure about the async and Python future concept (I am using GCP Pub/Sub's Python client library).

I will try to catch the message differently. My solution is based on logging and BigQuery. The idea is to write a log, for example message received with timestamp xxxxx, to filter this log pattern and to sink the result in BigQuery.
Then, when a user ask, you simply have to request BigQuery and to count the message in the desired lap of time. You also have the advantage to change the time frame, to have an history,...
For writing this log, 2 solutions
Cheaper but not really recommended, the process which consume the message log it with it process it. However, you are dependent of an external service. And this service has 2 responsibilities: its work, and this log (for metrics). Not SOLID. Maybe it's can be the role of the publisher with a loge like this: message published at XXXX. However this imply that all the publisher or all the subscribers are on GCP.
Better is to plug a function, the cheaper (128Mb of memory) to simply handle the message and write the log.

Related

How do I make sure that I process one message at a time at most?

I am wondering how to process one message at a time using Googles pub/sub functionality in Go. I am using the official library for this, https://pkg.go.dev/cloud.google.com/go/pubsub#section-readme. The event is being consumed by a service that runs with multiple instances, so any in memory locking mechanism will not work.
I realise that it's an anti-pattern to do this, so let me explain my use-case. Using mongoDB I store an array of objects as an embedded document for each entity. The event being published is modifying parts of this array and saves it. If I receive more than one event at a time and they start processing exactly at the same time, one of the saves will override the other. So I was thinking a solution for this is to make sure that only one message will be processed at a time, and it would be nice to use any built-in functionality in cloud pub/sub to do so. Otherwise I was thinking of implementing some locking mechanism in the DB but i'd like to avoid that.
Any help would be appreciated.
You can imagine 2 things:
You can use ordering key in PubSub. Like that, all the message in relation with the same object will be delivered in order and one by one.
You can use a PUSH subscription to PubSub, to push to Cloud Run or Cloud Functions. With Cloud Run, set the concurrency to 1 (it's by default with Cloud Functions gen1), and set the max instance to 1 also. Like that you can process only one message at a time, all the other message will be rejected (429 HTTP error code) and will be requeued to PubSub. The problem is that you can parallelize the processing as before with ordering key
A similar thing, and simpler to implement, is to use Cloud Tasks instead of PubSub. With Cloud Tasks you can set a rate limit on a queue, and set the maxConcurrentDispatches to 1 (and you haven't to do the same with Cloud Functions max instances or Cloud Run max instances and concurrency)

Avoid Data Loss While Processing Messages from Kafka

Looking out for best approach for designing my Kafka Consumer. Basically I would like to see what is the best way to avoid data loss in case there are any
exception/errors during processing the messages.
My use case is as below.
a) The reason why I am using a SERVICE to process the message is - in future I am planning to write an ERROR PROCESSOR application which would run at the end of the day, which will try to process the failed messages (not all messages, but messages which fails because of any dependencies like parent missing) again.
b) I want to make sure there is zero message loss and so I will save the message to a file in case there are any issues while saving the message to DB.
c) In production environment there can be multiple instances of consumer and services running and so there is high chance that multiple applications try to write to the
same file.
Q-1) Is writing to file the only option to avoid data loss ?
Q-2) If it is the only option, how to make sure multiple applications write to the same file and read at the same time ? Please consider in future once the error processor
is build, it might be reading the messages from the same file while another application is trying to write to the file.
ERROR PROCESSOR - Our source is following a event driven mechanics and there is high chance that some times the dependent event (for example, the parent entity for something) might get delayed by a couple of days. So in that case, I want my ERROR PROCESSOR to process the same messages multiple times.
I've run into something similar before. So, diving straight into your questions:
Not necessarily, you could perhaps send those messages back to Kafka in a new topic (let's say - error-topic). So, when your error processor is ready, it could just listen in to the this error-topic and consume those messages as they come in.
I think this question has been addressed in response to the first one. So, instead of using a file to write to and read from and open multiple file handles to do this concurrently, Kafka might be a better choice as it is designed for such problems.
Note: The following point is just some food for thought based on my limited understanding of your problem domain. So, you may just choose to ignore this safely.
One more point worth considering on your design for the service component - You might as well consider merging points 4 and 5 by sending all the error messages back to Kafka. That will enable you to process all error messages in a consistent way as opposed to putting some messages in the error DB and some in Kafka.
EDIT: Based on the additional information on the ERROR PROCESSOR requirement, here's a diagrammatic representation of the solution design.
I've deliberately kept the output of the ERROR PROCESSOR abstract for now just to keep it generic.
I hope this helps!
If you don't commit the consumed message before writing to the database, then nothing would be lost while Kafka retains the message. The tradeoff of that would be that if the consumer did commit to the database, but a Kafka offset commit fails or times out, you'd end up consuming records again and potentially have duplicates being processed in your service.
Even if you did write to a file, you wouldn't be guaranteed ordering unless you opened a file per partition, and ensured all consumers only ran on a single machine (because you're preserving state there, which isn't fault-tolerant). Deduplication would still need handled as well.
Also, rather than write your own consumer to a database, you could look into Kafka Connect framework. For validating a message, you can similarly deploy a Kafka Streams application to filter out bad messages from an input topic out into a topic to send to the DB

Distributed timer service

I am looking for a distributed timer service. Multiple remote client services should be able to register for callbacks (via REST apis) after specified intervals. The length of an interval can be 1 minute. I can live with an error margin of around 1 minute. The number of such callbacks can go up to 100,000 for now but I would need to scale up later. I have been looking at schedulers like Quartz but I am not sure if they are a fit for the problem. With Quartz, I will probably have to save the callback requests in a DB and poll every minute for overdue requests on 100,000 rows. I am not sure that will scale. Are there any out of the box solutions around? Else, how do I go about building one?
Posting as answer since i cant comment
One more options to consider is a message queue. Where you publish a message with scheduled delay so that consumers can consume after that delay.
Amazon SQS Delay Queues
Delay queues let you postpone the delivery of new messages in a queue for the specified number of seconds. If you create a delay queue, any message that you send to that queue is invisible to consumers for the duration of the delay period. You can use the CreateQueue action to create a delay queue by setting the DelaySeconds attribute to any value between 0 and 900 (15 minutes). You can also change an existing queue into a delay queue using the SetQueueAttributes action to set the queue's DelaySeconds attribute.
Scheduling Messages with RabbitMQ
https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/
A user can declare an exchange with the type x-delayed-message and then publish messages with the custom header x-delay expressing in milliseconds a delay time for the message. The message will be delivered to the respective queues after x-delay milliseconds.
Out of the box solution
RocketMQ meets your requirements since it supports the Scheduled messages:
Scheduled messages differ from normal messages in that they won’t be
delivered until a provided time later.
You can register your callbacks by sending such messages:
Message message = new Message("TestTopic", "");
message.setDelayTimeLevel(3);
producer.send(message);
And then, listen to this topic to deal with your callbacks:
consumer.subscribe("TestTopic", "*");
consumer.registerMessageListener(new MessageListenerConcurrently() {...})
It does well in almost every way except that the DelayTimeLevel options can only be defined before RocketMQ server start, which means that if your MQ server has configuration messageDelayLevel=1s 5s 10s, then you just can not register your callback with delayIntervalTime=3s.
DIY
Quartz+storage can build such callback service as you mentioned, while I don't recommend that you store callback data in relational DB since you hope it to achieve high TPS and constructing distributed service will be hard to get rid of lock and transaction which bring complexity to DB coding.
I do suggest storing callback data in Redis. Because it has better performance than relational DB and it's data structure ZSET suits this scene well.
I once developed a timed callback service based on Redis and Dubbo. it provides some more useful features. Maybe you can get some ideas from it https://github.com/joooohnli/delay-callback

Using many consumers in SQS Queue

I know that it is possible to consume a SQS queue using multiple threads. I would like to guarantee that each message will be consumed once. I know that it is possible to change the visibility timeout of a message, e.g., equal to my processing time. If my process spend more time than the visibility timeout (e.g. a slow connection) other thread can consume the same message.
What is the best approach to guarantee that a message will be processed once?
What is the best approach to guarantee that a message will be processed once?
You're asking for a guarantee - you won't get one. You can reduce probability of a message being processed more than once to a very small amount, but you won't get a guarantee.
I'll explain why, along with strategies for reducing duplication.
Where does duplication come from
When you put a message in SQS, SQS might actually receive that message more than once
For example: a minor network hiccup while sending the message caused a transient error that was automatically retried - from the message sender's perspective, it failed once, and successfully sent once, but SQS received both messages.
SQS can internally generate duplicates
Simlar to the first example - there's a lot of computers handling messages under the covers, and SQS needs to make sure nothing gets lost - messages are stored on multiple servers, and can this can result in duplication.
For the most part, by taking advantage of SQS message visibility timeout, the chances of duplication from these sources are already pretty small - like fraction of a percent small.
If processing duplicates really isn't that bad (strive to make your message consumption idempotent!), I'd consider this good enough - reducing chances of duplication further is complicated and potentially expensive...
What can your application do to reduce duplication further?
Ok, here we go down the rabbit hole... at a high level, you will want to assign unique ids to your messages, and check against an atomic cache of ids that are in progress or completed before starting processing:
Make sure your messages have unique identifiers provided at insertion time
Without this, you'll have no way of telling duplicates apart.
Handle duplication at the 'end of the line' for messages.
If your message receiver needs to send messages off-box for further processing, then it can be another source of duplication (for similar reasons to above)
You'll need somewhere to atomically store and check these unique ids (and flush them after some timeout). There are two important states: "InProgress" and "Completed"
InProgress entries should have a timeout based on how fast you need to recover in case of processing failure.
Completed entries should have a timeout based on how long you want your deduplication window
The simplest is probably a Guava cache, but would only be good for a single processing app. If you have a lot of messages or distributed consumption, consider a database for this job (with a background process to sweep for expired entries)
Before processing the message, attempt to store the messageId in "InProgress". If it's already there, stop - you just handled a duplicate.
Check if the message is "Completed" (and stop if it's there)
Your thread now has an exclusive lock on that messageId - Process your message
Mark the messageId as "Completed" - As long as this messageId stays here, you won't process any duplicates for that messageId.
You likely can't afford infinite storage though.
Remove the messageId from "InProgress" (or just let it expire from here)
Some notes
Keep in mind that chances of duplicate without all of that is already pretty low. Depending on how much time and money deduplication of messages is worth to you, feel free to skip or modify any of the steps
For example, you could leave out "InProgress", but that opens up the small chance of two threads working on a duplicated message at the same time (the second one starting before the first has "Completed" it)
Your deduplication window is as long as you can keep messageIds in "Completed". Since you likely can't afford infinite storage, make this last at least as long as 2x your SQS message visibility timeout; there is reduced chances of duplication after that (on top of the already very low chances, but still not guaranteed).
Even with all this, there is still a chance of duplication - all the precautions and SQS message visibility timeouts help reduce this chance to very small, but the chance is still there:
Your app can crash/hang/do a very long GC right after processing the message, but before the messageId is "Completed" (maybe you're using a database for this storage and the connection to it is down)
In this case, "Processing" will eventually expire, and another thread could process this message (either after SQS visibility timeout also expires or because SQS had a duplicate in it).
Store the message, or a reference to the message, in a database with a unique constraint on the Message ID, when you receive it. If the ID exists in the table, you've already received it, and the database will not allow you to insert it again -- because of the unique constraint.
AWS SQS API doesn't automatically "consume" the message when you read it with API,etc. Developer need to make the call to delete the message themselves.
SQS does have a features call "redrive policy" as part the "Dead letter Queue Setting". You just set the read request to 1. If the consume process crash, subsequent read on the same message will put the message into dead letter queue.
SQS queue visibility timeout can be set up to 12 hours. Unless you have a special need, then you need to implement process to store the message handler in database to allow it for inspection.
You can use setVisibilityTimeout() for both messages and batches, in order to extend the visibility time until the thread has completed processing the message.
This could be done by using a scheduledExecutorService, and schedule a runnable event after half the initial visibility time. The code snippet bellow creates and executes the VisibilityTimeExtender every half of the visibilityTime with a period of half the visibility time. (The time should to guarantee the message to be processed, extended with visibilityTime/2)
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
ScheduledFuture<?> futureEvent = scheduler.scheduleAtFixedRate(new VisibilityTimeExtender(..), visibilityTime/2, visibilityTime/2, TimeUnit.SECONDS);
VisibilityTimeExtender must implement Runnable, and is where you update the new visibility time.
When the thread is done processing the message, you can delete it from the queue, and call futureEvent.cancel(true) to stop the scheduled event.

Bloomberg Java API - bond yield in real time subscription

Goal:
I use Bloomberg Java API's subscription service to monitor bond prices in real time (subscribing to ASK/BID real time fields). However in the RESPONSE messages, bloomberg does not provide the associated yield for the given price. I need a way to calculate the yields.
Attempt:
Here's what I've tried:
Within in the code that processes Events coming backing from a real time subscription, when I get a BID or ASK response, I extract the price from the message element, and then initiates a new synchronous reference data request, using overrides to get the YAS_BOND_YLD by providing YAS_BOND_PX and setting the overriding flag.
Problem:
This seems very slow and cumbersome. Is there a better way other than having to calculate yields myself?
In my code, I seem to be able to process real time prices if they are being sent to me slowly. If a few bonds' prices were updated at the same time (say, in MSG1 pricing), I seem to only capture one out of these updates, it feels like I'm missing the other events.. Is this because I cannot use a synchronous reference data request while the subscription is still alive?
Thanks.
bloomberg does not provide the associated yield for the given price
Have you tried retrieving the ASK_YIELD and BID_YIELD fields? They may be what you are looking for.
Problem: This seems very slow and cumbersome.
Synchronous one-off requests are slower than real time subscription. Unless you need real time data on the yield, you could queue the requests and send them all at once every x seconds for example. The time to get 100 or 1 yield is probably not that different, and certainly not 100 times slower.
In my code, I seem to be able to process real time prices if they are being sent to me slowly. If a few bonds' prices were updated at the same time (say, in MSG1 pricing), I seem to only capture one out of these updates, it feels like I'm missing the other events.. Is this because I cannot use a synchronous reference data request while the subscription is still alive?
You should not miss items just because you are sending a synchronous request. You may get a "Slow consumer warning" but that's about it. It's difficult to say more without seeing your code. However, if you want to make sure your real time data is not delayed by your synchronous requests, you should use two separate Sessions.