Multi level queue design - queue

I am designing a system to deal with an external web service. The service limits the number of requests that can be made over a certain period of time (T). The system allows for the batching of a certain number of requests (R). There are a certain number of operations that the service supports (O).
My code will process an unknown number of requests from users (I really have no idea at this point, could be one request a day, could be thousands a second, I need to build it with the assumption of thousands a second though). These results will be cached in a database for a period of time. When the database records are out of date the system will need to request the data from the web service again.
I can only access the web service via one IP address with one account (no cheating and getting an account per operation type, or one machine per operation type). The system will (hopefully) all run on a single server.
What I am trying to do (been thinking about it on and off for a couple of weeks without any results that I like) is come up with a system where:
duplicate requests are merged (duplicate means they have the same request data)
user requests have priority over system requests
a system request can be changed to a user request (database update is in the queue and a user is requesting the same data)
if there are not R user requests for the particular opereration then the remainder are taken from the system requests
user requests are handled in the same order that they come in (except that once a user request is being handled R requests of the same type are handled).
So, for example, T is 1 second, R is 3, and O is 2. The following requests come into the system:
Request 1, user, operation A, data 1
Request 2, user, operation A, data 2
Request 3, user, operation A, data 1 <- duplicate of request 1
Request 4, system, operation B, data 3
Request 5, system, operation A, data 1 <- duplicate of request 3
Request 6, user, operation B, data 3 <- duplicate of Request 4
Request 7, system, operation A, data 4
Request 8, user, operation A, data 5
Request 9, user, operation A, data 6
Request 10, user, operation A, data 7
Request 11, user, operation B, data 8
Once you deal with the duplicates you would get this:
Request 1, user, operation A, data 1
Request 2, user, operation A, data 2
Request 4, user, operation B, data 3 <- promoted to user from system (msg 6)
Request 7, system, operation A, data 4
Request 8, user, operation A, data 5
Request 9, user, operation A, data 6
Request 10, user, operation A, data 7
Request 11, user, operation B, data 8
The requests should be handled in the following order:
T1 Request 1, Request 2, Request 8
T2 Request 4, Request 11
T3 Request 9, Request 10, Request 7
I think there will likely be 3-7 operation types. Some operation types will have more requests than others. System requests will likely be larger in number than user requests.
Is there a common way of dealing with this sort of problem? A pattern or technology? Am I overthinking it (unfortunatly I cannot get the usage statistics until after it is up and running, I cannot even reasonably guess at what they will be)?
The primary things I am trying to avoid are:
having system requests handled over user requests ( a system request can wait for weeks, a user request must be processes as soon as it can be)
not making the same request twice in the period that the data is cached in the database

I'd solve that by having two queues: one for user and one for system requests. Design each queue to be a lexicographically ordered set containing a tuple of (operation type, data, arrival time); this assumes that you can define an ordering over your data pieces. Ordered sets allow searching by partial keys, so that allows you to check for duplicate requests in both queues and allows for promoting a system to user request. Though, I do not quite understand the role of the T variable.

Related

Handling multiple requests with same body - REST API

Let's say I have a micro service which just registers a user into the database and we expose it to our client. I want to understand what's the better way of handling the following scenario,
What if the user sends multiple requests in parallel(say 10 requests within the 1 second) with same request body. Should I keep the requests in a queue and register for the very first user and deny all the other 9 requests, or should I classify each request and compare whichever having similar request body and if any of them has different request body shall be picked up one each and rest are rejected? or What's the best thing I can do to handle this scenario?
One more thing I would like to understand, is it recommended to have rate-limiting (say n requests per minute) on a global API level or micro-service level?
Thanks in advance!
The best way is to use an idempotent call. Instead of exposing an endpoint like this :
POST /users + payload
Expose an endpoint like this :
PUT /user/ID + payload
You let the caller generate the id, and you ask for an UUID. With UUID, no matter who generates it. This way, if caller invokes your endpoint multiple times, the first time you will create the user, the following times you will juste update the user with the same payload, which means you'll do nothing. At least you won't generate duplicates.
It's always a good practice to protect your services with rate-limiting. You have to set it at API level. If you define it at microservice level, you will authorize N times the rate if you have N instances, because you will ditribute the requests.

Architecture for ML jobs platform

I'm building a platform to run ML jobs.
Jobs will be started from an interface.
I'm making a service for each type of jobs. Some times, a service S1 might require to first make a request to another service S2 and get its output before running its own job.
Each service is split into 2 Kubernetes deployment:
one that will pull the message from a topic, check it and persist it to a database (D1)
one that will read request from the database, run the actual job, update the request state in the database and then answer to the client (D2)
Here is the flow:
interface generates a PubSub message to a topic T1
D1 pulls message from T1 and persist a request to a database
D2 sees the new request in the database and runs it then update its state in the database and answer to the client
To answer to the client, D2 has 2 options:
push a message to a pubsub topic T2 that will continiously be checked by the client. An id is passed in both request and response so that only the client can pull it from the topic.
use a callback provided by the client to make a POST request
What do you think abouut this architecture ? Does the usage of PubSub makes sense ? Also does it make sense to split each service into 2 deployment (1 that deals with request, 1 that runs the actual job ) ?
interface generates a PubSub message to a topic T1 D1 pulls message
from T1 and persist a request to a database
If there's only one database, I'm not sure I see much advantage in using a topic (implying pub/sub). Another approach would be to use a queue: the interface creates jobs into the queue, then you can have any number of workers processing it. Depending on the situation you may not even need the database at all - if all the data needed can be in the message in the queue.
use a callback provided by the client to make a POST request
That's better if you can do it, on the assumption that there's only one consumer for the event; pub/sub is more for broadcasting out to multiple consumers. Polling works but is really inefficient and has limits on how much it can scale.
Also does it make sense to split each service into 2 deployment (1
that deals with request, 1 that runs the actual job ) ?
Having separate deployables make sense if they are built by different teams and have a different release cadence or if you need to scale them out independently, otherwise it may not be necessary.

Scala and playframework shared cache between nodes

I have a complex problem and I can't figure out which one is the best solution to solve it.
this is the scenario:
I have N servers under a single load balancer and a Database.
All the servers connect to the database
All the servers run the same identical application
I want to implement a Cache in order to decrease the response time and reduce to the minimum the HTTP calls Server -> Database
I implemented it and works like a charm on a single server...but I need to find a mechanism to update all the other caches in the other servers when the data is not valid anymore.
example:
I have server A and server B, both have their own cache.
At the first request from the outside, for example, get user information, replies server A.
his cache is empty so he needs to get the information from the database.
the second request goes to B, also here server B cache is empty, so he needs to get information from the database.
the third request, again on server A, now the data is in the cache, it replies immediately without database request.
the fourth request, on server B, is a write request (for example change user name), server B can make the changes on the database and update his own cache, invalidating the old user.
but server A still has the old invalid user.
So I need a mechanism for server B to communicate to server A (or N other servers) to invalidate/update the data in the cache.
whats is the best way to do this, in scala play framework?
Also, consider that in the future servers can be in geo-redundancy, so in different geographical locations, in a different network, served by a different ISP.
would be great also to update all the other caches when one user is loaded (one server request from database update all the servers caches), this way all the servers are ready for future request.
Hope I have been clear.
Thanks
Since you're using Play, which under the hood, already uses Akka, I suggest using Akka Cluster Sharding. With this, the instances of your Play service would form a cluster (including failure detection, etc.) at startup, and organize between themselves which instance owns a particular user's information.
So proceeding through your requests, the first request to GET /userinfo/:uid hits server A. The request handler hashes uid (e.g. with murmur3: consistent hashing is important) and resolves it to, e.g., shard 27. Since the instances started, this is the first time we've had a request involving a user in shard 27, so shard 27 is created and let's say it gets owned by server A. We send a message (e.g. GetUserInfoFor(uid)) to a new UserInfoActor which loads the required data from the DB, stores it in its state, and replies. The Play API handler receives the reply and generates a response to the HTTP request.
For the second request, it's for the same uid, but hits server B. The handler resolves it to shard 27 and its cluster sharding knows that A owns that shard, so it sends a message to the UserInfoActor on A for that uid which has the data in memory. It replies with the info and the Play API handler generates a response to the HTTP request from the reply.
In this way, all subsequent requests (e.g. the third, the same GET hitting server A) for the user info will not touch the DB, no matter which server they hit.
For the fourth request, which let's say is POST /userinfo/:uid and hits server B, the request handler again hashes the uid to shard 27 but this time, we send, e.g., an UpdateUserInfoFor(uid, newInfo) message to that UserInfoActor on server A. The actor receives the message, updates the DB, updates its in-memory user info and replies (either something simple like Done or the new info). The request handler generates a response from that reply.
This works really well: I've personally seen systems using cluster sharding keep terabytes in memory and operate with consistent single-digit millisecond latency for streaming analytics with interactive queries. Servers crash, and the actors running on the servers get rebalanced to surviving instances.
It's important to note that anything matching your requirements is a distributed system and you're requiring strong consistency, i.e. you're requiring that it be unavailable under a network partition (if B is unable to communicate an update to A, it has no choice but to fail the request). Once you start talking about geo-redundancy and multiple ISPs, you're going to see partitions pretty regularly. The only way to get availability under a network partition is to relax the consistency demand and accept that sometimes the GET will not incorporate the latest PUT/POST/DELETE.
This is probably not something that you want to build yourself. But there are plenty of distributed caches out there that you can use, such as Ehcache or InfiniSpan. I suggest you look into one of those two.

What is the purpose of Chubby Sequencers

While reading article from google about chubby, I didn't really understand the purpose of sequencers
Assume we have 4 entities :
Chubby cell
Client 1
Client 2
Service we want to use and where we will send the requests (for which we need the lock)
As far as I understood the steps are:
Client 1 send lock_request() to Chubby cell, Chubby responses with Sequencer (assume SequenceNumber = 1)
Client 1 send request modify_data() with Sequencer (SequenceNumber = 1) to Service
Service asks Chubby cell if SequenceNumber is valid (=1)
Chubby acknowledges it, set LeasePeriod (period of lock expiration to (assume) 60 seconds)
! during this time no one is able to acquire the lock
After acknowledge, Service cache the data about Client 1 (SequenceNumber = 1) for (assume) 40 seconds
Now:
if Client 2 tries to acquire lock during these 60 seconds we set, it will be rejected by Chubby cell
that means it is impossible that Client 2 will acquire the lock with the next SequenceNumber = 2 and send anything to the Service
As far as I understand all purpose of SequenceNumber is just for situation when 2 requests come to Service and Service can just compare 2 SequenceNumbers and reject the lower, without need to ask Chubby cell
but how this situation will ever happen if we have caches and impossibility to get the lock by Client 2 while Client 1 is holding this lock?
It will be a mistake to think about timing in distributed systems with actual times (like seconds), but I'll try to answer using the same semantics.
As you said, say client1 acquires write lock named foo1,
foo here being the lock name and 1 being the generation number.
Now say, lease period is 60 seconds. 58th second now Client1 sends a write, say R1.
And soon enough, Client1 is now dead.
Now, here's the catch. You assumed in your analysis, that R1 would reach
the server inside the 2 seconds, before another client, say Client2 becomes master.
THAT'S JUST NOT CERTAIN.
In a distributed system, with fractions of milliseconds network latencies on one hand and network partitions on the other hand,
you just cannot ascertain what reaches the master first, R1 or client2's request to become master.
This is where sequence numbers would help.
Master, now having known that there is foo2, can reject R1 that came with foo1 in metadata.
Read more about generational clocks/logical clocks here.
A logical clock is a mechanism for capturing chronological and causal relationships in a distributed system. Often, distributed systems may have no physically synchronous global clock. Fortunately, in many applications (such as distributed GNU make), if two processes never interact, the lack of synchronization is unobservable. Moreover, in these applications, it suffices for the processes to agree on the event ordering (i.e., logical clock) rather than the wall-clock time.[1]

Reactive vs Non Reactive Response Time

For a system powerful enough to serve a number of requests (not running out of threads), would there be a difference, from the users perspective, in terms of response time / speed?
Also, would database the only thing that is usually blocking the thread, and hence we need reactive db driver?
I mean, if a rest endpoint does not make calls to db, there would be no diff whether the endpoint is reactive or not?
First of all, you need to know what happens when you use project-reactor - webflux client.
Let's assume your endpoint (call it /demo) is responsible for making 5 async calls to other systems to return response from itself.
Example response times:
Service A: 5 ms
Service B: 50 ms
Service C: 100 ms
Service D: 250 ms
Service E: 400 ms
Typical non-weblux client way:
5 threads are consumed, the last one is blocked for 400ms.
Webflux client way:
Every call to services A, B, C, D, E consume one thread, make call, returns thread and when the response comes another thread is consumed to process response.
The final conclusion:
If your system will be overloaded by big amount of requests (let it be n) in the same time you will lock n threads for 400 ms.
Try to imagine the scale of the problem.