How to configure channels and AMQ for spring-batch-integration where all steps are run as slaves on another cluster member - spring-batch

Followup to Configuration of MessageChannelPartitionHandler for assortment of remote steps
Even though the first question was answered (I think well), I think I'm confused enough that I'm not able to ask the right questions. Please direct me if you can.
Here is a sketch of the architecture I am attempting to build. Today, we have a job that runs a step across the cluster that works. We want to extend the architecture to run n (unbounded and different) jobs with n (unbounded and different) remote steps across the cluster.
I am not confusing job executions and job instances with jobs. We already run multiple job instances across the cluster. We need to be able to run other processes that are scalable in hte same way as the one we have defined already.
The source data is all coming from database which are known to the steps. The partitioner is defining the range of data for the "where" clause in the source database and putting that in the stepContext. All of the actual work happens in the stepContext. The jobContext simply serves to spawn steps, wait for completion, and provide the control API.
There will be 0 to n jobs running concurrently, with 0 to n steps from however many jobs running on the slave VM's concurrently.
Does each master job (or step?) require its own request and reply channel, and by extension its own OutboundChannelAdapter? Or are the request and reply channels shared?
Does each master job (or step?) require its own aggregator? By implication this means each job (or step) will have its own partition handler (which may be supported by the existing codebase)
The StepLocator on the slave appears to require a single shared replyChannel across all steps, but it appears to me that the messageChannelpartitionHandler requires a separate reply channel per step.
What I think is unclear (but I can't tell since it's unclear) is how the single reply channel is picked up by the aggregatedReplyChannel and then returned to the correct step. Of course I could be so lost I'm asking the wrong questions.
Thank you in advance

Does each master job (or step?) require its own request and reply channel, and by extension its own OutboundChannelAdapter? Or are the request and reply channels shared?
No, there is no need for that. StepExecutionRequests are identified with a correlation Id that makes it possible to distinguish them.
Does each master job (or step?) require its own aggregator? By implication this means each job (or step) will have its own partition handler (which may be supported by the existing codebase)
That should not be the case, as requests are uniquely identified with a correlation ID (similar to the previous point).
The StepLocator on the slave appears to require a single shared replyChannel across all steps, but it appears to me that the messageChannelpartitionHandler requires a separate reply channel per step.
The messageChannelpartitionHandler should be step or job scoped, as mentioned in the Javadoc (see recommendation in the last note). As a side note, there was an issue with message crossing in a previous version due to the reply channel being instance based, but it was fixed here.

Related

How to distribute tasks between servers where each task must be done by only one server?

Goal: There are X number backend servers. There are Y number of tasks. Each task must be done only by one server. The same task ran by two different servers should not happen.
There are tasks which include continuous work for an indefinite amount of time, such as polling for data. The same server can keep doing such a task as long as the server stays alive.
Problem: How to reassign a task if the server executing it dies? If the server dies, it can't mark the task as open. What are efficient ways to accomplish this?
Well, the way you define your problem makes it sloppy to reason about. What you actually is looking for called a "distributed lock".
Let's start with a simpler problem: assume you have only two concurrent servers S1, S2 and a single task T. The safety property you stated remains as is: at no point in time both S1 and S2 may process task T. How could that be achieved? The following strategies come to mind:
Implement an algorithm that deterministically maps task to a responsible server. For example, it could be as stupid as if task.name.contains('foo') then server1.process(task) else server2.process(task). That works and indeed might fit some real world requirements out there, yet such an approach is a dead end: a) you have to know how many server would you have upfront, statically and - the most dangerous - 2) you can not tolerate either server being down: if, say, S1 is taken off then there is nothing you can do with T right now except then just wait for S1 to come back online. These drawbacks could be softened, optimized - yet there is no way to get rid of them; escaping these deficiencies requires a more dynamic approach.
Implement an algorithm that would allow S1 and S2 to agree upon who is responsible for the T. Basically, you want both S1 and S2 to come to a consensus about (assumed, not necessarily needed) T.is_processed_by = "S1" or T.is_processed_by = "S2" property's value. Then your requirement translates to the "at any point in time is_process_by is seen by both servers in the same way". Hence "consensus": "an agreement (between the servers) about a is_processed_by value". Having that eliminates all the "too static" issues of the previous strategy: actually, you are no longer bound to 2 servers, you could have had n, n > 1 servers (provided that your distributed consensus works for a chosen n), however it is not prepared for accidents like unexpected power outage. It could be that S1 won the competition, is_processed_by became equal to the "S1", S2 agreed with that and... S1 went down and did nothing useful....
...so you're missing the last bit: the "liveness" property. In simple words, you'd like your system to continuously progress whenever possible. To achieve that property - among many other things I am not mentioning - you have to make sure that spontaneous server's death is monitored and - once it happened - not a single task T gets stuck for indefinitely long. How do you achieve that? That's another story, a typical piratical solution would be to copy-paste the good old TCP's way of doing essentially the same thing: meet the keepalive approach.
OK, let's conclude what we have by now:
Take any implementation of a "distributed locking" which is equivalent to "distributed consensus". It could be a ZooKeeper done correctly, a PostgreSQL running a serializable transaction or whatever alike.
Per each unprocessed or stuck task T in your system, make all the free servers S to race for that lock. Only one of them guaranteed to win and all the rest would surely loose.
Frequently enough push sort of TCP's keepalive notifications per each processing task or - at least - per each alive server. Missing, let say, three notifications in a sequence should be taken as server's death and all of it's tasks should be re-marked as "stuck" and (eventually) reprocessed in the previous step.
And that's it.
P.S. Safety & liveness properties is something you'd definitely want to be aware of once it comes to distributed computing.
Try rabbitmq worker queues
https://www.rabbitmq.com/tutorials/tutorial-two-python.html
It has an acknowledgement feature so if a task fails or server cashes it will automatically replay your task. Based on your specific use case u can setup retries, etc
"Problem: How to reassign a task if the server executing it dies? If the server dies, it can't mark the task as open. What are efficient ways to accomplish this?"
You are getting into a known problem in distributed systems, how does a system makes decisions when the system is partitioned. Let me elaborate on this.
A simple statement "server dies" requires quite a deep dive on what does this actually mean. Did the server lost power? Is it the network between your control plane and the server is down (and the task is keep running)? Or, maybe, the task was done successfully, but the failure happened just before the task server was about to report about it? If you want to be 100% correct in deciding the current state of the system - that the same as to say that the system has to be 100% consistent.
This is where CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem) comes to play. Since your system may be partitioned at any time (a worker server may get disconnected or die - which is the same state) and you want to be 100% correct/consistent, this means that the system won't be 100% available.
To reiterate the previous paragraph: if the system suspects a task server is down, the system as a whole will have to come to a stop, till it will be able to determine on what happened with the particular task server.
Trade off between consistency and availability is the core of distributed systems. Since you want to be 100% correct, you won't have 100% availability.
While availability is not 100%, you still can improve the system to make it as available as possible. Several approaches may help with that.
Simplest one is to alert a human when the system suspects it is down. The human will get a notification (24/7), wake up, login and do a manual check on what is going on. Whether this approach works for your case - it depends on how much availability you need. But this approach is completely legit and is widely used in the industry (those engineers carrying pagers).
More complicated approach is to let the system to fail over to another task server automatically, if that is possible. Few options are available here, depending on type of task.
First type of task is a re-runnable one, but they have to exist as a single instance. In this case, the system uses "STONITH" (shoot the other node in the head) technic to make sure previous node is dead for good. For example, in a cloud the system would actually kill the whole container of task server and then start a new container as a failover.
Second type of tasks is not re-runnable. For example, a task of transferring money from account A to be B is not (automatically) re-runnable. System does not know if the task failed before or after the money were moved. Hence, the fail over needs to do extra steps to calculate the outcome, which may also be impossible if network is not working correctly. In this cases the system usually goes to halt, till it can make 100% correct decision.
None of these options will give 100% of availability, but they can do as good as possible due to nature of distributed systems.

Uber Cadence task list management

We are using Uber Cadence and periodically we run into issues on the production environment.
The setup is the following:
One Java 14 BE with Cadence client 2.7.5
Cadence service version 0.14.1 with Postgres DB
There are multiple domains, for all domains the single BE server is registered as a worker.
What is visible in the logs is that sometimes during a query the cadence seems to lose stickiness to the BE service:
"msg":"query direct through matching failed on sticky, clearing sticky before attempting on non-sticky","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
...
In the backend in the meanwhile nothing is visible. However, during this time if I check the pollers on the cadence web client I see that the task list is there, but it is not considered as a decision handler any more (http://localhost:8088/domains/mydomain/task-lists/mytasklist/pollers). Because of this pretty much the whole environment is dead because there is nothing that can progress with the decision. The only option is to restart the backend service and let it re-register as a worker.
At this point the investigation is stuck, so some help would be appreciated.
Does anyone know about how a worker or task list can lose its ability to be a decision handler? Is it managed by cadence, like based on how many errors the worker generates? I was not able to find anything about this.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue it (in my case this will be the same worker as there is only one). Is it possible that replaying the flow is not possible (although I think it would generate something in the backend log from the cadence client) or at that point the worker is already removed from the list and that causes the time-out?
Any help would be more than welcome! Thanks!
Does anyone know about how a worker or task list can lose its ability to be a decision handler
This will happen when worker stops polling for decision tasks. For example if you configure the worker only polls for activity tasks, then it will show like that. So apparently it will also happen if for some reason the worker stops polling for decision tasks.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue
Yes, as long as there is another worker polling for decision tasks. Note that Query tasks is considered as of the the decision task types. (this is a wrong design, we are working on to separate it).
From your logs:
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
This means that Cadence dispatch the Query tasks to a worker, and a worker accepted, but didn't respond back within timeout.
It's very highly possible that there is some bugs in your Query handler logic. The bug caused decision worker to crash(which means Cadence java client also has a bug too, user code crashing shouldn't crash worker). And then a query task loop over all instances of your worker pool, finally crashed all your decision workers.

Is there a way of assigning an int number to different instances of stateless services?

I'm building a solution where we'll have a (service-fabric) stateless service deployed to K instances. This service is tasked with some workload (like querying) and I want to split the workload between them as evenly as I can - and I want to make this a dynamic solution, which means if I decide to go from K instances to N instances tomorrow, I want the workload splitting to happen in a way that it will automatically distribute the load across N instances now. I don't have any partitions specified for this service.
As an example -
Let's say I'd like to query a database to retrieve a particular chunk of the records. I have 5 nodes. I want these 5 nodes to retrieve different 1/5th of the set of records. This can be achieved through some query logic like (row_id % N == K) where N is the total number of instances and K is the unique instance_number.
I was hoping to leverage FabricRuntime.GetNodeContext().NodeId - but this returns a guid which is not overly useful.
I'm looking for a way where I can deterministically say it's instance number M out of N (I need to be able to name the instances through 1..N) - so I can set my querying logic according to this. One of the requirements is if that instance goes down / crashes etc... when SF automatically restarts it, it should still identify as the same instance id - so that 2 or more nodes doesn't query the same set of results.
What is the best of solving this problem? Is there a solution which involves pure configuration through ApplicationManifest.xml or ServiceManifest.xml?
There is no out of the box solution for your problem, but it can be easily done in many different ways.
The simplest way is using the Queue-Based Load Leveling pattern in conjunction with Competing Consumers pattern.
It consists of creating a queue, add the work to the queue, and each instance get one message to process this work, if one instance goes down and the message is not processed, it goes back to the queue and another instance pick it up.
This way you don't have to worry about the number of instances running, failures and so on.
Regarding the work being put in the queue, it will depend if you want to to do batch processing or process item by item.
Item by item, you put one message in the queue for each item being processed, this is a simple way to handle the work and each instance process one message at time, or multiple messages in parallel.
In batch, you can put a message that represents a list of items to be processed and each instance process that batch until completed, this is a bit trickier because you might have to handle the progress of the work being done, in case of failure, the next time you can continue from where it stopped.
The queue approach is a reactive design, in this case the work need to be put in the queue to trigger the processing, If you want a proactive approach and need to keep track of which work goes to who, you probably might be better of using some other approach, like a Leasing mechanism, where each instance acquire a lease that belongs to the instance until it releases the lease, this would more suitable when you work with partitioned data or other mechanism where you can easily split the load.
Regarding the issue with the ID, an option would be the InstanceId of the replica you are on, you can reach by StatelessService.Context.InstanceId, it is not a sequential ID, but it is a random number. It is better than using the node id, because you might have multiple partitions on same node and the id would conflict with each other.
If you decide to use named partitions, you could use order in the partition name instead, so each partition would have a sequential name.
Worth mention that service fabric has a limitation that doesn't allow services to have multiple replicas on same node, because of this limitation you might have to design your services with this in mind, otherwise you won't be able to scale out once the limit is reached. Also, the same thread has some discussion about approaches to process multiple distributed items that might give you some ideas.

Communication protocol

I'm developing distributed system that consists of master and worker servers. There should be 2 kind of messages:
Heartbeat
Master gets state of worker and respond immediately with appropriate command. For instance:
Message from Worker to Master: "Hey there! I have data a,b,c"
Response from Master to Worker: "All ok, But throw away c - we dont need this anymore"
The participants exchange this messages with interval T.
Direct master command
Lets say client asks master to kill job #123. Here is conversation:
Message from Master to Worker: "Alarm! We need to kill job #123"
Message from Worker to Master: "No problem! Done."
Obvious that we can't predict when this message appear.
Simplest solution is that master is initiator of all communications for both messages (in case of heartbeat we will include another one from master to start exchange). But lets assume that it is expensive to do all heartbeat housekeeping on master side for N workers. And we don't want to waste our resources to keep several tcp connections to worker servers so we have just one.
Is there any solution for this constraints?
First off, you have to do some bookkeeping somewhere. Otherwise, who's going to realize that a worker has died? The natural place to put that data is on the master, if you're building a master/worker system. Otherwise, the workers could be asked to keep track of each other in a long circle, or a randomized graph. If a worker notices that their accountabilibuddy is not responding anymore, it can alert the master.
Same thing applies to the list of jobs currently running; who keeps track of that? It also scales O(n), so presumably the master doesn't have space for that either. Sharding that data out among the workers (e.g. by keeping track of what things their accountabilibuddy is supposed to be doing) only works so far; if a and b crashes, and a is the only one looking after b, you just lost the list of jobs running on b (and possibly the alert that was supposed to notify you that b crashed).
I'd recommend a distributed consensus algorithm for this kind of task. For production, use something someone else has already written; they probably know what they're doing. If it's for learning purposes, which I presume, have a look at the raft consensus algorithm. It's not too hard to understand, but still highlights a lot of the complexity in distributed systems. The simulator is gold for proper understanding.
A master/worker system will never properly work with less than O(n) resources for n workers in the face of crashing workers. By definition, the master needs to control the workers, which is an O(n) job, even if some workers manage other workers. Also, what happens if the master crashes?
Like Filip Haglund said read the raft paper you should also implement it yourself. However in a nutshell what you need to extract from it would be this. In regaurds to membership management.
You need to keep membership lists and the masters Identity on all nodes.
Raft does it's heartbeat sending on master's end it is not very expensive network wise you don't need to keep them open. Every 200 ms to a second you need to send the heartbeat if they don't reply back the Master tells the slaves remove member x from list.
However what what to do if the master dies well basically you need to preset candidate nodes. If you haven't received a heart beat within the timeout the candidate requests votes from the rest of the cluster. If you get the slightest majority you become the new leader.
If you want to join a existing cluster basically same as above if not leader respond not leader with leaders address.

RESTful Job Assignment

I have a collection of jobs that need processing, http://example.com/jobs. Each job has a status of "new", "assigned" or "finished".
I want slave processes to pick off one "new" job, set it's status to "assigned", and then process it. I want to ensure each job is only processed by a single slave.
I considered having each slave do the following:
GET http://example.com/jobs
Pick one that's "new" and do an http PUT to http://example.com/jobs/123 {"status=assigned"}.
Repeat
The problem is that another slave may have assigned the job to itself between the GET and PUT. I could have the second PUT return a 409 (conflict), which would signal the second slave to try a different job.
Am I on the right track, or should I do this differently?
I would have one process that picks "new" jobs and assigns them. Other processes would independently go in and look to see if they've been assigned a job. You'd have to have some way to identify which process a job is assigned to, so some kind of slave process id would be called for.
(You could use POST too, as what you're trying to do shouldn't be idempotent anyway).
You could give each of your clients a unique ID (possibly a UUID) and have an "assignee/worker" field in your job resource.
GET http://example.com/jobs/
POST { "worker"=$myID } to http://example.com/jobs/123
GET http://example.com/jobs/123 and check that the worker ID is that of the client
You could combine this with conditional requests too.
On top of this, you could have a time out feature if the job queue doesn't hear back from a given client, it puts it back in the queue.
It looks that the statuses are an essential part of your job-domain model. So I would expose this as dedicated sub-resources
# 'idle' is what you called 'new'
GET /jobs/idle
GET /jobs/assigned
# start job
PUT /jobs/assigned/123
Slave is only allowed to gather jobs by GET /jobs/idle. This never includes jobs which are running. Still there could be race conditions (two slaves are getting the set, before one them has started job). I think 400 Bad Request or your mentioned 409 Conflict are alright with that.
I prefer above resource-structure instead of working with payloads (which often looks more "procedural" to me).
I was a little to specific, I don't actually care that the slave gets to pick the job, just that it gets a unique one.
With that in mind, I think #manuel aldana was on the right track, but I've made a few modifications.
I'll keep the /jobs resource, but also expose a /jobs/assigned resource. A single job may exist in both collections.
The slave can POST to /jobs/assigned with no parameters. The server will choose one "new" job, move it to "assigned", and return the url (/jobs/assigned/{jobid} or /jobs/{jobid}) in the Location header with a 201 status.
When the slave finishes the job, it will PUT to /jobs/{jobid} (status=finished).