Checking later the state of nodes during a past job execution - azure-batch

Is there a way to check in Azure Batch whether a node went to the unusable state while running a specific job on a specific pool? The context is that, when running a job and checking the pool on which it was running at that time, there were some nodes that went to the unusable state during the job execution, but we wouldn't have any indication that this happened if we weren't checking the heatmap of the pool during the job execution. Thus, how can I check if nodes went to the unusable state during some job run?
Also, I see that there are metrics collected about the state of nodes in Azure portal, but I am not sure why these metrics are always zero for me even though I am running jobs and tasks that fail?

I had a quick look for you: (I hope this helps :))
For the nodes state monitoring you can do something mentioned here:
https://learn.microsoft.com/en-us/azure/batch/batch-efficient-list-queries
PoolOperations: https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.pooloperations?view=azurebatch-7.0.1
ListComputeNodes : Enumerates the ComputeNode of the specified pool.
I think at the detail level if you filter with correct clasue you will get ComputeNode information, then you can loop through the information and check the state.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.common.computenodestate?view=azurebatch-7.0.1
Possible sample implementation: (Please note this specific code is probably for the pool health) https://github.com/Azure/azure-batch-samples/blob/master/CSharp/Common/GettingStartedCommon.cs#L31
With regards to the metrics, how are you getting the metrics back. I am sure I will get corrected if I said anything doubtful or incorrect. Thanks!

Related

Implementing a downtime on a resource that waits for the resource to finish its task before delaying it

I'm trying to represent a machine that works for a x amount of time before warning the operator that the oil tank needs to be refilled. Have in mind that the machine doesn't stop as soon as it send the warning message out. That way, the operator will wait until the machine stops any activity it had already started, and once it's done, he'll stop the machine and fill the tank.
In order to represent this process I'm using a Station block from the Material Handling library, that seizes a resource from a resource pool block, to which a downtime block is applied.
Is there a way to make the downtime block wait until the machine stops before performing the maintenance?
I also want to associate a resource pool representative of the operator to the downtime block, so that the operator is busy during the downtime, since he's the responsible for filling the tank. Can I do that?
Thank you in advance!
Is there a way to make the downtime block wait until the machine stops before performing the maintenance?
Yes, explore how Priorities work. Give your machine task a higher priority than the downtime task and ensure that the downtime block does not preempt other tasks:
I also want to associate a resource pool representative of the operator to the downtime block, so that the operator is busy during the downtime, since he's the responsible for filling the tank. Can I do that?
Yes, set the task type to "go to flowchart" and use a custom flow chart to seize from a resource pool (again, check the help on how to set this up in detail):
PS: Please only ask 1 question per issue always. See https://stackoverflow.com/help/how-to-ask and for AnyLogic https://www.benjamin-schumann.com/blog/2021/4/1/how-to-win-at-anylogic-on-stackoverflow

Uber Cadence task list management

We are using Uber Cadence and periodically we run into issues on the production environment.
The setup is the following:
One Java 14 BE with Cadence client 2.7.5
Cadence service version 0.14.1 with Postgres DB
There are multiple domains, for all domains the single BE server is registered as a worker.
What is visible in the logs is that sometimes during a query the cadence seems to lose stickiness to the BE service:
"msg":"query direct through matching failed on sticky, clearing sticky before attempting on non-sticky","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
...
In the backend in the meanwhile nothing is visible. However, during this time if I check the pollers on the cadence web client I see that the task list is there, but it is not considered as a decision handler any more (http://localhost:8088/domains/mydomain/task-lists/mytasklist/pollers). Because of this pretty much the whole environment is dead because there is nothing that can progress with the decision. The only option is to restart the backend service and let it re-register as a worker.
At this point the investigation is stuck, so some help would be appreciated.
Does anyone know about how a worker or task list can lose its ability to be a decision handler? Is it managed by cadence, like based on how many errors the worker generates? I was not able to find anything about this.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue it (in my case this will be the same worker as there is only one). Is it possible that replaying the flow is not possible (although I think it would generate something in the backend log from the cadence client) or at that point the worker is already removed from the list and that causes the time-out?
Any help would be more than welcome! Thanks!
Does anyone know about how a worker or task list can lose its ability to be a decision handler
This will happen when worker stops polling for decision tasks. For example if you configure the worker only polls for activity tasks, then it will show like that. So apparently it will also happen if for some reason the worker stops polling for decision tasks.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue
Yes, as long as there is another worker polling for decision tasks. Note that Query tasks is considered as of the the decision task types. (this is a wrong design, we are working on to separate it).
From your logs:
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
This means that Cadence dispatch the Query tasks to a worker, and a worker accepted, but didn't respond back within timeout.
It's very highly possible that there is some bugs in your Query handler logic. The bug caused decision worker to crash(which means Cadence java client also has a bug too, user code crashing shouldn't crash worker). And then a query task loop over all instances of your worker pool, finally crashed all your decision workers.

Unique access to Kubernetes resources

For some integration tests we would like to have a way of ensuring, that only one test at a time has access to certain resources (e.g. 3 DeploymentConfigurations).
For that to work we have have the following workflow:
Before test is started - wait until all DCs are "undeployed".
When test is started - set DC replicas to 1.
When test is stopped - set DC replicas to 0.
This works to some degree, but obviously has the problem, that once the test terminates unexpectedly, the DCs might still be in flight.
Now one way to "solve" this would be to introduce a CR, with a Controller, which handles lifetime of the lock (CR).
Is there any more elegant and straight forward way of allowing unique access to Kubernetes resources?
EDIT:
Sadly we are stuck with Kubernetes 1.9 for now.
look at 'kubectl wait' api to set different conditions between the test flow and depending on the result proceed to next test step.

Is there any method to get mutual exclusion in a chef node?

For example, If a process updates a node when a chef-client is running the chef-client will overwrite the node data:
chef-client gets node data (state 1)
The process A gets node data (state 1)
The process A updates locally the node data (state 2)
This process saves node data (state 2)
chef-client updates locally the node data (state 2*)
chef-client saves node data, and this node data does not contains the changes from the process A (state 2). The chef-client overwrite the node data. (state 2*)
The same problem occurs, if we have two processes saving node data in the same moment
EDIT
We need to external modification because we have a nice UI of Chef server to manage remotely a lot of computers, showing like a tree (similar to LDAP). An administrator can update the value of the recipes from here. This project is OpenSource: https://github.com/gecos-team/
Although we had a semaphore system, we have detected that if we have two or more simultaneous requests, we can have a concurrence problem:
The regular case is that the system works
But sometimes the system does not work
EDIT 2
I have added a document with a lot of information about our problem.
Throwing what I would do for this case as an answer:
Have a distributed lock mechanism like
This
I'm not using it myself, it is just for the idea
Build a start/report/error handler which will
at start acquire a lock on the node name in the DLM in 1.
if it can't abort the run or wait untill the lock is free
at end (report or error) release the lock.
Modify the External system to do the same as the handler above, aquire a lock before modifying and release when done.
Pay attention to the lock lifetime !!! It should be longer than your Chef Run plus a margin, and the UI should ensure its lock is still there before writing and abort if not.
A way to get rid of the handler (but you still need a lock for the UI) is to take advantage of the reporting api (premium feature of chef 12, free under 25 nodes, license needed upward)
This turn a bit convoluted and need the node to do reporting (so the chef-server url should end with organizations/ and the client version should be above 11.16 or use the backport)
Then your can ask about the runs for a node and check if there's one at started status for this node, and wait until it is ended.
Chef doesn't implement a transaction feature and also it does not re-converge nodes on updates automatically by default. It's open for race conditions which you can try to reduce by updated node attributes from within a chef-client run (right before you do something critical) but you will never end up in a reliable, working setup.
The longer the converge runs, the higher the gap and risk of corruption.
Chef's node attributes are only useful for debugging or modification by the chef-client running on the node itself and pretty much useless in highly concurrent/dynamic environments.
I would use Consul.io to coordinate semaphores and key/value configuration data in realtime. Access it using chef recipes or LWRPs using one of the various interfaces consul provides (http, DNS, …).
You can implement a very easy push-job task to run chef-client (IMHO easier and more powerful than the chef "push jobs" feature, however not integrated in Chefs' ACL/user management) which also is guarded by a distributed semaphore or using the "Leader Election" feature. Of course you'll have to add this logic to your node update script, too.
Chef-client will then retrieve a lock on start and block you from manipulating data while it converges and vice versa.
Discovered this one in production and came to the conclusion that there is no safe way to edit the node attributes directly. Leave it to the chef-client :-)
Good news is that there are other more reliable ways to set node attributes. Chef roles and environments can both be edited safely while a client is running and only take effect during the next chef run. Additionally node attribute precedence rules ensure that any settings you make override those that might be made by a recipe.
I suggest to avoid Chef node data updates from your external app, and move that desired node configuration to a Chef databag.
So nodes will read Chef node data and configuration databag and write only in node data. And you external app read both but only writes in the databag.
If you want to avoind a dependency on another external service, perhaps you could use some kind of time slicing.
Roughly: nodes only start a chef-client on odd minutes. Api only update chef data on even minutes (distribute these even minutes if you have more than a queue).

Select node in Quartz cluster to execute a job

I have some questions about Quartz clustering, specifically about how triggers fire / jobs execute within the cluster.
Does quartz give any preference to nodes when executing jobs? Such as always or never the node that executed the same job the last time, or is it simply whichever node that gets to the job first?
Is it possible to specify the node which should execute the job?
The answer to this will be something of a "it depends".
For quartz 1.x, the answer is that the execution of the job is always (only) on a more-or-less random node. Where "randomness" is really based on whichever node gets to it first. For "busy" schedulers (where there are always a lot of jobs to run) this ends up giving a pretty balanced load across the cluster nodes. For non-busy scheduler (only an occasional job to fire) it may sometimes look like a single node is firing all the jobs (because the scheduler looks for the next job to fire whenever a job execution completes - so the node just finishing an execution tends to find the next job to execute).
With quartz 2.0 (which is in beta) the answer is the same as above, for standard quartz. But the Terracotta folks have built an Enterprise Edition of their TerracottaJobStore which offers more complex clustering control - as you schedule jobs you can specify which nodes of the cluster are valid for the execution of the job, or you can specify node characteristics/requisites, such as "a node with at least 100 MB RAM available". This also works along with ehcache, such that you can specify the job to run "on the node where the data keyed by X is local".
I solved this question for my web application using Spring + AOP + memcached. My jobs do know from the data they traverse if the job has already been executed, so the only thing I need to avoid is two or more nodes running at the same time.
You can read it here:
http://blog.oio.de/2013/07/03/cluster-job-synchronization-with-spring-aop-and-memcached/