A Workload Scheduler's step doesn't proceed and remains queued status - scheduler

I created a simple process by Applicaiton Lab interface in Bluemix Workload Scheduler. I ran my process, but the step didn't proceed and remained in queued status.
How can I proceed the step?
I executed the process by "Run now". The process doesn't have triggers
The step remains "Queued status".
The Process information
There is only one step. The step is "ping www.ibm.com"
Process doesn't have trigger. It is an on-demand process.

There might be a problem with the agent as I can successfully run a simple workload process without any issues. If you are using the Workload Automation Agent that is created for you then you will need to open a support ticket to have the Workload team look at that agent.

reviewing your question I think that a process submitted to the Workload Scheduler service should be a process which will complete: a ping command like the one you are trying to submit, will never complete if not 'killed' using CTRL+C (or called with [-c count] option)

Related

Pending decisison tasks never picked for execution and eventually times out in uber cadence workflow

What could be the reason for the decision tasks to not get picked for execution in cadence cluster. They remains at pending state and finally times out. I dont see any error logs. How do I debug this ?
It’s very likely that there is no worker available and actively polling tasks for the tasklist.
Best way to confirm is to click on the tasklist naming in the webUI and see what are the workers behind the tasklist. Since it’s decision task, you should check the decision handler for the tasklist.
You can also use CLI to describe the tasklist to give the same information:
cadence tasklist desc —-tl <tasklist name>
In some extremely rare cases(I personally never seen but heard that happened in Uber with large scale cluster) that cadence server lost the task. In that case you can use CLI to either regenerate the task, or reset the workflow to unblock the workflow:
To regenerate task:
cadence adm wf refresh-tasks -w <wf id>
To reset:
cadence wf reset —-reset_type LastDecisionCompleted -w <wf id>

Azure DevOps Agent - Custom Setup/Teardown Operations

We have a cloud full of self-hosted Azure Agents running on custom AMIs. In some cases, I have some cleanup operations which I'd really like to do either before or after a job runs on the machine, but I don't want the developer waiting for the job to wait either at the beginning or the end of the job (which holds up other stages).
What I'd really like is to have the Azure Agent itself say "after this job finishes, I will run a set of custom scripts that will prepare for the next job, and I won't accept more work until that set of scripts is done".
In a pinch, maybe just a "cooldown" setting would work -- wait 30 seconds before accepting another job. (Then at least a job could trigger some background work before finishing.)
Has anyone had experience with this, or knows of a workable solution?
I suggest three solutions
Create another pipeline to run the clean up tasks on agents - you can also add demand for specific agents (See here https://learn.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml) by Agent.Name -equals [Your Agent Name]. You can set frequency to minutes, hours, as you like using cron pattern. As while this pipeline will be running and taking up the agent, the agent being cleaned will not be available for other jobs. Do note that you can trigger this pipeline from another pipeline, but if both are using the same agents - they can just get deadlocked.
Create a template containing scripts tasks having all clean up logic and use it at the end of every job (which you have discounted).
Rather than using static VM's for agent hosting, use Azure scaleset for Self hosted agents - everytime agents are scaled down they are gone and when scaled up they start fresh. This saves a lot of money in terms of sitting agents not doing anything when no one is working. We use this option and went away from static agents. We have also used packer to create the VM image/vhd overnight to update it with patches, softwares required, and docker images cached.
ref: https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/scale-set-agents?view=azure-devops
For those discovering this question, there's a much better way: run your self-hosted agent with the --once flag, documented here.
You'll need to wrap it in your own bash script, but something like this works:
while :
do
echo "Performing pre-job setup..."
echo "Waiting for job..."
./run.sh --once
echo "Cleaning up..."
sleep 2
done
Another option would be to use a ScaleSet VM setup which preps a new agent for each run and discards the VM when the run is done. It can prepare new VMs in the background while the job is running.
And I suspect you could implement your own IMaintenanceProvirer..
https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Agent.Worker/Maintenance/MaintenanceJobExtension.cs#L53

spring-batch job monitoring and restart

I am new to spring-batch, got few questions:-
I have got a question about the restart. As per documentation, the restart feature is enabled by default. What I am not clear is do I need to do any extra code for a restart? If so, I am thinking of adding a scheduled job that looks at failed processes and restarts them?
I understand spring-batch-admin is deprecated. However, we cannot use spring-cloud-data-flow right now. Is there any other alternative to monitor and restart jobs on demand?
The restart that you mention only means if a job is restartable or not .It doesn't mean Spring Batch will help you to restart the failed job automatically.
Instead, it provides the following building blocks for developers for achieving this task on their own :
JobExplorer to find out the id of the job execution that you want to restart
JobOperator to restart a job execution given a job execution id
Also , a restartable job can only be restarted if its status is FAILED. So if you want to restart a running job that was stop running because of the server breakdown , you have to first find out this running job and update its job execution status and all of its task execution status to FAILED first in order to restart it. (See this for more information). One of the solution is to implement a SmartLifecycle which use the above building blocks to achieve this goal.

AWS Fargate vs Batch vs ECS for a once a day batch process

I have a batch process, written in PHP and embedded in a Docker container. Basically, it loads data from several webservices, do some computation on data (during ~1h), and post computed data to an other webservice, then the container exit (with a return code of 0 if OK, 1 if failure somewhere on the process). During the process, some logs are written on STDOUT or STDERR. The batch must be triggered once a day.
I was wondering what is the best AWS service to use to schedule, execute, and monitor my batch process :
at the very begining, I used a EC2 machine with a crontab : no high-availibilty function here, so I decided to switch to a more PaaS approach.
then, I was using Elastic Beanstalk for Docker, with a non-functional Webserver (only to reply to the Healthcheck), and a Crontab inside the container to wake-up my batch command once a day. With autoscalling rule min=1 max=1, I have HA (if the container crash or if the VM crash, it is restarted by AWS)
but now, to be more efficient, I decided to move to some ECS service, and have an approach where I do not need to have EC2 instances awake 23/24 for nothing. So I tried Fargate.
with Fargate I defined my task (Fargate type, not the EC2 type), and configure everything on it.
I create a Cluster to run my task : I can run "by hand, one time" my task, so I know every settings are corrects.
Now, going deeper in Fargate, I want to have my task executed once a day.
It seems to work fine when I used the Scheduled Task feature of ECS : the container start on time, the process run, then the container stop. But CloudWatch is missing some metrics : CPUReservation and CPUUtilization are not reported. Also, there is no way to know if the batch quit with exit code 0 or 1 (all execution stopped with status "STOPPED"). So i Cant send a CloudWatch alarm if the container execution failed.
I use the "Services" feature of Fargate, but it cant handle a batch process, because the container is started every time it stops. This is normal, because the container do not have any daemon. There is no way to schedule a service. I want my container to be active only when it needs to work (once a day during at max 1h). But the missing metrics are correctly reported in CloudWatch.
Here are my questions : what are the best suitable AWS managed services to trigger a container once a day, let it run to do its task, and have reporting facility to track execution (CPU usage, batch duration), including alarm (SNS) when task failed ?
We had the same issue with identifying failed jobs. I propose you take a look into AWS Batch where logs for FAILED jobs are available in CloudWatch Logs; Take a look here.
One more thing you should consider is total cost of ownership of whatever solution you choose eventually. Fargate, in this regard, is quite expensive.
may be too late for your projects but still I thought it could benefit others.
Have you had a look at AWS Step Functions? It is possible to define a workflow and start tasks on ECS/Fargate (or jobs on EKS for that matter), wait for the results and raise alarms/send emails...

Jenkins trigger job by another which are running on offline node

Is there any way to do the following:
I have 2 jobs. One job on offline node has to trigger the second one. Are there any plugins in Jenkins that can do this. I know that TeamCity has a way of achieving this, but I think that Jenkins is more constrictive
When you configure your node, you can set Availability to Take this slave on-line when in demand and off-line when idle.
Set Usage as Leave this machine for tied jobs only
Finally, configure the job to be executed only on that node.
This way, when the job goes to queue and cannot execute (because the node is offline), Jenkins will try to bring this node online. After the job is finished, the node will go back to offline.
This of course relies on the fact that Jenkins is configured to be able to start this node.
One instance will always be turn on, on which the main job can be run. And have created the job which will look in DB and if in the DB no running instances, it will prepare one node. And the third job after running tests will clean up my environment.