matlab does not save variables on parallel batch job - matlab

I was running (in a cluster) a batch job and in the end I was trying to save results using save(), but I had the following error:
ErrorMessage: The parallel job was cancelled because the task with ID 1
terminated abnormally for the following reason:
Cannot create 'results.mat' because '/home/myusername/experiments' does not exist.
why is that happening? What is the correct way to save variables in a parallel job?

You can use SAVE in the normal way during execution of a parallel job, but you also need to be aware of where you are running. If you are running using the MathWorks jobmanager on the cluster, then depending on the security level set on the jobmanager, you might not have access to the same set of directories as you normally would. More about that stuff here: http://www.mathworks.co.uk/help/mdce/setting-job-manager-security.html

Related

spring-batch job monitoring and restart

I am new to spring-batch, got few questions:-
I have got a question about the restart. As per documentation, the restart feature is enabled by default. What I am not clear is do I need to do any extra code for a restart? If so, I am thinking of adding a scheduled job that looks at failed processes and restarts them?
I understand spring-batch-admin is deprecated. However, we cannot use spring-cloud-data-flow right now. Is there any other alternative to monitor and restart jobs on demand?
The restart that you mention only means if a job is restartable or not .It doesn't mean Spring Batch will help you to restart the failed job automatically.
Instead, it provides the following building blocks for developers for achieving this task on their own :
JobExplorer to find out the id of the job execution that you want to restart
JobOperator to restart a job execution given a job execution id
Also , a restartable job can only be restarted if its status is FAILED. So if you want to restart a running job that was stop running because of the server breakdown , you have to first find out this running job and update its job execution status and all of its task execution status to FAILED first in order to restart it. (See this for more information). One of the solution is to implement a SmartLifecycle which use the above building blocks to achieve this goal.

Datastage: How to keep continuous mode job running after a unexpected termination

I have a job that uses the Kafka Connector Stage in order to read a Kafka queue and then load into the database. That job runs in Continuous Mode, which it has no time to conclude, since it keeps monitoring the Kafka queue in real time.
For unexpected reasons (say, server issues, job issues etc) that job may terminate with failure. In general, that happens after 300 running hours of that job. So, in order to keep the job alive I have to manually look to the job status and then to do a Reset and Run, in order to keep the job running.
The problem is that between the job termination and my manual Reset and Run can pass several hours, which is critical. So I'm looking for a way to eliminate the manual interaction and to reduce that gap by automating the job invocation.
I tried to use Control-M to daily run the job, but with no success: The first day the Control-M called the job, it ran it fine. But in the next day, when the Control-M did an attempt to instantiate the job again it failed (since it was already running). Besides, the Datastage will never tell back Control-M that a job was successfully concluded, since the job's nature won't allow that.
Said that, I would like to hear ideas from you that can light me up.
The first thing that came in mind is to create a intermediate Sequence and then schedule it in Control-M. Then, this new Sequence would call the continuous job asynchronously by using command line stage.
For the case where just this one job terminates unexpectedly and you want it to be restarted as soon as possible, have you considered calling this job from a sequence? The sequence could be setup to loop running this job.
Thus sequence starts job and waits for it to finish. When job finishes, the sequence will then loop and start the job again. You could have added conditions on job exit (for example, if the job aborted, then based on that job end status, you could reset the job before re-running it.
This would not handle the condition where the DataStage engine itself was shut down (such as for maintenance or possibly an error) in which case all jobs end including your new sequence. The same also applies for a server reboot or other situations where someone may have inadvertently stopped your sequence. For those cases (such as DataStage engine stop) your team would need to have process in place for jobs/sequences that need to be started up following a DataStage or System outage.
For the outage scenario, you could create a monitor script (regardless of whether running the job solo or from sequence) that sleeps/loops on 5-10 minute intervals and then checks the status of your job using dsjob command, and if not running can start that job/sequence (also via dsjob command). You can decide whether that script startup would occur at DataSTage startup, machine startup, or run it from Control M or other scheduler.

Complete parallel Kubernetes job when one worker pod succeeds

I have a simple containerised python script which I am trying to parallelise with Kubernetes. This script guesses hashes until it finds a hashed value below a certain threshold.
I am only interested in the first such value, so I wish to create a Kubernetes job that spawns n worker pods and completes as soon as one worker pod finds a suitable value.
By default, Kubernetes jobs wait until all worker pods complete before marking the job as complete. I have so far been unable to find a way around this (no mention of this job pattern in the documentation), and have been relying on checking the logs of bare pods via a bash script to determine whether one has completed.
Is there a native means to achieve this? And, if not, what would be the best approach?
Hi look this link https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#parallel-jobs.
I've never tried it but it seems possible to launch several pods and configure the end of the job when x pods have finished. In your case x is 1.
We can define two specifications for parallel Jobs:
1. Parallel Jobs with a fixed completion count:
specify a non-zero positive value for .spec.completions.
the Job represents the overall task, and is complete when there is
one successful Pod for each value in the range 1 to
.spec.completions
not implemented yet: Each Pod is passed a different index in the
range 1 to .spec.completions.
2. Parallel Jobs with a work queue:
do not specify .spec.completions, default to .spec.parallelism
the Pods must coordinate amongst themselves or an external service to
determine what each should work on.
For example, a Pod might fetch a batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done.
when any Pod from the Job terminates with success, no new Pods are
created
once at least one Pod has terminated with success and all Pods are
terminated, then the Job is completed with success
once any Pod has exited with success, no other Pod should still be
doing any work for this task or writing any output. They should all
be in the process of exiting
For a fixed completion count Job, you should set .spec.completions to the number of completions needed. You can set .spec.parallelism, or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative integer.
For more information about how to make use of the different types of job, see the job patterns section.
You can also take a look on single job which starts controller pod:
This pattern is for a single Job to create a Pod which then creates other Pods, acting as a sort of custom controller for those Pods. This allows the most flexibility, but may be somewhat complicated to get started with and offers less integration with Kubernetes.
One example of this pattern would be a Job which starts a Pod which runs a script that in turn starts a Spark master controller (see spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job object, but complete control over what Pods are created and how work is assigned to them.
At the same time take under consideration that completition status of Job set by dafault - when specified number of successful completions is reached it ensure that all tasks are processed properly. Applying this status before all tasks are finished is not secure solution.
You should also know that finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
Here is official documentations: jobs-parallel-processing , parallel-jobs.
Useful blog: article-parallel job.
EDIT:
Another option is that you can create special script which will continuously check values you look for. Using job then will not be necessary, you can simply use deployment.

AWS Fargate vs Batch vs ECS for a once a day batch process

I have a batch process, written in PHP and embedded in a Docker container. Basically, it loads data from several webservices, do some computation on data (during ~1h), and post computed data to an other webservice, then the container exit (with a return code of 0 if OK, 1 if failure somewhere on the process). During the process, some logs are written on STDOUT or STDERR. The batch must be triggered once a day.
I was wondering what is the best AWS service to use to schedule, execute, and monitor my batch process :
at the very begining, I used a EC2 machine with a crontab : no high-availibilty function here, so I decided to switch to a more PaaS approach.
then, I was using Elastic Beanstalk for Docker, with a non-functional Webserver (only to reply to the Healthcheck), and a Crontab inside the container to wake-up my batch command once a day. With autoscalling rule min=1 max=1, I have HA (if the container crash or if the VM crash, it is restarted by AWS)
but now, to be more efficient, I decided to move to some ECS service, and have an approach where I do not need to have EC2 instances awake 23/24 for nothing. So I tried Fargate.
with Fargate I defined my task (Fargate type, not the EC2 type), and configure everything on it.
I create a Cluster to run my task : I can run "by hand, one time" my task, so I know every settings are corrects.
Now, going deeper in Fargate, I want to have my task executed once a day.
It seems to work fine when I used the Scheduled Task feature of ECS : the container start on time, the process run, then the container stop. But CloudWatch is missing some metrics : CPUReservation and CPUUtilization are not reported. Also, there is no way to know if the batch quit with exit code 0 or 1 (all execution stopped with status "STOPPED"). So i Cant send a CloudWatch alarm if the container execution failed.
I use the "Services" feature of Fargate, but it cant handle a batch process, because the container is started every time it stops. This is normal, because the container do not have any daemon. There is no way to schedule a service. I want my container to be active only when it needs to work (once a day during at max 1h). But the missing metrics are correctly reported in CloudWatch.
Here are my questions : what are the best suitable AWS managed services to trigger a container once a day, let it run to do its task, and have reporting facility to track execution (CPU usage, batch duration), including alarm (SNS) when task failed ?
We had the same issue with identifying failed jobs. I propose you take a look into AWS Batch where logs for FAILED jobs are available in CloudWatch Logs; Take a look here.
One more thing you should consider is total cost of ownership of whatever solution you choose eventually. Fargate, in this regard, is quite expensive.
may be too late for your projects but still I thought it could benefit others.
Have you had a look at AWS Step Functions? It is possible to define a workflow and start tasks on ECS/Fargate (or jobs on EKS for that matter), wait for the results and raise alarms/send emails...

Matlab process termination in slurm

I have two questions that to me seem related:
First, is it necessary to explicitly terminate Matlab in my sbatch command? I have looked through several online slurm tutorials, and in some cases the authors include an exit command:
http://www.umbc.edu/hpcf/resources-tara-2013/how-to-run-matlab.html
And in some they don't:
http://www.buffalo.edu/ccr/support/software-resources/compilers-programming-languages/matlab/PCT.html
Second, when creating a parallel pool in a job, I almost always get the following warning:
Warning: Found 4 pre-existing communicating job(s) created by pool that are
running, and 2 communicating job(s) that are pending or queued. You can use
'delete(myCluster.Jobs)' to remove all jobs created with profile local. To
create 'myCluster' use 'myCluster = parcluster('local')'
Why is this happening, and is there any way to avoid it happening to myself and to others because of me?
It depends on how you launch Matlab. Note that your two examples use distinct methods for running a matlab script; the first one uses the -r option
matlab -nodisplay -r "matrixmultiply, exit"
while the second one uses stdin redirection from a file
matlab < runjob.m
In the first solution, the Matlab process will be left running after the script is finished, that is why the exit command is needed there. In the second solution, the Matlab process is terminated as stdin closes when the end of the file is reached.
If you do not end the matlab process, Slurm will kill it when the maximum allocation time is reached, as defined by the --time option in you submission script or by the default cluster (or partition) value.
To avoid the warning you mention, make sure to systematically use matlabpool close at the end of your job. If you have several instances of Matlab running on the same node, and you have a shared home directory, you will probably get the warning anyhow, as I believe the information about open matlab pools is stored in a hidden folder in your home. Rebooting will probably not help, but finding those files and removing them will (be careful though and ask the system administrator).
to avoid your warning, you have to delete
.matlab/local_cluster_jobs/
directory