Why do my results become un-reproducible when run from qsub? - matlab

I'm running matlab on a cluster. when i run my .m script from an interactive matlab session on the cluster my results are reproducible. but when i run the same script from a qsub command, as part of an array job away from my watchful eye, I get believable but unreproducible results. The .m files are doing exactly the same thing, including saving the results as .mat files.
Anyone know why run one way the scripts give reproducible results, and run the other way they become unreproducible?
Is this only a problem with reproducibility or is this indicative of inaccurate results?
%%%%%
Thanks to spuder for a helpful answer. Just in case anyone stumbles upon this and is interested here is some further information.
If you use more than one thread in Matlab jobs, this may result in stealing resources from other jobs which plays havoc with the results. So you have 2 options:
1. Select exclusive access to a node. The cluster I am using is not currently allowing parallel array jobs, so doing this for me was very wasteful - i took a whole node but used it in serial.
2. Ask matlab to run on a singleCompThread. This may make your script take longer to complete, but it gets jobs through the queue faster.

There are a lot of variables at play. Ruling out transient issues such as network performance, and load here are a couple of possible explanations:
You are getting assigned a different batch of nodes when you run an interactive job from when you use qsub.
I've seen some sites that assign a policy of 'exclusive' to the nodes that run interactive jobs and 'shared' to nodes that run queued 'qsub' jobs. If that is the case, then you will almost always see better performance on the exclusive nodes.
Another answer might be that the interactive jobs are assigned to nodes that have less network congestion.
Additionally, if you are requesting multiple nodes, and you happen to land on nodes that traverse multiple hops, then you could be seeing significant network slowdowns. The solution would be for the cluster administrator to setup nodesets.
Are you using multiple nodes for the job? How are you requesting the resources?

Related

Dynamic number of replicas in a Kubernetes cron-job

I've been looking for days for a way to set-up a cron-job with a dynamic number of jobs.
I've read all these solutions and it seems that, in order to initialise a dynamic number of jobs, I need to do it manually with a script and a job template, but I need it to be automatic.
A bit of context:
I have a database / message queue / whatever can store "items"
I would like to start a job (so a single replica of a container) every 5 minutes to process each item
So, let's say there is a Kafka topic / a db table / a folder containing 5 records / rows / files, I would like Kubernetes to start 5 replicas of the job (with the cron-job) automatically. After 5 minutes, there will be 2 items, so Kubernetes will just start 2 replicas.
The most feasible solution seems to be using a static number of pods and make them process multiple items, but I feel like there is a better way to accomplish my desire keeping it inside Kubernetes that I can't figure due to my lack of experience. 🤔
What would you do to solve this problem?
P.S. Sorry for my English.
There are two ways I can think of:
Using a CronJob that is parallelised (1 work-item/pod or 1+ work-items/pod). This is what you're trying to achieve. Somewhat.
Using a data processing application. This I believe is the recommended approach.
Why and Why Not CronJobs
For (1), there are a few things that I would like to mention. There is no upside to having multiple Job/CronJob items when you are trying to perform the same operation from all of them. You think you are getting parllelism, but not really, you are only increasing management overhead. If your workload grows too large (which it will) there will be too many Job objects in the cluster and the API server will be slowed down drastically.
Job and CronJob items are only for stand-alone work items that need to be performed regularly. They are house-keeping tasks. So, selecting CronJobs for data processing is a very bad idea. Even if you run a parallelized set of pods (as provided here and here in the docs like you mentioned), even then, it would be best suited to have a single Job that handles all the pods that are working on the same work-item. So, you should not be thinking of "scaling Jobs" in those terms. Instead, think of scaling Pods. So, if you really want to move ahead with utilizing the Job and CronJob mechanisms, go ahead, the MessageQueue based design is your best bet. And you will have to reinvent a lot of wheels to get it to work (read below why that is the case).
Recommended Solution
For (2), I only say this since I see you are trying to perform data processing and doing this with a one-off mechanism like a Job will not be a good idea (Jobs are basically stateless, since they perform an operation that can be repeated simply without any repercussions). Say you start a pod, it fails processing, how will other pods know that this item was not processed successfully? What if the pod dies, the Job cannot keep track of the items in your data store, since the Job is not aware of the nature of the work you're performing. Therefore, it is natural for you to pursue a solution where the system components are specifically designed for data processing.
You will want to look into a system that can understand the nature of your data, how to keep track of the processing queues that have been finished successfully, how to restart a new Pod with the same item as input, from the Pod that just crashed etc. This is a lot of application/use-case specific functionality that is best served through the means of an operator or a CustomResource and a controller. And obviously, since this is not a new problem, there is a ton of solutions out there that can perform this the best way for you.
The best course of action would be to have that system in place, deployed with the means of a Deployment pattern, where auto-scaling would be enabled and you will achieve real parallelism that will also be best suited for data processing batch jobs.
And remember, when we talk about scaling in Kubernetes, it is always the pods that scale, not containers, not deployments, not services. Always Pods. That is because at the bottom of the chain, there is always a Pod somewhere that is working on something be it a Job that owns it, or a Deployment or a Service a DaemonSet or whatever. And it is obviously a bad idea to have multiple application containers in a Pod due to so many reasons. (side-car and adapter patterns are just helpers, they don't run the application).
Perhaps this blog that discusses data processing in Kubernetes can help.

Preferred approach for running one script over multiple directories in SLURM

My most typical use case is running a single script over multiple directories (usually R or Matlab). I have access to a high-performance computing environment (SLURM-based). From my research so far, it is unclear to me which of the following approaches would be preferred to make most efficient use of the CPUs/cores available. I also want to make sure I'm not unnecessarily taking up system resources so I'd like to double check which of the following two approaches is most suitable.
Approach 1:
Parallelize code within the script (MPI).
Wrap this in a loop that applies the script to all directories.
Submit this as a single MPI job as a SLURM script.
Approach 2:
Parallelize code within the script (MPI).
Create an MPI job array, one job per directory, each running the script on the directory.
I'm still new to this so if I've mixed up something here or you need more details to answer the question please let me know.
If you do not explicitly use MPI inside your original R or Matlab script, I suggest you avoid using MPI at all and use job arrays.
Assuming you have a script myscript.R and a set of subdirectories data01, data02, ..., data10, and the scripts takes the name of the directory as input parameter, you can do the following.
Create a submission script in the directory parent of the data directories:
#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --mem-per-cpu=2G
#SBATCH --time 1-0
#SBATCH --array=1-10
DIRS=(data*/) # Create a Bash array with all data directories
module load R
Rscript myscript.R ${DIRS[$SLURM_ARRAY_TASK_ID]} # Feed the script with the data directory
# corresponding to the task ID in the array
This script will create a job array where each job will run the myscript.R with one of the data directories as argument.
Of course you will need to adapt the values of the memory and time, and investigate whether or not using more than one CPU per job is beneficial in your case. And adapt the --array parameter to the actual number of directories in your case.
Answer is quite obvious to me, assuming that getting good paralelism is usually difficult.
In the first approach, you ask SLURM for a set of resources, but if you ask for many CPUs, you will probably waste quite a lot of resources (if you ask for 32 CPU and your speedup is only 4x you are wasting 28 CPUs). So you will go with a small portion of the cluster processing one folder after the other.
In the second approach, you will ask SLURM to run a job per every folder. There will be many jobs running simultaneously and they can ask for fewer resources. Say you ask for 4 CPUs per job (and the speedup is 3X, which means you waste 1 cpu per job). Running 8 jobs simultaneously will take the same 32 CPUs of first approach, but only 8 CPUs will be wasted and 8 folders will be processed simultaneously.
At the end, the decision have to be taken after seeing which is the speedup with the different number of CPUs, but my feeling is that second approach will be generally preferred, unless you get a very good speedup, in which case, both approaches are equivalent.

What is happening when Matlab is starting a "parallel pool"?

Running parallel CPU processes in Matlab starts with the command
parpool()
According to the documentation, that function:
[creates] a special job on a pool of workers, and [connects] the MATLAB client to the parallel pool.
This function usually takes a bit of time to execute, on the order of 30 seconds. But in other multi-CPU paradigms like OpenMP, the parallel execution seems totally transparent -- I've never noticed any behavior analogous to what Matlab does (granted I'm not very experienced with parallel programming).
So, what is actually happening between the time that parpool() is called and when it finishes executing? What takes so long?
Parallel Computing Toolbox enables you to run MATLAB code in parallel using several different paradigms (e.g. jobs and tasks, parfor, spmd, parfeval, batch processing), and to run it either locally (parallelised across cores in your local machine) or remotely (parallelised across machines in a cluster - either one that you own, or one in the cloud).
In any of these cases, the code is run on MATLAB workers, which are basically copies of MATLAB without an interactive desktop.
If you're intending to run on a remote cluster, it's likely that these workers will already be started up and ready to run code. If you're intending to run locally, it's possible that you might already have started workers, but maybe you haven't.
Some of constructs above (e.g. jobs and tasks, batch processing) just run the thing you asked for, and the workers then go back to being available for other things (possibly from a different user).
But some of the constructs (e.g. parfor, spmd) require that the workers on which you intend to run are reserved for you for a period of time - partly because they might lie idle for some time and you don't want them taken over by another user, and partly because (unlike with jobs and tasks, or batch processing) they might need to communicate with each other. This is called creating a worker pool.
When you run parpool, you're telling MATLAB that you want to reserve a pool of workers for yourself, because you're intending to run a construct that requires a worker pool. You can specify as an input argument a cluster profile, which would tell it whether you want to run on a remote cluster or locally.
If you're running on a cluster, parpool will send a message to the cluster to reserve some of its (already running) workers for your use.
If you're running locally, parpool will ensure that there are enough workers running locally, and then connect them into a pool for you.
The thing that takes 30 seconds is the part where it needs to start up workers, if they're not already running. On Windows, if you watch Task Manager while running parpool, you'll see additional copies of MATLAB popping up over those 30 seconds as the workers start (they're actually not MATLAB itself, they're MATLAB workers - you can distinguish them as they'll be using less memory with no desktop).
To compare what MATLAB is doing to OpenMP, note that these MATLAB workers are separate processes, whereas OpenMP creates multiple threads within an existing process.
To be honest I do not think that we will ever get to know exactly what MatLab does.
However, to give you some answer, MatLab basically opens additional instances of itself, for it to execute the code on. In order to do this, it first needs to check where the instances should be opened (you can change the cluster from local to whatever else you have access to, e.g. Amazons EC2 cluster). Once the new instances have been opened MatLab then has set the connection from your main window to the instances up.
Notes:
1) It is not recommended to use parpool inside a function or script as if it is run while a parallel pool is open it will cast an error. The use of parallel commands e.g. parfor will automatically open the instance.
2) parpool only have to be executed "once" (before it is shut down), i.e. if you run the code again the instances are already open.
3) If you want to avoid the overhead in your codes, you can create a file called startup.m in the search path of MATLAB, with the command parpool, this will automatically start a parallel pool on startup.
4) Vectorizing your code will automatically make it parallelised without the overhead.
A few further details to follow up on #Nicky's answer. Creating a parallel pool involves:
Submitting a communicating job to the appropriate cluster
This recruits MATLAB worker processes. These processes might be already running (in the case of MJS), or they might need to be started (in the case of 'local', and all other cluster types).
MPI communication is set up among the workers to support spmd (unless you specify "'SpmdEnabled', false" while starting the pool - however, this stage isn't usually a performance bottleneck).
The MATLAB workers are then connected up to the client so they can do its bidding.
The difference in overhead between parpool and something like OpenMP is because parpool generally launches additional MATLAB processes - a relatively heavy-weight operation, whereas OpenMP simply creates additional threads within a single process - comparatively light-weight. Also, as #Nicky points out - MATLAB can intrinsically multi-thread some/most vectorised operations - parpool is useful for cases where this doesn't happen, or where you have a real multi-node cluster available to run on.

Run time memory of perl script

I am having a perl script which is killed by a automated job whenever a high priority process comes as my script is running ~ 300 parallel jobs for downloading data and is consuming lot of memory. I want to figure out how much is the memory it takes during run time so that I can ask for more memory before scheduling the script or if I get to know using some tool the portion in my code which takes up more memory, I can optimize the code for it.
Regarding OP's comment on the question, if you want to minimize memory use, definitely collect and append the data one row/line at a time. If you collect all of it into a variable at once, that means you need to have all of it in memory at once.
Regarding the question itself, you may want to look into whether it's possible to have the Perl code just run once (rather than running 300 separate instances) and then fork to create your individual worker processes. When you fork, the child processes will share memory with the parent much more efficiently than is possible for unrelated processes, so you will, e.g., only need to have one copy of the Perl binary in memory rather than 300 copies.

Time taken for completion of PBS jobs

On a PBS system I have access to, I'm running some jobs using the -W x=NACCESSPOLICY:SINGLEJOB flag and, anecdotally, it seems that the same jobs take about 10% longer when adding this flag as without. Is this correct behaviour? If so, it surprises me, as I'd have thought having sole access to a whole node would, if anything, slightly decrease the time taken for a job to run due to having access to more memory.