How can one view the partial output of a job in PBS that has exceeded its walltime? - hpc

I'm new to using cluster computers to run experiments. I have a script running in python that should be regularly printing out information, but I find that when my job exceeds its walltime, I get no output at all except the notification that the job has been killed.
I've tried regularly flushing the buffer to no avail, and was wondering if there was something more basic that I'm missing.
Thanks!

I'm guessing you are having issues with a job cleanup script in the epilogue. You may want to ask the admins about it. You may also want to try a different approach.
If you were to redirect your output to a file in a shared filesystem you should be able to avoid data loss. This assumes you have a shared filesystem to work with and you aren't required to stage in and stage out all of your data.
If you reuse your submission script you can avoid clobbering the output of other jobs by including the $PBS_JOBID environment variable in the output filename.
script.py > $PBS_JOBID.out
On mobile so check qsub man page for a list of job environment variables.

Related

How can i identify processed files in Data flow Job

How can I identify processed files in Data flow Job? I am using a wildcard to read files from cloud storage. but every time when the job runs, it re-read all files.
This is a batch Job and following is sample reading TextIO that I am using.
PCollection<String> filePColection = pipeline.apply("Read files from Cloud Storage ", TextIO.read().from("gs://bucketName/TrafficData*.txt"));
To see a list of files that match your wildcard you can use gsutils, which is the Cloud Storage command line utility. You'd do the following:
gsutils ls gs://bucketName/TrafficData*.txt
Now, when it comes to running a batch job multiple times, your pipeline has no way to know which files it has analyzed already or not. To avoid analyzing new files you could do either of the following:
Define a Streaming job, and use TextIO's watchForNewFiles functionality. You would have to leave your job to run for as long as you want to keep processing files.
Find a way to provide your pipeline with files that have already been analyzed. For this, every time you run your pipeline you could generate a list of files to analyze, put it into a PCollection, read each with TextIO.readAll(), and store the list of analyzed files somewhere. Later, when you run your pipeline again you can use this list as a blacklist for files that you don't need to run again.
Let me know in the comments if you want to work out a solution around one of these two options.

DASK with local files on WORKER systems

I am working with mutiple systems as workers.
Each worker system has a part of the data locally stored. And I want the computation done by each worker on its respective file only.
I have tried using :
distributed.scheduler.decide_worker()
send_task_to_worker(worker, key)
but I could not automate assigning the task for each file.
Also, is there anyway I can access local files of the worker? Using tcp address, I only have access to a temp folder of the worker created for dask.
You can target computations to run on certain workers using the workers= keyword to the various methods on the client. See http://distributed.readthedocs.io/en/latest/locality.html#user-control for more information.
You might run a function on each of your workers that tells you which files are present:
>>> client.run(os.listdir, my_directory)
{'192.168.0.1:52523': ['myfile1.dat', 'myfile2.dat'],
'192.168.0.2:4244': ['myfile3.dat'],
'192.168.0.3:5515': ['myfile4.dat', 'myfile5.dat']}
You might then submit computations to run on those workers specifically.
future = client.submit(load, 'myfile1.dat', workers='192.168.0.1:52523')
If you are using dask.delayed you can also pass workers= to the `persist method. See http://distributed.readthedocs.io/en/latest/locality.html#user-control for more information

Intermittant file not found using Google Cloud Storage from Dataproc - flushing writes?

I have a series of dataproc jobs that run to import some data received each morning. The process creates a cluster, runs four jobs in sequence, then shuts down the cluster. The input file is read from Google Cloud Storage, and the intermediate results are also saved in Avro form in GCS with the final output going to Cloud SQL.
Fairly often the jobs will fail trying to read the Avro written by the previous job. It appears that GCS hasn't "caught up" and the results from the previous job haven't been fully written. I was getting failures trying to read files that appeared to be from the previous day's run and partway through those files would disappear and be replaced by the new ones. I have changed my script that runs the files to clear the work area before starting the jobs, but still have problems where sometimes it starts reading and all the parts haven't been written fully.
I could change the code to simply store the intermediate files on the cluster, tho I like having them available outside for diagnosing other problems. I could also just write to both locations with the cluster for working and GCS for diagnostics.
But assuming this is some kind of sync issue, is there a way to force GCS to flush writes / be fully synced between jobs? Or is there some check I can do to make sure everything has been written before starting the next job in my chain?
EDIT: To answer the comment below, the sequence of jobs all run on the same cluster. The cluster is started, each job run in turn on that cluster, and then the cluster is shut down.
For now, I have worked around this by having the jobs write to HDFS on the cluster in addition to GCS, and the subsequent jobs reading from the cluster. The GCS output is now strictly for diagnostics in case of a problem. But even tho my immediate problem is (I believe) fixed I still would like to know what's happening and why GCS seems out of sync for a bit.

Matlab process termination in slurm

I have two questions that to me seem related:
First, is it necessary to explicitly terminate Matlab in my sbatch command? I have looked through several online slurm tutorials, and in some cases the authors include an exit command:
http://www.umbc.edu/hpcf/resources-tara-2013/how-to-run-matlab.html
And in some they don't:
http://www.buffalo.edu/ccr/support/software-resources/compilers-programming-languages/matlab/PCT.html
Second, when creating a parallel pool in a job, I almost always get the following warning:
Warning: Found 4 pre-existing communicating job(s) created by pool that are
running, and 2 communicating job(s) that are pending or queued. You can use
'delete(myCluster.Jobs)' to remove all jobs created with profile local. To
create 'myCluster' use 'myCluster = parcluster('local')'
Why is this happening, and is there any way to avoid it happening to myself and to others because of me?
It depends on how you launch Matlab. Note that your two examples use distinct methods for running a matlab script; the first one uses the -r option
matlab -nodisplay -r "matrixmultiply, exit"
while the second one uses stdin redirection from a file
matlab < runjob.m
In the first solution, the Matlab process will be left running after the script is finished, that is why the exit command is needed there. In the second solution, the Matlab process is terminated as stdin closes when the end of the file is reached.
If you do not end the matlab process, Slurm will kill it when the maximum allocation time is reached, as defined by the --time option in you submission script or by the default cluster (or partition) value.
To avoid the warning you mention, make sure to systematically use matlabpool close at the end of your job. If you have several instances of Matlab running on the same node, and you have a shared home directory, you will probably get the warning anyhow, as I believe the information about open matlab pools is stored in a hidden folder in your home. Rebooting will probably not help, but finding those files and removing them will (be careful though and ask the system administrator).
to avoid your warning, you have to delete
.matlab/local_cluster_jobs/
directory

Good practices of Websphere MQ production deployment

I'm about to prepare a deployment specification for the Websphere MQ production environment. As always I hate reinventing the wheel hence the question:
Is there an article, specififaction of best practices when it comes to deploying and maintaining the Webshpere MQ production environment?
Here are more specific doubts of mine:
Configuration versioning (MQSC, dmpmqcfg, etc).
Deploying new objects (MQSC or manual instructions?)
Deployment automation (maybe basing on the diff of dmpmqcfg?).
Deploying and versioning configuration alterations.
Currently I am simply creating MQ objects manually and versioning the output of dmpmqcfg. However, in a while there are going to be too many deployments to handle it like this.
That's an extremely broad question so I'll try to respond before a moderator deletes it. :-)
The answer depends on many things such as whether MQ clusters are in use, the approaches to high availability and disaster recovery, the security requirements, whether the QMgrs are configured as dedicated or shared infrastructure, etc. However, there are a few patterns that I follow in almost all cases, including non-Production. This is because things like monitoring and security tend to get dropped at deployment time if not tested in Dev and don't work as expected in Prod.
I use a script to create my QMgrs in Production to insure that basics like generating the X.509 certificate (or CSR) is always done according to standards, that any exits or exit parm files are present, that certain SupportPac executables (like q) are present in /opt/mqm/bin, circular queues, etc. It also checks for negative factors such as GSKit not being installed.
I have a baseline script that is run against all QMgrs. This script sets up the DLQ, any queues for monitoring agents, enables events as required, sets up system services, trigger monitors, listeners, etc. The exception is B2B gateway QMgrs which are handled in a class all their own and have very specific configurations not used on the internal network.
cluster.
I have several classes of QMgr with specific configuration requirements. These include cluster repositories (where primary and secondary are distinct sub-types), service-provider QMgrs, and service consumer QMgrs. These all have secondary scripts run against them.
I have scripts per-cluster to join or suspend a QMgr in cases where clustering is used (which for me is almost 100% since v7.1).
These set up a QMgr's infrastructure. Then I maintain scripts for each application. So for example, if there's a Payroll app, I'll have queues and possibly topics with names containing a PAY node such as PAY.EMPLOYEE.UPDT.REQ.V032.PRD. Corresponding to that will be a single script for all PAY.** queues. Used to be one for setmqaut commands too, but these are now in the same script as the objects. I only ever have one version of the script and keep a history of changes in the script. This way when I need to recreate a QMgr, I just run all the scripts for it. Similarly, if I need to deploy the PAY objects on another QMgr, I just copy the script to that server.
When defining objects for clusters, I always do a DEFINE NOREPLACE that contains all the run-time attributes such as whether the queue is enabled in the cluster. The queue is always defined as disabled in the cluster and for triggering but because I use NOREPLACE re-running the script doesn't change whatever state it has in, say, a month. Those things that are configuration and not run-time, such as the description, are handled in an ALTER immediately after the DEFINE and these are updated each time the script is run. There's an article on this here.
Finally, the scripts I use are of the self-executing, self-documenting variety. For example, many people put all the MQSC commands into a script then do something like:
runmqsc < payroll.mqsc > payroll.out
TONS of problems here. The main one is that it relies on the operator to know a lot and execute the script right all the time. For example, suppose (s)he forgets to capture the output? Or overwrites a previous output? Or doesn't get STDERR because (s)he needs to do the 2>&1 at the end and doesn't know redirection that well?
So my scripts are all written in ksh handle all the capturing of output, complete with time and date stamping and STDERR, can freely mix MQSC with OS commands, etc. All you do is go to the scripts directory for that QMgr and . ./*ksh to build/rebuild a QMgr.
I do of course also take regular configuration dumps, but these are more for running queries and reports like "how many QMgrs have this channel defined and where are they?" kind of thing.
Also, when taking backups there is almost NEVER a good reason to back up a QMgr at a point in time. However, if it is required be sure to stop the QMgr first. Also, think long and hard about capturing certificates in a backup. Many people are good about locking the certificate directory so only mqm can read it but often the backups are unprotected. As long as you aren't trying to restore on top of Production, many shops let you restore the Production /var/mqm/* files to your own sandbox. If the QMgr's KDB files are included, you just lost them. An alternative is to put the certificates in /etc or some other directory that is protected but not backed up with the QMgr's directories.