How to combine outputs of multiple tasks performed using Matlab SGE? - matlab

I have the following batch file launching some m-files (main.m and f.m which are scripts) 4 times (4 tasks).
#$ -S /bin/bash
#$ -l h_vmem=2G
#$ -l tmem=2G
#$ -cwd
#$ -j y
#Run 4 tasks where each task has a different $SGE_TASK_ID ranging from 1 to 4
#$ -t 1-4
#$ -N example
#Output the Task ID
echo "Task ID is $SGE_TASK_ID"
cat main.m f.m | matlab -nodisplay -nodesktop -nojvm -nosplash
At the end I obtain 4 outputs that are example.o[...].1,example.o[...].2, example.o[...].3, example.o[...].4. Each of them looks like
...
Task ID is ...
< M A T L A B (R) >
...
>> >> >> >> >> >> >> >> >> >> >>
output =
4.0234 -3.4763
How can I combine these 4 outputs in a matrix 4x2 and save it?

You should save the relevant output from within f.m using MATLAB's save or something similar.
If you use the -r flag to call main and f from the command line you can add a variable which will contain the task ID and you can then access that from within f.m
matlab -nodisplay -nodesktop -nojvm -nosplash -r "main; ID = $SGE_TASK_ID; f; exit"
Then within f.m
% You theoretically generate some numeric result
result = rand(1, 2);
filename = sprintf('Result.%d.mat', ID);
save(filename, 'result')
This will save Result.0.mat, Result.1.mat etc.
Alternately, you could modify f.m such that it loads the data from the file, appends to it, and re-saves it every time
result = rand(1,2);
filename = 'Results.mat';
% If this is the first task, then create a new file, otherwise append to the old
if ID == 1
data = result;
else
tmp = load(filename, '-mat');
data = tmp.data;
data(ID,:) = result;
end
save(filename, 'data')

Related

How to break a MATLAB system call after a specified number of seconds

I am using Windows MATLAB to run SSH commands, but every once in a while the SSH command hangs indefinitely, and then so does my MATLAB script (sometimes when running overnight). How can I have a command time-out after a certain amount of waiting time? For example, suppose I don't want to wait more than 3 seconds for a SSH command to finish execution before breaking it and moving on:
% placeholder for a command that sometimes hangs
[status,result] = system('ssh some-user#0.1.2.3 sleep 10')
% placeholder for something I still want to run if the above command hangs
[status,result] = system('ssh some-user#0.1.2.3 ls')
I'd like to make a function sys_with_timeout that can be used like this:
timeoutDuration = 3;
[status,result] = sys_with_timeout('ssh some-user#0.1.2.3 sleep 10', timeoutDuration)
[status,result] = sys_with_timeout('ssh some-user#0.1.2.3 ls', timeoutDuration)
I've tried the timeout function from FEX but it doesn't seem to work for a system/SSH command.
I don't think the system command is very flexible, and I don't see a way to do what you want using it.
But we can use the Java functionality built into MATLAB for this, according to this answer by André Caron.
This is how you'd wait for the process to finish with a timeout:
runtime = java.lang.Runtime.getRuntime();
process = runtime.exec('sleep 20');
process.waitFor(10, java.util.concurrent.TimeUnit.SECONDS);
if process.isAlive()
disp("Done waiting for this...")
process.destroyForcibly();
end
process.exitValue()
In the example, we run the sleep 20 shell command, then waitFor() waits until the program finishes, but for a maximum of 10 seconds. We poll to see if the process is still running, and kill it if it is. exitValue() returns the status, if you need it.
Running sleep 5 I see:
ans =
0
Running sleep 20 I see:
Done waiting for this...
ans =
137
Building on #Cris Luengo's answer, here is the sys_with_timeout() function. I didn't use the process.waitFor() function because I'd rather wait in a while loop and display output as it comes in. The while loop breaks once the command finishes or it times out, whichever comes first.
function [status,cmdout] = sys_with_timeout(command,timeoutSeconds,streamOutput,errorOnTimeout)
arguments
command char
timeoutSeconds {mustBeNonnegative} = Inf
streamOutput logical = true % display output as it comes in
errorOnTimeout logical = false % if false, display warning only
end
% launch command as java process (does not wait for output)
process = java.lang.Runtime.getRuntime().exec(command);
% start the timeout timer!
timeoutTimer = tic();
% Output reader (from https://www.mathworks.com/matlabcentral/answers/257278)
outputReader = java.io.BufferedReader(java.io.InputStreamReader(process.getInputStream()));
errorReader = java.io.BufferedReader(java.io.InputStreamReader(process.getErrorStream()));
% initialize output char array
cmdout = '';
while true
% If any lines are ready to read, append them to output
% and display them if streamOutput is true
if outputReader.ready() || errorReader.ready()
if outputReader.ready()
nextLine = char(outputReader.readLine());
elseif errorReader.ready()
nextLine = char(errorReader.readLine());
end
cmdout = [cmdout,nextLine,newline()];
if streamOutput == true
disp(nextLine);
end
continue
else
% if there are no lines ready in the reader, and the
% process is no longer running, then we are done
if ~process.isAlive()
break
end
% Check for timeout. If timeout is reached, destroy
% the process and break
if toc(timeoutTimer) > timeoutSeconds
timeoutMessage = ['sys_with_timeout(''',command, ''',', num2str(timeoutSeconds), ')',...
' failed after timeout of ',num2str(toc(timeoutTimer)),' seconds'];
if errorOnTimeout == true
error(timeoutMessage)
else
warning(timeoutMessage)
end
process.destroyForcibly();
break
end
end
end
if ~isempty(cmdout)
cmdout(end) = []; % remove trailing newline of command output
end
status = process.exitValue(); % return
end
Replacing ssh some-user#0.1.2.3 with wsl (requires WSL of course) for simplicity, here is an example of a function that times out:
>> [status,cmdout] = sys_with_timeout('wsl echo start! && sleep 2 && echo finished!',1)
start!
Warning: sys_with_timeout('wsl echo start! && sleep 2 && echo finished!',1) failed after timeout of 1.0002 seconds
> In sys_with_timeout (line 41)
status =
1
cmdout =
'start!'
and here is an example of a function that doesn't time out:
>> [status,cmdout] = sys_with_timeout('wsl echo start! && sleep 2 && echo finished!',3)
start!
finished!
status =
0
cmdout =
'start!
finished!'

Snakemake workflow, ChildIOException or MissingInputException

I am trying to add a file renaming step in my current workflow to make it easier on some of the other users. What I want to do is take the contigs.fasta file from a spades assembly directory and rename it to include the sample name. (i.e foo_de_novo/contigs.fasta to foo_de_novo/foo.fasta)
here is my code... well currently.
configfile: "config.yaml"
import os
def is_file_empty(file_path):
""" Check if file is empty by confirming if its size is 0 bytes"""
# Check if singleton file exist and it is empty from bbrepair output
return os.path.exists(file_path) and os.stat(file_path).st_size == 0
rule all:
input:
expand("{sample}_de_novo/{sample}.fasta", sample = config["names"]),
rule fastp:
input:
r1 = lambda wildcards: config["sample_reads_r1"][wildcards.sample],
r2 = lambda wildcards: config["sample_reads_r2"][wildcards.sample]
output:
r1 = temp("clean/{sample}_r1.trim.fastq.gz"),
r2 = temp("clean/{sample}_r2.trim.fastq.gz")
shell:
"fastp --in1 {input.r1} --in2 {input.r2} --out1 {output.r1} --out2 {output.r2} --trim_front1 20 --trim_front2 20"
rule bbrepair:
input:
r1 = "clean/{sample}_r1.trim.fastq.gz",
r2 = "clean/{sample}_r2.trim.fastq.gz"
output:
r1 = temp("clean/{sample}_r1.fixed.fastq"),
r2 = temp("clean/{sample}_r2.fixed.fastq"),
singles = temp("clean/{sample}.singletons.fastq")
shell:
"repair.sh -Xmx10g in1={input.r1} in2={input.r2} out1={output.r1} out2={output.r2} outs={output.singles}"
rule spades:
input:
r1 = "clean/{sample}_r1.fixed.fastq",
r2 = "clean/{sample}_r2.fixed.fastq",
s = "clean/{sample}.singletons.fastq"
output:
directory("{sample}_de_novo")
run:
isempty = is_file_empty("clean/{sample}.singletons.fastq")
if isempty == "False":
shell("spades.py --careful --phred-offset 33 -1 {input.r1} -2 {input.r2} -s {input.singletons} -o {output}")
else:
shell("spades.py --careful --phred-offset 33 -1 {input.r1} -2 {input.r2} -o {output}")
rule rename_spades:
input:
"{sample}_de_novo/contigs.fasta"
output:
"{sample}_de_novo/{sample}.fasta"
shell:
"cp {input} {output}"
When I have it written like this I get the MissingInputError and when I change it to this.
rule rename_spades:
input:
"{sample}_de_novo"
output:
"{sample}_de_novo/{sample}.fasta"
shell:
"cp {input} {output}"
I get the ChildIOException
I feel I understand why snakemake is unhappy with both versions. The first one is becasue I don't explicitly output the "{sample}_de_novo/contigs.fasta" file. Its just one of several files spades outputs. And the other error is because it doesn't like how I am asking it to look into the directory. I however am at a loss on how to fix this.
Is there a way to ask snakmake to look into a directory for a file and then perform the task requested?
Thank you,
Sean
EDIT File Structure of Spades output
Sample_de_novo
|-corrected/
|-K21/
|-K33/
|-K55/
|-K77/
|-misc/
|-mismatch_corrector/
|-tmp/
|-assembly_graph.fastg
|-assembly_graph_with_scaffolds.gfa
|-before_rr.fasta
|-contigs.fasta
|-contigs.paths
|-dataset.info
|-input_dataset.ymal
|-params.txt
|-scaffolds.fasta
|-scaffolds.paths
|spades.log
Make {sample}_de_novo/contigs.fasta to be the output of spades and parse its path to get the directory that will be the argument to spades -o. Snakemake won't mind if there are other files created in addition to contigs.fasta. This should run --dry-run mode:
rule all:
input:
expand('{sample}_de_novo/{sample}.fasta', sample=['A', 'B']),
rule spades:
output:
fasta='{sample}_de_novo/contigs.fasta',
run:
outdir=os.path.dirname(output.fasta)
shell(f'spades ... -o {outdir}')
rule rename:
input:
fasta='{sample}_de_novo/contigs.fasta',
output:
fasta='{sample}_de_novo/{sample}.fasta',
shell:
r"""
mv {input.fasta} {output.fasta}
"""
Nope, spoke too soon. It didn't name the output directory correctly, so I moved it to the params and, now, finailly is working the way I wanted.
rule spades:
input:
r1 = "clean/{sample}_r1.fixed.fastq",
r2 = "clean/{sample}_r2.fixed.fastq",
s = "clean/{sample}.singletons.fastq"
output:
"{sample}_de_novo/contigs.fasta"
params:
outdir = directory("{sample}_de_novo/")
run:
isempty = is_file_empty("clean/{sample}.singletons.fastq")
if isempty == "False":
shell("spades.py --isolate --phred-offset 33 -1 {input.r1} -2 {input.r2} -s {input.singletons} -o {params.outdir}")
else:
shell("spades.py --isolate --phred-offset 33 -1 {input.r1} -2 {input.r2} -o {params.outdir}")
rule rename_spades:
input:
"{sample}_de_novo/contigs.fasta"
output:
"{sample}_de_novo/{sample}.fasta"
shell:
"cp {input} {output}"

Gnuplot prints differently when piping

I have file like this:
662.0,,9624
663.0,,9625
771.0,9624,
772.0,,9626
912.0,9625,
913.0,,9627
1083.0,9626,
1083.0,,9628
1174.0,9627,
1175.0,,9629
And when I use gnuplit console, it prints as expected:
set datafile separator ',';
plot 'data-exchange.log' using 1:2 pt 7 ps 2, '' using 1:3 pt 7 ps 2;
However, when I do the same when piping, it doesn't plot as expected:
cat data-exchange.log | gnuplot -p -e 'set datafile separator ","; plot "<cat" using 1:2 pt 7 ps 2, "" using 1:3 pt 7 ps 2;'
Not sure why is that...

LSF serial jobs on HPC performance worse than local sequential executions

I'm learning how to use HPC on our lab's clusters, which uses LSF. I tried a simple serial jobs each of which count the frequency of the words in a text file. I wrote a python code for counting the word frequency named count_word_freq.py, a jobs script named myjob.job as following:
python count_word_freq.py --in ~/books/1.txt --out ~/freqs/freq1.txt
python count_word_freq.py --in ~/books/2.txt --out ~/freqs/freq2.txt
python count_word_freq.py --in ~/books/3.txt --out ~/freqs/freq3.txt
python count_word_freq.py --in ~/books/4.txt --out ~/freqs/freq4.txt
python count_word_freq.py --in ~/books/5.txt --out ~/freqs/freq5.txt
and a lsf script to submit the serial jobs to the selfscheduler:
#!/bin/bash
#BSUB -J test01
#BSUB -P acc_pandeg01a
#BSUB -q alloc
#BSUB -W 20
#BSUB -n 20
#BSUB -m manda
#BSUB -o %J.stdout
#BSUB -eo %J.stderr
#BSUB -L /bin/bash
module load python
module load py_packages
module load selfsched
# And run the program; output will be on stdout
mpirun selfsched < myjobs001.jobs
The python code is as following:
def readBookAsFreqDict(infile):
dic = {}
with open(infile,"r") as file:
for line in file:
contents = line.split(" ")
for cont in contents:
if str.isalpha(cont):
if cont not in dic.keys():
dic[cont] = 1
else:
dic[cont] = dic[cont] + 1
return dic
import sys
import argparse
import time as T
if __name__ == "__main__":
start = T.time()
parser = argparse.ArgumentParser()
parser.add_argument('--i', type=str, help = 'input file')
parser.add_argument('--o', type=str, help = 'output file')
args = parser.parse_args()
dic = readBookAsFreqDict(args.i)
outfile = open(args.o,"w")
for key,freq in dic.iteritems():
outfile.write(key + ":" + str(freq) + "\n")
end = T.time()
print (end - start)
The 5 input texts are almost the same size of around 3.5 MB. My question is that the CPU time for running this serial job is 980s, which is worse than running it sequentially.
To my understanding, the selfscheduler can automatically assign the 5 jobs to empty nodes, thus can save the running time for running it sequentially. Is that because the execution time for each job is too short compared to the time to find an empty node? Is there any other approaches can be used to make it faster?
Thank you!

Shell variable name queried from Matlab has additional character

I'm working with the following script, run_test:
#!/bin/sh
temp=$1;
cat <<EOF | matlab
[status name] = unix('echo $temp');
disp(name);
% some Matlab code
test_complete = 1;
save(name)
exit
EOF
I want to pass a name to the script, run some code then save a .mat file with the name that was passed. However, there is a curious piece of behavior:
[energon2] ~ $ ./run_test 'run1'
Warning: No display specified. You will not be able to display graphics on the screen.
< M A T L A B (R) >
Copyright 1984-2010 The MathWorks, Inc.
Version 7.12.0.635 (R2011a) 64-bit (glnxa64)
March 18, 2011
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
>> >> >> >> run1
>> >> >> >> >>
[energon2] ~ $ ls *.mat
run1?.mat
There is a "?" at the end of the file name when it's saved, but not when displayed on command line. This is acceptable for my needs, but a bit irritating to not know why it's occurring. Any explanation would be appreciated.
Edits, solution:
Yuk was correct below in the underlying cause and the use of save('$temp'). I'm now using the following script
#!/bin/sh
temp=$1;
cat <<EOF | matlab
% some Matlab code
test_complete = 1;
save('$temp')
exit
EOF
Thanks for the help.
You name variable has end-of-line as the last character. When you run echo run1 in unix this command display run1 and then "hit enter". In your script all the output of echo are saved to the name variable.
You can confirm it with the following:
>> format compact
>> [status, name] = unix('echo run1')
status =
0
name =
run1
>> numel(name)
ans =
5
>> int8(name(end))
ans =
10
>> int8(sprintf('\n'))
ans =
10
Apparently this character can be a part of a file name in unix, but shell displays it as ?.
Can't you do save($temp) instead?
EDIT: See my comments below for correction and more explanation.