Snakemake: how to realize a mechanism to copy input/output files to/from tmp folder and apply rule there - hpc

We use Slurm workload manager to submit jobs to our high performance cluster. During runtime of a job, we need to copy the input files from a network filesystem to the node's local filesystem, run our analysis there and then copy the output files back to the project directory on the network filesystem.
While the workflow management system Snakemake integrates with Slurm (by defining profiles) and allows to run each rule/step in the workflow as Slurm job, I haven't found a simple way to specify for each rule, wether a tmp folder should be used (with all the implications stated above or not.
I am very happy for simple solutions how to realise this behaviour.

I am not entirely sure if I understand correctly. I am guessing you do not want to copy the input of each rule to a certain directory, do the rule, then copy the output back to another filesystem, since that would be a lot of unnecessary files moving around. So for the first half of the answer I assume before execution you move your files to /scratch/mydir.
I believe you could use the --directory command (https://snakemake.readthedocs.io/en/stable/executing/cli.html). However I find this works poorly, since then snakemake has difficulty finding the config.yaml and samples.tsv.
The way I solve this is just by adding a working dir in front of my paths in each rule...
rule example:
input:
config["cwd"] + "{sample}.txt"
output:
config["cwd"] + "processed/{sample}.txt"
shell:
"""
touch {output}
"""
So all you then have to do is change cwd in your config.yaml.
local:
cwd: ./
slurm:
cwd: /scratch/mydir
You would then have to manually copy them back to your long-term filesystem or make a rule that would do that for you.
Now if however you do want to copy your files from filesystem A -> B, do your rule, and then move the result from B -> A, then I think you want to make use of shadow rules. I think the docs properly explain how to use that so I just give a link :).

Related

Why does Yocto use absolute paths in TMPDIR?

Changing the path of a Yocto environment is not a good idea, as I found out. This also explains why e.g. bitbake can be run regardless the current working directory. Absolute paths are stored in many places during the build process, even subdirectory structures are created into the tmp directory tree. I ended up in rebuilding from scratch - which takes a long time.
A documentation of how I tried to modify all paths:
find . -name *.conf -exec sed -i 's/media\/rob\/3210bcd4-49ef-473e-97a6-e4b7a2c1973e/home/g' {} +
This step replaces absolute paths, within many dynamic conf files (from xx/xx/linux to /home/linux - where linux was chosen for historical reasons. I could mount the partition also as /home/yocto or whatever name).
Next was deletion of subdirectory structures with the old path in the hope that the build process would recognize these deletions, and still rebuild quickly:
find . -name *3210bcd4-49ef-473e-97a6-e4b7a2c1973e* -exec fakeroot rm -r {} +
It was not recognized. Then I gave up.
From a user new to Yocto, familiar with former/classic crossbuild environments based on make menuconfig etc.
My question is:
Why are absolute paths generated & used throughout tmp instead of treating everything as relative?
Or, asked differently:
Why not use something like ${TOPDIR}/tmp throughout the build configuration, instead of hardcoding the absolute path to tmp?

Zip files with encryption in a remote share, keeping orignal names and location

My team faces the need to encrypt all files in a repository with AES256. For this purpose, we decided we are going to zip all files with such encryption, using the same key for all of them.
The problem we have is that these files sit in a NAS, so from windows boxes they are accessible by \ to them.
The directory structure is something like this:
Original Structure:
Root
-1
|--folder1
|---file1.ext
|---file2.ext
|--folder2
|---filea.ext
|---fileb.ext
|--folder2.a
|---filec.ext
and so on...
Essentially, what we need is to have all the original files contained in a zip file, keeping their original names, which would be something like this:
Desired Outcome:
|-Root
|-1
|--folder1
|---file1.zip
|---file2.zip
|--folder2
|---filea.zip
|---fileb.zip
|--folder2a
|---filec.zip
and so on...
To accomplish this, we tried a batch script that calls 7zip, but it only works if it's run from the root directory, which is something we cannot use as the files are not in a server.
Here is the syntax of the batch script we came up with:
FOR /R %%i IN ("*.wmv") DO "C:\Program Files\7-Zip\7z.exe" a -mx0 -tzip -pPasswordHere "%%~dpni.zip" "%%i"
But, as wrote previously, it only works when run from the root folder, which is something we cannot do as files sit on a network location.
Mapping the drive or making a symbolic link to it doesn't do the trick either.
I've also checked on 7zip to do this, namely, making use of its "-r" operator, but I couldn't find a way to get the desired outcome (namely, recurse through all folders in the remote tree structure -there are a lot of them...- and keep the original file name).
I'm open to any suggestions as any kind of script, trick or guizmo that gets the job done will be more than welcome. =)
Thanks a million in advance!,
Sebas.
----SOLUTION----
I actually found a sollution here, mapping the drive in a different way (it's so simple it just made me feel stupid(er), but it's altogheter beautiful).
Using the batch script below, the remote share can be mapped like so:
You can map a drive using
net use X: \\server\directory
and then you can change to that directory using
pushd X:
(Post from which the answer was taken from: Batch File Iterating through files on a local network server)

flexible merge command for unison to pick newer or older file?

I've been using unison as my file synchronizer of choice and life has been great.
Essentially I could modify any files on any side at any time without ever worrying who's master and slave, etc. It's bidirectional.
However with four roots failing over to each other when each's primary partner cannot be reached, I'm starting to push the limits of this tool. Conflicts arise that halt automatic syncing for the files involved. Aspects of my business logic are distributed across the different hosts, which modify sometimes the same files when run.
The merge option in the configuration file comes into play. It lets you specify different merge commands for different file types.
For example for log files only I like to interpolate their lines with:
merge = Name *.log -> diff3 -m CURRENT1 CURRENTARCH CURRENT2 > NEW || echo "differences detected"
Question: for *.last files only, what merge command would always favor the older copy?
For *.rb *.sh and other source files, I'm not looking to merge but always pick the newer version in case of conflicts. I can do that by default with the prefer = newer global option though.
For *.png files I typically prefer to keep the smaller(optimized) size.
Regarding the .rb and .sh files, you could use the preferpartial = Name *.rb -> newer and the same for .ssh files. For .last files, you can use older instead.
Regarding .png files, you could write your own merge command that checks the size of both files. I would then set merge = Name *.png -> mycmp CURRENT1 CURRENT2 NEW, and have the mycmp command takes three file path, compare the size of the first two, and copy it to the third path.

how to check for activity or lack thereof on a unix file directory using perl or unix commands

Scenario:
I have a process where many files are being copied (scp'd) to a DestinationServer by Host1, Host2, Host3, Host4 for example. Going to the same common directory: DestinationServer:/home/target. All the files are unique so no files will be overwritten. Host1-Host4 will have a cronjob that will launch their scp script to DestinationServer. The caveat is the Hosts are in different time zones, locations. So, they will finish at different times.
Need:
Since the files are being scp'd to Destination:/home/target, what is the best way to programmatically check when those scp's from the other Hosts are done??
Options:
My options are to programmatically do this either in perl or shell if possible.
What do I look for, what unix commands or perl modules could I use to help determine when the processes would finish? Any ideas, examples would be great! Thanks.
Use a Maildir kind of approach: copy all files to a temporary directory, then after the transfer is complete have the originating host perform a rename into the target directory via ssh. That way when a file appears in the target directory, you know that it is complete.
I suggest this because if you just scp files into the target directory and monitor the directory in whatever way, you cannot distinguish a complete transfer from an interrupted scp command or a network failure.
SGI::FAM, Sys::Gamin
Similar but alternative way to Jouni is to use semaphore files. Before scp-ing files originating host puts up semaphore-file and when finished, remove it. So you know, it's time.

How do I copy from numerous release directories to a single folder

Okay this is and isn't programming related I guess...
I've got a whole bunch of little useful console utilities scattered across a suite of projects that I wrote and I want to dump them all to a single directory to make using them simpler. The only issue is that I have them all compiled in both Debug and Release mode.
Given that I only want the release mode versions in my utilities directory, what switch would allow me to specify that I want all executables from my tree structure but only from within Release folders:
Example:
Projects\
Project1\
Bin\
Debug\
Project1.exe
Release\
Project1.exe
Project2\
etc etc...
To
Utilities\
Project1.exe
Project2.exe
Project3.exe
Project4.exe
...
etc etc...
I figured this would be a cinch with XCopy - but it doesn't seem to allow me to exclude the Debug directories - or rather - only include items in my Release directories.
Any ideas?
You can restrict it to only release executables with the following. However, I do not believe the other requirement of flattening is possible using xcopy alone. To do the restriction:
First create a file such as exclude.txt and put this inside:
\Debug\
Then use the following command:
xcopy /e /EXCLUDE:exclude.txt *.exe C:\target
You can, however, accomplish what you want using xxcopy (free for non-commercial use). Read technical bulletin #16 for an explanation of the flattening features.
If the claim in that technical bulletin is correct, then it confirms that flattening cannot be accomplished with xcopy alone.
The following command will do exactly what you want using xxcopy:
xxcopy /sgfo /X:*\Debug\* .\Projects\*.exe .\Utilities
I recommend reading the technical bulletin, however, as it gives more sophisticated options for the flattening. I chose one of the most basic above.
Sorry, I haven't tried it yet, but shouldn't you be using:
xcopy release*.exe d:\destination /s
I am currently on my Mac so, I cant really check to be for sure.
This might not help you with assembling them all in one place now, but going forward have you considered adding a post-build event to the projects in Visual Studio (I'm assuming you are using it based on the directory names)
xcopy /Y /I /E "$(TargetDir)\$(TargetFileName)" "c:\somedirectory\$(TargetFileName)"
Ok, this is probably not going to work for you since you seem to be on a windows machine.
Here goes anyway, for the logic.
# From the base directory
mkdir Utilities
find . -type f | grep -w Release > utils.txt
for f in $(<utils.txt); do cp $f Utilities/; done
You can combine the find and cp lines into one, I split them for readability.
To do this on a windows machine you'll need Cygwin or some such Unix Utilities handy.
Maybe there are tools in the Windows shell to do this...
This may help get you started:
C:\>for %i in (*) do dir "%~dpi\*.exe"
Used in the dir command as a modifier to i, ~dp uses the drive and path of everything found in (*). If I run the above in a folder that has several subfolders containing executables, I get a dir list of all of the executables in each folder.
You should be able to modify that to add '\bin\release\' following the ~dpi portion and change dir to xcopy. A little experimentation should make it pretty easy.
To use the for statement above in a batch file, change '%' to '%%' in both places.