How to make Snakemake recognize Globus remote files using Globus CLI? - hpc

I am working in a high performance computing grid environment, where large-scale data transfers are done via Globus. I would like to use Snakemake to pull data from a Globus path, process the data, and then push the processed data to a different Globus path. Globus has a command-line interface.
Pulling the data is no problem, for I'd just create a rule that would run globus transfer to create the requisite local file. But for pushing the data back to Globus, I think I'll need a rule that can "see" that the file is missing at the remote location, and then work backwards to determine what needs to happen to create the file.
I could create local "proxy" files that represent the remote files. For example I could make a rule for creating 'processed_data_1234.tar.gz' output files in a directory. These files would just be created using touch (thus empty), and the same rule will run globus transfer to push the files remotely. But then there's the overhead of making sure that the proxy files don't get out of sync with the real Globus-hosted files.
Is there a more elegant way to do this akin to the Remote File capability? Is it difficult to add a Globus CLI support for Snakemake? Thanks in advance for any advice!

Would it help to create a utility function that would generate a list of all desired files and compare it against the list of files available on globus? Something like this (pseudocode):
def return_needed_files():
list_needed_files = [] # either hard-coded or specified with some logic
list_available = [] # as appropriate, e.g. using globus ls
return [i for i in list_needed_files if i not in list_available]
# include all the needed files in the all rule
rule all:
input: return_needed_files

Related

Where is a file created via Terraform code stored in Terraform Cloud?

I've been using Terraform for some time but I'm new to Terraform Cloud. I have a piece of code that if you run it locally it will create a .tf file under a folder that I tell him but if I run it with Terraform CLI on Terraform cloud this won't happen. I'll show it to you so it will be more clear for everyone.
resource "genesyscloud_tf_export" "export" {
directory = "../Folder/"
resource_types = []
include_state_file = false
export_as_hcl = true
log_permission_errors = true
}
So basically when I launch this code with terraform apply in local, it creates a .tf file with everything I need. Where? It goes up one folder and under the folder "Folder" it will store this file.
But when I execute the same code on Terraform Cloud obviously this won't happen. Does any of you have any workaround with this kind of troubles? How can I manage to store this file for example in a github repo when executing github actions? Thanks beforehand
The Terraform Cloud remote execution environment has an ephemeral filesystem that is discarded after a run is complete. Any files you instruct Terraform to create there during the run will therefore be lost after the run is complete.
If you want to make use of this information after the run is complete then you will need to arrange to either store it somewhere else (using additional resources that will write the data to somewhere like Amazon S3) or export the relevant information as root module output values so you can access it via Terraform Cloud's API or UI.
I'm not familiar with genesyscloud_tf_export, but from its documentation it sounds like it will create either one or two files in the given directory:
genesyscloud.tf or genesyscloud.tf.json, depending on whether you set export_as_hcl. (You did, so I assume it'll generate genesyscloud.tf.
terraform.tfstate if you set include_state_file. (You didn't, so I assume that file isn't important in your case.
Based on that, I think you could use the hashicorp/local provider's local_file data source to read the generated file into memory once the MyPureCloud/genesyscloud provider has created it, like this:
resource "genesyscloud_tf_export" "export" {
directory = "../Folder"
resource_types = []
include_state_file = false
export_as_hcl = true
log_permission_errors = true
}
data "local_file" "export_config" {
filename = "${genesyscloud_tf_export.export.directory}/genesyscloud.tf"
}
You can then refer to data.local_file.export_config.content to obtain the content of the file elsewhere in your module and declare that it should be written into some other location that will persist after your run is complete.
This genesyscloud_tf_export resource type seems unusual in that it modifies data on local disk and so its result presumably can't survive from one run to the next in Terraform Cloud. There might therefore be some problems on the next run if Terraform thinks that genesyscloud_tf_export.export.directory still exists but the files on disk don't, but hopefully the developers of this provider have accounted for that somehow in the provider logic.

How to configure ClamAV's freshclam.conf to point to a local nexus repository?

My company has tasked me with installing clamAV on a large amount of machines running RHEL6, none of which have internet access. I know freshclam.conf can be edited to point to a local mirror of the virus database, in this section of the file:
# This option allows you to easily point freshclam to private mirrors.
# If PrivateMirror is set, freshclam does not attempt to use DNS
# to determine whether its databases are out-of-date, instead it will
# use the If-Modified-Since request or directly check the headers of the
# remote database files. For each database, freshclam first attempts
# to download the CLD file. If that fails, it tries to download the
# CVD file. This option overrides DatabaseMirror, DNSDatabaseInfo
# and ScriptedUpdates. It can be used multiple times to provide
# fall-back mirrors.
# Default: disabled
#PrivateMirror mirror1.mynetwork.com
#PrivateMirror mirror2.mynetwork.com
The company has sonatype-nexus repositories available, with which we can push the database files to at an interval of our choosing once I have access. I know I can get a link to said repository once it has been created. Do I just paste that link where mirror1.mynetwork.com currently is in its entirety, or are there additions I have to make? I'm losing my mind trying to find this simple answer and not being able to find any examples, as I have zero experience with any of this.

how to copy local directory with files to remote server talend

in Talend(data integration) i am trying to copy local directory to remote directory but when i am running the job only i can copy files but not folders from directory.please help me with this job.
In my talend job i am using local connection and remote connection components->
tfilelist->tfileproperties(to store path and name in one table)->tmssqlinput(extracting path from last table)->iteration-> tssh(if directory s not available then create)->finally sending it to tftpput to connect and copy to remote directory.
when i am storing in one table using tfileproperties in that for files it will generate some size but when folder s coming the size will be zero,using this condition m creating the directory using tssh component but unable to create folders,please help me.
Do you get an error message?
I believe the output of the TMSSqlInput should be a row based, rather than iteration. That might be the source of the problem.
tMSqlInput docs
tMSSqlInput executes a DB query with a strictly defined order which
must correspond to the schema definition. Then it passes on the field
list to the next component via a Main row link.

How to compress a list of files into a single gzip file using elasticluster, grid-engine-tools, and google cloud

I want to start by thanking you all for your help ahead of time, as this will help clear up a detail left out on the readthedocs.io guide. What I need is to compress several files into a single gzip, however, the guide shows only how to compress a list of files as individual gzipped file. Again, I appreciate any help as there is very few resources and documentation for this set up. (If there is some extra info, please include links to sources)
After I had set up the grid engine, I ran through the samples in the guide.
Am I right in assuming there is not a script for combining multiple files into one gzip using grid-computing-tools?
Are there any solutions on the Elasticluster Grid Engine setup to compress multiple files into 1 gzip?
What changes can be made to the grid-engine-tools to make it work?
EDIT
The reason we are considering a cluster is that we do expect multiple operations occurring simultaneously, zipped up files per order, which will occur systematically so that a vendor can download a single compressed file per order.
May I state the definition of the problem and you can let me know if I understood it correctly, as both Matt and I provided the exact same solution and somehow it doesn't seem sufficient.
Problem Definition
You have an Order defining the start of a task to process some data.
The processing of data would be split among several compute nodes, each producing a resulting file stored on GS directories.
The goal is:
Collect the files from GS bucket (that were produced by each of the nodes),
Archive the collection of files as one file,
Then compress that archive, and
Push it back to a different GS location.
Let me know if I summarized it properly,
Thanks,
Paul
Are the files in question in Cloud Storage?
Are the files in question on a local or network drive?
In your description, you indicate "What I need is to compress several files into a single gzip". It isn't clear to me that a cluster of computers is needed for this. It sounds more like you just want to use tar along with gzip.
The tar utility will create an archive file it can compress it as well. For example:
$ # Create a directory with a few input files
$ mkdir myfiles
$ echo "This is file1" > myfiles/file1.txt
$ echo "This is file2" > myfiles/file2.txt
$ # (C)reate a compressed archive
$ tar cvfz archive.tgz myfiles/*
a myfiles/file1.txt
a myfiles/file2.txt
$ # (V)erify the archive
$ tar tvfz archive.tgz
-rw-r--r-- 0 myuser mygroup 14 Jul 20 15:19 myfiles/file1.txt
-rw-r--r-- 0 myuser mygroup 14 Jul 20 15:19 myfiles/file2.txt
To extract the contents use:
$ # E(x)tract the archive contents
$ tar xvfz archive.tgz
x myfiles/file1.txt
x myfiles/file2.txt
UPDATE:
In your updated problem description, you have indicated that you may have multiple orders processed simultaneously. If the frequency in which results need to be tar-ed is low, and providing the tar-ed results is not extremely time-sensitive, then you could likely do this with a single node.
However, as the scale of the problem ramps up, you might take a look at using the Pipelines API.
Rather than keeping a fixed cluster running, you could initiate a "pipeline" (in this case a single task) when a customer's order completes.
A call to the Pipelines API would start a VM whose sole purpose is to download the customer's files, tar them up, and push the resulting tar file into Cloud Storage. The Pipelines API infrastructure does the copying from and to Cloud Storage for you. You would effectively just need to supply the tar command line.
There is an example that does something similar here:
https://github.com/googlegenomics/pipelines-api-examples/tree/master/compress
This example will download a list of files and compress each of them independently. It could be easily modified to tar the list of input files.
Take a look at the https://github.com/googlegenomics/pipelines-api-examples github repository for more information and examples.
-Matt
So there are many ways to do it, but the thing is that you cannot directly compress on Google Storage a collection of files - or a directory - into one file, and would need to perform the tar/gzip combination locally before transferring it.
If you want you can have the data compressed automatically via:
gsutil cp -Z
Which is detailed at the following link:
https://cloud.google.com/storage/docs/gsutil/commands/cp#changing-temp-directories
And the nice thing is that you retrieve uncompressed results from compressed data on Google Storage, because it has the ability to perform Decompressive Transcoding:
https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding
You will notice on the last line in the following script:
https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh
The following line will basically copy the current compressed file to Google Cloud Storage:
gcs_util::upload "${WS_OUT_DIR}/*" "${OUTPUT_PATH}/"
What you will need is to first perform the tar/zip on the files in the local scratch directory, and then gsutil copy the compressed file over to Google Storage, but make sure that all the files that need to be compressed are in the scratch directory before starting to compress them. Most likely you would need to SSH copy (scp) them to one of the nodes (i.e. master), and then have the master tar/gzip the whole directory before sending it over to Google Storage. I am assuming each GCE instance has its own scratch disk, but the "gsutil cp" transfer is very fast when working on GCE.
Since Google Storage is fast at data transfers with Google Compute instances, the easiest second option to pursue is to mark out lines 66-69 in the do_compress.sh file:
https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh
This way no compression happens, but the copy happens on the last line via gsutil::upload, in order to have all the uncompressed files transferred to the same Google Storage bucket. Then using "gsutil cp" from the master node you would copy them back locally, in order to compress them locally via tar/gz and then copy the compressed directory file back to the bucket using "gsutil cp".
Hope it helps but it's tricky,
Paul

Version Control a file with per machine dependent stuff

I have a config file in my project that includes some info that is per machine dependent (db username, password, path). I understand that in this particular case, I could enforce everybody to use the same username, db path, and password to keep this simple, but there must be another way to deal with this problem.
I use mercurial, if you care, but I am ok with just a theoretical answer if you are unfamiliar with hg specifics.
A common way to handle this is to put a config.example or similar under version control and force the user to copy it and make any necessary changes. That way the user can pull down the overall structure of the file from your repository without overwriting local changes.
Alternatively, you could make your config file provide only defaults, with the option to source a subset of variables from a higher-priority custom config file (in the same format) which the user may or may not provide.
You'll want to use the .hgignore file to not include the config file in the repository.
This will allow everyone to have their own version of the config file.
Basically, you just want to add the relative path to the config file and Mercurial commands will ignore it. So the file would look like this:
config/dbconfig.ext
Edit
I just realized you still want to be able to version control the config file (misunderstood the question). So I suggest moving the parts of the config file that are dependent into their own config file and then applying the fix above. That way, you can still have the regular config information under version control and keep part of it separate for each person's machine.
I have per machine databases for my PHP projects. What I do is check the hostname at runtime. If it is one host, I feed it certain credentials. If another, feed it different credentials.
On some systems I create a list of credentials and then just go down the line trying them until one of the connections works. If the list is exhausted, the connection cannot be made.
I've never found a solid method for handling this type of configuration files. My final solution was to just maintain a version of each file and use symbolic links. That way each server has the same file path, but different root file.
Without knowing exactly what is in your config file, I'm going to assume your file has some stuff that is machine-dependent (e.g., db password, paths) and other stuff that is not (db hostname, maybe some paths relative to a path that is configured on a per-machine basis, etc.)
If that's the case, what you want to do is re-factor your config file so that you have two config files---one for the common stuff, one for the machine-specific stuff. Check the common one in, and add the machine-specific configuration to the ignore file.