How to compress a list of files into a single gzip file using elasticluster, grid-engine-tools, and google cloud - google-cloud-storage

I want to start by thanking you all for your help ahead of time, as this will help clear up a detail left out on the readthedocs.io guide. What I need is to compress several files into a single gzip, however, the guide shows only how to compress a list of files as individual gzipped file. Again, I appreciate any help as there is very few resources and documentation for this set up. (If there is some extra info, please include links to sources)
After I had set up the grid engine, I ran through the samples in the guide.
Am I right in assuming there is not a script for combining multiple files into one gzip using grid-computing-tools?
Are there any solutions on the Elasticluster Grid Engine setup to compress multiple files into 1 gzip?
What changes can be made to the grid-engine-tools to make it work?
EDIT
The reason we are considering a cluster is that we do expect multiple operations occurring simultaneously, zipped up files per order, which will occur systematically so that a vendor can download a single compressed file per order.

May I state the definition of the problem and you can let me know if I understood it correctly, as both Matt and I provided the exact same solution and somehow it doesn't seem sufficient.
Problem Definition
You have an Order defining the start of a task to process some data.
The processing of data would be split among several compute nodes, each producing a resulting file stored on GS directories.
The goal is:
Collect the files from GS bucket (that were produced by each of the nodes),
Archive the collection of files as one file,
Then compress that archive, and
Push it back to a different GS location.
Let me know if I summarized it properly,
Thanks,
Paul

Are the files in question in Cloud Storage?
Are the files in question on a local or network drive?
In your description, you indicate "What I need is to compress several files into a single gzip". It isn't clear to me that a cluster of computers is needed for this. It sounds more like you just want to use tar along with gzip.
The tar utility will create an archive file it can compress it as well. For example:
$ # Create a directory with a few input files
$ mkdir myfiles
$ echo "This is file1" > myfiles/file1.txt
$ echo "This is file2" > myfiles/file2.txt
$ # (C)reate a compressed archive
$ tar cvfz archive.tgz myfiles/*
a myfiles/file1.txt
a myfiles/file2.txt
$ # (V)erify the archive
$ tar tvfz archive.tgz
-rw-r--r-- 0 myuser mygroup 14 Jul 20 15:19 myfiles/file1.txt
-rw-r--r-- 0 myuser mygroup 14 Jul 20 15:19 myfiles/file2.txt
To extract the contents use:
$ # E(x)tract the archive contents
$ tar xvfz archive.tgz
x myfiles/file1.txt
x myfiles/file2.txt
UPDATE:
In your updated problem description, you have indicated that you may have multiple orders processed simultaneously. If the frequency in which results need to be tar-ed is low, and providing the tar-ed results is not extremely time-sensitive, then you could likely do this with a single node.
However, as the scale of the problem ramps up, you might take a look at using the Pipelines API.
Rather than keeping a fixed cluster running, you could initiate a "pipeline" (in this case a single task) when a customer's order completes.
A call to the Pipelines API would start a VM whose sole purpose is to download the customer's files, tar them up, and push the resulting tar file into Cloud Storage. The Pipelines API infrastructure does the copying from and to Cloud Storage for you. You would effectively just need to supply the tar command line.
There is an example that does something similar here:
https://github.com/googlegenomics/pipelines-api-examples/tree/master/compress
This example will download a list of files and compress each of them independently. It could be easily modified to tar the list of input files.
Take a look at the https://github.com/googlegenomics/pipelines-api-examples github repository for more information and examples.
-Matt

So there are many ways to do it, but the thing is that you cannot directly compress on Google Storage a collection of files - or a directory - into one file, and would need to perform the tar/gzip combination locally before transferring it.
If you want you can have the data compressed automatically via:
gsutil cp -Z
Which is detailed at the following link:
https://cloud.google.com/storage/docs/gsutil/commands/cp#changing-temp-directories
And the nice thing is that you retrieve uncompressed results from compressed data on Google Storage, because it has the ability to perform Decompressive Transcoding:
https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding
You will notice on the last line in the following script:
https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh
The following line will basically copy the current compressed file to Google Cloud Storage:
gcs_util::upload "${WS_OUT_DIR}/*" "${OUTPUT_PATH}/"
What you will need is to first perform the tar/zip on the files in the local scratch directory, and then gsutil copy the compressed file over to Google Storage, but make sure that all the files that need to be compressed are in the scratch directory before starting to compress them. Most likely you would need to SSH copy (scp) them to one of the nodes (i.e. master), and then have the master tar/gzip the whole directory before sending it over to Google Storage. I am assuming each GCE instance has its own scratch disk, but the "gsutil cp" transfer is very fast when working on GCE.
Since Google Storage is fast at data transfers with Google Compute instances, the easiest second option to pursue is to mark out lines 66-69 in the do_compress.sh file:
https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh
This way no compression happens, but the copy happens on the last line via gsutil::upload, in order to have all the uncompressed files transferred to the same Google Storage bucket. Then using "gsutil cp" from the master node you would copy them back locally, in order to compress them locally via tar/gz and then copy the compressed directory file back to the bucket using "gsutil cp".
Hope it helps but it's tricky,
Paul

Related

How to make Snakemake recognize Globus remote files using Globus CLI?

I am working in a high performance computing grid environment, where large-scale data transfers are done via Globus. I would like to use Snakemake to pull data from a Globus path, process the data, and then push the processed data to a different Globus path. Globus has a command-line interface.
Pulling the data is no problem, for I'd just create a rule that would run globus transfer to create the requisite local file. But for pushing the data back to Globus, I think I'll need a rule that can "see" that the file is missing at the remote location, and then work backwards to determine what needs to happen to create the file.
I could create local "proxy" files that represent the remote files. For example I could make a rule for creating 'processed_data_1234.tar.gz' output files in a directory. These files would just be created using touch (thus empty), and the same rule will run globus transfer to push the files remotely. But then there's the overhead of making sure that the proxy files don't get out of sync with the real Globus-hosted files.
Is there a more elegant way to do this akin to the Remote File capability? Is it difficult to add a Globus CLI support for Snakemake? Thanks in advance for any advice!
Would it help to create a utility function that would generate a list of all desired files and compare it against the list of files available on globus? Something like this (pseudocode):
def return_needed_files():
list_needed_files = [] # either hard-coded or specified with some logic
list_available = [] # as appropriate, e.g. using globus ls
return [i for i in list_needed_files if i not in list_available]
# include all the needed files in the all rule
rule all:
input: return_needed_files

Downloading public data directory from google cloud storage with command line utilities like wget

I would like to download publicly available data from google cloud storage. However, because I need to be in a Python3.x environment, it is not possible to use gsutil. I can download individual files with wget as
wget http://storage.googleapis.com/path-to-file/output_filename -O output_filename
However, commands like
wget -r --no-parent https://console.cloud.google.com/path_to_directory/output_directoryname -O output_directoryname
do not seem to work as they just download an index file for the directory. Neither do rsync or curl attempts based on some initial attempts. Any idea of how to download publicly available data on google cloud storage as a directory?
The approach you mentioned above does not work because Google Cloud Storage doesn't have real "directories". As an example, "path/to/some/files/file.txt" is the entire name of that object. A similarly named object, "path/to/some/files/file2.txt", just happens to share the same naming prefix.
As for how you could fetch these files: The GCS APIs (both XML and JSON) allow you to do an object listing against the parent bucket, specifying a prefix; in this case, you'd want all objects starting with the prefix "path/to/some/files/". You could then make individual HTTP requests for each of the objects specified in the response body. That being said, you'd probably find this much easier to do via one of the GCS client libraries, such as the Python library.
Also, gsutil currently has a GitHub issue open to track adding support for Python 3.

How to download multiple objects from IBM Cloud Object Storage?

I am trying to use IBM Cloud Object Storage to store images uploaded to my site by users. I have this functionality working just fine.
However, based on the documentation here (link) it appears as though only one object can be downloaded from a bucket at a time.
Is there any way a list of objects could all be downloaded from the bucket? Is there a different approach to requesting multiple objects from a COS bucket?
Via the REST API, no, you can only download a single object at a time. But most tools (like the AWS CLI, or Minio Client) allow downloading all objects that share a prefix (eg foo/bar and foo/bas). The IBM forks of the S3 libraries also are now integrated with Aspera, and can transfer large directories all at once. What are you trying to do?
According to S3 spec (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html), you can only download one object at a time.
There are various tools which may help to download multiple objects at a time from COS. I used AWS CLI tool to download and upload the objects from/to COS.
So install aws-cli tool and configure it by supplying access_key_id and secret_access_key here.
Recursively copying S3 objects to a local directory: When passed with the parameter --recursive, the following cp command recursively copies all objects under a specified prefix and bucket to a specified directory.
C:\Users\Shashank>aws s3 cp s3://yourBucketName . --recursive
for example:
C:\Users\Shashank>aws --endpoint-url http://s3.us-east.cloud-object-storage.appdomain.cloud s3 cp s3://yourBucketName D:\s3\ --recursive
In my case having endpoint based on us-east region and I am copying objects into D:\s3 directory.
Recursively copying local files to S3: When passed with the parameter --recursive, the following cp command recursively copies all files under a specified directory to a specified bucket.
C:\Users\Shashank>aws s3 cp myDir s3://yourBucketName/ --recursive
for example:
C:\Users\Shashank>aws --endpoint-url http://s3.us-east.cloud-object-storage.appdomain.cloud s3 cp D:\s3 s3://yourBucketName/ --recursive
I am copying objects from D:\s3 directory to COS.
For more reference, you can see the link here.
I hope it works for you.

~/.felix folder contains massive number of files

On one of the accounts we use on a cluster there is a hidden folder in the home directory:
/home/user/.felix/
This contains a huge number of directories:
[user#gateway .felix]$ ls | head -10
osgi-cache1050e0f4_15774cb91f4_-7ffe
osgi-cache-1063880a_15289337854_-7ffe
osgi-cache-10716929_155ac249b99_-7ffe
osgi-cache-1076af32_1567b76f77c_-7ffe
osgi-cache10fdd858_15288297a76_-7ffe
osgi-cache1145761a_1567b157a97_-7ffe
osgi-cache-1158de5c_15775794758_-7ffe
osgi-cache-117b5c79_1577655ca87_-7ffe
osgi-cache-1188faa3_154532959fc_-7fff
osgi-cache11bf2822_1528906f443_-7ffe
In each of these folders:
osgi-cache-37166e7_1545cb3b7e0_-7ffe/bundle10
[user#gateway bundle4]$ cat bundle.location
reference:file:/gpfs22/local/centos6/matlab/2013a/java/jar/toolbox/bioinfo.jar
So I'm thinking these files are created by matlab somehow.
This .felix folder contains about ~150k files which is causing us to go over our quota of 300k files. Is there a way to:
disable the creation of these files
clean them up in a safe way (maybe a cron)
possible move the location of where these files are created?
Technically its the apache-felix bundle cache (http://felix.apache.org/documentation/subprojects/apache-felix-framework/apache-felix-framework-usage-documentation.html) and I'm afraid there's no safe way to remove any of these without contacting the user (even when migrating the path).
I noticed that Matlab is creating about 7k files in /tmp/.felix. The space usage is pretty minimal (184k). I was able to delete them by:
find /tmp/.felix -user <my username> -exec rm -r {} \;
But when I run my Matlab code it recreates many (all?) of the files. So at least in the Matlab usage case it seems relatively safe to delete them, but I could imagine there being problems if this info is actively being updated.
Digging into the Felix docs a bit (mentioned in answer), I google "Felix bundle cache", and find that this is used to store pointers to Java jar files, and perhaps to state as well. There are indeed parameters that you can configure to control the location and flushing of this cache. configuring Felix bundle cache
Mathworks also has Matlab specific suggestions. In the case mentioned there, this seemed to be triggered by plotting. Names in the stack trace there suggest it may have to do with implementation of key bindings (keyboard shortcuts).
Rob

gsutil make bucket command [gsutil mb] is not working

I am trying to create a bucket using gsutil mb command:
gsutil mb -c DRA -l US-CENTRAL1 gs://some-bucket-to-my-gs
But I am getting this error message:
Creating gs://some-bucket-to-my-gs/...
BadRequestException: 400 Invalid argument.
I am following the documentation from here
What is the reason for this type of error?
I got the same error. I was because I used the wrong location.
The location parameter expects a region without specifying witch zone.
Eg.
sutil mb -p ${TF_ADMIN} -l europe-west1-b gs://${TF_ADMIN}
Should have been
sutil mb -p ${TF_ADMIN} -l europe-west1 gs://${TF_ADMIN}
One reason this error can occur (confirmed in chat with the question author) is that you have an invalid default_project_id configured in your .boto file. Ensure that ID matches your project ID in the Google Developers Console
If you can make a bucket successfully using the Google Developers Console, but not using "gsutil mb", this is a good thing to check.
I was receiving the same error for the same command while using gsutil as well as the web console. Interestingly enough, changing my bucket name from "google-gatk-test" to "gatk" allowed the request to go through. The original name does not appear to violate bucket naming conventions.
Playing with the bucket name is worth trying if anyone else is running into this issue.
Got this error and adding the default_project_id to the .boto file didn't work.
Took me some time but at the end i deleted the credentials file from the "Global Config" directory and recreated the account.
Using it on windows btw...
This can happen if you are logged into the management console (storage browser), possibly a locking/contention issue.
May be an issue if you add and remove buckets in batch scripts.
In particular this was happening to me when creating regionally diverse (non DRA) buckets :
gsutil mb -l EU gs://somebucket
Also watch underscores, the abstraction scheme seems to use them to map folders. All objects in the same project are stored at the same level (possibly as blobs in an abstracted database structure).
You can see this when downloading from the browser interface (at the moment anyway).
An object copied to gs://somebucket/home/crap.txt might be downloaded via a browser (or curl) as home_crap.txt. As a an aside (red herring) somefile.tar.gz can come down as somefile.tar.gz.tar so a little bit of renaming may be required due to the vagaries of the headers returned from the browser interface anyway. Min real support level is still $150/mth.
I had this same issue when I created my bucket using the following commands
MY_BUCKET_NAME_1=quiceicklabs928322j22df
MY_BUCKET_NAME_2=MY_BUCKET_NAME_1
MY_REGION=us-central1
But when I decided to add dollar sign $ to the variable MY_BUCKET_NAME_1 as MY_BUCKET_NAME_2=$MY_BUCKET_NAME_1 the error was cleared and I was able to create the bucket
I got this error when I had capital letter in the bucket name
$gsutil mb gs://CLIbucket-anu-100000
Creating gs://CLIbucket-anu-100000/...
BadRequestException: 400 Invalid bucket name: 'CLIbucket-anu-100000'
$gsutil mb -l ASIA-SOUTH1 -p single-archive-352211 gs://clibucket-anu-100
Creating gs://clibucket-anu-100/..
$