Downloading public data directory from google cloud storage with command line utilities like wget - google-cloud-storage

I would like to download publicly available data from google cloud storage. However, because I need to be in a Python3.x environment, it is not possible to use gsutil. I can download individual files with wget as
wget http://storage.googleapis.com/path-to-file/output_filename -O output_filename
However, commands like
wget -r --no-parent https://console.cloud.google.com/path_to_directory/output_directoryname -O output_directoryname
do not seem to work as they just download an index file for the directory. Neither do rsync or curl attempts based on some initial attempts. Any idea of how to download publicly available data on google cloud storage as a directory?

The approach you mentioned above does not work because Google Cloud Storage doesn't have real "directories". As an example, "path/to/some/files/file.txt" is the entire name of that object. A similarly named object, "path/to/some/files/file2.txt", just happens to share the same naming prefix.
As for how you could fetch these files: The GCS APIs (both XML and JSON) allow you to do an object listing against the parent bucket, specifying a prefix; in this case, you'd want all objects starting with the prefix "path/to/some/files/". You could then make individual HTTP requests for each of the objects specified in the response body. That being said, you'd probably find this much easier to do via one of the GCS client libraries, such as the Python library.
Also, gsutil currently has a GitHub issue open to track adding support for Python 3.

Related

de- serialize JSON metadata to .qvf using qlik sense API

I am aware of Qlik sense serialize app where we generate a JSON object containing metadata information of a .qvf file using Qlik sense API.
I want to do a reverse operation of this i.e generate .qvf file back from json metadata.
After many research just found this link github and it doesnot have a complete information.
Any solution would be helpfull.
Technically you cant create qvf directly from json. You'll have to create an empty qvf and then use various api to import the json.
Qlik have a very nice tool for un-build/build apps (and more). qlik-cli have dedicated commands for un-build/build:
If you are looking for something more "programmable" then ive create some enigma.js mixin for the same purpouse - enigma-mixin. I still need to perform more detailed testing there but it was working ok with simpler tests
Update 08/10/2021
Using qlik-cli
setup context
first unbuild an app:
qlik app unbuild --app 11111111-2222-3333-4444-555555555555
This will create new folder in the current folder named <app_name>-unbuild. The folder will contain all info about the app in json and/or yaml files
once these files are available then you can use them to build another app. Just to mention that the target app should exists before the build is ran:
qlik.exe app build --config ./config.yml --app 55555555-4444-3333-2222-111111111111
The above command will use all available files (specified in config.yml) and update the target app
If you dont want all files to be used and only want to update the data connections, for example, then the build command can be ran with different arguments:
qlik.exe app build --connections ./connections.yml --app 55555555-4444-3333-2222-111111111111
This command will only update the data connections in the target app and will not update anything else

can i search google cloud storage buckets recursively in the console?

We have uploaded files to google cloud storage buckets and planning to create a permission to have a number of people access it. So far we could only filter/search files and folders in the directory you are in. Is it possible to search files recursively though?
It seems what you are looking for is the following command for searching within a bucket recursively:
gsutil ls -r gs://bucket/**
Note: "bucket" is the name of the bucket you have set.
In the case you would like to search within a specific directory you can run the following:
gsutil ls -r gs://bucket/dir/**
Note: "dir" would be the directory in which you would like to search
You can find more information regarding searching through "Directory By Directory, Flat, And Recursive" by going to the following link.
Update
If this is not what you meant then I would like to mention another option. You can retrieve the information regarding the contents in a bucket through an API as well. The following API link here retrieves a list of objects matching the criteria specified.
Note: In order for this API to work the user must have "READER" permission or above.
Please let me know if this is what you were looking for.

How to download multiple objects from IBM Cloud Object Storage?

I am trying to use IBM Cloud Object Storage to store images uploaded to my site by users. I have this functionality working just fine.
However, based on the documentation here (link) it appears as though only one object can be downloaded from a bucket at a time.
Is there any way a list of objects could all be downloaded from the bucket? Is there a different approach to requesting multiple objects from a COS bucket?
Via the REST API, no, you can only download a single object at a time. But most tools (like the AWS CLI, or Minio Client) allow downloading all objects that share a prefix (eg foo/bar and foo/bas). The IBM forks of the S3 libraries also are now integrated with Aspera, and can transfer large directories all at once. What are you trying to do?
According to S3 spec (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html), you can only download one object at a time.
There are various tools which may help to download multiple objects at a time from COS. I used AWS CLI tool to download and upload the objects from/to COS.
So install aws-cli tool and configure it by supplying access_key_id and secret_access_key here.
Recursively copying S3 objects to a local directory: When passed with the parameter --recursive, the following cp command recursively copies all objects under a specified prefix and bucket to a specified directory.
C:\Users\Shashank>aws s3 cp s3://yourBucketName . --recursive
for example:
C:\Users\Shashank>aws --endpoint-url http://s3.us-east.cloud-object-storage.appdomain.cloud s3 cp s3://yourBucketName D:\s3\ --recursive
In my case having endpoint based on us-east region and I am copying objects into D:\s3 directory.
Recursively copying local files to S3: When passed with the parameter --recursive, the following cp command recursively copies all files under a specified directory to a specified bucket.
C:\Users\Shashank>aws s3 cp myDir s3://yourBucketName/ --recursive
for example:
C:\Users\Shashank>aws --endpoint-url http://s3.us-east.cloud-object-storage.appdomain.cloud s3 cp D:\s3 s3://yourBucketName/ --recursive
I am copying objects from D:\s3 directory to COS.
For more reference, you can see the link here.
I hope it works for you.

How to compress a list of files into a single gzip file using elasticluster, grid-engine-tools, and google cloud

I want to start by thanking you all for your help ahead of time, as this will help clear up a detail left out on the readthedocs.io guide. What I need is to compress several files into a single gzip, however, the guide shows only how to compress a list of files as individual gzipped file. Again, I appreciate any help as there is very few resources and documentation for this set up. (If there is some extra info, please include links to sources)
After I had set up the grid engine, I ran through the samples in the guide.
Am I right in assuming there is not a script for combining multiple files into one gzip using grid-computing-tools?
Are there any solutions on the Elasticluster Grid Engine setup to compress multiple files into 1 gzip?
What changes can be made to the grid-engine-tools to make it work?
EDIT
The reason we are considering a cluster is that we do expect multiple operations occurring simultaneously, zipped up files per order, which will occur systematically so that a vendor can download a single compressed file per order.
May I state the definition of the problem and you can let me know if I understood it correctly, as both Matt and I provided the exact same solution and somehow it doesn't seem sufficient.
Problem Definition
You have an Order defining the start of a task to process some data.
The processing of data would be split among several compute nodes, each producing a resulting file stored on GS directories.
The goal is:
Collect the files from GS bucket (that were produced by each of the nodes),
Archive the collection of files as one file,
Then compress that archive, and
Push it back to a different GS location.
Let me know if I summarized it properly,
Thanks,
Paul
Are the files in question in Cloud Storage?
Are the files in question on a local or network drive?
In your description, you indicate "What I need is to compress several files into a single gzip". It isn't clear to me that a cluster of computers is needed for this. It sounds more like you just want to use tar along with gzip.
The tar utility will create an archive file it can compress it as well. For example:
$ # Create a directory with a few input files
$ mkdir myfiles
$ echo "This is file1" > myfiles/file1.txt
$ echo "This is file2" > myfiles/file2.txt
$ # (C)reate a compressed archive
$ tar cvfz archive.tgz myfiles/*
a myfiles/file1.txt
a myfiles/file2.txt
$ # (V)erify the archive
$ tar tvfz archive.tgz
-rw-r--r-- 0 myuser mygroup 14 Jul 20 15:19 myfiles/file1.txt
-rw-r--r-- 0 myuser mygroup 14 Jul 20 15:19 myfiles/file2.txt
To extract the contents use:
$ # E(x)tract the archive contents
$ tar xvfz archive.tgz
x myfiles/file1.txt
x myfiles/file2.txt
UPDATE:
In your updated problem description, you have indicated that you may have multiple orders processed simultaneously. If the frequency in which results need to be tar-ed is low, and providing the tar-ed results is not extremely time-sensitive, then you could likely do this with a single node.
However, as the scale of the problem ramps up, you might take a look at using the Pipelines API.
Rather than keeping a fixed cluster running, you could initiate a "pipeline" (in this case a single task) when a customer's order completes.
A call to the Pipelines API would start a VM whose sole purpose is to download the customer's files, tar them up, and push the resulting tar file into Cloud Storage. The Pipelines API infrastructure does the copying from and to Cloud Storage for you. You would effectively just need to supply the tar command line.
There is an example that does something similar here:
https://github.com/googlegenomics/pipelines-api-examples/tree/master/compress
This example will download a list of files and compress each of them independently. It could be easily modified to tar the list of input files.
Take a look at the https://github.com/googlegenomics/pipelines-api-examples github repository for more information and examples.
-Matt
So there are many ways to do it, but the thing is that you cannot directly compress on Google Storage a collection of files - or a directory - into one file, and would need to perform the tar/gzip combination locally before transferring it.
If you want you can have the data compressed automatically via:
gsutil cp -Z
Which is detailed at the following link:
https://cloud.google.com/storage/docs/gsutil/commands/cp#changing-temp-directories
And the nice thing is that you retrieve uncompressed results from compressed data on Google Storage, because it has the ability to perform Decompressive Transcoding:
https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding
You will notice on the last line in the following script:
https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh
The following line will basically copy the current compressed file to Google Cloud Storage:
gcs_util::upload "${WS_OUT_DIR}/*" "${OUTPUT_PATH}/"
What you will need is to first perform the tar/zip on the files in the local scratch directory, and then gsutil copy the compressed file over to Google Storage, but make sure that all the files that need to be compressed are in the scratch directory before starting to compress them. Most likely you would need to SSH copy (scp) them to one of the nodes (i.e. master), and then have the master tar/gzip the whole directory before sending it over to Google Storage. I am assuming each GCE instance has its own scratch disk, but the "gsutil cp" transfer is very fast when working on GCE.
Since Google Storage is fast at data transfers with Google Compute instances, the easiest second option to pursue is to mark out lines 66-69 in the do_compress.sh file:
https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh
This way no compression happens, but the copy happens on the last line via gsutil::upload, in order to have all the uncompressed files transferred to the same Google Storage bucket. Then using "gsutil cp" from the master node you would copy them back locally, in order to compress them locally via tar/gz and then copy the compressed directory file back to the bucket using "gsutil cp".
Hope it helps but it's tricky,
Paul

gsutil acl set command AccessDeniedException: 403 Forbidden

I am following the steps of setting up Django on Google App Engine, and since Gunicorn does not serve static files, I have to store my static files to Google Cloud Storage.
I am at the line with "Create a Cloud Storage bucket and make it publically readable." on https://cloud.google.com/python/django/flexible-environment#run_the_app_on_your_local_computer. I ran the following commands as suggested:
$ gsutil mb gs://your-gcs-bucket
$ gsutil defacl set public-read gs://your-gcs-bucket
The first command is supposed to create a new storage bucket, and the second line sets its default ACL. When I type in the command, the second line returns an error.
Setting default object ACL on gs://your-gcs-bucket/...
AccessDeniedException: 403 Forbidden
I also tried other commands setting or getting acl, but all returns the same error, with no additional information.
I am a newbie with google cloud services, could anyone point out what is the problem?
I figured it out myself, and it is kind of silly. I didn't notice if the first command is successful or not. And apparently it did not.
For a newbie like me, it is important to note that things like bucket name and project name are global across its space. And what happened was that the name I used to create a new bucket is already used by other people. And no wonder that I do not have permission to access that bucket.
A better way to work with this is to name the bucket name wisely, like prefixing project name and application name.