Efficient way upload multiple files to separate locations on google cloud storage - google-cloud-storage

Here is my case, I'd like to copy multiple files to separate locations on google cloud storage, e.g:
gsutil -m cp /local/path/to/d1/x.csv.gz gs://bucketx/d1/x.csv.gz
gsutil -m cp /local/path/to/d2/x.csv.gz gs://bucketx/d2/x.csv.gz
gsutil -m cp /local/path/to/d3/x.csv.gz gs://bucketx/d3/x.csv.gz
...
I have over 10k of such files, and executing them by separate calls of gsutil seems to be really slow, and lots of time is wasted on setting up network connection. What's the most efficient way to do that, please?

If your paths are consistently of the nature in your example, you could do this with one gsutil command:
gsutil -m cp -r /local/path/to/* gs://bucketx
However, this only works if you want the destination naming to mirror the source. If your paths are arbitrary mappings of source name to destination name, you'll need to run individual commands (which as you note can be sped up with parallel).

Related

Gsutil rm does not remove everything

I have a problem with one of my automated jobs.
Before launching a cloud dataflow job, I perform a gsutil rm on previous files but it appears that it does not remove everything because when I launch another dataflow job some older shards remain.
I tried :
gsutil -m rm gs://mybucket/blahblah/*
and
gsutil rm -r gs://mybucket/blablah
But same result...
Strange thing is that not removed files are nor the first nor the last.
I tought it was my second job fault but the fact is that I saw in logs that indeed files were not removed bu gsutil.
Is there possibility that there is too many files to delete ?
Is there known problems of gsutil rm reliability ?
I use version 0.9.80 of google cloud sdk
Thanks
The gsutil rm commands you're using depend on listing the objects in a bucket, which is an eventually consistent operation in Google Cloud Storage. Thus, it's possible that attempting these commands in a bucket soon after objects were written will not remove all the objects. If you try again later it should succeed.
One way to avoid this problem would be to keep track of the names of the objects you uploaded, and explicitly list those objects in the gsutil rm command. For example, if you kept the object list in the file objects.manifest you could run a command like this on Linux or MacOS:
xargs gsutil -m rm < objects.manifest

How to download multiple files in Google Cloud Storage

Scenario: there are multiple folders and many files stored in storage bucket that is accessible by project team members. Instead of downloading individual files one at a time (which is very slow and time consuming), is there a way to download entire folders? Or at least multiple files at once? Is this possible without having to use one of the command consoles? Some of the team members are not tech savvy and need to access these files as simple as possible. Thank you for any help!
I would suggest downloading the files with gsutil. However if you have a large number of files to transfer you might want to use the gsutil -m option, to perform a parallel (multi-threaded/multi-processing) copy:
gsutil -m cp -R gs://your-bucket .
The time reduction for downloading the files can be quite significant. See this Cloud Storage documentation for complete information on the GCS cp command.
If you want to copy into a particular directory, note that the directory must exist first, as gsutils won't create it automatically. (e.g: mkdir my-bucket-local-copy && gsutil -m cp -r gs://your-bucket my-bucket-local-copy)
I recommend they use gsutil. GCS's API deals with only one object at a time. However, its command-line utility, gsutil, is more than happy to download a bunch of objects in parallel, though. Downloading an entire GCS "folder" with gsutil is pretty simple:
$> gsutil cp -r gs://my-bucket/remoteDirectory localDirectory
To download files to local machine need to:
install gsutil to local machine
run Google Cloud SDK Shell
run the command like this (example, for Windows-platform):
gsutil -m cp -r gs://source_folder_path "%userprofile%/Downloads"
gsutil rsync -d -r gs://bucketName .
works for me

gsutil rsync with gzip compression

I'm hosting publicly available static resources in a google storage bucket, and I want to use the gsutil rsync command to sync our local version to the bucket, saving bandwidth and time. Part of our build process is to pre-gzip these resources, but gsutil rsync has no way to set the Content-Encoding header. This means we must run gsutil rsync, then immediately run gsutil setmeta to set headers on all the of gzipped file types. This leaves the bucket in a BAD state until that header is set. Another option is to use gsutil cp, passing the -z option, but this requires us to re-upload the entire directory structure every time, and this includes a LOT of image files and other non-gzipped resources that wastes time and bandwidth.
Is there an atomic way to accomplish the rsync and set proper Content-Encoding headers?
Assuming you're starting with gzipped source files in source-dir you can do:
gsutil -h content-encoding:gzip rsync -r source-dir gs://your-bucket
Note: If you do this and then run rsync in the reverse direction it will decompress and copy all the objects back down:
gsutil rsync -r gs://your-bucket source-dir
which may not be what you want to happen. Basically, the safest way to use rsync is to simply synchronize objects as-is between source and destination, and not try to set content encodings on the objects.
I'm not completely answering the question but I came here as I was wondering the same thing trying to achieve the following:
how to deploy efficiently a static website to google cloud storage
I was able to find an optimized way for deploying my static web site from a local folder to a gs bucket
Split my local folder into 2 folders with the same hierarchy, one containing the content to be gzip (html,css,js...), the other the other files
Gzip each file in my gzip folder (in place)
Call gsutil rsync in for each folder to the same gs destination
Of course, it is only a one way synchronization and deleted local files are not deleted remotely
For the gzip folder the command is
gsutil -m -h Content-Encoding:gzip rsync -c -r src/gzip gs://dst
forcing the content encoding to be gzippped
For the other folder the command is
gsutil -m rsync -c -r src/none gs://dst
the -m option is used for parallel optimization. The -c option is needed to force using checksum validation (Why is gsutil rsync re-downloading all our files?) as I was touching each local file in my build process. the -r option is used for recursivity.
I even wrote a script for it (in dart): http://tekhoow.blogspot.fr/2016/10/deploying-static-website-efficiently-on.html

Fast way of deleting non-empty Google bucket?

Is this my only option or is there a faster way?
# Delete contents in bucket (takes a long time on large bucket)
gsutil -m rm -r gs://my-bucket/*
# Remove bucket
gsutil rb gs://my-bucket/
Buckets are required to be empty before they're deleted. So before you can delete a bucket, you have to delete all of the objects it contains.
You can do this with gsutil rm -r (documentation). Just don't pass the * wildcard and it will delete the bucket itself after it has deleted all of the objects.
gsutil -m rm -r gs://my-bucket
Google Cloud Storage bucket deletes can't succeed until the bucket listing returns 0 objects. If objects remain, you can get a Bucket Not Empty error (or in the UI's case 'Bucket Not Ready') when trying to delete the bucket.
gsutil has built-in retry logic to delete both buckets and objects.
Another option is to enable Lifecycle Management on the bucket. You could specify an Age of 0 days and then wait a couple days. All of your objects should be deleted.
Using Python client, you can force a delete within your script by using:
bucket.delete(force=True)
Try out a similar thing in your current language.
Github thread that discusses this
This deserves to be summarized and pointed out.
Deleting with gsutil rm is slow if you have LOTS (terabytes) of data
gsutil -m rm -r gs://my-bucket
However, you can specify the expiration for the bucket and let the GCS do the work for you. Create a fast-delete.json policy:
{
"rule":[
{
"action":{
"type":"Delete"
},
"condition":{
"age":0
}
}
]
}
then apply
gsutil lifecycle set fast-delete.json gs://MY-BUCKET
Thanks, #jterrace and #Janosch
Use this to set an appropriate lifecycle rule. e.g. wait for a day.
https://cloud.google.com/storage/docs/gsutil/commands/lifecycle
Example (Read carefully before copy paste)
gsutil lifecycle set [LIFECYCLE_CONFIG_FILE] gs://[BUCKET_NAME]
Example (Read carefully before copy paste)
{
"rule":
[
{
"action": {"type": "Delete"},
"condition": {"age": 1}
}
]
}
Then delete the bucket.
This will delete the data asynchronously, so you don't have to keep
some background job running on your end.
Shorter one liner for the lifecycle change:
gsutil lifecycle set <(echo '{"rule":[{"action":{"type":"Delete"},"condition":{"age":0}}]}') gs://MY-BUCKET
I've also had good luck creating an empty bucket then starting a transfer to the bucket I want to empty out. Our largest bucket took about an hour to empty this way; the lifecycle method seems to take at least a day.
I benchmarked deletes using three techniques:
Storage Transfer Service: 1200 - 1500 / sec
gcloud alpha storage rm: 520 / sec
gsutil -m rm: 240 / sec
The big winner is the Storage Transfer Service. To delete files with it you need a source bucket (or folder in a bucket) that is empty, and then you copy that to a destination bucket (or folder in that bucket) that you want to be empty.
If using the GUI select this bullet in the advanced transfer options dialog:
You can also create and run the job from the CLI. This example assumes you have access to gs://bucket1/empty/ (which has no objects in it) and you want to delete all objects from gs://bucket2/:
gcloud transfer jobs create \
gs://bucket1/empty/ gs://bucket2/ \
--delete-from=destination-if-unique \
--project my-project
If you want your deletes to happen even faster you'll need to create multiple transfer jobs and have them target different sections of the bucket. Because it has to do a bucket listing to find the files to delete you'd want to make the destination paths non-overlapping (e.g. gs://bucket2/folder1/ and gs://bucket2/folder2/, etc). Each job will process in parallel at speed getting the job done in less total time.
Usually I like this better than using Object Lifecycle Management (OLM) because it starts right away (no waiting up to 24 hours for policy evaluation) but there may be times when OLM is the way to go.
Remove the bucket from Developers Console. It will ask for confirmation before deleting a non empty bucket. It works like a charm ;)
I've tried both ways (expiration time and gsutil command direct to bucket root), but I could not wait to the expiration time to propagate.
The gsutil rm was deleting 200 files per second, so I did this:
Open several terminal and executed the gsutil rm using different "folder" names with *
ie:
gsutil -m rm -r gs://my-bucket/a*
gsutil -m rm -r gs://my-bucket/b*
gsutil -m rm -r gs://my-bucket/c*
In this example, the command is able to delete 600 files per second.
So you just need to open more terminals and find the patterns to delete more files.
If one wildcard is huge, you can detail, like this
gsutil -m rm -r gs://my-bucket/b1*
gsutil -m rm -r gs://my-bucket/b2*
gsutil -m rm -r gs://my-bucket/b3*

gsutil copy, how to specify max number of objects to copy?

I need to move files from one bucket to another in GCS for processing, each time I just need to copy and process certain amount of files, say 100, how can I configure it in the command line?
gsutil -m mv gs://bucket1/file001.json gs://bucket2/
Thanks for your help.
You could write a script that builds a list of all objects to be copied and then walks through them 100 at a time, calling gsutil mv one at a time. That would lose parallelism compared to running gsutil -m mv, so you could run each of those gsutil mv commands in the background and then wait for all 100 to complete before moving on to the next batch.