"gsutil -m mv" not running parallel transfers - google-cloud-storage

I am moving a large number of files from a local drive into a bucket using "gsutil -m mv". However during the transfer process it would appear to only be running one transfer at a time. I have checked top and only see one process from python running the command. I have modified both "parallel_process_count"
"parallel_thread_count" in the boto config file and do not observe any change in the transfers behavior. Even when running gsutil with the -m option i still receive the message below:
"==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m -m ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous."
Has anyone else run into this issue before?
OS: Centos 6.6
gsutil version: 4.15
python version: 2.6.6

This is a bug in gsutil 4.14-4.15, where the -m flag was not getting propagated correctly for the mv command. It is fixed in gsutil 4.16.

Related

"cat urls.txt | gsutil -m cp -I gs://target-bucket-name/" consistently hangs after transferring ~10,000 files

I am trying to copy ~80,000 images from one google cloud storage bucket to another.
I am initiating this operation from a mac with google cloud sdk 180.0.1 which contains gsutil 4.28.
The ~url of each image to be transferred in in a text file which I feed to gsutil cp like so ...
$cat urls.txt | gsutil -m cp -I gs://target-bucket-name/
wherein urls.txt looks like ...
head -3 urls.txt
gs://source-bucket-name/1506567870546.jpg
gs://source-bucket-name/1506567930548.jpg
gs://source-bucket-name/1507853339446.jpg
The process consistently hangs after ~10,000 of the images have been transferred.
I have edited $HOME/.boto to uncomment:
parallel_composite_upload_threshold = 0
This has not prevented the operation from hanging.
I am uncertain what causes the hanging.
The underlying need is for a general utility to copy N items from one bucket to another. I need a work around that will enable me to accomplish that mission.
UPDATE
Removing the -m option seems to work around the hanging problem but the file transfer is now significantly slower. I would like to be able avoid the hanging problem whilst still gaining the speed that comes with using concurrency if possible.
gstuil should not be hanging. This is a bug. Could you record the output of gsutl -D and when it hangs, create an issue in the gsutil github repo with the output attached and comment here with a link to it? You can use the following command to log the output:
$ cat urls.txt | gsutil -D -m cp -I gs://target-bucket-name/ 2>&1 | tee output
In the meanwhile, you could try experimenting with reducing the number of threads and processes that the parallel mode (-m) uses by changing these defaults in your boto file.
parallel_process_count = 1 # Default - 12
parallel_thread_count = 10 # Default - 10
Note that gsutil has options to copy all files in a bucket or subdirectory to a new bucket, as well as only copy files that have changed or do not exist in the target with the following commands:
gsutil -m cp gs://source-bucket/ gs://target-bucket
gsutil -m cp 'gs://source-bucket/dir/**' gs://target-bucket
gsutil -m rsync -r gs://source-bucket gs://target-bucket

Efficient way upload multiple files to separate locations on google cloud storage

Here is my case, I'd like to copy multiple files to separate locations on google cloud storage, e.g:
gsutil -m cp /local/path/to/d1/x.csv.gz gs://bucketx/d1/x.csv.gz
gsutil -m cp /local/path/to/d2/x.csv.gz gs://bucketx/d2/x.csv.gz
gsutil -m cp /local/path/to/d3/x.csv.gz gs://bucketx/d3/x.csv.gz
...
I have over 10k of such files, and executing them by separate calls of gsutil seems to be really slow, and lots of time is wasted on setting up network connection. What's the most efficient way to do that, please?
If your paths are consistently of the nature in your example, you could do this with one gsutil command:
gsutil -m cp -r /local/path/to/* gs://bucketx
However, this only works if you want the destination naming to mirror the source. If your paths are arbitrary mappings of source name to destination name, you'll need to run individual commands (which as you note can be sped up with parallel).

Gsutil rm does not remove everything

I have a problem with one of my automated jobs.
Before launching a cloud dataflow job, I perform a gsutil rm on previous files but it appears that it does not remove everything because when I launch another dataflow job some older shards remain.
I tried :
gsutil -m rm gs://mybucket/blahblah/*
and
gsutil rm -r gs://mybucket/blablah
But same result...
Strange thing is that not removed files are nor the first nor the last.
I tought it was my second job fault but the fact is that I saw in logs that indeed files were not removed bu gsutil.
Is there possibility that there is too many files to delete ?
Is there known problems of gsutil rm reliability ?
I use version 0.9.80 of google cloud sdk
Thanks
The gsutil rm commands you're using depend on listing the objects in a bucket, which is an eventually consistent operation in Google Cloud Storage. Thus, it's possible that attempting these commands in a bucket soon after objects were written will not remove all the objects. If you try again later it should succeed.
One way to avoid this problem would be to keep track of the names of the objects you uploaded, and explicitly list those objects in the gsutil rm command. For example, if you kept the object list in the file objects.manifest you could run a command like this on Linux or MacOS:
xargs gsutil -m rm < objects.manifest

How to download multiple files in Google Cloud Storage

Scenario: there are multiple folders and many files stored in storage bucket that is accessible by project team members. Instead of downloading individual files one at a time (which is very slow and time consuming), is there a way to download entire folders? Or at least multiple files at once? Is this possible without having to use one of the command consoles? Some of the team members are not tech savvy and need to access these files as simple as possible. Thank you for any help!
I would suggest downloading the files with gsutil. However if you have a large number of files to transfer you might want to use the gsutil -m option, to perform a parallel (multi-threaded/multi-processing) copy:
gsutil -m cp -R gs://your-bucket .
The time reduction for downloading the files can be quite significant. See this Cloud Storage documentation for complete information on the GCS cp command.
If you want to copy into a particular directory, note that the directory must exist first, as gsutils won't create it automatically. (e.g: mkdir my-bucket-local-copy && gsutil -m cp -r gs://your-bucket my-bucket-local-copy)
I recommend they use gsutil. GCS's API deals with only one object at a time. However, its command-line utility, gsutil, is more than happy to download a bunch of objects in parallel, though. Downloading an entire GCS "folder" with gsutil is pretty simple:
$> gsutil cp -r gs://my-bucket/remoteDirectory localDirectory
To download files to local machine need to:
install gsutil to local machine
run Google Cloud SDK Shell
run the command like this (example, for Windows-platform):
gsutil -m cp -r gs://source_folder_path "%userprofile%/Downloads"
gsutil rsync -d -r gs://bucketName .
works for me

gsutil rsync not preserving uid/gid ownership

when using gsutil -m rsync -p -d -r
the ownership became root
Any idea how to run gsutil rsync just like rsync -a?
thanks
Peter
gsutil rsync doesn't currently support preserving POSIX file attributes in the cloud.
It's not guaranteed that the uid/gid on the system that uploaded a file is even valid on the system that downloaded the file. So (at least for now), you'll need to manage your file permissions manually.