"cat urls.txt | gsutil -m cp -I gs://target-bucket-name/" consistently hangs after transferring ~10,000 files - google-cloud-storage

I am trying to copy ~80,000 images from one google cloud storage bucket to another.
I am initiating this operation from a mac with google cloud sdk 180.0.1 which contains gsutil 4.28.
The ~url of each image to be transferred in in a text file which I feed to gsutil cp like so ...
$cat urls.txt | gsutil -m cp -I gs://target-bucket-name/
wherein urls.txt looks like ...
head -3 urls.txt
gs://source-bucket-name/1506567870546.jpg
gs://source-bucket-name/1506567930548.jpg
gs://source-bucket-name/1507853339446.jpg
The process consistently hangs after ~10,000 of the images have been transferred.
I have edited $HOME/.boto to uncomment:
parallel_composite_upload_threshold = 0
This has not prevented the operation from hanging.
I am uncertain what causes the hanging.
The underlying need is for a general utility to copy N items from one bucket to another. I need a work around that will enable me to accomplish that mission.
UPDATE
Removing the -m option seems to work around the hanging problem but the file transfer is now significantly slower. I would like to be able avoid the hanging problem whilst still gaining the speed that comes with using concurrency if possible.

gstuil should not be hanging. This is a bug. Could you record the output of gsutl -D and when it hangs, create an issue in the gsutil github repo with the output attached and comment here with a link to it? You can use the following command to log the output:
$ cat urls.txt | gsutil -D -m cp -I gs://target-bucket-name/ 2>&1 | tee output
In the meanwhile, you could try experimenting with reducing the number of threads and processes that the parallel mode (-m) uses by changing these defaults in your boto file.
parallel_process_count = 1 # Default - 12
parallel_thread_count = 10 # Default - 10
Note that gsutil has options to copy all files in a bucket or subdirectory to a new bucket, as well as only copy files that have changed or do not exist in the target with the following commands:
gsutil -m cp gs://source-bucket/ gs://target-bucket
gsutil -m cp 'gs://source-bucket/dir/**' gs://target-bucket
gsutil -m rsync -r gs://source-bucket gs://target-bucket

Related

Cloud Storage: Remove all files under a folder from a bucket gs://<bucket>/path/to/directory/2017-{01..07}*

I have 800 million files under gs:///path/to/directory/2017-{01..07}* to delete, use this multi-threading delete recursively
$ gsutil -m rm -r gs://<bucket>/path/to/directory/2017-{01..07}*
run it under bash would expand to gsutil -m rm -r gs://<bucket>/path/to/directory/2017-01* gs://<bucket>/path/to/directory/2017-02* gs://<bucket>/path/to/directory/2017-03* ... gs://<bucket>/path/to/directory/2017-07*
but it shows deletion speed at 550/s ; so, to delete all 800 millions files would need 16 days, is too slow, wonder is there a faster way?
You could distribute the processing across multiple machines. For example, have scripts so
machine1 does gsutil -m rm -r gs://<bucket>/path/to/directory/2017-01*
machine2 does gsutil -m rm -r gs://<bucket>/path/to/directory/2017-02*
etc.
That would give you ~12x speedup.
You could make it faster still if you shard the data to be deleted more ways.

How do i copy/move all files and subfolders from the current directory to a Google Cloud Storage bucket with gsutil

I'm using gsutil and I need to copy a large number of files/subdirectories from a directory on a windows server to a Google Cloud Storage Bucket.
I have checked the documentation but somehow I can't seem to get the syntax right - I'm trying something along these lines:
c:\test>gsutil -m cp -r . gs://mytestbucket
But I keep getting the message:
CommandException: No URLs matched: .
What am I doing wrong here?
Regards
Morten Hjorth Nielsen
Try gsutil -m cp -r * gs://mytestbucket
Or gsutil -m cp -r *.* gs://mytestbucket
Or if your local directory is called test go one dir up and type: gsutil -m cp -r test gs://mytestbucket
Not sure which syntax you need on Windows, but probably the first.

How Can I make gsutil cp skip false symlinks?

I am using gsutil to upload a folder which contains symlinks, the problem is that some of these files are false symlinks ( Unfortunately, that's the case)
Here is an example of the command I am using:
gsutil -m cp -c -n -e -L output-upload.log -r output gs://my-storage
and I get the following:
[Errno 2] No such file or directory: 'output/1231/file.mp4'
CommandException: 1 file/object could not be transferred.
Is there a way to make gsutil skip this file or fail safely without stopping the upload ?
This was a bug in gsutil (which it looks like you reported here) and it will be fixed in gsutil 4.23.

Efficient way upload multiple files to separate locations on google cloud storage

Here is my case, I'd like to copy multiple files to separate locations on google cloud storage, e.g:
gsutil -m cp /local/path/to/d1/x.csv.gz gs://bucketx/d1/x.csv.gz
gsutil -m cp /local/path/to/d2/x.csv.gz gs://bucketx/d2/x.csv.gz
gsutil -m cp /local/path/to/d3/x.csv.gz gs://bucketx/d3/x.csv.gz
...
I have over 10k of such files, and executing them by separate calls of gsutil seems to be really slow, and lots of time is wasted on setting up network connection. What's the most efficient way to do that, please?
If your paths are consistently of the nature in your example, you could do this with one gsutil command:
gsutil -m cp -r /local/path/to/* gs://bucketx
However, this only works if you want the destination naming to mirror the source. If your paths are arbitrary mappings of source name to destination name, you'll need to run individual commands (which as you note can be sped up with parallel).

"gsutil -m mv" not running parallel transfers

I am moving a large number of files from a local drive into a bucket using "gsutil -m mv". However during the transfer process it would appear to only be running one transfer at a time. I have checked top and only see one process from python running the command. I have modified both "parallel_process_count"
"parallel_thread_count" in the boto config file and do not observe any change in the transfers behavior. Even when running gsutil with the -m option i still receive the message below:
"==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m -m ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous."
Has anyone else run into this issue before?
OS: Centos 6.6
gsutil version: 4.15
python version: 2.6.6
This is a bug in gsutil 4.14-4.15, where the -m flag was not getting propagated correctly for the mv command. It is fixed in gsutil 4.16.