How Can I make gsutil cp skip false symlinks? - google-cloud-storage

I am using gsutil to upload a folder which contains symlinks, the problem is that some of these files are false symlinks ( Unfortunately, that's the case)
Here is an example of the command I am using:
gsutil -m cp -c -n -e -L output-upload.log -r output gs://my-storage
and I get the following:
[Errno 2] No such file or directory: 'output/1231/file.mp4'
CommandException: 1 file/object could not be transferred.
Is there a way to make gsutil skip this file or fail safely without stopping the upload ?

This was a bug in gsutil (which it looks like you reported here) and it will be fixed in gsutil 4.23.

Related

Copy files from bucket to local dir

I want to copy files from my bucket but only files/directories that not exist on local drive.
Is it possible?
I tried something like
gsutil -m rsync -n -r "MyBUCKET" "my_local_dir"
but something is wrong.
You should try using the recently-added -i flag, which ignores existing files on the destination. That flag was added in this commit and should be available as of gsutil v4.59.

"cat urls.txt | gsutil -m cp -I gs://target-bucket-name/" consistently hangs after transferring ~10,000 files

I am trying to copy ~80,000 images from one google cloud storage bucket to another.
I am initiating this operation from a mac with google cloud sdk 180.0.1 which contains gsutil 4.28.
The ~url of each image to be transferred in in a text file which I feed to gsutil cp like so ...
$cat urls.txt | gsutil -m cp -I gs://target-bucket-name/
wherein urls.txt looks like ...
head -3 urls.txt
gs://source-bucket-name/1506567870546.jpg
gs://source-bucket-name/1506567930548.jpg
gs://source-bucket-name/1507853339446.jpg
The process consistently hangs after ~10,000 of the images have been transferred.
I have edited $HOME/.boto to uncomment:
parallel_composite_upload_threshold = 0
This has not prevented the operation from hanging.
I am uncertain what causes the hanging.
The underlying need is for a general utility to copy N items from one bucket to another. I need a work around that will enable me to accomplish that mission.
UPDATE
Removing the -m option seems to work around the hanging problem but the file transfer is now significantly slower. I would like to be able avoid the hanging problem whilst still gaining the speed that comes with using concurrency if possible.
gstuil should not be hanging. This is a bug. Could you record the output of gsutl -D and when it hangs, create an issue in the gsutil github repo with the output attached and comment here with a link to it? You can use the following command to log the output:
$ cat urls.txt | gsutil -D -m cp -I gs://target-bucket-name/ 2>&1 | tee output
In the meanwhile, you could try experimenting with reducing the number of threads and processes that the parallel mode (-m) uses by changing these defaults in your boto file.
parallel_process_count = 1 # Default - 12
parallel_thread_count = 10 # Default - 10
Note that gsutil has options to copy all files in a bucket or subdirectory to a new bucket, as well as only copy files that have changed or do not exist in the target with the following commands:
gsutil -m cp gs://source-bucket/ gs://target-bucket
gsutil -m cp 'gs://source-bucket/dir/**' gs://target-bucket
gsutil -m rsync -r gs://source-bucket gs://target-bucket

How do i copy/move all files and subfolders from the current directory to a Google Cloud Storage bucket with gsutil

I'm using gsutil and I need to copy a large number of files/subdirectories from a directory on a windows server to a Google Cloud Storage Bucket.
I have checked the documentation but somehow I can't seem to get the syntax right - I'm trying something along these lines:
c:\test>gsutil -m cp -r . gs://mytestbucket
But I keep getting the message:
CommandException: No URLs matched: .
What am I doing wrong here?
Regards
Morten Hjorth Nielsen
Try gsutil -m cp -r * gs://mytestbucket
Or gsutil -m cp -r *.* gs://mytestbucket
Or if your local directory is called test go one dir up and type: gsutil -m cp -r test gs://mytestbucket
Not sure which syntax you need on Windows, but probably the first.

wget --warc-file --recursive, prevent writing individual files

I run wget to create a warc archive as follows:
$ wget --warc-file=/tmp/epfl --recursive --level=1 http://www.epfl.ch/
$ l -h /tmp/epfl.warc.gz
-rw-r--r-- 1 david wheel 657K Sep 2 15:18 /tmp/epfl.warc.gz
$ find .
./www.epfl.ch/index.html
./www.epfl.ch/public/hp2013/css/homepage.70a623197f74.css
[...]
I only need the epfl.warc.gz file. How do I prevent wget to creating all the individual files?
I tried as follows:
$ wget --warc-file=/tmp/epfl --recursive --level=1 --output-document=/dev/null http://www.epfl.ch/
ERROR: -k or -r can be used together with -O only if outputting to a regular file.
tl;dr Add the options --delete-after and --no-directories.
Option --delete-after instructs wget to delete each downloaded file immediately after its download is complete. As a consequence, the maximum disk usage during execution will be the size of the WARC file plus the size of the single largest downloaded file.
Option --no-directories prevents wget from leaving behind a useless tree of empty directories. By default wget creates a directory tree that mirrors the one on the host, and downloads each file into the appropriate directory of the mirrored tree. wget does this even when the downloaded file is temporary due to --delete-after. To prevent that, use option --no-directories.
The below demonstrates the result, using your given example (slightly altered).
$ cd $(mktemp -d)
$ wget --delete-after --no-directories \
--warc-file=epfl --recursive --level=1 http://www.epfl.ch/
...
Total wall clock time: 12s
Downloaded: 22 files, 1.4M in 5.9s (239 KB/s)
$ ls -lhA
-rw-rw-r--. 1 chadv chadv 1.5M Aug 31 07:55 epfl.warc
If you forget to use --no-directories, you can easily clean up the tree of empty directories with find -type d -delete.
For individual files (without --recursive) the option -O /dev/null will make wget not to create a file for the output. For recursive fetches /dev/null is not accepted (don't know why). But why not just write all the output concatenated into one single file via -O tmpfile and delete this file afterwards?

gsutil rsync not preserving uid/gid ownership

when using gsutil -m rsync -p -d -r
the ownership became root
Any idea how to run gsutil rsync just like rsync -a?
thanks
Peter
gsutil rsync doesn't currently support preserving POSIX file attributes in the cloud.
It's not guaranteed that the uid/gid on the system that uploaded a file is even valid on the system that downloaded the file. So (at least for now), you'll need to manage your file permissions manually.