gsutil cp command copies entire directory path (on Windows) - bug? - google-cloud-storage

I am having a problem where gsutil does not seem to follow the behavior described in the documentation (at least in Windows). The documentation states:
When performing recursive directory copies, object names are constructed that mirror the source directory structure starting at the point of recursive processing. For example, the command:
gsutil cp -R dir1/dir2 gs://my_bucket
will create objects named like gs://my_bucket/dir2/a/b/c, assuming dir1/dir2 contains the file a/b/c.
However, in practice I have found that it will create objects named:
gs://my_bucket/dir1/dir2/a/b/c
ie, it copies the entire directory path stated in the gsutil command, rather than "starting at the point of recursive processing" (dir2) as stated in the documentation.
Am I missing/misunderstanding something here?

I noticed the same behavior when using the gsutil cp -R command with a similar directory structure. In order to copy the desired directory from within the 'dir2' level I used the command: gsutil rsync -r dir1/dir2 gs://mybucket

Related

Efficient way upload multiple files to separate locations on google cloud storage

Here is my case, I'd like to copy multiple files to separate locations on google cloud storage, e.g:
gsutil -m cp /local/path/to/d1/x.csv.gz gs://bucketx/d1/x.csv.gz
gsutil -m cp /local/path/to/d2/x.csv.gz gs://bucketx/d2/x.csv.gz
gsutil -m cp /local/path/to/d3/x.csv.gz gs://bucketx/d3/x.csv.gz
...
I have over 10k of such files, and executing them by separate calls of gsutil seems to be really slow, and lots of time is wasted on setting up network connection. What's the most efficient way to do that, please?
If your paths are consistently of the nature in your example, you could do this with one gsutil command:
gsutil -m cp -r /local/path/to/* gs://bucketx
However, this only works if you want the destination naming to mirror the source. If your paths are arbitrary mappings of source name to destination name, you'll need to run individual commands (which as you note can be sped up with parallel).

Gsutil rm does not remove everything

I have a problem with one of my automated jobs.
Before launching a cloud dataflow job, I perform a gsutil rm on previous files but it appears that it does not remove everything because when I launch another dataflow job some older shards remain.
I tried :
gsutil -m rm gs://mybucket/blahblah/*
and
gsutil rm -r gs://mybucket/blablah
But same result...
Strange thing is that not removed files are nor the first nor the last.
I tought it was my second job fault but the fact is that I saw in logs that indeed files were not removed bu gsutil.
Is there possibility that there is too many files to delete ?
Is there known problems of gsutil rm reliability ?
I use version 0.9.80 of google cloud sdk
Thanks
The gsutil rm commands you're using depend on listing the objects in a bucket, which is an eventually consistent operation in Google Cloud Storage. Thus, it's possible that attempting these commands in a bucket soon after objects were written will not remove all the objects. If you try again later it should succeed.
One way to avoid this problem would be to keep track of the names of the objects you uploaded, and explicitly list those objects in the gsutil rm command. For example, if you kept the object list in the file objects.manifest you could run a command like this on Linux or MacOS:
xargs gsutil -m rm < objects.manifest

How to download multiple files in Google Cloud Storage

Scenario: there are multiple folders and many files stored in storage bucket that is accessible by project team members. Instead of downloading individual files one at a time (which is very slow and time consuming), is there a way to download entire folders? Or at least multiple files at once? Is this possible without having to use one of the command consoles? Some of the team members are not tech savvy and need to access these files as simple as possible. Thank you for any help!
I would suggest downloading the files with gsutil. However if you have a large number of files to transfer you might want to use the gsutil -m option, to perform a parallel (multi-threaded/multi-processing) copy:
gsutil -m cp -R gs://your-bucket .
The time reduction for downloading the files can be quite significant. See this Cloud Storage documentation for complete information on the GCS cp command.
If you want to copy into a particular directory, note that the directory must exist first, as gsutils won't create it automatically. (e.g: mkdir my-bucket-local-copy && gsutil -m cp -r gs://your-bucket my-bucket-local-copy)
I recommend they use gsutil. GCS's API deals with only one object at a time. However, its command-line utility, gsutil, is more than happy to download a bunch of objects in parallel, though. Downloading an entire GCS "folder" with gsutil is pretty simple:
$> gsutil cp -r gs://my-bucket/remoteDirectory localDirectory
To download files to local machine need to:
install gsutil to local machine
run Google Cloud SDK Shell
run the command like this (example, for Windows-platform):
gsutil -m cp -r gs://source_folder_path "%userprofile%/Downloads"
gsutil rsync -d -r gs://bucketName .
works for me

gsutil returning "no matches found"

I'm trying using gsutil to remove the contents of a Cloud Storage bucket (but not the bucket itself). According to the documentation, the command should be:
gsutil rm gs://bucket/**
However, whenever I run that (with my bucket name substituted of course), I get the following response:
zsh: no matches found: gs://my-bucket/**
I've checked permissions, and I have owner permissions. Additionally, if I specify a file, which is in the bucket, directly, it is successfully deleted.
Other information which may matter:
My bucket name has a "-" in it (similar to "my-bucket")
It is the bucket that Cloud Storage saves my usage logs to
How do I go about deleting the contents of a bucket?
zsh is attempting to expand the wildcard before gsutil sees it (and is complaining that you have no local files matching that wildcard). Please try this, to prevent zsh from doing so:
gsutil rm 'gs://bucket/**'
Note that you need to use single (not double) quotes to prevent zsh wildcard handling.
If you have variables to replace, you can also just escape the wildcard character
Examples with copy (with interesting flags) and rm
GCP_PROJECT_NAME=your-project-name
gsutil -m cp -r gs://${GCP_PROJECT_NAME}.appspot.com/assets/\* src/local-assets/
gsutil rm gs://${GCP_PROJECT_NAME}.appspot.com/\*\*
gsutil rm gs://bucketName/doc.txt
And for remove entire bucket including all objects
gsutil rm -r gs://bucketname

gsutil rsync not preserving uid/gid ownership

when using gsutil -m rsync -p -d -r
the ownership became root
Any idea how to run gsutil rsync just like rsync -a?
thanks
Peter
gsutil rsync doesn't currently support preserving POSIX file attributes in the cloud.
It's not guaranteed that the uid/gid on the system that uploaded a file is even valid on the system that downloaded the file. So (at least for now), you'll need to manage your file permissions manually.