Fast way of deleting non-empty Google bucket?

Fast way of deleting non-empty Google bucket? - google-cloud-storage

Is this my only option or is there a faster way?
# Delete contents in bucket (takes a long time on large bucket)
gsutil -m rm -r gs://my-bucket/*
# Remove bucket
gsutil rb gs://my-bucket/

Buckets are required to be empty before they're deleted. So before you can delete a bucket, you have to delete all of the objects it contains.
You can do this with gsutil rm -r (documentation). Just don't pass the * wildcard and it will delete the bucket itself after it has deleted all of the objects.
gsutil -m rm -r gs://my-bucket
Google Cloud Storage bucket deletes can't succeed until the bucket listing returns 0 objects. If objects remain, you can get a Bucket Not Empty error (or in the UI's case 'Bucket Not Ready') when trying to delete the bucket.
gsutil has built-in retry logic to delete both buckets and objects.

Another option is to enable Lifecycle Management on the bucket. You could specify an Age of 0 days and then wait a couple days. All of your objects should be deleted.

Using Python client, you can force a delete within your script by using:
bucket.delete(force=True)
Try out a similar thing in your current language.
Github thread that discusses this

This deserves to be summarized and pointed out.
Deleting with gsutil rm is slow if you have LOTS (terabytes) of data
gsutil -m rm -r gs://my-bucket
However, you can specify the expiration for the bucket and let the GCS do the work for you. Create a fast-delete.json policy:
{
"rule":[
{
"action":{
"type":"Delete"
},
"condition":{
"age":0
}
}
]
}
then apply
gsutil lifecycle set fast-delete.json gs://MY-BUCKET
Thanks, #jterrace and #Janosch

Use this to set an appropriate lifecycle rule. e.g. wait for a day.
https://cloud.google.com/storage/docs/gsutil/commands/lifecycle
Example (Read carefully before copy paste)
gsutil lifecycle set [LIFECYCLE_CONFIG_FILE] gs://[BUCKET_NAME]
Example (Read carefully before copy paste)
{
"rule":
[
{
"action": {"type": "Delete"},
"condition": {"age": 1}
}
]
}
Then delete the bucket.
This will delete the data asynchronously, so you don't have to keep
some background job running on your end.

Shorter one liner for the lifecycle change:
gsutil lifecycle set <(echo '{"rule":[{"action":{"type":"Delete"},"condition":{"age":0}}]}') gs://MY-BUCKET
I've also had good luck creating an empty bucket then starting a transfer to the bucket I want to empty out. Our largest bucket took about an hour to empty this way; the lifecycle method seems to take at least a day.

I benchmarked deletes using three techniques:
Storage Transfer Service: 1200 - 1500 / sec
gcloud alpha storage rm: 520 / sec
gsutil -m rm: 240 / sec
The big winner is the Storage Transfer Service. To delete files with it you need a source bucket (or folder in a bucket) that is empty, and then you copy that to a destination bucket (or folder in that bucket) that you want to be empty.
If using the GUI select this bullet in the advanced transfer options dialog:
You can also create and run the job from the CLI. This example assumes you have access to gs://bucket1/empty/ (which has no objects in it) and you want to delete all objects from gs://bucket2/:
gcloud transfer jobs create \
gs://bucket1/empty/ gs://bucket2/ \
--delete-from=destination-if-unique \
--project my-project
If you want your deletes to happen even faster you'll need to create multiple transfer jobs and have them target different sections of the bucket. Because it has to do a bucket listing to find the files to delete you'd want to make the destination paths non-overlapping (e.g. gs://bucket2/folder1/ and gs://bucket2/folder2/, etc). Each job will process in parallel at speed getting the job done in less total time.
Usually I like this better than using Object Lifecycle Management (OLM) because it starts right away (no waiting up to 24 hours for policy evaluation) but there may be times when OLM is the way to go.

Remove the bucket from Developers Console. It will ask for confirmation before deleting a non empty bucket. It works like a charm ;)

I've tried both ways (expiration time and gsutil command direct to bucket root), but I could not wait to the expiration time to propagate.
The gsutil rm was deleting 200 files per second, so I did this:
Open several terminal and executed the gsutil rm using different "folder" names with *
ie:
gsutil -m rm -r gs://my-bucket/a*
gsutil -m rm -r gs://my-bucket/b*
gsutil -m rm -r gs://my-bucket/c*
In this example, the command is able to delete 600 files per second.
So you just need to open more terminals and find the patterns to delete more files.
If one wildcard is huge, you can detail, like this
gsutil -m rm -r gs://my-bucket/b1*
gsutil -m rm -r gs://my-bucket/b2*
gsutil -m rm -r gs://my-bucket/b3*

Related

Efficient way upload multiple files to separate locations on google cloud storage

Here is my case, I'd like to copy multiple files to separate locations on google cloud storage, e.g:
gsutil -m cp /local/path/to/d1/x.csv.gz gs://bucketx/d1/x.csv.gz
gsutil -m cp /local/path/to/d2/x.csv.gz gs://bucketx/d2/x.csv.gz
gsutil -m cp /local/path/to/d3/x.csv.gz gs://bucketx/d3/x.csv.gz
...
I have over 10k of such files, and executing them by separate calls of gsutil seems to be really slow, and lots of time is wasted on setting up network connection. What's the most efficient way to do that, please?

If your paths are consistently of the nature in your example, you could do this with one gsutil command:
gsutil -m cp -r /local/path/to/* gs://bucketx
However, this only works if you want the destination naming to mirror the source. If your paths are arbitrary mappings of source name to destination name, you'll need to run individual commands (which as you note can be sped up with parallel).

Gsutil rm does not remove everything

I have a problem with one of my automated jobs.
Before launching a cloud dataflow job, I perform a gsutil rm on previous files but it appears that it does not remove everything because when I launch another dataflow job some older shards remain.
I tried :
gsutil -m rm gs://mybucket/blahblah/*
and
gsutil rm -r gs://mybucket/blablah
But same result...
Strange thing is that not removed files are nor the first nor the last.
I tought it was my second job fault but the fact is that I saw in logs that indeed files were not removed bu gsutil.
Is there possibility that there is too many files to delete ?
Is there known problems of gsutil rm reliability ?
I use version 0.9.80 of google cloud sdk
Thanks

The gsutil rm commands you're using depend on listing the objects in a bucket, which is an eventually consistent operation in Google Cloud Storage. Thus, it's possible that attempting these commands in a bucket soon after objects were written will not remove all the objects. If you try again later it should succeed.
One way to avoid this problem would be to keep track of the names of the objects you uploaded, and explicitly list those objects in the gsutil rm command. For example, if you kept the object list in the file objects.manifest you could run a command like this on Linux or MacOS:
xargs gsutil -m rm < objects.manifest

How to download multiple files in Google Cloud Storage

Scenario: there are multiple folders and many files stored in storage bucket that is accessible by project team members. Instead of downloading individual files one at a time (which is very slow and time consuming), is there a way to download entire folders? Or at least multiple files at once? Is this possible without having to use one of the command consoles? Some of the team members are not tech savvy and need to access these files as simple as possible. Thank you for any help!

I would suggest downloading the files with gsutil. However if you have a large number of files to transfer you might want to use the gsutil -m option, to perform a parallel (multi-threaded/multi-processing) copy:
gsutil -m cp -R gs://your-bucket .
The time reduction for downloading the files can be quite significant. See this Cloud Storage documentation for complete information on the GCS cp command.
If you want to copy into a particular directory, note that the directory must exist first, as gsutils won't create it automatically. (e.g: mkdir my-bucket-local-copy && gsutil -m cp -r gs://your-bucket my-bucket-local-copy)

I recommend they use gsutil. GCS's API deals with only one object at a time. However, its command-line utility, gsutil, is more than happy to download a bunch of objects in parallel, though. Downloading an entire GCS "folder" with gsutil is pretty simple:
$> gsutil cp -r gs://my-bucket/remoteDirectory localDirectory

To download files to local machine need to:
install gsutil to local machine
run Google Cloud SDK Shell
run the command like this (example, for Windows-platform):
gsutil -m cp -r gs://source_folder_path "%userprofile%/Downloads"

gsutil rsync -d -r gs://bucketName .
works for me

gsutil returning "no matches found"

I'm trying using gsutil to remove the contents of a Cloud Storage bucket (but not the bucket itself). According to the documentation, the command should be:
gsutil rm gs://bucket/**
However, whenever I run that (with my bucket name substituted of course), I get the following response:
zsh: no matches found: gs://my-bucket/**
I've checked permissions, and I have owner permissions. Additionally, if I specify a file, which is in the bucket, directly, it is successfully deleted.
Other information which may matter:
My bucket name has a "-" in it (similar to "my-bucket")
It is the bucket that Cloud Storage saves my usage logs to
How do I go about deleting the contents of a bucket?

zsh is attempting to expand the wildcard before gsutil sees it (and is complaining that you have no local files matching that wildcard). Please try this, to prevent zsh from doing so:
gsutil rm 'gs://bucket/**'
Note that you need to use single (not double) quotes to prevent zsh wildcard handling.

If you have variables to replace, you can also just escape the wildcard character
Examples with copy (with interesting flags) and rm
GCP_PROJECT_NAME=your-project-name
gsutil -m cp -r gs://${GCP_PROJECT_NAME}.appspot.com/assets/\* src/local-assets/
gsutil rm gs://${GCP_PROJECT_NAME}.appspot.com/\*\*

gsutil rm gs://bucketName/doc.txt
And for remove entire bucket including all objects
gsutil rm -r gs://bucketname

gsutil copy, how to specify max number of objects to copy?

I need to move files from one bucket to another in GCS for processing, each time I just need to copy and process certain amount of files, say 100, how can I configure it in the command line?
gsutil -m mv gs://bucket1/file001.json gs://bucket2/
Thanks for your help.

You could write a script that builds a list of all objects to be copied and then walks through them 100 at a time, calling gsutil mv one at a time. That would lose parallelism compared to running gsutil -m mv, so you could run each of those gsutil mv commands in the background and then wait for all 100 to complete before moving on to the next batch.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Fast way of deleting non-empty Google bucket? - google-cloud-storage

Is this my only option or is there a faster way? # Delete contents in bucket (takes a long time on large bucket) gsutil -m rm -r gs://my-bucket/* # Remove bucket gsutil rb gs://my-bucket/

Another option is to enable Lifecycle Management on the bucket. You could specify an Age of 0 days and then wait a couple days. All of your objects should be deleted.

Using Python client, you can force a delete within your script by using: bucket.delete(force=True) Try out a similar thing in your current language. Github thread that discusses this

Remove the bucket from Developers Console. It will ask for confirmation before deleting a non empty bucket. It works like a charm ;)

Related

Efficient way upload multiple files to separate locations on google cloud storage

Gsutil rm does not remove everything

How to download multiple files in Google Cloud Storage

gsutil returning "no matches found"

gsutil copy, how to specify max number of objects to copy?

Categories

Resources