What is the quickest way to delete many blobs from GCS? - google-cloud-storage

I have a bucket containing many millions of blobs that I want to delete however I can't simply delete the bucket. This is the best method I have come up with to delete millions of blobs in the quickest time possible:
gsutil ls gs://bucket/path/to/dir/ | xargs gsutil -m rm -r
For what I want to do (which involves removing about 30million blobs) it still takes many hours to run, partly I guess because its at the mercy of the speed of my broadband connection.
Anyone know of a quicker way of achieving this? I had kinda hoped it'd be an instantaneous operation as in the backend the location could simply be marked as deleted - clearly not.

Google recommend using the console to do this
The Cloud Console can bulk delete up to several million objects and does so in the background. The Cloud Console can also be used to bulk delete only those objects that share a common prefix, which appear as part of a folder when using the Cloud Console.
https://cloud.google.com/storage/docs/best-practices#deleting
That said (personal opinion here) using the console might be quicker but you haven't a clue how far its got. At least with cli option you do know.

Another alternative is using lifecycle management to delete based on rules:
Delete objects in bulk
If you want to bulk delete a hundred thousand or more objects, avoid
using gsutil, as the process takes a long time to complete. Instead,
use the Google Cloud console, which can delete up to several million
objects, or Object Lifecycle Management, which can delete any number
of objects.
To bulk delete objects in your bucket using Object Lifecycle
Management, set a lifecycle configuration rule on your bucket where
the condition has Age set to 0 days, and the action is set to delete.
From: https://cloud.google.com/storage/docs/deleting-objects#delete-objects-in-bulk
However this won't work if you're in a rush:
After you have added or edited a rule, it may take up to 24 hours to take effect.

Related

gcloud run that requires large databse

i hope this is the right place to ask this. So what i want to do is perform a sequence search against a large database and create an API for this. I expect that this service will be accessed VERY rarely, as such I though about gcloud run, because this only bills me for each use case (and i dont use a lot). I already have a docker container configured that does what i expect it to, however I have an issue with the data thats required. I need a Database thats roughly 100 GB large. Is there a way to access this in glcoud run?
What would be the optimal way for me to get there? I think downlading 100GB of data every time a request is made is a waste. Maybe I could fetch a zip file from a storage bucket and inflate it in the run instance? But I am not sure if there is even that much space available.
Thank you
I believe the simpler way to do this is to rip the weight of the Cloud Run shoulders.
I'm assuming it is some sort of structured data (json, csv, etc) - if it really is it is simpler to import this data into BigQuery and make your Cloud Run service to query against BQ.
This way your API will answer way faster, you will save costs from running Cloud Run with very large instances to load into memory part of those 100gigs as also you will separate your architecture in layers (ie. an application layer and a data layer).

Google Storage - Backup Bucket with Different Key For Each File

I need to backup a bucket in which every file is encrypted with a different key to a different bucket on Google Storage.
I want to create a daily snapshot of the data so in a case where the data has been deleted I could easily recover it.
My Research:
Using gsutil cp -r - because every file has a different key it does not work
Using Google Transfer | cloud - does not work on such buckets from the same reason
List all the files in the bucket and fetch all the keys from the database and copy each file - this will probably be very expensive to do because i have a lot of files and i want to do it daily
Object versioning - Does not cover a case where the bucket has been completely deleted
Are there any other solutions for that problem?
Unfortunately, as you mentioned the only option indeed, would be to follow your number 3 choice. As you said and as clarified in this official documentation here, download of encrypted data is a restricted feature, so you won't be able to download/snapshot the data, without fetching the keys and then copying the files.
Indeed, this will probably make a huge impact in your quota and pricing, since you will be performing multiple operations everyday, for multiple files, which will affect multiple aspects on the pricing. However, this seems to be the only available way right now. In addition to this, I would recommend you to raise a Feature Request in Google's Issue Tracker, so they can check about the possibility of implementing this in the future.
Let me know if the clarifed your doubts!

Can someone give me a rough guideline for how long it will take to delete a Nearline storage bucket?

4 million JPG files, approximately 30TB in size. I deleted it via their web interface, and it currently states "Deleting 1 bucket", and has done for an hour.
Just after someone's experience for a rough estimation as to how long this operation will take - another hour? A day? A week?!
Region: europe-west1, if that makes a difference.
Thank you!
According to this documentation on the deletion request timeline, on step 2 it says that:
Once the deletion request is made, data is typically marked for deletion immediately and our goal is to perform this step within a maximum period of 24 hours.
A couple of points to be also considered are that:
This timeline will vary depending on the number of files, so your case might take longer that that.
If you files are organized in different folders, it would take longer to delete them since the system would have to enter each directory to delete.
One thing that you could do to speed up the deletion process is to use this command for parrallel deletion:
gsutil rm -m gs://bucket
NOTE: I don't think that the fact that your storage is a nearline storage has any effect on the timeline of deletion but I could not find any confirmation for that on the documentation.

Google Cloud Storage Multi-Regional bucket slow deletion

I am experiencing slow deletion for a GCS Multi-Regional bucket.
I was wondering if this is normal performance to be expected since the bucket is Multi-Regional.
My bucket is being deleted programmatically by Terraform, and the delete step has taken 16 minutes:
google_storage_bucket.<REDACTED>: Still destroying... (ID: <REDACTED>, 16m30s elapsed)
When i go into the GCS console, when I check (click, select) the bucket, and click Delete, it takes a long time while a tooltip appears saying "Checking the bucket".
After that it asks me if I want to delete the X number of items.
After I choose yes, it tells me it failed.
When I first did the delete step in GCS console, it said 146 items.
I repeated it again later and it says 102 items, which probably means 40 items were deleted.
How can I delete this bucket properly?
Is this performance expected, since it is multi-regional?
UPDATE:
33 minutes and it is still deleting (as per Terraform)
UPDATE:
Deletion is complete
google_storage_bucket.vault: Destruction complete after 52m48s
So I don't need a fix, but it'll be nice to know if this is normal expected performance.
This is expected behavior, when a bucket is going to be deleted, all objects are recursively listed to know if they are all deleted. Usually this is pretty quick, but it can take a very long time if there are a lot of objects.
As a workaround you can delete the bucket faster running gsutil rm -m -r gs://bucket
to perform parallel (multi-threaded/multi-processing) removes.
Also there is an already filed feature request for this issue, you can click on the Me too! button to indicate that you are affected by this issue.

Is it better to store 1 email/file in Google Cloud Storage or multiple emails in one large file?

I am trying to do analytics on emails for some users. To achieve this, I am trying to store the emails on Cloud Storage so I can run Hadoop jobs on them. (Earlier I tried App Engine DataStore but it was having a hard time scaling over that many user's data: hitting various resource limits etc.)
Is it better to store 1 email/file in Cloud Storage or all of a user's emails in one large file? In many examples about cloud storage, I see folks operating on large files, but it seems more logical to keep 1 file/email.
From a GCS scaling perspective there's no advantage to storing everything in one object vs many objects. However, listing the objects in a bucket is an eventually consistent operation. So, if your computation would proceed by first uploading (say) 1 million objects to a bucket, and then immediately starting a computation that lists the objects in the bucket and computing over their content, it's possible the listing would be incomplete. You could address that problem by maintaining a manifest of objects you upload and passing the manifest to the computation instead of having the computation list the objects in the bucket. Alternatively, if you load all the emails into a single file and upload it, you wouldn't need to perform a bucket listing operation.
If you plan to upload the data once and then run a variety of analytics computations (or rev a single computation and run it a number of times), uploading a large number of objects and depending on listing the bucket from your analytics computation would not be a problem, since the eventual consistency problem really only impacts you in the case where you list the bucket shortly after uploading.