Google cloud storage versioned bucket not obeying lifecycle rule - google-cloud-storage

I would like to have a maximum of only 2 versions of all objects in my google cloud storage bucket. I have enabled object versioning and have added a lifecycle rule to delete any objects with more than 2 versions. I then start adding objects multiple times to the bucket and run
gsutil ls -R -a gs://bucketname
I end up seeing 3 or 4 different generations of each object even after several minutes of waiting they are not deleted.
Eg:
gs://bucketname/b331108b.csv.gz#1562856078193350
gs://bucketname/b331108b.csv.gz#1564856078195342
gs://bucketname/b331108b.csv.gz#1565856078143350
gs://bucketname/b331108b.csv.gz#1567856078193551
Is this the expected behaviour?

Related

Without retention policy or lifecycle rules, would Google Cloud Storage automatically delete files?

My app uses Google Cloud Storage through Firebase with Java, Angular & Flutter. It stores pictures and such there. Now, a lot of older files recently disappeared from Google Cloud Storage. A test version of my app is probably the culprit. But I want to make sure that I got the storage bucket configured correctly.
Please note that I don't have object versioning enabled. From what I know, it would keep a copy of deleted files around. That's why I plan to enable it in the future. But it doesn't help me with files deleted in the past.
Right now, my storage bucket is configured as follows:
Default storage class: Standard
Object versioning: Off
Retention policy: None
Lifecycle rules: None
So with that configuration, would Google Cloud Storage automatically delete files? Like, say, after a year or so?
No. If you don't ask Cloud Storage to delete your files, your files will stay around forever. There's no expiration of any sort by default. Cloud Storage is a popular tool for long term storage/backup/retention.
If you want to be especially careful not to delete certain objects, retention policies and object holds can be used to make it harder to delete objects by accident. For example, if you wanted to temporarily ensure that your scripts would not delete your most important object, you could run:
gsutil retention temp set gs://my_bucket_name/my_important_file.txt
This would set a "temporary object hold" on the object, which would make it so that my_important_file.txt could not be deleted with the delete command until you released the hold.

Is there a way to figure out in which region a Google Cloud Storage bucket is hosted?

NCBI (the National Center for Biotech Info) generously provided their data for 3rd parties to consume. The data is located in cloud buckets such as gs://sra-pub-run-1/. I would like to read this data without incurring additional costs, which I believe can be achieved by reading it from the same region as where the bucket is hosted. Unfortunately, I can't figure out in which region the bucket is hosted (NCBI mentions in their docs that's in the US, but not where in the US). So my questions are:
Is there a way to figure out in which region a bucket that I don't own, like gs://sra-pub-run-1/ is hosted?
Is my understanding correct that reading the data from instances in the same region is free of charge? What if the GCS bucket is multi-region?
Doing a simple gsutil ls -b -L either provides no information (when listing a specific directory within sra-pub-run-1 or I get a permission denied error if I try to list info on gs://sra-pub-run-1/ directly using:
gsutil -u metagraph ls -b gs://sra-pub-run-1/
You cannot specify a specific Compute Engine zone as a bucket location, but all Compute Engine VM instances in zones within a given region have similar performance when accessing buckets in that region.
Billing-wise, egressing data from Cloud Storage into a Compute Engine instance in the same location/region (for example, US-EAST1 to US-EAST1) is free, regardless of zone.
So, check the "Location constraint" of the GCS bucket (gsutil ls -Lb gs://bucketname ), and if it says "US-EAST1", and if your GCE instance is also in US-EAST1, downloading data from that GCS bucket will not incur an egress fee.

Bulk file restore from Google Cloud Storage

Accidentally run delete command on wrong bucket, object versioning is turned on, but I don't really understand what steps should I take in order to restore files, or what's more important, how to do it in bulk as I've deleted few hundreds of them.
Will appreciate any help.
To restore hundreds of objects you could do something as simple as:
gsutil cp -AR gs://my-bucket gs://my-bucket
This will copy all objects (including deleted ones) to the live generation, using metadata-only copying, i.e., not require copying the actual bytes. Caveats:
It will leave the deleted generations in place, so costing you extra storage.
If your bucket isn't empty this command will re-copy any live objects on top of themselves (ending up with an extra archived version of each of those as well, also costing you for extra storage).
If you want to restore a large number of objects this simplistic script would run too slowly - you'd want to parallelize the individual gsutil cp operations. You can't use the gsutil -m option in this case, because gsutil prevents that, in order to preserve generation ordering (e.g., if there were several generations of objects with the same name, parallel copying them would end up with the live generation coming from an unpredictable generation). If you only have 1 generation of each you could parallelize the copying by doing something like:
gsutil ls -a gs://my-bucket/** | sed 's/\(.\)\(#[0-9]\)/gsutil cp \1\2 \1 \&/' > gsutil_script.sh
This generates a listing of all objects (including deleted ones), and transforms it into a sequence of gsutil cp commands to copy those objects (by generation-specific name) back to the live generation in parallel. If the list is long you'll want to break in into parts so you don't (for example) try to fork 100k processes to do the parallel copying (which would overload your machine).

Performance of gsutil cp command has declined

We have observed that the gsutil cp command for copying a single file to google storage was better when few such processes where running to copy different single files to different location on google storage. The normal speed at that time was ~50mbps. But as "gsutil cp" processes to copy a single file to google storage have increased, the average speed these days has dropped to ~10mbps.
I suppose "gsutil -m cp" command will not improve performace as there is only 1 file to be copied.
What can be attributed to this low speed with increase in number of gsutil cp processes to copy many single files. What can we do increase the speed of these processes
gsutil can upload a single large file in parallel. It does so by uploading parts of the file as separate objects in GCS and then asking GCS to compose them together afterwards and then deleting the individual sub-objects.
N.B. Because this involves uploading objects and then almost immediately deleting them, you shouldn't do this on Nearline buckets, since there's an extra charge for deleting objects that have been recently uploaded.
You can set a file size above which gsutil will use this behavior. Try this:
gsutil -o GSUtil:parallel_composite_upload_threshold=100M cp bigfile gs://your-bucket
More documentation on the feature is available here: https://cloud.google.com/storage/docs/gsutil/commands/cp#parallel-composite-uploads

Composing objects with > 1024 parts without download/upload

Is there a way to either clear the compose count or copy an object inside cloud storage so as to remove the compose count without downloading and uploading again?
With a 5TB object size limit, I'd need 5GB pieces composed together with a 1024 compose limit -- are 5GB uploads even possible? They are certainly not easy to work with.
The compose count should be higher (1MM) or I should be able to copy an object within cloud storage to get rid of the existing compose count.
There is no longer a restriction on the component count. Composing > 1024 parts is allowed.
https://cloud.google.com/storage/docs/composite-objects
5G uploads are definitely possible. You can use a tool such as gsutil to perform them easily.
There's not an easy way to reduce the existing component count, but it is possible using the Rewrite API. Per the documentation: "When you rewrite a composite object where the source and destination are different locations and/or storage classes, the result will be a composite object containing a single component."
So you can create a bucket of a different storage class, rewrite it, then rewrite it back to your original bucket and delete the copy. gsutil uses the rewrite API under the hood, so you could do all of this with gsutil cp:
$ gsutil mb -c DRA gs://dra-bucket
$ gsutil cp gs://original-bucket/composite-obj gs://dra-bucket/composite-obj
$ gsutil cp gs://your-dra-bucket/composite-obj gs://original-bucket/composite-obj
$ gsutil rm gs://dra-bucket/composite-obj