Fastest way to get Google Storage bucket size? - google-cloud-storage

I'm currently doing this, but it's VERY slow since I have several terabytes of data in the bucket:
gsutil du -sh gs://my-bucket-1/
And the same for a sub-folder:
gsutil du -sh gs://my-bucket-1/folder
Is it possible to somehow obtain the total size of a complete bucket (or a sub-folder) elsewhere or in some other fashion which is much faster?

The visibility for google storage here is pretty shitty
The fastest way is actually to pull the stackdriver metrics and look at the total size in bytes:
Unfortunately there is practically no filtering you can do in stackdriver. You can't wildcard the bucket name and the almost useless bucket resource labels are NOT aggregate-able in stack driver metrics
Also this is bucket level only- not prefixes
The SD metrics are updated daily so unless you can wait a day you cant use this to get the current size right now
UPDATE: Stack Driver metrics now support user metadata labels so you can label your GCS buckets and aggregate those metrics by custom labels you apply.
Edit
I want to add a word of warning if you are creating monitors off of this metric. There is a really crappy bug with this metric right now.
GCP occasionally has platform issues that cause this metric to stop getting written. And I think it's tenant specific (maybe?) so you also won't see it on their public health status pages. And it seems poorly documented for their internal support staff as well because every time we open a ticket to complain they seem to think we are lying and it takes some back and forth before they even acknowledge its broken.
I think this happens if you have many buckets and something crashes on their end and stops writing metrics to your projects. While it does not happen all the time we see it several times a year.
For example it just happened to us again. This is what I'm seeing in stack driver right now across all our projects:
Response from GCP support
Just adding the last response we got from GCP support during this most recent metric outage. I'll add all our buckets were accessible it was just this metric was not being written:
The product team concluded their investigation stating that this was indeed a widespread issue, not tied to your projects only. This internal issue caused unavailability for some GCS buckets, which was affecting the metering systems directly, thus the reason why the "GCS Bucket Total Bytes" metric was not available.

Unfortunately, no. If you need to know what size the bucket is right now, there's no faster way than what you're doing.
If you need to check on this regularly, you can enable bucket logging. Google Cloud Storage will generate a daily storage log that you can use to check the size of the bucket. If that would be useful, you can read more about it here: https://cloud.google.com/storage/docs/accesslogs#delivery

If the daily storage log you get from enabling bucket logging (per Brandon's suggestion) won't work for you, one thing you could do to speed things up is to shard the du request. For example, you could do something like:
gsutil du -s gs://my-bucket-1/a* > a.size &
gsutil du -s gs://my-bucket-1/b* > b.size &
...
gsutil du -s gs://my-bucket-1/z* > z.size &
wait
awk '{sum+=$1} END {print sum}' *.size
(assuming your subfolders are named starting with letters of the English alphabet; if not; you'd need to adjust how you ran the above commands).

Use the built in dashboard
Operations -> Monitoring -> Dashboards -> Cloud Storage
The graph at the bottom shows the bucket size for all buckets, or you can select an individual bucket to drill down.
Note that the metric is only updated once per day.

With python you can get the size of your bucket as follows:
from google.cloud import storage
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_or_name='name_of_your_bucket')
blobs_total_size = 0
for blob in blobs:
blobs_total_size += blob.size # size in bytes
blobs_total_size / (1024 ** 3) # size in GB

Google Console
Platform -> Monitoring -> Dashboard -> Select the bucket
Scroll down can see the object size for that bucket

I found that that using the CLI it was frequently timing out. But that my be as I was reviewing a coldline storage.
For a GUI solution. Look at Cloudberry Explorer
GUI view of storage

For me following command helped:
gsutil ls -l gs://{bucket_name}
It then gives output like this after listing all files:
TOTAL: 6442 objects, 143992287936 bytes (134.1 GiB)

Related

Google Cloud cloud storage operation costs

I am looking into using Google Cloud cloud storage buckets as a cheaper alternative to compute engine snapshots to store backups.
However, I am a bit confused about the costs per operation. Specifically the insert operation. If I understand the documentation correctly, it doesn't seem that it matters how large the file is that you want to insert is, it always counts as 1 operation.
So if I upload a single 20 TB file using one insert to a standard storage class bucket, wait 14 days, then retrieve it again, and all this within the same region, I practically only pay for storing it for 14 days?
Doesn't that mean that even the standard storage class bucket is a more cost effective option for storing backups compared to snapshots, as long as you can get your whole thing into a single file?
It's not fully accurate, and all depends on what cost for you.
First of all, the maximum size of an object in Cloud Storage is 5 TiB, so you can't store 1 file of 20Tb, but 4, at the end, it's the same principle.
The persistent disk snapshot is a very powerful feature:
The snapshot doesn't need CPUs to be done, compared to your solution.
The snapshot doesn't need network bandwidth to be done, compared to your solution.
The snapshot can be done anytime, on the fly.
The snapshot can be restored in the current VM, or you can create a new VM with a snapshot to investigate on it, for example.
You can perform incremental snapshots saving money (cheaper than full image snapshot).
You don't need additional space on your persistent disk to be done (compared to your solution where you need to create an archive before sending it to Cloud Storage).
In your scenario seems like using snapshots seems like the best solution in terms of time efficiency. Now, is using Cloud Storage a cheaper solution? Probably, as it is listed as the most affordable storage option, but in the end, you will have to calculate the cost-benefits on your own.

Google Storage - Backup Bucket with Different Key For Each File

I need to backup a bucket in which every file is encrypted with a different key to a different bucket on Google Storage.
I want to create a daily snapshot of the data so in a case where the data has been deleted I could easily recover it.
My Research:
Using gsutil cp -r - because every file has a different key it does not work
Using Google Transfer | cloud - does not work on such buckets from the same reason
List all the files in the bucket and fetch all the keys from the database and copy each file - this will probably be very expensive to do because i have a lot of files and i want to do it daily
Object versioning - Does not cover a case where the bucket has been completely deleted
Are there any other solutions for that problem?
Unfortunately, as you mentioned the only option indeed, would be to follow your number 3 choice. As you said and as clarified in this official documentation here, download of encrypted data is a restricted feature, so you won't be able to download/snapshot the data, without fetching the keys and then copying the files.
Indeed, this will probably make a huge impact in your quota and pricing, since you will be performing multiple operations everyday, for multiple files, which will affect multiple aspects on the pricing. However, this seems to be the only available way right now. In addition to this, I would recommend you to raise a Feature Request in Google's Issue Tracker, so they can check about the possibility of implementing this in the future.
Let me know if the clarifed your doubts!

DynamoDB vs ElasticSearch vs S3 - which service to use for superfast get/put 10-20MB files?

I have backend that recieves, stores and serves 10-20 MB json files. Which service should I use for superfast put and get (I cannot break the file in smaller chunks)? I dont have to run queries on these files just get them, store them and supply them instantly. The service should scale to tens of thousands of files easily. Ideally I should be able to put the file in 1-2 seconds and retrieve it in the same time.
I feel s3 is the best option and elastic search the second best option. Dyanmodb doesnt allow such object size. What should I use? Also, is there any other service? Mongodb is a possible solution but i dont see that on AWS, so something quick to setup would be great.
Thanks
I don't think you should go for Dynamo or ES for this kind of operation.
After all, what you want is to store and serve it, not going into the file's content which both Dynamo and ES would waste time to do.
My suggestion is to use AWS Lambda + S3 to optimize for cost
S3 does have some small downtime after putting till the file is available though ( It get bigger, minutes even, when you have millions of object in a bucket )
If downtime is important for your operation and total throughput at any given moment is not too huge, You can create a server ( preferably EC2) that serves as a temporary file stash. It will
Receive your file
Try to upload it to S3
If the file is requested before it's available on S3, serve the file on disk
If the file is successfully uploaded to S3, serve the S3 url, delete the file on disk

Is there a less costly option than gsutil rsync for backup to cloud storage?

We are currently utilizing the new coldline storage to backup files off site, the storage part is super cost effective. We are using gsutil rsync once a day to make sure our coldline storage is up to date.
The problem is that using gsutil rsync creates a massive number of class A requests, which are quite expensive. In this case it would be at least 5x the amount of the coldline storage making it no longer a good deal.
Are we going to have to custom code a custom solution to avoid these charges, is there a better option for this type of back, or is there some way get rsync to not generate so many requests?
I think there is one pricing trick you can use if the expense is from the storage.objects.list operation. From the GCS operations pricing page: "When an operation applies to a bucket, such as listing the objects in a bucket, the default storage class set for that bucket determines the operation cost. So the trick is:
Set the bucket's default storage class to STANDARD
On upload, set the object's storage class to ARCHIVE (using gsutil cp -s ARCHIVE, or by setting the appropriate option on the upload requesting using the API).
As far as I understand: this now means you will be charged at the STANDARD rate for listing the bucket ($0.05/10k operations instead of $0.50/10k operations).
The one challenge is that gsutil rsync does not support the -s ARCHIVE flag that is supported by gsutil cp, so you won't be able to use it. You might want to look at other tools, like possibly rclone.

Is it better to store 1 email/file in Google Cloud Storage or multiple emails in one large file?

I am trying to do analytics on emails for some users. To achieve this, I am trying to store the emails on Cloud Storage so I can run Hadoop jobs on them. (Earlier I tried App Engine DataStore but it was having a hard time scaling over that many user's data: hitting various resource limits etc.)
Is it better to store 1 email/file in Cloud Storage or all of a user's emails in one large file? In many examples about cloud storage, I see folks operating on large files, but it seems more logical to keep 1 file/email.
From a GCS scaling perspective there's no advantage to storing everything in one object vs many objects. However, listing the objects in a bucket is an eventually consistent operation. So, if your computation would proceed by first uploading (say) 1 million objects to a bucket, and then immediately starting a computation that lists the objects in the bucket and computing over their content, it's possible the listing would be incomplete. You could address that problem by maintaining a manifest of objects you upload and passing the manifest to the computation instead of having the computation list the objects in the bucket. Alternatively, if you load all the emails into a single file and upload it, you wouldn't need to perform a bucket listing operation.
If you plan to upload the data once and then run a variety of analytics computations (or rev a single computation and run it a number of times), uploading a large number of objects and depending on listing the bucket from your analytics computation would not be a problem, since the eventual consistency problem really only impacts you in the case where you list the bucket shortly after uploading.