Google Cloud Platform has minimum storage duration for various bucket types like Nearline, Coldline. In the minimum duration period for Nearline and Coldline buckets can objects be modified or should just be read from the bucket or should it not be read or write within the minimum storage duration.
You can modify (overwrite) or delete objects of Nearline or Coldline storage class, it just comes with a price, see details on early deletion here:
https://cloud.google.com/storage/pricing#early-deletion
Early deletion charges apply when you overwrite existing objects, since the original object is replaced by a new one.
Related
I am looking into using Google Cloud cloud storage buckets as a cheaper alternative to compute engine snapshots to store backups.
However, I am a bit confused about the costs per operation. Specifically the insert operation. If I understand the documentation correctly, it doesn't seem that it matters how large the file is that you want to insert is, it always counts as 1 operation.
So if I upload a single 20 TB file using one insert to a standard storage class bucket, wait 14 days, then retrieve it again, and all this within the same region, I practically only pay for storing it for 14 days?
Doesn't that mean that even the standard storage class bucket is a more cost effective option for storing backups compared to snapshots, as long as you can get your whole thing into a single file?
It's not fully accurate, and all depends on what cost for you.
First of all, the maximum size of an object in Cloud Storage is 5 TiB, so you can't store 1 file of 20Tb, but 4, at the end, it's the same principle.
The persistent disk snapshot is a very powerful feature:
The snapshot doesn't need CPUs to be done, compared to your solution.
The snapshot doesn't need network bandwidth to be done, compared to your solution.
The snapshot can be done anytime, on the fly.
The snapshot can be restored in the current VM, or you can create a new VM with a snapshot to investigate on it, for example.
You can perform incremental snapshots saving money (cheaper than full image snapshot).
You don't need additional space on your persistent disk to be done (compared to your solution where you need to create an archive before sending it to Cloud Storage).
In your scenario seems like using snapshots seems like the best solution in terms of time efficiency. Now, is using Cloud Storage a cheaper solution? Probably, as it is listed as the most affordable storage option, but in the end, you will have to calculate the cost-benefits on your own.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I want to use Google Colab free TPU with a custom dataset, that's why I need to upload it to GCS. I created bucket in GCS and uploaded dataset.
Also I read that there are two classes of operations with data in GCS: operation class A and operation class B [reference].
My questions are: does accessing dataset from GCS in Google Colab fall in one of these operation classes? What is average price you pay for using GCS for Colab TPU?
Yes, accessing to the Objects (files) in your GCS bucket will result in possible charges to your Billing Account but there are some other factors that you might need to consider. Let me explain (sorry in advance for the very long answer):
Google Cloud Platform services use APIs behind the scene to perform multiple actions such as show, create, delete or edit certain resources.
Cloud Storage is not the Exception. As mentioned in the Cloud Storage docs operations can be cataloged in two: the ones performed by the JSON API and the ones done by the XML API.
All operations performed on the Cloud Console or Client libraries (the ones used to interact via code with languages like Python, Java, PHP, etc.), the Operations will be charged using the JSON API by default. Let's focus on this one.
I want you to pay attention at the name of the methods under each Operations column:
The structure can be read as follows:
service.resource.action
Since all these methods are related to the Cloud Storage service, it is normal to see the storage service in all of them.
In Operations B column, the first method is storage.*.get. There is no other get method in the other columns which means that retrieving information from a bucket (read metadata) or objects (read a file via code, download files, etc.) inside a bucket will be considered as part of this method.
Before talking about how to calculate costs let me add: Google Cloud Storage not only charges you for the action itself but also for the size of the file traveling among the Network. Here are the 2 most common scenarios:
You are interacting with the files from another GCP service. Since it uses the internal GCP network, charges are not that big. If you decide to go with this, I would recommend to use resources (App Engine, Compute Engine, Kubernetes Engine, etc.) in the same location to avoid additional charges. Please check the Network egress charges within GCP.
You are interacting from an environment outside GCP. This is the scenario where you are interacting with other services like Google Colab (even when it is a Google service, it is outside the Cloud Platform). Please see the General network usage pricing for Cloud Storage.
Now, let's talk about the Storage classes, which can also affect the object's availability and pricing. Depending on where the bucket is created, you will be charged for the amount of stored Data as mentioned in the docs.
Even when the Nearline, Coldline and Archive classes is the cheapest ones regarding storage, they charge you an extra for retrieving data. This is because these classes are meant to be used to store data that is infrequently.
I think we have covered everything and we can move now to the important question: How much all of this will cost? It depends on your files' size, the times you interact with them and the Storage class of your bucket.
Let's say that you have 1 Standard bucket in North America with your Dataset of 20 GB and you read it from Google Colab 10 times a day we can calculate the following:
Standard Storage: $0.020 per GB
$0.020 * 20 = $0.4USD
Class B operations (per 10,000 operations) for Standard operations: $0.004
Given that you are only charged $0.004 per 10,000 we can say that each operation
costs $0.0000004 USD so 10 operations will be $0.000004 USD.
Egress to Worldwide Destinations (excluding Asia & Australia): $0.12 per GB
$0.12 * 20 because it is the size of our file = $2.4 USD
10 times we are reading this doc per day: 2.4 * 10 = $24 USD
Given this example, you would pay per day: 0.4 + 0.000004 + 24 = $24.400004 USD. Another example can be found in the Pricing overview section
And finally the good news, Google Cloud Storage offers Always Free usage limits that reset every month. I am attaching the table from that link below:
This means that: if during a whole month you store less than 5 GBs in a Standard class bucket, you perform less than 50,000 Class B operations, less than 5,000 Class A Operations and you sent less than 1GB over the Network, you won't pay a thing.
Once you pass those limits, the charges will start i.e. If you have a Dataset of 15GB, you will only be charged for 10GB.
I know that Google Cloud storage has 4 options of storage and each options has a different of "Minimum storage duration"
https://cloud.google.com/storage/docs/lifecycle?hl=vi
Standard Storage: None
Nearline Storage: 30 days
Coldline Storage: 90 days
Archive Storage: 365 days
What is the meaning of "Minimum storage duration"?
I guess that, "Minimum storage duration" is the time that your data has been keep in Google Cloud storage.
Is it the period after which your data will automatically be deleted if not used?
Such as:
I use options Nearline Storage: 30 days to store my data.
If within 30 days I don't use this data. It will be delete
If I use this data frequently. It will be stored until I delete my bucket.
Is my guess right?
If wrong: please tell me the right thing.
In order to understand the Minimum Storage Duration, it is necessary to know the concept of Storage classes first.
What is a storage class?
The storage class of an object or bucket affects the object's/bucket's
availability and pricing.
Depending on one's use case and how frequently one accesses the data in a bucket, he may chooce one of the available Storage Classes:
Standard Storage is used for data that is frequently accessed and/or
stored only for short periods of time.
Nearline Storage is a low-cost option for accessing infrequently
data. It offers lower at-rest costs in exchange to lower availability,
30 days minimum storage duration and cost for data access. It is
suggested to be used in use cases where one accesses his data once per
month on average.
Coldline Storage is similar to Nearline, but offers even lower
at-rest costs again in exhange to lower availability, 90 days minimum
storage duration and higher cost for data access.
Archive Storage is the lowest-cost, highly durable storage service
for data archiving, online backup, and disaster recovery. has no
availability SLA, though the typical availability is comparable to
Nearline Storage and Coldline Storage. Archive Storage also has higher
costs for data access and operations, as well as a 365-day minimum
storage duration.
You may find detailed information in the Storage Classes documentation.
So what is the minimum storage duration?
A minimum storage duration applies to data stored using one of the above storage classes. You can delete the file before it has been stored for this duration, but at the time of deletion you are charged as if the file was stored for the minimum duration.
Please note that minimum storage duration doesn't have to do with automatic deletion of objects.
If you would like to delete objects based on conditions such the Age of an object, then you may set an Object Lifecycle policy for the target object. You may find an example on how to delete live versions of objects older than 30 days, here.
It simply means that, even if you delete your objects before that minimum duration, they will assume that you have stored for minimum duration and charge your for that time.
If within 30 days I don't use this data. It will be delete:
They will not delete unless you specify object lifecycle rules or delete objects by yourself
If I use this data frequently. It will be stored until I delete my bucket.
If you use this data frequently, you will be charged as per the rates specified for the storage class(Nearline Storage in your case)
Question
When Google Cloud Storage says that all GCS Buckets share a common namespace (paragraph 2 here)
When you create a bucket, you specify a globally-unique name, a geographic location where the bucket and its contents are stored, and a default storage class. The default storage class you choose applies to objects added to the bucket that don't have a storage class specified explicitly.
and (bullet 1 here)
Bucket names reside in a single Cloud Storage namespace.
This means that:
Every bucket name must be unique.
Bucket names are publicly visible.
If you try to create a bucket with a name that already belongs to an existing bucket, Cloud Storage responds with an error message. However, once you delete a bucket, you or another user can reuse its name for a new bucket. The time it takes a deleted bucket's name to become available again is typically on the order of seconds; however, if you delete the project that contains the bucket, which effectively deletes the bucket as well, the bucket name may not be released for weeks or longer.
does "single namespace" and "globally" literally mean that across the entire Google Cloud regardless of your organization, project, or region you cannot create any bucket that shares a name with another bucket anywhere else on the entire planet for all existing buckets at any given time?
I have only ever worked on GCP within one organization, and we prefix our buckets with the organization name lots of times but also sometimes we don't. I am not concerned with running out of names or anything like that, I am more just curious what is meant by those things "globally" and "single namespace" and if it means what I think it does.
Given that most times buckets are only referenced by their name with the gs:// prefix I can see how having literal global uniqueness is important for ensuring consistent access experiences without needing to know things like project/organization IDs. Can anybody find a source that confirms this?
Odd Implication/thought experiment
If this is the case something I do wonder given...
There appears to be no cost associated to creating an empty bucket you do not use up to 5k buckets a month, more than that is $0.05 per 10k buckets (source)
There is no limit to the number of buckets you can create in a project (source)
... what is to stop me from creating a free-tier project(s), and iterating over ALL possible GCS bucket names (obviously it would take forever and be quite impractical) but in theory just occupying all bucket names (or at least all human readable ones) and selling those names to individuals who wish to purchase them for their organization if the bucket name does not already exist? I suppose the number of possibilities is astronomical meaning even at $0.05, for it to be profitable one would need to know the ratio of names that will be bought to the list of available ones to determine rational pricing, and even at $0.05 per 10k (with the first 5k a month being free) there is not enough money in the world to create all of the buckets at once. Still, I think about these things.
Yes, "single namespace" and "globally" mean what you said: All GCS buckets must have unique names, regardless of organization, project, and region.
I am trying to do analytics on emails for some users. To achieve this, I am trying to store the emails on Cloud Storage so I can run Hadoop jobs on them. (Earlier I tried App Engine DataStore but it was having a hard time scaling over that many user's data: hitting various resource limits etc.)
Is it better to store 1 email/file in Cloud Storage or all of a user's emails in one large file? In many examples about cloud storage, I see folks operating on large files, but it seems more logical to keep 1 file/email.
From a GCS scaling perspective there's no advantage to storing everything in one object vs many objects. However, listing the objects in a bucket is an eventually consistent operation. So, if your computation would proceed by first uploading (say) 1 million objects to a bucket, and then immediately starting a computation that lists the objects in the bucket and computing over their content, it's possible the listing would be incomplete. You could address that problem by maintaining a manifest of objects you upload and passing the manifest to the computation instead of having the computation list the objects in the bucket. Alternatively, if you load all the emails into a single file and upload it, you wouldn't need to perform a bucket listing operation.
If you plan to upload the data once and then run a variety of analytics computations (or rev a single computation and run it a number of times), uploading a large number of objects and depending on listing the bucket from your analytics computation would not be a problem, since the eventual consistency problem really only impacts you in the case where you list the bucket shortly after uploading.