I know that Google Cloud storage has 4 options of storage and each options has a different of "Minimum storage duration"
https://cloud.google.com/storage/docs/lifecycle?hl=vi
Standard Storage: None
Nearline Storage: 30 days
Coldline Storage: 90 days
Archive Storage: 365 days
What is the meaning of "Minimum storage duration"?
I guess that, "Minimum storage duration" is the time that your data has been keep in Google Cloud storage.
Is it the period after which your data will automatically be deleted if not used?
Such as:
I use options Nearline Storage: 30 days to store my data.
If within 30 days I don't use this data. It will be delete
If I use this data frequently. It will be stored until I delete my bucket.
Is my guess right?
If wrong: please tell me the right thing.
In order to understand the Minimum Storage Duration, it is necessary to know the concept of Storage classes first.
What is a storage class?
The storage class of an object or bucket affects the object's/bucket's
availability and pricing.
Depending on one's use case and how frequently one accesses the data in a bucket, he may chooce one of the available Storage Classes:
Standard Storage is used for data that is frequently accessed and/or
stored only for short periods of time.
Nearline Storage is a low-cost option for accessing infrequently
data. It offers lower at-rest costs in exchange to lower availability,
30 days minimum storage duration and cost for data access. It is
suggested to be used in use cases where one accesses his data once per
month on average.
Coldline Storage is similar to Nearline, but offers even lower
at-rest costs again in exhange to lower availability, 90 days minimum
storage duration and higher cost for data access.
Archive Storage is the lowest-cost, highly durable storage service
for data archiving, online backup, and disaster recovery. has no
availability SLA, though the typical availability is comparable to
Nearline Storage and Coldline Storage. Archive Storage also has higher
costs for data access and operations, as well as a 365-day minimum
storage duration.
You may find detailed information in the Storage Classes documentation.
So what is the minimum storage duration?
A minimum storage duration applies to data stored using one of the above storage classes. You can delete the file before it has been stored for this duration, but at the time of deletion you are charged as if the file was stored for the minimum duration.
Please note that minimum storage duration doesn't have to do with automatic deletion of objects.
If you would like to delete objects based on conditions such the Age of an object, then you may set an Object Lifecycle policy for the target object. You may find an example on how to delete live versions of objects older than 30 days, here.
It simply means that, even if you delete your objects before that minimum duration, they will assume that you have stored for minimum duration and charge your for that time.
If within 30 days I don't use this data. It will be delete:
They will not delete unless you specify object lifecycle rules or delete objects by yourself
If I use this data frequently. It will be stored until I delete my bucket.
If you use this data frequently, you will be charged as per the rates specified for the storage class(Nearline Storage in your case)
Related
I am looking into using Google Cloud cloud storage buckets as a cheaper alternative to compute engine snapshots to store backups.
However, I am a bit confused about the costs per operation. Specifically the insert operation. If I understand the documentation correctly, it doesn't seem that it matters how large the file is that you want to insert is, it always counts as 1 operation.
So if I upload a single 20 TB file using one insert to a standard storage class bucket, wait 14 days, then retrieve it again, and all this within the same region, I practically only pay for storing it for 14 days?
Doesn't that mean that even the standard storage class bucket is a more cost effective option for storing backups compared to snapshots, as long as you can get your whole thing into a single file?
It's not fully accurate, and all depends on what cost for you.
First of all, the maximum size of an object in Cloud Storage is 5 TiB, so you can't store 1 file of 20Tb, but 4, at the end, it's the same principle.
The persistent disk snapshot is a very powerful feature:
The snapshot doesn't need CPUs to be done, compared to your solution.
The snapshot doesn't need network bandwidth to be done, compared to your solution.
The snapshot can be done anytime, on the fly.
The snapshot can be restored in the current VM, or you can create a new VM with a snapshot to investigate on it, for example.
You can perform incremental snapshots saving money (cheaper than full image snapshot).
You don't need additional space on your persistent disk to be done (compared to your solution where you need to create an archive before sending it to Cloud Storage).
In your scenario seems like using snapshots seems like the best solution in terms of time efficiency. Now, is using Cloud Storage a cheaper solution? Probably, as it is listed as the most affordable storage option, but in the end, you will have to calculate the cost-benefits on your own.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I want to use Google Colab free TPU with a custom dataset, that's why I need to upload it to GCS. I created bucket in GCS and uploaded dataset.
Also I read that there are two classes of operations with data in GCS: operation class A and operation class B [reference].
My questions are: does accessing dataset from GCS in Google Colab fall in one of these operation classes? What is average price you pay for using GCS for Colab TPU?
Yes, accessing to the Objects (files) in your GCS bucket will result in possible charges to your Billing Account but there are some other factors that you might need to consider. Let me explain (sorry in advance for the very long answer):
Google Cloud Platform services use APIs behind the scene to perform multiple actions such as show, create, delete or edit certain resources.
Cloud Storage is not the Exception. As mentioned in the Cloud Storage docs operations can be cataloged in two: the ones performed by the JSON API and the ones done by the XML API.
All operations performed on the Cloud Console or Client libraries (the ones used to interact via code with languages like Python, Java, PHP, etc.), the Operations will be charged using the JSON API by default. Let's focus on this one.
I want you to pay attention at the name of the methods under each Operations column:
The structure can be read as follows:
service.resource.action
Since all these methods are related to the Cloud Storage service, it is normal to see the storage service in all of them.
In Operations B column, the first method is storage.*.get. There is no other get method in the other columns which means that retrieving information from a bucket (read metadata) or objects (read a file via code, download files, etc.) inside a bucket will be considered as part of this method.
Before talking about how to calculate costs let me add: Google Cloud Storage not only charges you for the action itself but also for the size of the file traveling among the Network. Here are the 2 most common scenarios:
You are interacting with the files from another GCP service. Since it uses the internal GCP network, charges are not that big. If you decide to go with this, I would recommend to use resources (App Engine, Compute Engine, Kubernetes Engine, etc.) in the same location to avoid additional charges. Please check the Network egress charges within GCP.
You are interacting from an environment outside GCP. This is the scenario where you are interacting with other services like Google Colab (even when it is a Google service, it is outside the Cloud Platform). Please see the General network usage pricing for Cloud Storage.
Now, let's talk about the Storage classes, which can also affect the object's availability and pricing. Depending on where the bucket is created, you will be charged for the amount of stored Data as mentioned in the docs.
Even when the Nearline, Coldline and Archive classes is the cheapest ones regarding storage, they charge you an extra for retrieving data. This is because these classes are meant to be used to store data that is infrequently.
I think we have covered everything and we can move now to the important question: How much all of this will cost? It depends on your files' size, the times you interact with them and the Storage class of your bucket.
Let's say that you have 1 Standard bucket in North America with your Dataset of 20 GB and you read it from Google Colab 10 times a day we can calculate the following:
Standard Storage: $0.020 per GB
$0.020 * 20 = $0.4USD
Class B operations (per 10,000 operations) for Standard operations: $0.004
Given that you are only charged $0.004 per 10,000 we can say that each operation
costs $0.0000004 USD so 10 operations will be $0.000004 USD.
Egress to Worldwide Destinations (excluding Asia & Australia): $0.12 per GB
$0.12 * 20 because it is the size of our file = $2.4 USD
10 times we are reading this doc per day: 2.4 * 10 = $24 USD
Given this example, you would pay per day: 0.4 + 0.000004 + 24 = $24.400004 USD. Another example can be found in the Pricing overview section
And finally the good news, Google Cloud Storage offers Always Free usage limits that reset every month. I am attaching the table from that link below:
This means that: if during a whole month you store less than 5 GBs in a Standard class bucket, you perform less than 50,000 Class B operations, less than 5,000 Class A Operations and you sent less than 1GB over the Network, you won't pay a thing.
Once you pass those limits, the charges will start i.e. If you have a Dataset of 15GB, you will only be charged for 10GB.
I have files that accumulate in Blob Storage on Azure that are moved each hour to ADLS with data factory... there are around 1000 files per hour, and they are 10 to 60kb per file...
what is the best combination of:
"parallelCopies": ?
"cloudDataMovementUnits": ?
and also,
"concurrency": ?
to use?
currently i have all of these set to 10, and each hourly slice takes around 5 minutes, which seems slow?
could ADLS, or Blob be getting throttled, how can i tell?
There won't be a one solution fits all scenarios when it comes to optimizing a copy activity. However there few things you can checkout and find a balance. A lot of it depends on the pricing tiers / type of data being copied / type of source and sink.
I am pretty sure that you would have come across this article.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance
this is a reference performance sheet, the values are definitely different depending on the pricing tiers of your source and destination items.
Parallel Copy :
This happens at the file level, so it is beneficial if your source files are big as it chunks the data (from the article)
Copy data between file-based stores Between 1 and 32. Depends on the size of the files and the number of cloud data movement units (DMUs) used to copy data between two cloud data stores, or the physical configuration of the Self-hosted Integration Runtime machine.
The default value is 4.
behavior of the copy is important. if it is set to mergeFile then parallel copy is not used.
Concurrency :
This is simply how many instances of the same activity you can run in parallel.
Other considerations :
Compression :
Codec
Level
Bottom line is that you can pick and choose the compression, faster compression will increase network traffic, slower will increase time consumed.
Region :
the location or region of that the data factory, source and destination might affect performance and specially the cost of the operation. having them in the same region might not be feasible all the time depending on your business requirement, but definitely something you can explore.
Specific to Blobs
https://learn.microsoft.com/en-us/azure/storage/common/storage-performance-checklist#blobs
this article gives you a good number of metrics to improve performance, however when using data factory i don't think there is much you can do at this level. You can use the application monitoring to check out throughput while your copy is going on.
Google Cloud Platform has minimum storage duration for various bucket types like Nearline, Coldline. In the minimum duration period for Nearline and Coldline buckets can objects be modified or should just be read from the bucket or should it not be read or write within the minimum storage duration.
You can modify (overwrite) or delete objects of Nearline or Coldline storage class, it just comes with a price, see details on early deletion here:
https://cloud.google.com/storage/pricing#early-deletion
Early deletion charges apply when you overwrite existing objects, since the original object is replaced by a new one.
I am trying to do analytics on emails for some users. To achieve this, I am trying to store the emails on Cloud Storage so I can run Hadoop jobs on them. (Earlier I tried App Engine DataStore but it was having a hard time scaling over that many user's data: hitting various resource limits etc.)
Is it better to store 1 email/file in Cloud Storage or all of a user's emails in one large file? In many examples about cloud storage, I see folks operating on large files, but it seems more logical to keep 1 file/email.
From a GCS scaling perspective there's no advantage to storing everything in one object vs many objects. However, listing the objects in a bucket is an eventually consistent operation. So, if your computation would proceed by first uploading (say) 1 million objects to a bucket, and then immediately starting a computation that lists the objects in the bucket and computing over their content, it's possible the listing would be incomplete. You could address that problem by maintaining a manifest of objects you upload and passing the manifest to the computation instead of having the computation list the objects in the bucket. Alternatively, if you load all the emails into a single file and upload it, you wouldn't need to perform a bucket listing operation.
If you plan to upload the data once and then run a variety of analytics computations (or rev a single computation and run it a number of times), uploading a large number of objects and depending on listing the bucket from your analytics computation would not be a problem, since the eventual consistency problem really only impacts you in the case where you list the bucket shortly after uploading.