Google Cloud Storage for Google Colab TPU pricing [closed] - google-cloud-storage

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I want to use Google Colab free TPU with a custom dataset, that's why I need to upload it to GCS. I created bucket in GCS and uploaded dataset.
Also I read that there are two classes of operations with data in GCS: operation class A and operation class B [reference].
My questions are: does accessing dataset from GCS in Google Colab fall in one of these operation classes? What is average price you pay for using GCS for Colab TPU?

Yes, accessing to the Objects (files) in your GCS bucket will result in possible charges to your Billing Account but there are some other factors that you might need to consider. Let me explain (sorry in advance for the very long answer):
Google Cloud Platform services use APIs behind the scene to perform multiple actions such as show, create, delete or edit certain resources.
Cloud Storage is not the Exception. As mentioned in the Cloud Storage docs operations can be cataloged in two: the ones performed by the JSON API and the ones done by the XML API.
All operations performed on the Cloud Console or Client libraries (the ones used to interact via code with languages like Python, Java, PHP, etc.), the Operations will be charged using the JSON API by default. Let's focus on this one.
I want you to pay attention at the name of the methods under each Operations column:
The structure can be read as follows:
service.resource.action
Since all these methods are related to the Cloud Storage service, it is normal to see the storage service in all of them.
In Operations B column, the first method is storage.*.get. There is no other get method in the other columns which means that retrieving information from a bucket (read metadata) or objects (read a file via code, download files, etc.) inside a bucket will be considered as part of this method.
Before talking about how to calculate costs let me add: Google Cloud Storage not only charges you for the action itself but also for the size of the file traveling among the Network. Here are the 2 most common scenarios:
You are interacting with the files from another GCP service. Since it uses the internal GCP network, charges are not that big. If you decide to go with this, I would recommend to use resources (App Engine, Compute Engine, Kubernetes Engine, etc.) in the same location to avoid additional charges. Please check the Network egress charges within GCP.
You are interacting from an environment outside GCP. This is the scenario where you are interacting with other services like Google Colab (even when it is a Google service, it is outside the Cloud Platform). Please see the General network usage pricing for Cloud Storage.
Now, let's talk about the Storage classes, which can also affect the object's availability and pricing. Depending on where the bucket is created, you will be charged for the amount of stored Data as mentioned in the docs.
Even when the Nearline, Coldline and Archive classes is the cheapest ones regarding storage, they charge you an extra for retrieving data. This is because these classes are meant to be used to store data that is infrequently.
I think we have covered everything and we can move now to the important question: How much all of this will cost? It depends on your files' size, the times you interact with them and the Storage class of your bucket.
Let's say that you have 1 Standard bucket in North America with your Dataset of 20 GB and you read it from Google Colab 10 times a day we can calculate the following:
Standard Storage: $0.020 per GB
$0.020 * 20 = $0.4USD
Class B operations (per 10,000 operations) for Standard operations: $0.004
Given that you are only charged $0.004 per 10,000 we can say that each operation
costs $0.0000004 USD so 10 operations will be $0.000004 USD.
Egress to Worldwide Destinations (excluding Asia & Australia): $0.12 per GB
$0.12 * 20 because it is the size of our file = $2.4 USD
10 times we are reading this doc per day: 2.4 * 10 = $24 USD
Given this example, you would pay per day: 0.4 + 0.000004 + 24 = $24.400004 USD. Another example can be found in the Pricing overview section
And finally the good news, Google Cloud Storage offers Always Free usage limits that reset every month. I am attaching the table from that link below:
This means that: if during a whole month you store less than 5 GBs in a Standard class bucket, you perform less than 50,000 Class B operations, less than 5,000 Class A Operations and you sent less than 1GB over the Network, you won't pay a thing.
Once you pass those limits, the charges will start i.e. If you have a Dataset of 15GB, you will only be charged for 10GB.

Related

What is better for Reports, DataAnalytics and BI: AWS Redshift or AWS ElasticSearch?

We have 4 applications: A, B, C, and D. Applications are scaping different social-network data from different sources. Each application has its own database.
Application A scrapes eg. Instagram accounts, Instagram posts, Instagram stories - from external X source. 
Applications B scrapes eg. Instagram account follower and following COUNT history - from external source Y. 
Application C scrapes eg. Instagram account audience data (eg. gender statistic: male vs female, age statistic, country statistic, etc) - from external source Z.
Application D scrapes TikTok data from external source W.
Our data analytics team has to create different kinds of analysis: 
eg. data (table) that have Instagram post engagement (likes + post / total number of followers for that month) for specific Instagram accounts. 
eg. Instagram account development - total number of followers per month, the total number of posts per month, average post engagement per month, etc...
eg. account follower insights - we are analyzing just pieces of Instagram account followers eg. 5000 of them 1000000. We analyze who our followers follow beside us. Top 10 followings. 
lot of other similar kind of reports
Right now we have 3TB of data in our OLTP Postgres DB, and it is not a solution for us anymore. - We are running really heavy queries for reporting, BI... and we want to move social-network data to Data Warehouse or Open Search.
We are on AWS and we want to use Redshift or Open Search for our data analysis. 
We don't need Real Time processing. What is the better solution for us, Redshift or OpenSearch?
Any ideas are welcome.  
I expect to have infrastructure that will be able to run heavy queries for data analytics team for reporting and BI.
Based on what you've described, it sounds like AWS Redshift would be a better fit for your needs. Redshift is designed for data warehousing and can handle large-scale data processing, analysis, and reporting, which aligns with your goal of analyzing large amounts of data from multiple sources. Redshift also offers advanced query optimization capabilities, which can help your team run complex queries more efficiently.
OpenSearch, on the other hand, is a search and analytics engine that's designed for full-text search and real-time analytics. While OpenSearch is optimized for OLTP workloads, it may not be the best fit for your use case, which involves analyzing structured data from different sources.
When it comes to infrastructure, it's important to consider the size of your data, the complexity of your queries, and the number of users accessing the system. Redshift can scale to handle large amounts of data, and you can choose the appropriate node type and cluster size based on your needs. You can also use features such as Amazon Redshift Spectrum to analyze data in external data sources like Amazon S3.
It's worth noting that moving data to a data warehouse like Redshift may involve some initial setup and data migration costs. However, in the long run, having a dedicated data warehouse can improve the efficiency and scalability of your data analytics processes.

Google Cloud cloud storage operation costs

I am looking into using Google Cloud cloud storage buckets as a cheaper alternative to compute engine snapshots to store backups.
However, I am a bit confused about the costs per operation. Specifically the insert operation. If I understand the documentation correctly, it doesn't seem that it matters how large the file is that you want to insert is, it always counts as 1 operation.
So if I upload a single 20 TB file using one insert to a standard storage class bucket, wait 14 days, then retrieve it again, and all this within the same region, I practically only pay for storing it for 14 days?
Doesn't that mean that even the standard storage class bucket is a more cost effective option for storing backups compared to snapshots, as long as you can get your whole thing into a single file?
It's not fully accurate, and all depends on what cost for you.
First of all, the maximum size of an object in Cloud Storage is 5 TiB, so you can't store 1 file of 20Tb, but 4, at the end, it's the same principle.
The persistent disk snapshot is a very powerful feature:
The snapshot doesn't need CPUs to be done, compared to your solution.
The snapshot doesn't need network bandwidth to be done, compared to your solution.
The snapshot can be done anytime, on the fly.
The snapshot can be restored in the current VM, or you can create a new VM with a snapshot to investigate on it, for example.
You can perform incremental snapshots saving money (cheaper than full image snapshot).
You don't need additional space on your persistent disk to be done (compared to your solution where you need to create an archive before sending it to Cloud Storage).
In your scenario seems like using snapshots seems like the best solution in terms of time efficiency. Now, is using Cloud Storage a cheaper solution? Probably, as it is listed as the most affordable storage option, but in the end, you will have to calculate the cost-benefits on your own.

AWS RDS Billing, split data across multiple databases? or use just one database?

Is AWS RDS billing purely based on RAM/IO and storage? or is there any additional per database charges?
For my RDS deploy, If I have 1 PostgreSQL DB that has all my data but only receives 2000 queries per day vs if I have 4 PostgreSQL DBs that have the same relations as the 1 DB but those relations are split up on the 4 DBs and the 4 DBs will collectively receive the same 2000 queries per day... will the bill between the two different setups be essentially the same amount? The assumption being that the "size" of the data in 1DB vs 4DBs is exactly the same.
I want to split the data across multiple databases to make reporting for different modules in my system easier.
You are billed based on instance size and some additional criteria (disk size, outbound traffic, etc.) If these are the same, the number of databases doesn't matter. So you can split your application across multiple databases within an instance without impact to the billing.
In the future - this is a question better suited to Server Exchange than to Stack Overflow.
AWS RDS charges based on size, Data Transfer, backup, Storage etc.
In your case if you are going to keep the size of the instance same then it is better to have only one instance as the costing for instance is more than the Data Transfer and storage.
It makes no sense to have 4 same size of instances as the base billing will be 4 times. If you use small instance size then it may make some difference.
I would request you to refer to the below links:
https://aws.amazon.com/rds/postgresql/pricing/
https://calculator.aws/#/
With this you can understand how much you are billed for instances based on your usage also instance size.
Also you can choose different options to reduce the billing like Reserved instance.
Since there will be only one instance I think the charges will be the same, as long as the parameters on which it charges is the same.

Is it better to store 1 email/file in Google Cloud Storage or multiple emails in one large file?

I am trying to do analytics on emails for some users. To achieve this, I am trying to store the emails on Cloud Storage so I can run Hadoop jobs on them. (Earlier I tried App Engine DataStore but it was having a hard time scaling over that many user's data: hitting various resource limits etc.)
Is it better to store 1 email/file in Cloud Storage or all of a user's emails in one large file? In many examples about cloud storage, I see folks operating on large files, but it seems more logical to keep 1 file/email.
From a GCS scaling perspective there's no advantage to storing everything in one object vs many objects. However, listing the objects in a bucket is an eventually consistent operation. So, if your computation would proceed by first uploading (say) 1 million objects to a bucket, and then immediately starting a computation that lists the objects in the bucket and computing over their content, it's possible the listing would be incomplete. You could address that problem by maintaining a manifest of objects you upload and passing the manifest to the computation instead of having the computation list the objects in the bucket. Alternatively, if you load all the emails into a single file and upload it, you wouldn't need to perform a bucket listing operation.
If you plan to upload the data once and then run a variety of analytics computations (or rev a single computation and run it a number of times), uploading a large number of objects and depending on listing the bucket from your analytics computation would not be a problem, since the eventual consistency problem really only impacts you in the case where you list the bucket shortly after uploading.

Google Cloud SQL Pricing

I am an avid user of Amazon AWS but I am not sure about the RDS as compared to Google's Cloud SQL. In this site - it is mentioned that Per Use Billing Plan exists.
How is that calculated? It is mentioned 'charged for periods of continuous use, rounded up to the nearest hour'.
How does it go? If there are no visitors to my site there are no charges, right? What if I say I have 100 continuous users for 30 days. Will I still be billed $0.025 per hour (excluding the network usage charges)?
How do I upload my present SQL database to Google Cloud service? Is it the same way as Amazon using Oracle Workbench?
Thank you
Using the per use billing, if your database isn't access for 15 minutes then it is taken offline and you are only charged for data storage ($0.24 per GB per month). Its brought back online the next time it's accessed, which typically takes around a second for a D1 instance. The number of users doesn't affect the charge: you are charged for the database instance, not the user.
More details here
https://developers.google.com/cloud-sql/faq#how_usage_calculated
More information on importing data here:
https://developers.google.com/cloud-sql/docs/import-export
For Google Cloud SQL, I think we need to differentiate the MySQL 1st generation and the 2nd generation. In this FAQ link (answered by Joe Faith), https://developers.google.com/cloud-sql/faq#how_usage_calculated, it is about the 1st generation with activation policy of ON_DEMAND, meaning that you are charged per minute of usage.
However, with MySQL 2nd generation (as answered by Se Song), it will charge you entirely every minute (24 h per day) regardless whether you have active connections or not. The reason is that it uses the instance with activation policy = ALWAYS. You can read more the pricing details in here: https://cloud.google.com/sql/pricing/#2nd-gen-pricing
You can manually stop and restart your database instance, and hence it could be possible to write a script that activates it under particular circumstances, but this is not provided within GCP's features.
Watch the default settings carefully or you risk $350/month fees. Here's my experience: https://medium.com/#the-bumbling-developer/can-you-use-google-cloud-platform-gcp-cheaply-and-safely-86284e04b332