Google Cloud Storage API latency - google-cloud-storage

We have own MariaDB server on one server in the Google Compute Engine. One DB table contains blobs about 10kB - 1.5MB, these blobs are INSERT, SELECT only, no UPDATE, DELETE rarely. The table takes about 100GB and is growing. We would like to move it out of the DB, so have just tried to store a few blobs in the Google Storage.
But a simple benchmark showed that reading the blobs using Python google-api-python-client (and httplib2shim) by objects().get_media() is much slower than connection to our MariaDB server: 400ms vs. 10ms. The OAuth2 requires an extra HTTP request, so I would expect the total time about 20ms, not 400ms. The new Python library google-cloud-storage blob.download_as_string is not faster.
Update: when I make the files available for the public download (http://storage.googleapis.com/...), then the speed is comparable or even faster than MariaDB.
Are these latencies of the Google Storage API OAuth normal or do I miss something else? Of course, we prefer the secure access to our data, if possible.

Related

Google Cloud cloud storage operation costs

I am looking into using Google Cloud cloud storage buckets as a cheaper alternative to compute engine snapshots to store backups.
However, I am a bit confused about the costs per operation. Specifically the insert operation. If I understand the documentation correctly, it doesn't seem that it matters how large the file is that you want to insert is, it always counts as 1 operation.
So if I upload a single 20 TB file using one insert to a standard storage class bucket, wait 14 days, then retrieve it again, and all this within the same region, I practically only pay for storing it for 14 days?
Doesn't that mean that even the standard storage class bucket is a more cost effective option for storing backups compared to snapshots, as long as you can get your whole thing into a single file?
It's not fully accurate, and all depends on what cost for you.
First of all, the maximum size of an object in Cloud Storage is 5 TiB, so you can't store 1 file of 20Tb, but 4, at the end, it's the same principle.
The persistent disk snapshot is a very powerful feature:
The snapshot doesn't need CPUs to be done, compared to your solution.
The snapshot doesn't need network bandwidth to be done, compared to your solution.
The snapshot can be done anytime, on the fly.
The snapshot can be restored in the current VM, or you can create a new VM with a snapshot to investigate on it, for example.
You can perform incremental snapshots saving money (cheaper than full image snapshot).
You don't need additional space on your persistent disk to be done (compared to your solution where you need to create an archive before sending it to Cloud Storage).
In your scenario seems like using snapshots seems like the best solution in terms of time efficiency. Now, is using Cloud Storage a cheaper solution? Probably, as it is listed as the most affordable storage option, but in the end, you will have to calculate the cost-benefits on your own.

Exposing Data from BigQuery to Mobile/Web Apps Via Firestore

I am looking for an easy way (of course with good performance) to expose data in my BigQuery table to web applications.
The current solution which is running is using a Cloud Function and Firestore (in native mode) to expose the data in BigQuery. The implementation is like - as soon as the data is written to the final big query table, we are triggering cloud functions (500 records per commit) to update the data in our final firestore table. The data in firestore table is finally exposed to the App/Web client.
And, to avoid timeout issues associated with Cloud Functions, we are dividing the entire dataset into batches and each cloud function instance will handle a single batch of records only.
But soon after going live, we were hit with scalability issues for the writes as we were triggering the Cloud Function instances sequentially.
A simple way to improve the performance could be to do parallel writes from inside the cloud function, but again as per the firestore documentation doing more than 1000 writes/sec against a collection can reduce performance. So eventually the performance gains we are getting with this approach could be minimum. In our case, we have only one collection.
Anyone here has experience of dealing with high volume writes and reads against Firestore ? Firestore in datastore mode can be used for high volume writes, but what about the read latency?
Also, I am thinking of using BigTable for this purpose (eventual consistency could be fine for us), but using bigtable might add additional layers to expose the data, maybe through a web service.
We are expecting data size to be around GBs only.
PS : I don't need the offline capabilities offered by the Firestore, the reason for choosing Firestore was for the ease of development only. 
Based on the information you shared Firestore does not seem like an appropriate choice of product for the amount of data you will be adding at once, plus the costs of this might be heavier than the alternative if we talking about TBs of data, which I assume is the case.
Generally speaking Firestore is not recommended for very data intensive apps nor apps with too many writes, for pricing reasons, as reads are considerably cheaper than writes.
Personally I would choose Big Table for this case for the following reasons:
Supports apps with high throughput.
Easily scalable without lost in performance or instance downtime while doing so.
If kept in the same zone or region as Big Query will have no additional costs to migrating the data to Big Table.

Migrate an on-prem database to AWS Aurora

I have a postgres database running locally that I'd like to migrate to AWS Aurora (or AWS postgres).
I've pg_dump'd the database that I want, and it's ~30gb compressed.
How should I upload this file and get the AWS RDS instance to pg_restore from it?
Requirements:
There's no one else using the DB so we're ok with a lot of downtime and an exclusive lock on the db. We want it to be as cheap as possible to migrate
What I've tried/looked at so far:
Running pg_restore on the local file with the remote target - unknown pricing total
I'd also like to do this as cheaply as possible, and I'm not sure I understand their pricing strategy.
Their pricing says:
Storage Rate $0.10 per GB-month
I/O Rate $0.20 per 1 million requests
Replicated Write I/Os $0.20 per million replicated write I/Os
Would pg_restore count as one request? The database has about 2.2 billion entries, and if each one is 1 request does that come out to $440 to just recreate the database?
AWS Database Migration Service - it looks like this would be the cheapest (as it's free?) but it only works by connecting to the local database. Uncompressed the data is about 200gb, and I'm not sure it makes sense to do a one for one copy using DMS
I've read this article but I'm still not clear on the best way of doing the migration.
We're ok with this taking a while, we'd just like to do it as cheap as possible.
Thanks in advance!
There are some points you should note when migrating
AWS Database Migration Service - it looks like this would be the cheapest (as it's free?)
The service they provide for free is a Virtual machine ( with softwares included ) that provide the computing power and functionality to move Databases to some of their RDS service.
Even when that service is free, you would be charged normal fee for any RDS usage
The number they provided is roughly related to EBS (the underlying disks ) they use to serve your data. A very big and complex query can take some I/O, the two are not equal to eachother.
The estimation for EBS usage can be seen here
As an example, a medium sized website database might be 100 GB in size and expect to average 100 I/Os per second over the course of a month. This would translate to $10 per month in storage costs (100 GB x $0.10/month), and approximately $26 per month in request costs (~2.6 million seconds/month x 100 I/O per second * $0.10 per million I/O).
My personal advice: Make a clone of your DB with only part of the set ( 5% maybe). Use DMS on that piece. You can see how the bills work out for you in a few minutes. Then you can estimate the price on a full DB migration

DynamoDB vs ElasticSearch vs S3 - which service to use for superfast get/put 10-20MB files?

I have backend that recieves, stores and serves 10-20 MB json files. Which service should I use for superfast put and get (I cannot break the file in smaller chunks)? I dont have to run queries on these files just get them, store them and supply them instantly. The service should scale to tens of thousands of files easily. Ideally I should be able to put the file in 1-2 seconds and retrieve it in the same time.
I feel s3 is the best option and elastic search the second best option. Dyanmodb doesnt allow such object size. What should I use? Also, is there any other service? Mongodb is a possible solution but i dont see that on AWS, so something quick to setup would be great.
Thanks
I don't think you should go for Dynamo or ES for this kind of operation.
After all, what you want is to store and serve it, not going into the file's content which both Dynamo and ES would waste time to do.
My suggestion is to use AWS Lambda + S3 to optimize for cost
S3 does have some small downtime after putting till the file is available though ( It get bigger, minutes even, when you have millions of object in a bucket )
If downtime is important for your operation and total throughput at any given moment is not too huge, You can create a server ( preferably EC2) that serves as a temporary file stash. It will
Receive your file
Try to upload it to S3
If the file is requested before it's available on S3, serve the file on disk
If the file is successfully uploaded to S3, serve the S3 url, delete the file on disk

AWS database solution for storing non-relational data

Whats the best AWS database for the below requirement
I need to store around 50,000 - 1,00,000 entries in the database.
Each of the entry would have a String as a key and a Json array as the value.
I should be able to retrieve the JSON array using the key.
The size of JSON data is around 20-30KB
I expect around 10,000 - 40,000 reads per hour.
Around 50,000 - 1,00,000 writes/week
I have to consider the cost as well.
Ease of integration/development
I am bit confused between MongoDB, DynamoDB and PostgreSQL. Please share your thoughts on this.
DynamoDB:-
DynamoDB is a fully managed proprietary NoSQL database service that supports key-value and document data structures. For the typical use case that you have described in OP, it would serve the purpose.
DynamoDB can handle more than 10 trillion requests per day and support
peaks of more than 20 million requests per second.
DynamoDB has good AWS SDK for all operations. The read and write capacity units can be configured for the table.
DynamoDB tables using on-demand capacity mode automatically adapt to
your application’s traffic volume. On-demand capacity mode instantly
accommodates up to double the previous peak traffic on a table. For
example, if your application’s traffic pattern varies between 25,000
and 50,000 strongly consistent reads per second where 50,000 reads per
second is the previous traffic peak, on-demand capacity mode instantly
accommodates sustained traffic of up to 100,000 reads per second. If
your application sustains traffic of 100,000 reads per second, that
peak becomes your new previous peak, enabling subsequent traffic to
reach up to 200,000 reads per second.
One point to note is that it doesn't allow to query the table based on non-key attributes. This means if you don't know the hash key of the table, you may need to do full table scan to get the data. However, there is a Secondary Index option which you can explore to get around the problem. You may need to have all the Query Access Patterns of your use case before you design and make informed decision.
MongoDB:-
MongoDB is not a fully managed service on AWS. However, you can setup the database using AWS service such as EC2, VPC, IAM, EBS etc. This requires some AWS cloud experience to setup the database. The other option is to use MongoDB Atlas service.
MongoDB is more flexible in terms of querying. Also, it has a powerful aggregate functions. There are lots of tools available to query the database directly to explore the data like SQL.
In terms of Java API, the Spring MongoDB can be used to perform typical database operation. There are lots of open source frameworks available on various languages for MongoDB (example Mongoose Nodejs) as well.
The MongoDB has support for many programming languages and the APIs are mature as well.
PostgreSQL:-
PostgreSQL is a fully managed database on AWS.
PostgreSQL has become the preferred open source relational database
for many enterprise developers and start-ups, powering leading
geospatial and mobile applications. Amazon RDS makes it easy to set
up, operate, and scale PostgreSQL deployments in the cloud.
I think I don't need to write much about this database and its API. It is very mature database and has good APIs.
Points to consider:-
Query Access Pattern
Easy setup
Database maintenance
API and frameworks
Community support