I have a postgres database running locally that I'd like to migrate to AWS Aurora (or AWS postgres).
I've pg_dump'd the database that I want, and it's ~30gb compressed.
How should I upload this file and get the AWS RDS instance to pg_restore from it?
Requirements:
There's no one else using the DB so we're ok with a lot of downtime and an exclusive lock on the db. We want it to be as cheap as possible to migrate
What I've tried/looked at so far:
Running pg_restore on the local file with the remote target - unknown pricing total
I'd also like to do this as cheaply as possible, and I'm not sure I understand their pricing strategy.
Their pricing says:
Storage Rate $0.10 per GB-month
I/O Rate $0.20 per 1 million requests
Replicated Write I/Os $0.20 per million replicated write I/Os
Would pg_restore count as one request? The database has about 2.2 billion entries, and if each one is 1 request does that come out to $440 to just recreate the database?
AWS Database Migration Service - it looks like this would be the cheapest (as it's free?) but it only works by connecting to the local database. Uncompressed the data is about 200gb, and I'm not sure it makes sense to do a one for one copy using DMS
I've read this article but I'm still not clear on the best way of doing the migration.
We're ok with this taking a while, we'd just like to do it as cheap as possible.
Thanks in advance!
There are some points you should note when migrating
AWS Database Migration Service - it looks like this would be the cheapest (as it's free?)
The service they provide for free is a Virtual machine ( with softwares included ) that provide the computing power and functionality to move Databases to some of their RDS service.
Even when that service is free, you would be charged normal fee for any RDS usage
The number they provided is roughly related to EBS (the underlying disks ) they use to serve your data. A very big and complex query can take some I/O, the two are not equal to eachother.
The estimation for EBS usage can be seen here
As an example, a medium sized website database might be 100 GB in size and expect to average 100 I/Os per second over the course of a month. This would translate to $10 per month in storage costs (100 GB x $0.10/month), and approximately $26 per month in request costs (~2.6 million seconds/month x 100 I/O per second * $0.10 per million I/O).
My personal advice: Make a clone of your DB with only part of the set ( 5% maybe). Use DMS on that piece. You can see how the bills work out for you in a few minutes. Then you can estimate the price on a full DB migration
Related
Is AWS RDS billing purely based on RAM/IO and storage? or is there any additional per database charges?
For my RDS deploy, If I have 1 PostgreSQL DB that has all my data but only receives 2000 queries per day vs if I have 4 PostgreSQL DBs that have the same relations as the 1 DB but those relations are split up on the 4 DBs and the 4 DBs will collectively receive the same 2000 queries per day... will the bill between the two different setups be essentially the same amount? The assumption being that the "size" of the data in 1DB vs 4DBs is exactly the same.
I want to split the data across multiple databases to make reporting for different modules in my system easier.
You are billed based on instance size and some additional criteria (disk size, outbound traffic, etc.) If these are the same, the number of databases doesn't matter. So you can split your application across multiple databases within an instance without impact to the billing.
In the future - this is a question better suited to Server Exchange than to Stack Overflow.
AWS RDS charges based on size, Data Transfer, backup, Storage etc.
In your case if you are going to keep the size of the instance same then it is better to have only one instance as the costing for instance is more than the Data Transfer and storage.
It makes no sense to have 4 same size of instances as the base billing will be 4 times. If you use small instance size then it may make some difference.
I would request you to refer to the below links:
https://aws.amazon.com/rds/postgresql/pricing/
https://calculator.aws/#/
With this you can understand how much you are billed for instances based on your usage also instance size.
Also you can choose different options to reduce the billing like Reserved instance.
Since there will be only one instance I think the charges will be the same, as long as the parameters on which it charges is the same.
I created a test Postgres database in AWS RDS. Created a 100 million row, 2 column table. Ran select * on that table. Postgres reports "Buffers: shared hit=24722 read=521226" but AWS reports IOPS in the hundreds. Why this huge discrepancy? Broadly, I'm trying to figure out how to estimate the number of AWS I/O operations a query might cost.
PostgreSQL does not have insight into what the kernel/FS get up to. If PostgreSQL issues a system call to read the data, then it reports that buffer as "read". If it was actually served out of the kernel's filesystem cache, rather than truly from disk, PostgreSQL has no way of knowing that (although you can make some reasonable statistical guesses if track_io_timing is on), while AWS's IO monitoring tool would know.
If you set shared_buffers to a large fraction of memory, then there would be little room left for a filesystem cache, so most buffers reported as read should truly have been read from disk. This might not be a good way run the system, but it might provide some clarity to your EXPLAIN plans. I've also heard rumors that Amazon Aurora reimplemented the storage system so that it uses directIO, or something similar, and so doesn't use the filesystem cache at all.
Cloud SQL reports that I've used ~4TB of SSD storage, but my database is only ~225 GB. What explains this discrepancy? Is there something I can delete to free up space? If I moved it to a different instance, would the required storage go down?
There are a couple of options about why your Cloud SQL storage has increase:
-Did you enable Point-in-time recovery? PITR uses write-ahead logs and if you enabled this feature, that could be the reason why of your increases.
-Have you used temporary tables and you have not deleted them?
If none of the above applies to you, I highly recommend you to open a case with GCP support team so that they take a look at your Cloud SQL instance.
On the other hand, you should open a case to decrease the disk size to a smaller one so it won’t be necessary to create a new instance and copy all the data to that new instance in addition that shrinking the disk is done at Google's end making the effort from you the lowest possible.
A maintenance window can be scheduled where Google can proceed with this task and you may want to schedule a maintenance window to minimize the impact of the downtime. For this case it is necessary to know the new disk size and when you would like to perform this operation.
Finally, if you prefer to use the migration method, you should export the DB, then create the new instance, import the DB and synchronize the old one with the new one to have all the data in both instances to which can take several hours to complete those four steps.
You do not specify what kind of database. In my case, for a MySQL database, there were several hundred GB as binary logs (mysql flag).
You could check with:
SHOW BINARY LOGS;
Whats the best AWS database for the below requirement
I need to store around 50,000 - 1,00,000 entries in the database.
Each of the entry would have a String as a key and a Json array as the value.
I should be able to retrieve the JSON array using the key.
The size of JSON data is around 20-30KB
I expect around 10,000 - 40,000 reads per hour.
Around 50,000 - 1,00,000 writes/week
I have to consider the cost as well.
Ease of integration/development
I am bit confused between MongoDB, DynamoDB and PostgreSQL. Please share your thoughts on this.
DynamoDB:-
DynamoDB is a fully managed proprietary NoSQL database service that supports key-value and document data structures. For the typical use case that you have described in OP, it would serve the purpose.
DynamoDB can handle more than 10 trillion requests per day and support
peaks of more than 20 million requests per second.
DynamoDB has good AWS SDK for all operations. The read and write capacity units can be configured for the table.
DynamoDB tables using on-demand capacity mode automatically adapt to
your application’s traffic volume. On-demand capacity mode instantly
accommodates up to double the previous peak traffic on a table. For
example, if your application’s traffic pattern varies between 25,000
and 50,000 strongly consistent reads per second where 50,000 reads per
second is the previous traffic peak, on-demand capacity mode instantly
accommodates sustained traffic of up to 100,000 reads per second. If
your application sustains traffic of 100,000 reads per second, that
peak becomes your new previous peak, enabling subsequent traffic to
reach up to 200,000 reads per second.
One point to note is that it doesn't allow to query the table based on non-key attributes. This means if you don't know the hash key of the table, you may need to do full table scan to get the data. However, there is a Secondary Index option which you can explore to get around the problem. You may need to have all the Query Access Patterns of your use case before you design and make informed decision.
MongoDB:-
MongoDB is not a fully managed service on AWS. However, you can setup the database using AWS service such as EC2, VPC, IAM, EBS etc. This requires some AWS cloud experience to setup the database. The other option is to use MongoDB Atlas service.
MongoDB is more flexible in terms of querying. Also, it has a powerful aggregate functions. There are lots of tools available to query the database directly to explore the data like SQL.
In terms of Java API, the Spring MongoDB can be used to perform typical database operation. There are lots of open source frameworks available on various languages for MongoDB (example Mongoose Nodejs) as well.
The MongoDB has support for many programming languages and the APIs are mature as well.
PostgreSQL:-
PostgreSQL is a fully managed database on AWS.
PostgreSQL has become the preferred open source relational database
for many enterprise developers and start-ups, powering leading
geospatial and mobile applications. Amazon RDS makes it easy to set
up, operate, and scale PostgreSQL deployments in the cloud.
I think I don't need to write much about this database and its API. It is very mature database and has good APIs.
Points to consider:-
Query Access Pattern
Easy setup
Database maintenance
API and frameworks
Community support
I have a mongoDb instance provisioned on Azure cloud used as IAAS. There is a load balancer behind which there is a shard cluser, each shard has 2 replicas. Each replica is a VM. So I can go inside that VM and check the storage space, RAM etc and check on the hardware details for that VM.
Now, I have cosmosDb provisioned as well which is actually a managed service and I have no control over what it uses behind the hoods. For example, I would not know how much RAM, what storage space etc is used.
So if I have to compare the performance of mongoDb and cosmosDb on azure cloud, I am not sure how to compare apples to apples if I don't have the exact information about the underlying hardware.
Can someone suggest a way I can compare the performance of the two ?
Why not compare on price?
Take the direct Azure charges for your IAAS mongodb and allocate the same budget to purchase an allowance of CosmosDb request units. This would represent a very basic comparison.
Next fine tune your comparison to genuinely reflect some advantages of PaaS CosmosDb.
Assume you could dial down allocated RU by 30% for 10 hours per day.
Enable the new add-on provisioning for request units per minute. 20% cost savings have been cited by Microsoft when this feature is enabled.
Finally, add 10% of the salary of a Database Administrator to your total IAAS cost.