Currently, I have an application consisting of a backend, frontend, and database. The Postgres database has a table with around 60 million rows.
This table has a foreign key to another table: categories. So, if want to count—I know it's one of the slowest operations in a DB—every row from a specific category, on my current setup this will result in a 5-minute query. Currently, the DB, backend, and frontend a just running on a VM.
I've now containerized the backend and the frontend and I want to spin them up in Google Kubernetes Engine.
So my question, will the performance of my queries go up if you also use a container DB and let Kubernetes do some load balancing work, or should I use Google's Cloud SQL? Does anyone have some experience in this?
will the performance of my queries go up if you also use a container DB
Raw performance will only go up if the capacity of the nodes (larger nodes) is larger than your current node. If you use the same node as a kubernetes node it will not go up. You won't get benefits from containers in this case other than maybe updating your DB software might be a bit easier if you run it in Kubernetes. There are many factors that are in play here, including what disk you use for your storage. (SSD, magnetic, clustered filesystem?).
Say if your goal is to maximize resources in your cluster by making use if that capacity when say not many queries are being sent to your database then Kubernetes/containers might be a good choice. (But that's not what the original question is)
should I use Google's Cloud SQL
The only reason I would use Cloud SQL is that if you want to offload managing your SQL db. Other than that you'll get similar performance numbers than running in the same size instance on GCE.
Related
I wonder if few microservices in the same project can share the same database? For example I'm using MongoDB and I'm using few microservices that need a database, should I use one collection for every microservice? Or just create new database for every microservice? Any suggestions?
It is typically recommended to have database per service, however there may be exceptions depending on the specific problem at hand. Your question does not have enough details to provide anything other then general advice.
The database per service has multiple advantages.
Since each microservice is independently deploy-able. It can change the database schema and not break other services.
There is a clear boundary of ownership of data, no two services write on same data. If there is need for accessing data owned by other services there are patterns(Materialized views, REST endpoints exposing data, Eventual consistency using queues ) to solve that problem.
Depending on the volume of data generated\managed by one microservice the database can be scaled (sharding, read replicas) independently.
Database per service has significant challenges too (multiple service management overhead, No ACID transaction), it is always recommended to start with a modular monolith and carve out services based on business need.
References
https://learn.microsoft.com/en-us/dotnet/architecture/cloud-native/distributed-data
Trying to figure out what would be better:
Multiple instances, one per DB
or
Single large instance which will hold multiple DBs inside
The scenario is similar to Jira Cloud where each customer has his own Jira Cloud server, with its own DB.
Now the question is, will it be better to manage all of the users' DBs in 1 large instance, or to have a DB instance for each customer?
What would be the cons and pros for the chosen alternative?
The first thing that came to or minds is backup management - Will we be able to recover a specific customer's DB if it resides on the same large instance as all other DBs?
Similar question, but in a different scenario and other requirements - 1-big-google-cloud-sql-instance-2-small-google-cloud-sql-instances-or-1-medium
This answer is based on a personal opinion. It is up to you to decide how you want to build your database. However, it is better to go with multiple smaller Cloud SQL instances as it is also stated in Cloud SQL > Best practices documentation.
PROS of multiple databases
It is easier to manage smaller instances rather than big instances. (In documentation provided above)
You can choose the region and zone for each database, so if your customers are located in different geographical locations, you can always choose the closest for them zone for the Cloud SQL instance and this way you will reduce the latency.
Also if you are planning to have a lot of databases, with a lot of tables in each database and a lot of records in each table, this means that the instance will be huge. Therefore the backup, creating read replicas or fail-over replicas and maintaining them, will take some time after the databases will begin to expand.
Although, I would suggest, if you have multiple databases per user, have them inside one Cloud SQL instance that so you manage one Cloud SQL instance per user. e.g. You have 100 users and User1 has 4 databases, User2 has 6 databases etc. Create 100 Cloud SQL instances instead of having one Cloud SQL instance per databases, otherwise you will end up with a lot of them and it will be hard to manage multiple instances per user.
As per standard Postgres documentation
As with the plain file-system-backup technique, this method can only support restoration of an entire database cluster, not a subset.
From this, I understood that it is not possible to setup PITR for individual databases in a cluster (a.k.a. a database instance holding multiple databases).
If my understanding is incorrect, probably the next part of the question is not relevant, but if not, here it is:
I still do not get the problem in setting this up theoretically as each database is generating its own WAL archive.
The problem here is: I am in need of setting up multiple Postgres clusters and somehow I have only 2 RHEL 7.6 machines to handle this. I am trying to reduce the number of clusters on these 2 machines to only 2. I am planning to create multiple database rather than multiple instances to handle customer applications. But that means that I have to sacrifice PITS, as PITR only can be performed on the instance/cluster level and not on the database level (as per the official documentation).
Could someone please help clarifying my misunderstanding.
You are correct, you can only do PITR on a PostgreSQL database cluster, not on an individual database.
There is only one WAL stream for the complete database cluster; WAL is not split up per database.
Don't hesitate to run several PostgreSQL clusters on a single machine if that is advantageous for you.
There is little overhead in running a second database cluster. The biggest resource that is hogged by a cluster is shared buffers, but you want that to be only a fraction of the available RAM anyway. Most of the memory should be left to the filesystem cache that is shared by all PostgreSQL clusters.
I would like to have a Postgres database which is in sync with my production database like a read-replica, but I would also like to write to that database. AWS provides read-replicas to be writable for MySQL and MariaDB but not Postgres. Is there any other way to achieve this?
Well, by definition, read replicas are not writable, so I'm afraid I don't think you'll have much luck with that approach.
Amazon themselves state that read replicas are for read only traffic:
You can create one or more replicas of a given source DB Instance and
serve high-volume application read traffic from multiple copies of
your data, thereby increasing aggregate read throughput.
Now, as you say, for MySQL read replicas can be promoted to masters (and therefore become writable), but pay special attention to the "when needed" below:
Read replicas can also be promoted when needed to become standalone DB
instances.
However, RDS itself does not support multi-master deployments for MySQL.
For PostgreSQL things are even "worse". AWS RDS for Postgres does not (at the time of writing this) support automatic promotion of read-replicas, leaving you with Multi-AZ as your only option.
Outside RDS, multi-master deployments of PostgreSQL (which sounds like what you're looking for) require an even more elaborate setup. You can find more information in the clustering section of their wiki.
As a general note, horizontally scaling relational / SQL databases is probably not something you'll have a lot of fun with and you're bound to run into problems along the way.
That's because they were simply not designed for horizontal scaling the same way that newer "NoSQL" databases are (take a look at MongoDB or Cassandra, etc.). You are far better off scaling them vertically, for as far as that will take you (and it will take you quite some way).
The only relational database that I know of that's (being) built to scale out is CockroachDB, but albeit a very promising solution, that's still in beta -- there's no 1.0 release of it yet.
I have an application using a MongoDB database and am about to add custom analytics. Is there a reason to use a separate database for the analytics or am I better off to build it in the application database?
Here are the reasons I can think of:
Name collisions between production collections and analytics collections
You need a different replica set configuration for analytics
You need a different sharding configuration for analytics
You want different physical media for some data (production data on fast disks, analytics on slow disk, for example)
Starting in Mongo 2.2, the "global write lock" will be a per-database lock, so different databases will isolate analytics traffic from production traffic a bit more.
Unless something on this list applies to you, then you don't need to split them out. Also, it's much easier to move a collection across DBs in MongoDB than an RDBMS (as you don't have foreign keys to cause trouble), so IMO it's a relatively easy decision to delay.