Database for every microservice - mongodb

I wonder if few microservices in the same project can share the same database? For example I'm using MongoDB and I'm using few microservices that need a database, should I use one collection for every microservice? Or just create new database for every microservice? Any suggestions?

It is typically recommended to have database per service, however there may be exceptions depending on the specific problem at hand. Your question does not have enough details to provide anything other then general advice.
The database per service has multiple advantages.
Since each microservice is independently deploy-able. It can change the database schema and not break other services.
There is a clear boundary of ownership of data, no two services write on same data. If there is need for accessing data owned by other services there are patterns(Materialized views, REST endpoints exposing data, Eventual consistency using queues ) to solve that problem.
Depending on the volume of data generated\managed by one microservice the database can be scaled (sharding, read replicas) independently.
Database per service has significant challenges too (multiple service management overhead, No ACID transaction), it is always recommended to start with a modular monolith and carve out services based on business need.
References
https://learn.microsoft.com/en-us/dotnet/architecture/cloud-native/distributed-data

Related

Relational DB in microservices

I have a monolithic application that currently uses a PostgreSQL DB and the schemas are set up as you would expect for most relational databases with various table data being linked back to the user via FKs on the user_id.
I'm trying to learn more about microservices am trying to migrate my python API to a microservice architecture. I have a reasonable understanding of how I'm going to break up the larger app into smaller parts, however, I'm not entirely clear on how I'm supposed to deal with the data side of things.
I understand that one single large DB is against general design principles of microservices but I'm not clear on what the alternative would be.
My biggest concern is cascading across individual databases that would hold microservice data. In a simple rdb, I can just cascade on delete and the DB will handle the work across the various tables. In the case of microservices, how would that work? Would I need to have a separate service that handles deleting user data across the other service DBs?
I don't really understand how I would migrate a traditional application with a relational DB to a microservice architecture?
EDIT:
To clarify - a specific architectural/design problem I'm facing is as follows:
I have split up my application into a few microservices. The ones that are in my mind still relational are:
Geolocation - A service that checks geometry data, records in PostGIS, and returns certain information. A primary purpose is to record the location of a particular user for referencing later
Image - A simple upload service to upload images and store meta data in the db.
Load-Image - A simple service that returns a random set of images based on parameters such as location, and user profile data such as Age, Gender, etc
Profile - A service that simply manages user data such as Age, Gender, etc
Normally, these three items would have a table each in a larger db rather than their own individual dbs. Filtering images by say location and age is a very simple JOIN and filter.
How would something like that work in a microservice architecture? If the data is held in different dbs entirely how would I setup the logic to filter the data? I could duplicate data that doesn't change often like profile info and add it to a MongoDB document that would contain image data including user_id and profile data - however, location data can change regularly and constant updates doesn't sound practical.
What would be the best approach? Or should I stick with a shared RDBMS for just those few services?
It comes down to the duplication of data, why we want it, and how we manage it.
Early in our careers we were taught about the duplication of data to make it redundant, for example in database replication or backups. We were also taught that data can be modelled in a relational manner, with constraints enforcing the integrity of the model. In fact, the integrity of the model is sacrosanct. Without integrity, how can you have consistency? The answer is that you can't. Kinda.
When you work with distributed systems and service orientation, you do so because you want to minimise interactions thereby reducing coupling between components. However, there is a cost to this. The more distributed your architecture, the less coupling it has, and the more duplication of data will be necessary. This is taken to an extreme with microservices, where effectively the same data may be present in many different places, in varying degrees of consistency.
Instead of being bad, however, in this context data duplication is an essential feature of your system. It is an enabler of an architectural style with many great benefits. Put another way, without duplication of data, you get less distribution, you get more coupling, which makes your system more expensive to build, own, and change.
So, now we understand duplication of data and why we want it, let's move onto how we manage having lots of duplication. Let's try an example:
In a relational database, let's say we have a table called Customers, which contains a customer ID, and customer details, and another table called Orders which contains the order ID, customer ID, and the order details. Let's say we also have an ordering application, which needs to delete all the customer's orders if the customer is deleted for GDPR.
Because we are migrating our system to microservices, we decide to create a service called Customers.
So we create a service with the following operation:
DELETE /customers/{customerId} - deletes a customer
We create another service called Orders with the following operations:
GET /orders/customers/{customerId} - gets all the orders for a customer
DELETE /orders/{orderId} - deletes an order
We build a UX screen for deleting a customer. The UX first calls the orders service to get all the orders for the customer. Then it iterates over the list of orders, calling the orders service to delete the order. Then it calls the customers service to delete the user.
This example is very simplistic, but as you can see, there is no option but to orchestrate the "Delete Customer" operation from the caller, which in this case is the user interface. Of course, what would be a single atomic transaction in a database does not translate to multiple HTTP/s calls, so it is possible that some of the calls may not succeed, leaving the system as a whole in an inconsistent state. In this instance the inconsistency would need to be resolved via some recovery mechanism.
In a microservice architecture, we have both the option, either use database per service or a shared database. There are advantages and disadvantages to both the pattern. Database per service architecture is the best practice but when the monolithic application has lots of function, procedure or database-specific feature on database level then we can use the Shared database approach, I know this is not the best practice if you have time and bandwidth then you should go for database per service.
As your concern is cascading over individual databases, you need to remove cascading from the database and implement global transaction handling in your application and execute all cascading related queries from that transaction.

Google CloudSQL - Instance per DB or single instance for all DBs?

Trying to figure out what would be better:
Multiple instances, one per DB
or
Single large instance which will hold multiple DBs inside
The scenario is similar to Jira Cloud where each customer has his own Jira Cloud server, with its own DB.
Now the question is, will it be better to manage all of the users' DBs in 1 large instance, or to have a DB instance for each customer?
What would be the cons and pros for the chosen alternative?
The first thing that came to or minds is backup management - Will we be able to recover a specific customer's DB if it resides on the same large instance as all other DBs?
Similar question, but in a different scenario and other requirements - 1-big-google-cloud-sql-instance-2-small-google-cloud-sql-instances-or-1-medium
This answer is based on a personal opinion. It is up to you to decide how you want to build your database. However, it is better to go with multiple smaller Cloud SQL instances as it is also stated in Cloud SQL > Best practices documentation.
PROS of multiple databases
It is easier to manage smaller instances rather than big instances. (In documentation provided above)
You can choose the region and zone for each database, so if your customers are located in different geographical locations, you can always choose the closest for them zone for the Cloud SQL instance and this way you will reduce the latency.
Also if you are planning to have a lot of databases, with a lot of tables in each database and a lot of records in each table, this means that the instance will be huge. Therefore the backup, creating read replicas or fail-over replicas and maintaining them, will take some time after the databases will begin to expand.
Although, I would suggest, if you have multiple databases per user, have them inside one Cloud SQL instance that so you manage one Cloud SQL instance per user. e.g. You have 100 users and User1 has 4 databases, User2 has 6 databases etc. Create 100 Cloud SQL instances instead of having one Cloud SQL instance per databases, otherwise you will end up with a lot of them and it will be hard to manage multiple instances per user.

Creating a high available and heavy usable Database

Currently, I have an application consisting of a backend, frontend, and database. The Postgres database has a table with around 60 million rows.
This table has a foreign key to another table: categories. So, if want to count—I know it's one of the slowest operations in a DB—every row from a specific category, on my current setup this will result in a 5-minute query. Currently, the DB, backend, and frontend a just running on a VM.
I've now containerized the backend and the frontend and I want to spin them up in Google Kubernetes Engine.
So my question, will the performance of my queries go up if you also use a container DB and let Kubernetes do some load balancing work, or should I use Google's Cloud SQL? Does anyone have some experience in this?
will the performance of my queries go up if you also use a container DB
Raw performance will only go up if the capacity of the nodes (larger nodes) is larger than your current node. If you use the same node as a kubernetes node it will not go up. You won't get benefits from containers in this case other than maybe updating your DB software might be a bit easier if you run it in Kubernetes. There are many factors that are in play here, including what disk you use for your storage. (SSD, magnetic, clustered filesystem?).
Say if your goal is to maximize resources in your cluster by making use if that capacity when say not many queries are being sent to your database then Kubernetes/containers might be a good choice. (But that's not what the original question is)
should I use Google's Cloud SQL
The only reason I would use Cloud SQL is that if you want to offload managing your SQL db. Other than that you'll get similar performance numbers than running in the same size instance on GCE.

Multi Tenant vs Single Tenant?

I am about to build a SAAS product using Rail and Postgres. I would like to know if I should follow schema level, sub-domain based multi tenancy or a single tenant application is good enough Architecture?
My requirement has no dependability of data between clients hence schema based multi tenant architecture seems right to me. Could anyone please explain me further why it is good or bad with relevant explanation?
Here's a post from the creators of the Apartment gem suggesting they would not use schema-per-tenant approach in future.
The end result of the above mentioned problems have caused us to mostly abandon our separate schemas approach to multi-tenancy. For all services we build going forward, we use a more traditional column scoped approach and have written our own wrappers that effectively mimic the per-request tenanting approach that Apartment gave us.
If you are deploying to Heroku, there is a warning about schema-per-tenant affecting performance of the managed backup tool:
The most common use case for using multiple schemas in a database is building a software-as-a-service application wherein each customer has their own schema. While this technique seems compelling, we strongly recommend against it as it has caused numerous cases of operational problems. For instance, even a moderate number of schemas (> 50) can severely impact the performance of Heroku’s database snapshots tool, PG Backups.
For maximum data segregation a database-per-tenant approach is appropriate.
For simplest operations, a tenant_id column per table can be used to scope your queries, and can be enforced with row level security policies.

Data sharding based on schemes in PostgreSQL

I would like to develop a multi-tenant web application using PostgreSQL DB, having the data of each tenant in a dedicated scheme.
Each query or update will access only a single tenant scheme and/or the public scheme.
Assuming I will, at some point, need to scale out and have several PostgreSQL servers, is there some automatic way in which I can connect to a single load balancer of some sort, that will redirect the queries/updates to the relevant server, based on the required scheme?
The challenging part of this question is 'automatic way'. I have a feeling that postgres is moving that way, maybe 9.5 or later will have multi-master tendencies with partitioning allowing spreading of data across a cluster so that your frontend doesn't have to change.
Assuming that your tenants can operate in separate databases, and you are looking for a way to operate a query in the correct database, perhaps something like dns could be used during your connection to the database, using the tenant ID as a component in the dns host. Something like:
tenant_1.example.com -> 192.168.0.10
tenant_2.example.com -> 192.168.0.11
tenant_3.example.com -> 192.168.0.11
etc.example.com -> 192.168.0.X
Then you could use the connection as a map to the correct db installation. The tricky part here is the overlapping data that all tenants would need access to. If that overlapping data needs to be joined to then it will have to exist in all databases. Either copied or dblinked. If the overlapping data needs to be updated then automatic is going to be tough. Good question.