Relational DB in microservices - postgresql

I have a monolithic application that currently uses a PostgreSQL DB and the schemas are set up as you would expect for most relational databases with various table data being linked back to the user via FKs on the user_id.
I'm trying to learn more about microservices am trying to migrate my python API to a microservice architecture. I have a reasonable understanding of how I'm going to break up the larger app into smaller parts, however, I'm not entirely clear on how I'm supposed to deal with the data side of things.
I understand that one single large DB is against general design principles of microservices but I'm not clear on what the alternative would be.
My biggest concern is cascading across individual databases that would hold microservice data. In a simple rdb, I can just cascade on delete and the DB will handle the work across the various tables. In the case of microservices, how would that work? Would I need to have a separate service that handles deleting user data across the other service DBs?
I don't really understand how I would migrate a traditional application with a relational DB to a microservice architecture?
EDIT:
To clarify - a specific architectural/design problem I'm facing is as follows:
I have split up my application into a few microservices. The ones that are in my mind still relational are:
Geolocation - A service that checks geometry data, records in PostGIS, and returns certain information. A primary purpose is to record the location of a particular user for referencing later
Image - A simple upload service to upload images and store meta data in the db.
Load-Image - A simple service that returns a random set of images based on parameters such as location, and user profile data such as Age, Gender, etc
Profile - A service that simply manages user data such as Age, Gender, etc
Normally, these three items would have a table each in a larger db rather than their own individual dbs. Filtering images by say location and age is a very simple JOIN and filter.
How would something like that work in a microservice architecture? If the data is held in different dbs entirely how would I setup the logic to filter the data? I could duplicate data that doesn't change often like profile info and add it to a MongoDB document that would contain image data including user_id and profile data - however, location data can change regularly and constant updates doesn't sound practical.
What would be the best approach? Or should I stick with a shared RDBMS for just those few services?

It comes down to the duplication of data, why we want it, and how we manage it.
Early in our careers we were taught about the duplication of data to make it redundant, for example in database replication or backups. We were also taught that data can be modelled in a relational manner, with constraints enforcing the integrity of the model. In fact, the integrity of the model is sacrosanct. Without integrity, how can you have consistency? The answer is that you can't. Kinda.
When you work with distributed systems and service orientation, you do so because you want to minimise interactions thereby reducing coupling between components. However, there is a cost to this. The more distributed your architecture, the less coupling it has, and the more duplication of data will be necessary. This is taken to an extreme with microservices, where effectively the same data may be present in many different places, in varying degrees of consistency.
Instead of being bad, however, in this context data duplication is an essential feature of your system. It is an enabler of an architectural style with many great benefits. Put another way, without duplication of data, you get less distribution, you get more coupling, which makes your system more expensive to build, own, and change.
So, now we understand duplication of data and why we want it, let's move onto how we manage having lots of duplication. Let's try an example:
In a relational database, let's say we have a table called Customers, which contains a customer ID, and customer details, and another table called Orders which contains the order ID, customer ID, and the order details. Let's say we also have an ordering application, which needs to delete all the customer's orders if the customer is deleted for GDPR.
Because we are migrating our system to microservices, we decide to create a service called Customers.
So we create a service with the following operation:
DELETE /customers/{customerId} - deletes a customer
We create another service called Orders with the following operations:
GET /orders/customers/{customerId} - gets all the orders for a customer
DELETE /orders/{orderId} - deletes an order
We build a UX screen for deleting a customer. The UX first calls the orders service to get all the orders for the customer. Then it iterates over the list of orders, calling the orders service to delete the order. Then it calls the customers service to delete the user.
This example is very simplistic, but as you can see, there is no option but to orchestrate the "Delete Customer" operation from the caller, which in this case is the user interface. Of course, what would be a single atomic transaction in a database does not translate to multiple HTTP/s calls, so it is possible that some of the calls may not succeed, leaving the system as a whole in an inconsistent state. In this instance the inconsistency would need to be resolved via some recovery mechanism.

In a microservice architecture, we have both the option, either use database per service or a shared database. There are advantages and disadvantages to both the pattern. Database per service architecture is the best practice but when the monolithic application has lots of function, procedure or database-specific feature on database level then we can use the Shared database approach, I know this is not the best practice if you have time and bandwidth then you should go for database per service.
As your concern is cascading over individual databases, you need to remove cascading from the database and implement global transaction handling in your application and execute all cascading related queries from that transaction.

Related

Database for every microservice

I wonder if few microservices in the same project can share the same database? For example I'm using MongoDB and I'm using few microservices that need a database, should I use one collection for every microservice? Or just create new database for every microservice? Any suggestions?
It is typically recommended to have database per service, however there may be exceptions depending on the specific problem at hand. Your question does not have enough details to provide anything other then general advice.
The database per service has multiple advantages.
Since each microservice is independently deploy-able. It can change the database schema and not break other services.
There is a clear boundary of ownership of data, no two services write on same data. If there is need for accessing data owned by other services there are patterns(Materialized views, REST endpoints exposing data, Eventual consistency using queues ) to solve that problem.
Depending on the volume of data generated\managed by one microservice the database can be scaled (sharding, read replicas) independently.
Database per service has significant challenges too (multiple service management overhead, No ACID transaction), it is always recommended to start with a modular monolith and carve out services based on business need.
References
https://learn.microsoft.com/en-us/dotnet/architecture/cloud-native/distributed-data

Sharding from application development point of view

I read a lot regarding sharding, what i understand about this its a DB managment concept. When I come to know about application side, Lets take a example a spring boot microservice having huge table orders where it needs to be shard with a shard Key(K1) in table.
Let's say I decided to shard based on K1 fields using range based sharding and will shard in multiple node of my MySQL DB.
Now I have the following question:
How this sharding is performed in existing data. Is it a background job?
What are the changes need to done in my existing application as currently its connecting to first Instance of MySQL db. while fetching data based on my shard key how can this application decided from which instance It need to request?
With Application Level Sharding you have a lot of options as you are the Application Developer/Architect who has the full control of it. There are a lot of options what you could do but for example here is one option or one Idea which could lead you in the right direction:
How this sharding is performed in existing data. Is it a background
job?
I guess by this you mean how do I separate or migrate the data from one Database to another database shard?
Background job. Yes having a background job is an option. With this background job you can move the data from one Db-shard to another Db-shard.
Migration Script. You can also write a migration script on your database level(SQL script) which will migrate all the data to other Db-shard.
With both of these options you have to think about the fact if you system need to be running and operational all the time? Can you live with down-time?
If yes this can be more challenging. As while you are migrating you have to stay operational. Doing this in non-Business hours can help, doing it in chunks key by key and similar. Still this depends on your business.
If no and you can have a down-time to do this then it will be much easier to separate the data to the appropriate Shards based on key. Here you do not have to consider a running system and some data mismatches in data. So if you can somehow can do it like this this would be much easier.
What are the changes need to done in my existing application as
currently its connecting to first Instance of MySQL db. while fetching
data based on my shard key how can this application decided from which
instance It need to request?
You have to provide that logic. Since it is on your application level you need to make that decision in code. In you DataAccess level code you need to know where to send your querys(or other sql statements): Service-Db-Shard1 or Service-Db-Shard2.
What you can do is for example in your Main Instance1 Db-Shard-1 you can have one table called Shards:
shards Table
shard_key
database_instance
key1
Service-Db-Shard1
key2
Service-Db-Shard2
The shards table
This table will contain the information where each shard data can be found. So the data which is sharded based on the key2 can be found in Service-Db-Shard2. Depending on your architecture you can put this table in one Main/Master(preferred option especially if you have some Read replicas to support downtime of Main instance) shard or in all shards(not preferred as it creates duplication). In addition you can cache this information in your micro-service Cache on startup and reuse its values from cache so you do not have to read this Table every time you need to execute an SQL statement on any other table.
The good thing about this is that you can control this and evolve this over time. For example in beginning when you do not have so much data separate/spread all your keys to 2 Instances(save money) and as the data grows you can increase the number of instances. Example:
shards Table
shard_key
database_instance
key1
Service-Db-Shard1
key2
Service-Db-Shard2
key3
Service-Db-Shard1
key4
Service-Db-Shard1
key5
Service-Db-Shard2
Multiple shards in one instance
Doing it like this gives you the option to have multiple shard keys data on the same instance to save money on to many resources. Keep in mind this does not work well with every key type. For example it could work quite well if you have a system which is a multi Customer/Tenant system and as your number of Tenants grows the data grows as well. Usually not all the Tenants have the same amount of data so having them in a dedicated Instance is not always the most efficient way to shard. This gives you this additional flexibility.
Keep the shard key column in every table
In addition you want to add to each of your tables the shard key column so that you can identify what needs to be moved where. Event when your data is distributed to multiple shard(instances) you still might want to keep this column for the fact that you might have multiple shard keys on the same instance and also having the option to migrate further(if needed).
Before executing sql statements
Before each sql statement against your DB you will need to get the Instance information from the "shards" Table and each sql statement to your "orders" Table or any other table which is sharded should contain the Sharding key in its filters.
Data Access layer
Consider DataAccess layer in your micro-service code, this is one good example why SOLID Principles and proper loose design of an DataAccess layer classes/modules and proper design can help you implement something like this easier. It is much easier to adjust a couple of classes to add additional step to find the Instance based on key and include the key in each query if your DataAccess layer code is done well.
Conclusion
This was just to give you an Idea how you could approach this. There are many ways how you can do this. It will heavily depend on your domain, your current service structure, your data, its Architecture, the way you have your Infrastructure setup and db deployments and migration strategy.

Implementing multi tenant data structure using multiple schemas or by customerId table column

I am developing multi tenant store web application (software as a service) which will be used by many customers. I would like to use just one database. I would appreciate suggestions/feedback on how to go about this in the database:
Separate schemas for each customer. Whenever new customer signs up, I create separate schema.
Single schema with all the customers. And creating a CUSTOMER table with customerId that is referenced in all other tables (eg. orders, payments, etc). Whenever new customer signs up, I create an entry in CUSTOMER table.
Incase if you want to know what technologies are being used:
Postgres, Spring Boot MVC, REST, Maven, JPA.
Thanks.
There are major tradeoffs here. With customer id's your foreign keys become more complex (the customer id should probably be a part of every foreign key) and that means additional indexes. It also means you have to have some means of enforcing this restriction. The big issue is that bugs in your application can quite easily disclose material from other customers.
With multiple schemas you have an issue that you have many more tables and this can cause performance problems for pg_dump in particular. However with appropriate search paths it is a bit harder to compromise other clients' data. However this is harder to use with a connection pool.
In general I think the schema approach is better because you can always scale out by partitioning by customer set, and the better security is important. However it means you must have a good understanding of search_path and set it to a sensible value on every database transaction.

Postgres Multi-tenant administration/maintenance

We have a SaaS application where each tenant has its own database in Postgres. How would I apply a patch to all the databses? For example if I want to add a table or add a column to a table, I have to either write a program that loops through all databases and execute a SQL against them or using pgadmin, go through them one by one.
Is there smarter and/or faster way?
Any help is greatly appreciated.
Yes, there's a smarter way.
Don't create a new database for each tenant. If everything is in one database then you only need to alter one database.
Pick one database, alter each table to have the column TENANT and add this to the primary key. Then insert into this database every record for all tenants and drop the other databases (obviously considerably more work than this as your application will need to be changed).
The differences with your approach are extensively discussed elsewhere:
What problems will I get creating a database per customer?
What are the advantages of using a single database for EACH client?
Multiple schemas versus enormous tables
Practicality of multiple databases per client vs one database
Multi-tenancy - single database vs multiple database
If you don't put everything in one database then I'm afraid you have to alter them all individually, and doing it programatically would be simplest.
At a higher level, all multi-tenant applications follow one of three approaches:
One tenant's data lives in one database,
One tenant's data lives in one schema, or
Add a tenant_id / account_id column to your tables (shared schema).
I usually find that developers use the following criteria when they evaluate these different approaches.
Isolation: Since you can put each tenant into its own database in one hand, and have tenants share the same table on the other, this becomes the most apparent dimension. If you provide your users raw SQL access or you're in a regulated industry such as healthcare, you may need strict guarantees from your database. That said, PostgreSQL 9.5 comes with row level security policies that makes this less of a concern for most applications.
Extensibility: If your tenants are sharing the same schema (approach #3), and your tenants have fields that varies between them, then you need to think about how to merge these fields.
This article on multi-tenant databases has a great summary of different approaches. For example, you can add a dozen columns, call them C1, C2, and so forth, and have your application infer the actual data in this column based on the tenant_id. PostgresQL 9.4 comes with JSONB support and natively allows you to use semi-structured fields to express variations between different tenants' data.
Scaling: Another criteria is how easily your database would scale-out. If you create a tenant per database or schema (#1 or #2 above), your application can make use of existing Ruby Gems or [Django packages][1] to simplify app integration. That said, you'll need to manually manage your tenants' data and the machines they live on. Similarly, you'll need to build your own sharding logic to propagate foreign key constraints and ALTER TABLE commands.
With approach #3, you can use existing open source scaling solutions, such as Citus. For example, this blog post describes how to easily shard a multi-tenant app with Postgres.
it's time for me to give back to the community :) So after 4 years, our multi-tenant platform is in production and I would like to share the following observations/experiences with all of you.
We used a database per each tenant. This has given us extreme flexibility as the size of the databases in the backups are not huge and hence we can easily import them into our staging environment for customers issues.
We use Liquibase for database development and upgrades. This has been a tremendous help to us, allowing us to package the entire build into a simple war file. All changes are easily versioned and managed very efficiently. There is a bit of learning curve here an there but nothing substantial. 2-5 days can significantly save you time.
Given that we use Spring/JPA/Hibernate, we use a technique called Dynamic Data Source Routing. So when a user logs-in, we find the related datasource with a lookup and connect them to the session to the right database. That's also when the Liquibase scripts get applied for updates.
This is, for now, I will come back with more later on.
Well, there are problems with one database for all tenants in our case for sure.
The backup file gets huge and becomes almost not practical hard to manage
For troubleshooting, we need to restore customer's data in our dev env, we just use that customer's backup file and usually the file is not as big as if we were to use one database for all customers.
Again, Liquibase has been key in allowing to manage updates across all the tenants seamlessly and without any issues. Without Liquibase, I can see lots of complications with this approach. So Liquibase, Liquibase and more Liquibase.
I also suspect that we would need a more powerful hardware to manage a huge database with large joins across millions of records vs much lighter database with much smaller queries.
In case of problems, the service doesn't go down for everyone and there will be limited to one or few tenants.
In general, for our purposes, this has been a great architectural decision and we are benefiting from it every day. One time we had one customer that didn't have their archiving active and their database size grew to over 3 GB. With offshore teams and slower internet as well as storage/bandwidth prices, one can see how things may become complicated very quickly.
Hope this helps someone.
--Rex

I'm accessing a mongoDb database using the repository patterns. Where should I check for data Integrity?

I'm kind of new to mongodb and NoSQL data design in general.
I'm building a mongodb database that will have some denormalized data. For exemple, my "User" documents contains a reference (just the id) to zero or more "Article" documents and my Article documents contains references to zero or more users.
Since I'm using the repository pattern, no parts of my Data Access Layer knows about Articles AND Users. Where in my code should I check to make sure that all my documents are consistent with each others? Should I simply let the DAL's users code do the checks?
Would it be a good idea to have a Data Integrity Script run once in a while to check if everything is consistent?
Here is Microsoft's write-up on the Repository Pattern. From that document:
Use a repository to separate the logic that retrieves the data and maps it to the entity model from the business logic that acts on the model.
You have a couple of questions:
Where in my code should I check to make sure that all my documents are consistent with each others?
Based on the statement above, I think it's clear that this logic belongs in the Repository. The relation between these objects only exists at the layer of "business logic", the database cannot enforce these types of rules.
Should I simply let the DAL's users code do the checks?
How could they? As the writer of the repository, you are the DAL user. For MongoDB, the DAL is basically the driver.
You could possibly write a wrapper around the driver that would wrap the multiple writes in some form of transactions. But you would have to write this, MongoDB has no notion of transactions.
Would it be a good idea to have a Data Integrity Script run once in a while to check if everything is consistent?
At the end of the day, whoever writes the repository is going to be responsible for the integrity of the data. Such a script might be useful, but it would definitely suck a lot of CPU cycles.
My suggestion for N:M mappings is to start building some basic blocks for handling the multiple writes that are required to keep these two in sync. One idea is to Queue the changes and let a background job make the updates. This way you don't have to worry about multiple writes and roll-backs causing bad data.