Sharding from application development point of view - spring-data-jpa

I read a lot regarding sharding, what i understand about this its a DB managment concept. When I come to know about application side, Lets take a example a spring boot microservice having huge table orders where it needs to be shard with a shard Key(K1) in table.
Let's say I decided to shard based on K1 fields using range based sharding and will shard in multiple node of my MySQL DB.
Now I have the following question:
How this sharding is performed in existing data. Is it a background job?
What are the changes need to done in my existing application as currently its connecting to first Instance of MySQL db. while fetching data based on my shard key how can this application decided from which instance It need to request?

With Application Level Sharding you have a lot of options as you are the Application Developer/Architect who has the full control of it. There are a lot of options what you could do but for example here is one option or one Idea which could lead you in the right direction:
How this sharding is performed in existing data. Is it a background
job?
I guess by this you mean how do I separate or migrate the data from one Database to another database shard?
Background job. Yes having a background job is an option. With this background job you can move the data from one Db-shard to another Db-shard.
Migration Script. You can also write a migration script on your database level(SQL script) which will migrate all the data to other Db-shard.
With both of these options you have to think about the fact if you system need to be running and operational all the time? Can you live with down-time?
If yes this can be more challenging. As while you are migrating you have to stay operational. Doing this in non-Business hours can help, doing it in chunks key by key and similar. Still this depends on your business.
If no and you can have a down-time to do this then it will be much easier to separate the data to the appropriate Shards based on key. Here you do not have to consider a running system and some data mismatches in data. So if you can somehow can do it like this this would be much easier.
What are the changes need to done in my existing application as
currently its connecting to first Instance of MySQL db. while fetching
data based on my shard key how can this application decided from which
instance It need to request?
You have to provide that logic. Since it is on your application level you need to make that decision in code. In you DataAccess level code you need to know where to send your querys(or other sql statements): Service-Db-Shard1 or Service-Db-Shard2.
What you can do is for example in your Main Instance1 Db-Shard-1 you can have one table called Shards:
shards Table
shard_key
database_instance
key1
Service-Db-Shard1
key2
Service-Db-Shard2
The shards table
This table will contain the information where each shard data can be found. So the data which is sharded based on the key2 can be found in Service-Db-Shard2. Depending on your architecture you can put this table in one Main/Master(preferred option especially if you have some Read replicas to support downtime of Main instance) shard or in all shards(not preferred as it creates duplication). In addition you can cache this information in your micro-service Cache on startup and reuse its values from cache so you do not have to read this Table every time you need to execute an SQL statement on any other table.
The good thing about this is that you can control this and evolve this over time. For example in beginning when you do not have so much data separate/spread all your keys to 2 Instances(save money) and as the data grows you can increase the number of instances. Example:
shards Table
shard_key
database_instance
key1
Service-Db-Shard1
key2
Service-Db-Shard2
key3
Service-Db-Shard1
key4
Service-Db-Shard1
key5
Service-Db-Shard2
Multiple shards in one instance
Doing it like this gives you the option to have multiple shard keys data on the same instance to save money on to many resources. Keep in mind this does not work well with every key type. For example it could work quite well if you have a system which is a multi Customer/Tenant system and as your number of Tenants grows the data grows as well. Usually not all the Tenants have the same amount of data so having them in a dedicated Instance is not always the most efficient way to shard. This gives you this additional flexibility.
Keep the shard key column in every table
In addition you want to add to each of your tables the shard key column so that you can identify what needs to be moved where. Event when your data is distributed to multiple shard(instances) you still might want to keep this column for the fact that you might have multiple shard keys on the same instance and also having the option to migrate further(if needed).
Before executing sql statements
Before each sql statement against your DB you will need to get the Instance information from the "shards" Table and each sql statement to your "orders" Table or any other table which is sharded should contain the Sharding key in its filters.
Data Access layer
Consider DataAccess layer in your micro-service code, this is one good example why SOLID Principles and proper loose design of an DataAccess layer classes/modules and proper design can help you implement something like this easier. It is much easier to adjust a couple of classes to add additional step to find the Instance based on key and include the key in each query if your DataAccess layer code is done well.
Conclusion
This was just to give you an Idea how you could approach this. There are many ways how you can do this. It will heavily depend on your domain, your current service structure, your data, its Architecture, the way you have your Infrastructure setup and db deployments and migration strategy.

Related

Relational DB in microservices

I have a monolithic application that currently uses a PostgreSQL DB and the schemas are set up as you would expect for most relational databases with various table data being linked back to the user via FKs on the user_id.
I'm trying to learn more about microservices am trying to migrate my python API to a microservice architecture. I have a reasonable understanding of how I'm going to break up the larger app into smaller parts, however, I'm not entirely clear on how I'm supposed to deal with the data side of things.
I understand that one single large DB is against general design principles of microservices but I'm not clear on what the alternative would be.
My biggest concern is cascading across individual databases that would hold microservice data. In a simple rdb, I can just cascade on delete and the DB will handle the work across the various tables. In the case of microservices, how would that work? Would I need to have a separate service that handles deleting user data across the other service DBs?
I don't really understand how I would migrate a traditional application with a relational DB to a microservice architecture?
EDIT:
To clarify - a specific architectural/design problem I'm facing is as follows:
I have split up my application into a few microservices. The ones that are in my mind still relational are:
Geolocation - A service that checks geometry data, records in PostGIS, and returns certain information. A primary purpose is to record the location of a particular user for referencing later
Image - A simple upload service to upload images and store meta data in the db.
Load-Image - A simple service that returns a random set of images based on parameters such as location, and user profile data such as Age, Gender, etc
Profile - A service that simply manages user data such as Age, Gender, etc
Normally, these three items would have a table each in a larger db rather than their own individual dbs. Filtering images by say location and age is a very simple JOIN and filter.
How would something like that work in a microservice architecture? If the data is held in different dbs entirely how would I setup the logic to filter the data? I could duplicate data that doesn't change often like profile info and add it to a MongoDB document that would contain image data including user_id and profile data - however, location data can change regularly and constant updates doesn't sound practical.
What would be the best approach? Or should I stick with a shared RDBMS for just those few services?
It comes down to the duplication of data, why we want it, and how we manage it.
Early in our careers we were taught about the duplication of data to make it redundant, for example in database replication or backups. We were also taught that data can be modelled in a relational manner, with constraints enforcing the integrity of the model. In fact, the integrity of the model is sacrosanct. Without integrity, how can you have consistency? The answer is that you can't. Kinda.
When you work with distributed systems and service orientation, you do so because you want to minimise interactions thereby reducing coupling between components. However, there is a cost to this. The more distributed your architecture, the less coupling it has, and the more duplication of data will be necessary. This is taken to an extreme with microservices, where effectively the same data may be present in many different places, in varying degrees of consistency.
Instead of being bad, however, in this context data duplication is an essential feature of your system. It is an enabler of an architectural style with many great benefits. Put another way, without duplication of data, you get less distribution, you get more coupling, which makes your system more expensive to build, own, and change.
So, now we understand duplication of data and why we want it, let's move onto how we manage having lots of duplication. Let's try an example:
In a relational database, let's say we have a table called Customers, which contains a customer ID, and customer details, and another table called Orders which contains the order ID, customer ID, and the order details. Let's say we also have an ordering application, which needs to delete all the customer's orders if the customer is deleted for GDPR.
Because we are migrating our system to microservices, we decide to create a service called Customers.
So we create a service with the following operation:
DELETE /customers/{customerId} - deletes a customer
We create another service called Orders with the following operations:
GET /orders/customers/{customerId} - gets all the orders for a customer
DELETE /orders/{orderId} - deletes an order
We build a UX screen for deleting a customer. The UX first calls the orders service to get all the orders for the customer. Then it iterates over the list of orders, calling the orders service to delete the order. Then it calls the customers service to delete the user.
This example is very simplistic, but as you can see, there is no option but to orchestrate the "Delete Customer" operation from the caller, which in this case is the user interface. Of course, what would be a single atomic transaction in a database does not translate to multiple HTTP/s calls, so it is possible that some of the calls may not succeed, leaving the system as a whole in an inconsistent state. In this instance the inconsistency would need to be resolved via some recovery mechanism.
In a microservice architecture, we have both the option, either use database per service or a shared database. There are advantages and disadvantages to both the pattern. Database per service architecture is the best practice but when the monolithic application has lots of function, procedure or database-specific feature on database level then we can use the Shared database approach, I know this is not the best practice if you have time and bandwidth then you should go for database per service.
As your concern is cascading over individual databases, you need to remove cascading from the database and implement global transaction handling in your application and execute all cascading related queries from that transaction.

How does pglogical-2 handle logical replication on same table while allowing it to be writeable on both databases?

Based on the above image, there are certain tables I want to be in the Internal Database (right hand side). The other tables I want to be replicated in the external database.
In reality there's only one set of values that SHOULD NOT be replicated across. The rest of the database can be replicated. Basically the actual price columns in the prices table cannot be replicated across. It should stay within the internal database.
Because the vendors are external to the network, they have no access to the internal app.
My plan is to create a replicated version of the same app and allow vendors to submit quotations and picking items.
Let's say the replicated tables are at least quotations and quotation_line_items. These tables should be writeable (in terms of data for INSERTs, UPDATEs, and DELETEs) at both the external database and the internal database. Hence at both databases, the data in the quotations and quotation_line_items table are writeable and should be replicated across in both directions.
The data in the other tables are going to be replicated in a single direction (from internal to external) except for the actual raw prices columns in the prices table.
The quotation_line_items table will have a price_id column. However, the raw price values in the prices table should not appear in the external database.
Ultimately, I want the data to be consistent for the replicated tables on both databases. I am okay with synchronous replication, so a bit of delay (say, a couple of second for the write operations) is fine.
I came across pglogical https://github.com/2ndQuadrant/pglogical/tree/REL2_x_STABLE
and they have the concept of PUBLISHER and SUBSCRIBER.
I cannot tell based on the readme which one would be acting as publisher and subscriber and how to configure it for my situation.
That won't work. With the setup you are dreaming of, you will necessarily end up with replication conflicts.
How do you want to prevent that data are modified in a conflicting fashion in the two databases? If you say that that won't happen, think again.
I believe that you would be much better off using a single database with two users: one that can access the “secret” table and one that cannot.
If you want to restrict access only to certain columns, use a view. Simple views are updateable in PostgreSQL.
It is possible with BDR replication which uses pglogical. On a basic level by allocating ranges of key ids to each node so writes are possible in both locations without conflict. However BDR is now a commercial paid for product.

Asp Net Boilerplate - Setup Schema-Per-Tenant Multitenancy (EntityFrameworkCore & PostgreSQL)

We are looking into using Asp Net Boilerplate. Looks very promising. We love the framework, but we would like to be able to use a per-schema Multitenancy configuration. Instead of sharing the data in the same db & tables, each tenant would "have" a schema, in which the whole database structure would be replicated.
One of our data tables will be quite big (sometimes +1 million entries / tenant), and we were advised that for performance reasons, it's better to keep the number of entries as low as possible. Also, this particular table will be queried & inserted a lot. It would be unrealistic that this table would hold data for 40+ tenants. For that reason, and others, we would prefer to have a distinct schema per tenant.
Our DB is a single PostgreSQL server (might scale up to more in the future). We use EntityFramework & Npgsql. We already noticed that it is possible to set up a different ConnectionString for specific tenants that would have bigger data requirements.
http://www.summa.com/blog/2013/09/17/approaches-to-multi-tenancy See separate schema per tenant
Any idea on how to acheive a schema-per-tenant multitenancy? There's a lot of moving parts in this, I'm not sure where to start.

(How) Is it possible to convert tables into foreign tables in Postgres?

We have a large table in our Postgres production database which we want to start "sharding" using foreign tables and inheritance.
The desired architecture will be to have 1 (empty) table that defines the schema and several foreign tables inheriting from the empty "parent" table. (possible with Postgres 9.5)
I found this well written article https://www.depesz.com/2015/04/02/waiting-for-9-5-allow-foreign-tables-to-participate-in-inheritance/ that explains everything on how to do it from scratch.
My question is how to reduce the needed migration of data to a minimum.
We have this 100+ GB table now, that should become our first "shard". And in the future we will regulary add new "shards". At some point, the older shards will be moved to another tablespace (on cheaper hardware since they become less important).
My question now:
Is there a way to "ALTER" an existing table to be a foreign table instead?
No way to use alter table to do this.
You really have to basically do it manually. This is no different (really) than doing table partitioning. You create your partitions, you load the data. You direct reads and writes to the partitions.
Now in your case, in terms of doing sharding there are a number of tools I would look at to make this less painful. First, if you make sure your tables are split the way you like them first, you can use a logical replication solution like Bucardo to replicate the writes while you are moving everything over.
There are some other approaches (parallelized readers and writers) that may save you some time at the expense of db load, but those are niche tools.
There is no native solution for shard management of standard PostgreSQL (and I don't know enough about Postgres-XL in this regard to know how well it can manage changing shard criteria). However pretty much anything is possible with a little work and knowledge.

Postgres Multi-tenant administration/maintenance

We have a SaaS application where each tenant has its own database in Postgres. How would I apply a patch to all the databses? For example if I want to add a table or add a column to a table, I have to either write a program that loops through all databases and execute a SQL against them or using pgadmin, go through them one by one.
Is there smarter and/or faster way?
Any help is greatly appreciated.
Yes, there's a smarter way.
Don't create a new database for each tenant. If everything is in one database then you only need to alter one database.
Pick one database, alter each table to have the column TENANT and add this to the primary key. Then insert into this database every record for all tenants and drop the other databases (obviously considerably more work than this as your application will need to be changed).
The differences with your approach are extensively discussed elsewhere:
What problems will I get creating a database per customer?
What are the advantages of using a single database for EACH client?
Multiple schemas versus enormous tables
Practicality of multiple databases per client vs one database
Multi-tenancy - single database vs multiple database
If you don't put everything in one database then I'm afraid you have to alter them all individually, and doing it programatically would be simplest.
At a higher level, all multi-tenant applications follow one of three approaches:
One tenant's data lives in one database,
One tenant's data lives in one schema, or
Add a tenant_id / account_id column to your tables (shared schema).
I usually find that developers use the following criteria when they evaluate these different approaches.
Isolation: Since you can put each tenant into its own database in one hand, and have tenants share the same table on the other, this becomes the most apparent dimension. If you provide your users raw SQL access or you're in a regulated industry such as healthcare, you may need strict guarantees from your database. That said, PostgreSQL 9.5 comes with row level security policies that makes this less of a concern for most applications.
Extensibility: If your tenants are sharing the same schema (approach #3), and your tenants have fields that varies between them, then you need to think about how to merge these fields.
This article on multi-tenant databases has a great summary of different approaches. For example, you can add a dozen columns, call them C1, C2, and so forth, and have your application infer the actual data in this column based on the tenant_id. PostgresQL 9.4 comes with JSONB support and natively allows you to use semi-structured fields to express variations between different tenants' data.
Scaling: Another criteria is how easily your database would scale-out. If you create a tenant per database or schema (#1 or #2 above), your application can make use of existing Ruby Gems or [Django packages][1] to simplify app integration. That said, you'll need to manually manage your tenants' data and the machines they live on. Similarly, you'll need to build your own sharding logic to propagate foreign key constraints and ALTER TABLE commands.
With approach #3, you can use existing open source scaling solutions, such as Citus. For example, this blog post describes how to easily shard a multi-tenant app with Postgres.
it's time for me to give back to the community :) So after 4 years, our multi-tenant platform is in production and I would like to share the following observations/experiences with all of you.
We used a database per each tenant. This has given us extreme flexibility as the size of the databases in the backups are not huge and hence we can easily import them into our staging environment for customers issues.
We use Liquibase for database development and upgrades. This has been a tremendous help to us, allowing us to package the entire build into a simple war file. All changes are easily versioned and managed very efficiently. There is a bit of learning curve here an there but nothing substantial. 2-5 days can significantly save you time.
Given that we use Spring/JPA/Hibernate, we use a technique called Dynamic Data Source Routing. So when a user logs-in, we find the related datasource with a lookup and connect them to the session to the right database. That's also when the Liquibase scripts get applied for updates.
This is, for now, I will come back with more later on.
Well, there are problems with one database for all tenants in our case for sure.
The backup file gets huge and becomes almost not practical hard to manage
For troubleshooting, we need to restore customer's data in our dev env, we just use that customer's backup file and usually the file is not as big as if we were to use one database for all customers.
Again, Liquibase has been key in allowing to manage updates across all the tenants seamlessly and without any issues. Without Liquibase, I can see lots of complications with this approach. So Liquibase, Liquibase and more Liquibase.
I also suspect that we would need a more powerful hardware to manage a huge database with large joins across millions of records vs much lighter database with much smaller queries.
In case of problems, the service doesn't go down for everyone and there will be limited to one or few tenants.
In general, for our purposes, this has been a great architectural decision and we are benefiting from it every day. One time we had one customer that didn't have their archiving active and their database size grew to over 3 GB. With offshore teams and slower internet as well as storage/bandwidth prices, one can see how things may become complicated very quickly.
Hope this helps someone.
--Rex