I have the need of creating 2 data sources in spring, one pointed to a read replica and one to a primary db. Both the db's are the same since the replica is just a copy of the primary. Therefore, they both use the same exact entities. Since I've created 2 data sources, I can no longer use the #EntityScan annotation. Before, when using the #EntityScan on my singular data source my connections to the database worked great. When I define two data sources and custom entity managers for each I can still connect to the databases just fine, but all my SQL queries will fail. It's as if the schemas are now incorrect and I'll get errors like columns don't exist when I know they do. Are there resources on how to achieve this design? I have been trying to follow this example:
https://www.baeldung.com/spring-boot-configure-multiple-datasources
Related
I read a lot regarding sharding, what i understand about this its a DB managment concept. When I come to know about application side, Lets take a example a spring boot microservice having huge table orders where it needs to be shard with a shard Key(K1) in table.
Let's say I decided to shard based on K1 fields using range based sharding and will shard in multiple node of my MySQL DB.
Now I have the following question:
How this sharding is performed in existing data. Is it a background job?
What are the changes need to done in my existing application as currently its connecting to first Instance of MySQL db. while fetching data based on my shard key how can this application decided from which instance It need to request?
With Application Level Sharding you have a lot of options as you are the Application Developer/Architect who has the full control of it. There are a lot of options what you could do but for example here is one option or one Idea which could lead you in the right direction:
How this sharding is performed in existing data. Is it a background
job?
I guess by this you mean how do I separate or migrate the data from one Database to another database shard?
Background job. Yes having a background job is an option. With this background job you can move the data from one Db-shard to another Db-shard.
Migration Script. You can also write a migration script on your database level(SQL script) which will migrate all the data to other Db-shard.
With both of these options you have to think about the fact if you system need to be running and operational all the time? Can you live with down-time?
If yes this can be more challenging. As while you are migrating you have to stay operational. Doing this in non-Business hours can help, doing it in chunks key by key and similar. Still this depends on your business.
If no and you can have a down-time to do this then it will be much easier to separate the data to the appropriate Shards based on key. Here you do not have to consider a running system and some data mismatches in data. So if you can somehow can do it like this this would be much easier.
What are the changes need to done in my existing application as
currently its connecting to first Instance of MySQL db. while fetching
data based on my shard key how can this application decided from which
instance It need to request?
You have to provide that logic. Since it is on your application level you need to make that decision in code. In you DataAccess level code you need to know where to send your querys(or other sql statements): Service-Db-Shard1 or Service-Db-Shard2.
What you can do is for example in your Main Instance1 Db-Shard-1 you can have one table called Shards:
shards Table
shard_key
database_instance
key1
Service-Db-Shard1
key2
Service-Db-Shard2
The shards table
This table will contain the information where each shard data can be found. So the data which is sharded based on the key2 can be found in Service-Db-Shard2. Depending on your architecture you can put this table in one Main/Master(preferred option especially if you have some Read replicas to support downtime of Main instance) shard or in all shards(not preferred as it creates duplication). In addition you can cache this information in your micro-service Cache on startup and reuse its values from cache so you do not have to read this Table every time you need to execute an SQL statement on any other table.
The good thing about this is that you can control this and evolve this over time. For example in beginning when you do not have so much data separate/spread all your keys to 2 Instances(save money) and as the data grows you can increase the number of instances. Example:
shards Table
shard_key
database_instance
key1
Service-Db-Shard1
key2
Service-Db-Shard2
key3
Service-Db-Shard1
key4
Service-Db-Shard1
key5
Service-Db-Shard2
Multiple shards in one instance
Doing it like this gives you the option to have multiple shard keys data on the same instance to save money on to many resources. Keep in mind this does not work well with every key type. For example it could work quite well if you have a system which is a multi Customer/Tenant system and as your number of Tenants grows the data grows as well. Usually not all the Tenants have the same amount of data so having them in a dedicated Instance is not always the most efficient way to shard. This gives you this additional flexibility.
Keep the shard key column in every table
In addition you want to add to each of your tables the shard key column so that you can identify what needs to be moved where. Event when your data is distributed to multiple shard(instances) you still might want to keep this column for the fact that you might have multiple shard keys on the same instance and also having the option to migrate further(if needed).
Before executing sql statements
Before each sql statement against your DB you will need to get the Instance information from the "shards" Table and each sql statement to your "orders" Table or any other table which is sharded should contain the Sharding key in its filters.
Data Access layer
Consider DataAccess layer in your micro-service code, this is one good example why SOLID Principles and proper loose design of an DataAccess layer classes/modules and proper design can help you implement something like this easier. It is much easier to adjust a couple of classes to add additional step to find the Instance based on key and include the key in each query if your DataAccess layer code is done well.
Conclusion
This was just to give you an Idea how you could approach this. There are many ways how you can do this. It will heavily depend on your domain, your current service structure, your data, its Architecture, the way you have your Infrastructure setup and db deployments and migration strategy.
We are looking into using Asp Net Boilerplate. Looks very promising. We love the framework, but we would like to be able to use a per-schema Multitenancy configuration. Instead of sharing the data in the same db & tables, each tenant would "have" a schema, in which the whole database structure would be replicated.
One of our data tables will be quite big (sometimes +1 million entries / tenant), and we were advised that for performance reasons, it's better to keep the number of entries as low as possible. Also, this particular table will be queried & inserted a lot. It would be unrealistic that this table would hold data for 40+ tenants. For that reason, and others, we would prefer to have a distinct schema per tenant.
Our DB is a single PostgreSQL server (might scale up to more in the future). We use EntityFramework & Npgsql. We already noticed that it is possible to set up a different ConnectionString for specific tenants that would have bigger data requirements.
http://www.summa.com/blog/2013/09/17/approaches-to-multi-tenancy See separate schema per tenant
Any idea on how to acheive a schema-per-tenant multitenancy? There's a lot of moving parts in this, I'm not sure where to start.
I'd like to preface this by saying I'm not a DBA, so sorry for any gaps in technical knowledge.
I am working within a microservices architecture, where we have about a dozen or applications, each supported by its Postgres database instance (which is in RDS, if that helps). Each of the microservices' databases contains a few tables. It's safe to assume that there's no naming conflicts across any of the schemas/tables, and that there's no sharding of any data across the databases.
One of the issues we keep running into is wanting to analyze/join data across the databases. Right now, we're relying on a 3rd Party tool that caches our data and makes it possible to query across multiple database sources (via the shared cache).
Is it possible to create read-replicas of the schemas/tables from all of our production databases and have them available to query in a single database?
Are there any other ways to configure Postgres or RDS to make joining across our databases possible?
Is it possible to create read-replicas of the schemas/tables from all of our production databases and have them available to query in a single database?
Yes, that's possible and it's actually quite easy.
Setup one Postgres server that acts as the master.
For each remote server, create a foreign server then you then use to create a foreign table that makes the data accessible from the master server.
If you have multiple tables in multiple server that should be viewed as a single table in the master, you can setup inheritance to make all those tables appear like one. If you can define a "sharding" key that identifies a distinct attribute between those server, you can even make Postgres request the data only from the specific server.
All foreign tables can be joined as if they were local tables. Depending on the kind of query, some (or a lot) of the filter and join criteria can even be pushed down to the remote server to distribute the work.
As the Postgres Foreign Data Wrapper is writeable, you can even update the remote tables from the master server.
If the remote access and joins is too slow, you can create materialized views based on the remote tables to create a local copy of the data. This however means that it's not a real time copy and you have to manage the regular refresh of the tables.
Other (more complicated) options are the BDR project or pglogical. It seems that logical replication will be built into the next Postgres version (to be released a the end of this year).
Or you could use a distributed, shared-nothing system like Postgres-XL (which probably is the most complicated system to setup and maintain)
In context of mongodb and gorm, If we need to have different databases for different clients, then are multi-tenancy (With Database mode) and multiple data source approach are 2 solutions to achieve the same thing or is there any difference between them?
Multiple Data Source Solution:
http://gorm.grails.org/latest/mongodb/manual/#multipleDataSources
Multiple Tenant Solution:
http://gorm.grails.org/latest/mongodb/manual/#multiTenancy
Well they are not meant to achieve the same purpose
tldr;
Multiple Data Source is meant to have different databases (collections if you only plan to use mongodb) for different objects while Multi Tenant will store the same object but add a discriminator to identify client specific data.
If your question is about supporting different databases for different clients the answer will be to go with multi tenant
Multiple Data Sources
Grails supports (for long) to have multiple database for the same application (it can be different db vendor or different db from same vendor). The purpose is to have specific data stored in a different db/namespace.
For example, you can decide to have a db for all core entity of your business and to have a dedicated db for all audit/logging things. When using multiple data sources you will map an object to a dedicated datasource
Multi Tenancy (with database tenant as per OP context)
In mutli tenancy (database tenant) on the other hand, grails will have a single database schema for your client to store all the objects. so data from Client A will be in another db than Client B. Grails will have some default tenant resolver (that you can still override if needed) which will manage to determine which database needs to be queried depending the context.
I am using Self Tracking Entities with the Entity Framework 4. I have 2 databases, with the exact same schema. However, tables in one database will be added to/edited etc (and I mean data will be added/edited, not the actual table definitions) and at certain points of the day I will need to synchronize all the changes between this database and the other database.
I can create a separate context for both of them. But if I read a large graph from one database, how can I update the other database with the graph? Is there an easy way?
My database model is large and complex and fully relational. So it would be a big job to go through every single entity and do a read from the other database to see if it exists or not, update/insert it if need be, and then carry this on through the full object graph!
Any ideas?
This is not a use case for EF. In EF you will have to do exactly what you've described. Self tracking entities are able to track changes to these object instances - they know nothing about changes made to their own database over time and they will not know anything about state of your second database as well.
Try to look at SQL server native features (including mirroring, transaction log shipping or SSIS) and MS Sync framework. Depending on your detailed requirements these tools can suite you better.