I would like to know if it's possible to sync only 1 MongoDB, instead of all databases inside my server.
Best Regards
I assume you're referring to Replica Sets when you say "sync". In that case, no, it's not possible to replicate only one database, all databases on the server are replicated. This is by design, so that in case of failure of the primary, a secondary can step up and become a primary without any loss of functionality or data (except for whatever data may not have been replicated since when the former primary failed).
Related
I want to setup PostgreSQL databases in a way that the one (primary) server receives data from multiple agents, does some post-processing and regularly syncs to read-only (archive) replica. The sync does not have to be real-time. The replica server is used to feed the processed data to other clients.
The idea is that the replica will keep a complete database in the historical sense but the primary will keep all data from all agents (before some aggregation etc).
The primary server would ideally keep only part of the database e.g. last month.
The more I think about it, the term replica is not correct here.
What I probably need is to setup a procedure (for the aggregation etc.) and as a last step send the resulted data the 'replica' and delete portion of the primary DB.
Is there a better term for my use case?
I read a lot regarding sharding, what i understand about this its a DB managment concept. When I come to know about application side, Lets take a example a spring boot microservice having huge table orders where it needs to be shard with a shard Key(K1) in table.
Let's say I decided to shard based on K1 fields using range based sharding and will shard in multiple node of my MySQL DB.
Now I have the following question:
How this sharding is performed in existing data. Is it a background job?
What are the changes need to done in my existing application as currently its connecting to first Instance of MySQL db. while fetching data based on my shard key how can this application decided from which instance It need to request?
With Application Level Sharding you have a lot of options as you are the Application Developer/Architect who has the full control of it. There are a lot of options what you could do but for example here is one option or one Idea which could lead you in the right direction:
How this sharding is performed in existing data. Is it a background
job?
I guess by this you mean how do I separate or migrate the data from one Database to another database shard?
Background job. Yes having a background job is an option. With this background job you can move the data from one Db-shard to another Db-shard.
Migration Script. You can also write a migration script on your database level(SQL script) which will migrate all the data to other Db-shard.
With both of these options you have to think about the fact if you system need to be running and operational all the time? Can you live with down-time?
If yes this can be more challenging. As while you are migrating you have to stay operational. Doing this in non-Business hours can help, doing it in chunks key by key and similar. Still this depends on your business.
If no and you can have a down-time to do this then it will be much easier to separate the data to the appropriate Shards based on key. Here you do not have to consider a running system and some data mismatches in data. So if you can somehow can do it like this this would be much easier.
What are the changes need to done in my existing application as
currently its connecting to first Instance of MySQL db. while fetching
data based on my shard key how can this application decided from which
instance It need to request?
You have to provide that logic. Since it is on your application level you need to make that decision in code. In you DataAccess level code you need to know where to send your querys(or other sql statements): Service-Db-Shard1 or Service-Db-Shard2.
What you can do is for example in your Main Instance1 Db-Shard-1 you can have one table called Shards:
shards Table
shard_key
database_instance
key1
Service-Db-Shard1
key2
Service-Db-Shard2
The shards table
This table will contain the information where each shard data can be found. So the data which is sharded based on the key2 can be found in Service-Db-Shard2. Depending on your architecture you can put this table in one Main/Master(preferred option especially if you have some Read replicas to support downtime of Main instance) shard or in all shards(not preferred as it creates duplication). In addition you can cache this information in your micro-service Cache on startup and reuse its values from cache so you do not have to read this Table every time you need to execute an SQL statement on any other table.
The good thing about this is that you can control this and evolve this over time. For example in beginning when you do not have so much data separate/spread all your keys to 2 Instances(save money) and as the data grows you can increase the number of instances. Example:
shards Table
shard_key
database_instance
key1
Service-Db-Shard1
key2
Service-Db-Shard2
key3
Service-Db-Shard1
key4
Service-Db-Shard1
key5
Service-Db-Shard2
Multiple shards in one instance
Doing it like this gives you the option to have multiple shard keys data on the same instance to save money on to many resources. Keep in mind this does not work well with every key type. For example it could work quite well if you have a system which is a multi Customer/Tenant system and as your number of Tenants grows the data grows as well. Usually not all the Tenants have the same amount of data so having them in a dedicated Instance is not always the most efficient way to shard. This gives you this additional flexibility.
Keep the shard key column in every table
In addition you want to add to each of your tables the shard key column so that you can identify what needs to be moved where. Event when your data is distributed to multiple shard(instances) you still might want to keep this column for the fact that you might have multiple shard keys on the same instance and also having the option to migrate further(if needed).
Before executing sql statements
Before each sql statement against your DB you will need to get the Instance information from the "shards" Table and each sql statement to your "orders" Table or any other table which is sharded should contain the Sharding key in its filters.
Data Access layer
Consider DataAccess layer in your micro-service code, this is one good example why SOLID Principles and proper loose design of an DataAccess layer classes/modules and proper design can help you implement something like this easier. It is much easier to adjust a couple of classes to add additional step to find the Instance based on key and include the key in each query if your DataAccess layer code is done well.
Conclusion
This was just to give you an Idea how you could approach this. There are many ways how you can do this. It will heavily depend on your domain, your current service structure, your data, its Architecture, the way you have your Infrastructure setup and db deployments and migration strategy.
My postresql database is updated each night.
At the end of each nightly update, I need to know what data changed.
The update process is complex, taking a couple of hours and requires dozens of scripts, so I don't know if that influences how I could see what data has changed.
The database is around 1 TB in size, so any method that requires starting a temporary database may be very slow.
The database is an AWS instance (RDS). I have automated backups enabled (these are different to RDS snapshots which are user initiated). Is it possible to see the difference between two RDS automated backups?
I do not know if it is possible to see difference between RDS snapshots. But in the past we tested several solutions for similar problem. Maybe you can take some inspiration from it.
Obvious solution is of course auditing system. This way you can see in relatively simply way what was changed. Depending on granularity of your auditing system down to column values. Of course there is impact on your application due auditing triggers and queries into audit tables.
Another possibility is - for tables with primary keys you can store values of primary key and 'xmin' and 'ctid' hidden system columns (https://www.postgresql.org/docs/current/static/ddl-system-columns.html) for each row before updated and compare them with values after update. But this way you can identify only changed / inserted / deleted rows but not changes in different columns.
You can make streaming replica and set replication slots (and to be on the safe side also WAL log archiving ). Then stop replication on replica before updates and compare data after updates using dblink selects. But these queries can be very heavy.
Background: I have a very specific use case where I have an existing MongoDB that I need to interact with via reads, but I have to ensure that the data can never be modified. However I also need to trigger some form of event when new data comes in so I can do post processing on it.
The current plan is to use replication to get the data onto a slave for the read processing. However for my purposes I only care about new data in various document stores. Part of the issue is that I can not modify the existing MongoDB and not all the data is timestamped, so there is no incremental way to handle this that I can think of.
Question: Is it possible to fire an event from a slave that would tell me I have new data and what it is? I will only have access to the slave DB, as the master will be locked.
I may have some limited ability to change the master DB, but I can not expect to change the document structure at all.
Instead of using a master/slave configuration you could instead use a replica set with a priority 0 secondary (so that it can never become primary).
You can tail the oplog on that secondary looking for insert operations.
Need some way to push data from clients database to central database.Basically, there are several instances of MongoDB running on remote machines [clients] , and need some method to periodically update central mongo database with newly added and modified documents in clients.it must replicate its records to the single central server
Eg:
If I have 3 mongo instances running on 3 machines each having data of 10GB then after the data migration 4th machine's mongoDB must have 30GB of data. And cenral mongoDB machine must get periodically updated with data of all those 3 machines. But these 3 machines not only get new documents but existing documents in them may get updated. I would like the central mongoDB machine also to get these updations.
Your desired replication strategy is not formally supported by MongoDB.
A MongoDB replica set consists of a single primary with asynchronous replication to one or more secondary servers in the same replica set. You cannot configure a replica set with multiple primaries or replication to a different replica set.
However, there are a few possible approaches for your use case depending on how actively you want to keep your central server up to date and the volume of data/updates you need to manage.
Some general caveats:
Merging data from multiple standalone servers can create unexpected conflicts. For example, unique indexes would not know about documents created on other servers.
Ideally the data you are consolidating will still be separated by a unique database name per origin server so you don't have strange crosstalk between disparate documents that happen to have the same namespace and _id shared by different origin servers.
Approach #1: use mongodump and mongorestore
If you just need to periodically sync content to your central server, one way to do so is using mongodump and mongorestore. You can schedule a periodic mongodump from each of your standalone instances and use mongorestore to import them into the central server.
Caveats:
There is a --db parameter for mongorestore that allows you to restore into a different database from the original name (if needed)
mongorestore only performs inserts into the existing database (i.e. does not perform updates or upserts). If existing data with the same _id already exists on the target database, mongorestore will not replace it.
You can use mongodump options such as --query to be more selective on data to export (for example, only select recent data rather than all)
If you want to limit the amount of data to dump & restore on each run (for example, only exporting "changed" data), you will need to work out how to handle updates and deletions on the central server.
Given the caveats, the simplest use of this approach would be to do a full dump & restore (i.e. using mongorestore --drop) to ensure all changes are copied.
Approach #2: use a tailable cursor with the MongoDB oplog.
If you need more realtime or incremental replication, a possible approach is creating tailable cursors on the MongoDB replication oplog.
This approach is basically "roll your own replication". You would have to write an application which tails the oplog on each of your MongoDB instances and looks for changes of interest to save to your central server. For example, you may only want to replicate changes for selective namespaces (databases or collections).
A related tool that may be of interest is the experimental Mongo Connector from 10gen labs. This is a Python module that provides an interface for tailing the replication oplog.
Caveats:
You have to implement your own code for this, and learn/understand how to work with the oplog documents
There may be an alternative product which better supports your desired replication model "out of the box".
You should be aware that there are only replica set for doing replication there a replicat set always means: one primary, multiple secondary. Write always go to the primary server. Appearently you want multi-master replication which is not supported by MongoDB. So you want to look into a different technology like CouchDB or CouchBase. MongoDB is barrel burst here.
There may be a way since MongoDB 3.6 to achieve your goal: Change Streams.
Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them. Because change streams use the aggregation framework, applications can also filter for specific changes or transform the notifications at will.
There are some configuration options that affect whether you can use Change Streams or not, so please read about them.
Another option is Delayed Replica Set Members.
Because delayed members are a "rolling backup" or a running "historical" snapshot of the data set, they may help you recover from various kinds of human error. For example, a delayed member can make it possible to recover from unsuccessful application upgrades and operator errors including dropped databases and collections.
Hidden Replica Set Members may be another option to consider.
A hidden member maintains a copy of the primary's data set but is invisible to client applications. Hidden members are good for workloads with different usage patterns from the other members in the replica set.
Another option may be to configure a Priority 0 Replica Set Member.
Because delayed members are a "rolling backup" or a running "historical" snapshot of the data set, they may help you recover from various kinds of human error.
I am interested in these options myself, but I haven't decided what approach I will use.