Coming from Microsoft SQL Server with Database Services, Integration Services and Analysis Services it was easy to create full or incremental replication beside SSIS Packages to sync child databases to a master reporting database while all have the same schema. Now I am looking how this can be done in real time while using PostgreSQL, close to real time and by knowing the source of the child database.
For example:
The goal is to gather data from selected tables and fields from child databases to a master reporting database. In this case all tables in child databases have the same schema and the master reporting database has one more column in all of the selected tables identifying the source database.
For reference:
First thought was to use Kafka based on a selected list of tables and fields to generate messages which will populate master reporting database in real time as soon as possible a new record is inserted into child databases, is there any other better idea?
Related
I am looking to find the data source of couple of Tables in Redshift. I have gone through all the stored procedures in Redshift instance. I couldn't find any stored procedure which populates these tables in Redshift. I have also checked the Data Migration Service and didn't see these tables are being migrated from RDS instance. However, the tables are updated regularly each day.
What would be the way to find how data is populated in those 2 tables? Is there any logs or system tables I can look in to?
One place I'd look is svl_statementtext. That will pull any queries and utility queries that may be inserting or running copy jobs against that table. Just use a WHERE text LIKE %yourtablenamehere% and see what comes back.
https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_STATEMENTTEXT.html
Also check scheduled queries in the Redshift UI console.
I have created a db long ago using django. Now as we are migrating the application, so I need all the CREATE TABLE sql queries which django might have run to create the entire db for our service (which has around 70-80 tables and each table has avg 30-70 columns).
Both the servers old and new are using Postgres for databases.
But the technology stack is completely different (A 3rd party proprietary application which will host the service) instead of django.
If I start to write all the tables again from scratch, it will take at least a week or two.
Is there any way either from Postgres or from django which can generate the CREATE TABLE sql schema for an entire db keeping all the relationship as is?
Also, I have to do minor modification to that schema as per customer requirement.
p.s - pg_dump won't work as I need actual schema itself to get it reviewed from client.
I need some advice about the following scenario.
I have multiple embedded systems supporting PostgreSQL database running at different places and we have a server running on CentOS at our premises.
Each system is running at remote location and has multiple tables inside its database. These tables have the same names as the server's table names, but each system has different table name than the other systems, e.g.:
system 1 has tables:
sys1_table1
sys1_table2
system 2 has tables
sys2_table1
sys2_table2
I want to update the tables sys1_table1, sys1_table2, sys2_table1 and sys2_table2 on the server on every insert done on system 1 and system 2.
One solution is to write a trigger on each table, which will run on every insert of both systems' tables and insert the same data on the server's tables. This trigger will also delete the records in the systems after inserting the data into server. The problem with this solution is that if the connection with the server is not established due to network issue than that trigger will not execute or the insert will be wasted. I have checked the following solution for this
Trigger to insert rows in remote database after deletion
The second solution is to replicate tables from system 1 and system 2 to the server's tables. The problem with replication will be that if we delete data from the systems, it'll also delete the records on the server. I could add the alternative trigger on the server's tables which will update on the duplicate table, hence the replicated table can get empty and it'll not effect the data, but it'll make a long tables list if we have more than 200 systems.
The third solution is to write a foreign table using postgres_fdw or dblink and update the data inside the server's tables, but will this effect the data inside the server when we delete the data inside the system's table, right? And what will happen if there is no connectivity with the server?
The forth solution is to write an application in python inside each system which will make a connection to server's database and write the data in real time and if there is no connectivity to the server than it will store the data inside the sys1.table1 or sys2.table2 or whatever the table the data belongs and after the re-connect, the code will send the tables data into server's tables.
Which option will be best according to this scenario? I like the trigger solution best, but is there any way to avoid the data loss in case of dis-connectivity from the server?
I'd go with the fourth solution, or perhaps with the third, as long as it is triggered from outside the database. That way you can easily survive connection loss.
The first solution with triggers has the problems you already detected. It is also a bad idea to start potentially long operations, like data replication across a network of uncertain quality, inside a database transaction. Long transactions mean long locks and inefficient autovacuum.
The second solution may actually also be an option if you you have a recent PostgreSQL versions that supports logical replication. You can use a publication WITH (publish = 'insert,update'), so that DELETE and TRUNCATE are not replicated. Replication can deal well with lost connectivity (for a while), but it is not an option if you want the data at the source to be deleted after they have been replicated.
I'd like to preface this by saying I'm not a DBA, so sorry for any gaps in technical knowledge.
I am working within a microservices architecture, where we have about a dozen or applications, each supported by its Postgres database instance (which is in RDS, if that helps). Each of the microservices' databases contains a few tables. It's safe to assume that there's no naming conflicts across any of the schemas/tables, and that there's no sharding of any data across the databases.
One of the issues we keep running into is wanting to analyze/join data across the databases. Right now, we're relying on a 3rd Party tool that caches our data and makes it possible to query across multiple database sources (via the shared cache).
Is it possible to create read-replicas of the schemas/tables from all of our production databases and have them available to query in a single database?
Are there any other ways to configure Postgres or RDS to make joining across our databases possible?
Is it possible to create read-replicas of the schemas/tables from all of our production databases and have them available to query in a single database?
Yes, that's possible and it's actually quite easy.
Setup one Postgres server that acts as the master.
For each remote server, create a foreign server then you then use to create a foreign table that makes the data accessible from the master server.
If you have multiple tables in multiple server that should be viewed as a single table in the master, you can setup inheritance to make all those tables appear like one. If you can define a "sharding" key that identifies a distinct attribute between those server, you can even make Postgres request the data only from the specific server.
All foreign tables can be joined as if they were local tables. Depending on the kind of query, some (or a lot) of the filter and join criteria can even be pushed down to the remote server to distribute the work.
As the Postgres Foreign Data Wrapper is writeable, you can even update the remote tables from the master server.
If the remote access and joins is too slow, you can create materialized views based on the remote tables to create a local copy of the data. This however means that it's not a real time copy and you have to manage the regular refresh of the tables.
Other (more complicated) options are the BDR project or pglogical. It seems that logical replication will be built into the next Postgres version (to be released a the end of this year).
Or you could use a distributed, shared-nothing system like Postgres-XL (which probably is the most complicated system to setup and maintain)
I have created custom script in Express that actually migrates SQL Server database to MongoDB.
But I am facing problems in live syncing between the two databases.
Currently I have added a column updated_by in both the databases.
Then I fetch the latest updated_by row from MongoDb and SQL Server database.
Then I check the date difference and based on it I update my MongoDB database.
There are lots of db tables and I am finding it difficult to identify that, which table is being updated.
Is there any log in SQL Server 2008 R2 that states which table is updated and at what time?
I need a mechanism like, any data update in the db table should immediately sync that rows into my MongoDB.
Any more suggestions on live data syncing is also welcome.
Thanks in advance. :)
When i have such requirement to Sync between Relational DB say (MYSQL) and Non-Relational DB (Mongodb).
I had followed following steps which may help others in future. and the concept is generally called as Change Data Capture
Capture changes (For MYSQL iam using triggers.)
Transform changes to a suitable changes
ie RDBMS to Non RDBMS
Update changes
Remember to sync the structural changes of database and corresponding implementaions.
Following links may help
https://www.flydata.com/blog/what-change-data-capture-cdc-is-and-why-its-important/