Get incremental changes for RDS Postgres DB - postgresql

I have a process that extract data from origin DB, make changes and insert it to target DB.
Now I'm using an AWS Lambda that runs every few minutes and added a timestamp column that indicates when the data was last changed to filter base on it.
This process is inefficient since I need to "remember" to manually add a timestamp column to each new table. Is there a better way? can I use a query (Postgres) or an API call (boto3) to get only the data that was changed?

Related

Copy/Sync changed data alone from one database to another in specific interval, Postgresql

I have a requirement where I need to capture the changes in one Postgres server and make the same updates to another database in another server after specific intervals.
So what I want is both databases to be in sync during an interval that I specify.
I don't want to take a the whole Database dump, only the changes that occurred after last sync needs to be updated.
What are the ways this can be achieved? If it is possible.

Set up delta load in azure data factory

I have an SQL database on prem that I want to get data from. In the database there is a column called last_update that has information about when a row was last updated. The first time I run my pipeline I want it to copy everything from the database on prem to an azure database. The next time I want copy only the rows that have been updated since the last run. I therefore want to copy everything where last_update is higher than the time of the last run. Is there a way of using information about the time of the last run in a pipeline? Is there any other good way of creating what i want?
I think you can do this by developing custom copy activity. You can add your own transformation/processing logic and use the activity in a pipeline.

Mirth Connect save time-stamp

I have a Mirth Channel that takes data from MS SQL Server, creates an HL7 message for a file drop.
I want to run the query to only consider data younger than the last time we ran the query. How do we get Mirth Connect to save the old time stamp so that it can be used as part of the next query and survive between reboots? We cannot modify the database we are pulling the data from (otherwise we would just update the status table).
Do you have any suggestions for how, within Mirth Connect, we can save the timestamp of a given query to use in the next query?
A few options:
I think you could store some unique identifier and the timestamp in
one of the global maps and it will stick around between channel
calls. Not 100% on this one.
You can always write it to a file then
read it later. Depending on how your channel flows, this could be an
advantage. A file reader source could read that file and do queries
since the timestamp recorded in that file (or even the file
timestamp itself!).
The next option is to create a table in the same
or a different database (like a local SQLLite instance) and handle
it in SQL.

MongoDB into AWS Redshift

We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data.
I've looked around for options to make the data available and easier to consume, and have settled on two promising options:
AWS Redshift
Hadoop + Hive
We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight).
As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster.
I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising.
Has anyone done something like this?
I ended up coding up our own migrator using NodeJS.
I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end.
Timestamped data
Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp.
Plugins returning cursors
We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin.
How the cursors are used
The migrator engine then uses this cursor, and loops through each record.
It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes.
Delimited exports from S3 into a table on redshift
The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table.
So for example, if I had a plugin configured with a table name of employees, it would create a temp table with the name of temp_employees.
Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an is_deleted flag in redshift.
Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine.
So in summary, each plugin/migration provides the following to the engine:
A cursor, which optionally uses the last migrated date passed to it
from the engine, in order to ensure that only deltas are moved
across.
A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file
A schema file, this is a SQL file containing the schema for the table at redshift
Redshift is a data ware housing product and Mongo DB is a NoSQL DB. Clearly, they are not a replacement of each other and can co-exist and serve different purpose. Now how to save and update records at both places.
You can move all Mongo DB data to Redshift as a one time activity.
Redshift is not a good fit for real time write. For Near Real Time Sync to Redshift, you should Modify program that writes into Mongo DB.
Let that program also writes into S3 locations. S3 location to redshift movement can be done on regular interval.
Mongo DB being a document storage engine, Apache Solr, Elastic Search can be considered as possible replacements. But they do not support SQL type querying capabilities.They basically use a different filtering mechanism. For eg, for Solr, you might need to use the Dismax Filter.
On Cloud, Amazon's Cloud Search/Azure Search would be compelling options to try as well.
You can use AWS DMS to migrate data to redshift now easily , you can also realtime ongoing changes with it.

How to bulk-refresh postgres database

I've got a Postgres 9.1 database that contains weather information. The dataset consists of approximately 3.1 million rows.
It takes about 2 minutes to load the data from a CSV file, and a little less to create a multicolumn index.
Every 6 hours I need to completely refresh the dataset. My current thinking is I would import the new dataset into a different database name, such as "weather_imported" and once the import and index creation are finished, I would drop the original database and rename the imported database.
In theory, clients would continue to query the database during this operation, though if that has ill effects, I could probably arrange to have the clients silently ignore a few errors.
Questions:
Will that strategy work?
If a client happened to be in the
process of running a query at the time of DB drop, my assumption is
the database would not complete the drop until the query were
finished - true?
What if a query happened between the time the
DB were dropped and the rename? I assume a "database not found"
error.
Is there a better strategy?
Consider the following strategy as an alternative:
Include a "dataset version" field in the primary table.
Store the "current dataset version" in some central location, and write your selects to only search for rows which have the current dataset version.
To update the dataset:
Insert all the data with a new dataset version. (You could just use the start time of the update job as a version.)
Update the "current dataset version" atomically to the value you just inserted.
Delete all data with an older version than the version number you just inserted.
Presto -- no need to shuffle databases around.