Bulk data insertion and updating from one db server to another db server - mongodb

I have some set of tables which has 20 million records in a postgres server. As of now i m migrating some table data from one server to another server using insert and update queries with dependent tables in functions. It takes around 2 hours even after optimizing the query. I need a solution to migrate the data faster by using mongodb or cassandra. How?

Try putting your updates and inserts into a file and then load the file. I understand Postgresql will optimise loading the file contents. It's always worked for me although I haven't used that quantity.

Related

Spark data pipeline initial load impact on production DB

I want to write a Spark pipeline to perform aggregation on my production DB data and then write data back to the DB. My goal of writing the pipeline is to perform aggregation and not impact production DB while it runs, meaning I don't want users experiencing lag nor DB having heavy IOPS while the aggregation is performed. For example, an equivalent aggregation query just run as SQL would take a long time and also use up the RDS IOPS, which results in users not being able to get data - trying to avoid this. A few questions:
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)? For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
When writing data back to DB, does that incur load on DB as well?
I'm using a PostgreSQL database in case this matters.
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
By default there will be a single partition in Glue to which the whole table is read into.But you can do parallel reads using this and make sure to chose a column that will not affect the DB performance.
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)?
Yes when you pass a query instead of table you will be only reading the result of it from the DB and reducing the large n/w and IO transfer. This means you are delegating it to DB engine to calculate the result.Refer to this on how you can do it.
For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
Yes depending on the table size and query complexity this might affect DB performance and if you have a read replica then you can simply use that.
When writing data back to DB, does that incur load on DB as well?
Yes it depends on how you are writing the result back to DB. Few partitions is always good i.e, not too many and not too less.

How to replicate a Postgres DB with only a sample of the data

I'm attempting to mock a database for testing purposes. What I'd like to do is given a connection to an existing Postgres DB, retrieve the schema, limit the data pulled to 1000 rows from each table, and persist both of these components as a file which can later be imported into a local database.
pg_dump doesn't seem to fullfill my requirements as theres no way to tell it to only retrieve a limited amount of rows from tables, its all or nothing.
COPY/\copy commands can help fill this gap, however, it doesn't seem like theres a way to copy data from multiple tables into a single file. I'd rather avoid having to create a single file per table, is there a way to work around this?

How would you connect to two postgres databases in a sql script or a stored procedure?

I need to move some old data from one database to another - both have similar schemas. After the old rows in db1.mytable are inserted into db2.mytable - these same rows should be deleted from db1.mytable.
This is reduce db size and archive data that is not really needed that much but still important.
You should to use Foreign Data Wrapper for Postgres - https://www.postgresql.org/docs/current/static/postgres-fdw.html With foreign tables, you can send a query to another database.

Can Spring-JPA work with Postgres partitioning?

We have a Spring Boot project that uses Spring-JPA for data access. We have a couple of tables where we create/update rows once (or a few times, all within minutes). We don't update rows that are older than a day. These tables (like audit table) can get very large and we want to use Postgres' table partitioning features to help break up the data by month. So the main table always has this calendar month's data but if the query requires retrieval from previous months it would somehow read it from other partitions.
Two questions:
1) Is this a good idea for archiving older data but still leave it query-able?
2) Does Spring-JPA work with partitioned tables? Or do we have to figure out how to break up the query and do native queries and concatenate the restult set?
Thanks.
I am working with postgres partitioning with Hibernate & Spring JPA for a period of time. So I think, I can try to answer your questions.
1) Is this a good idea for archiving older data but still leave it query-able?
If you are applying indexes and not re-indexing table frequently, then partitioning of data may result faster query results.
Also you can use clustered index feature in postgres as well to fetch the data faster.
Because table with older data will not going to be updated, so clustered index will improve the performance efficiently.
2) Does Spring-JPA work with partitioned tables? Or do we have to figure out how to break up the query and do native queries and concatenate the restult set?
Spring JPA will work out of the box with partitioned table. It will retrieve the data from master as well as child tables and returns the concatenated result set.
Note : Issue with partitioned table
The only issue you will face with partitioned table is insertion in partitioned table.
Let me explain, when you partition a table, you will create a trigger over master table, and that trigger will return null. This is the key behind insertion issue in partitioned table using Spring JPA / Hibernate.
When you try to insert a row using Spring JPA or Hibernate you will face below issue
Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1
To overcome this issue you need to override implementation of Batching batcher.
In hibernate you can provide the custom implementation of batcher factory using below configuration
hibernate.jdbc.factory_class=path.to.my.batcher.factory.implementation
In spring JPA you can achieve the same by custom implementation of batch builder using below configuration
hibernate.jdbc.batch.builder=path.to.my.batch.builder.implementation
References :
Custom Batch Builder/Batch in Spring-JPA
Demo Application
In addition to the #Anil Agrawal answer.
If you are using spring boot 2 then you need to define the customBatcher using the property.
spring.jpa.properties.hibernate.jdbc.batch.builder=net.xyz.jdbc.CustomBatchBuilder
You do not have to break down the JDBC query with postgres 11+.
If you execute select on the main table with plain jdbc, the DB would return the aggregated results from the partitioned tables.
In other words, the work is done by the Postgres DB, so Spring JPA will simply get the result and map it to objects as if there were no partitioning.
For having inserts work in a partitioned table you need to make sure that your partitions are already created, i think spring data will not create them for you.

MongoDB into AWS Redshift

We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data.
I've looked around for options to make the data available and easier to consume, and have settled on two promising options:
AWS Redshift
Hadoop + Hive
We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight).
As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster.
I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising.
Has anyone done something like this?
I ended up coding up our own migrator using NodeJS.
I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end.
Timestamped data
Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp.
Plugins returning cursors
We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin.
How the cursors are used
The migrator engine then uses this cursor, and loops through each record.
It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes.
Delimited exports from S3 into a table on redshift
The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table.
So for example, if I had a plugin configured with a table name of employees, it would create a temp table with the name of temp_employees.
Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an is_deleted flag in redshift.
Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine.
So in summary, each plugin/migration provides the following to the engine:
A cursor, which optionally uses the last migrated date passed to it
from the engine, in order to ensure that only deltas are moved
across.
A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file
A schema file, this is a SQL file containing the schema for the table at redshift
Redshift is a data ware housing product and Mongo DB is a NoSQL DB. Clearly, they are not a replacement of each other and can co-exist and serve different purpose. Now how to save and update records at both places.
You can move all Mongo DB data to Redshift as a one time activity.
Redshift is not a good fit for real time write. For Near Real Time Sync to Redshift, you should Modify program that writes into Mongo DB.
Let that program also writes into S3 locations. S3 location to redshift movement can be done on regular interval.
Mongo DB being a document storage engine, Apache Solr, Elastic Search can be considered as possible replacements. But they do not support SQL type querying capabilities.They basically use a different filtering mechanism. For eg, for Solr, you might need to use the Dismax Filter.
On Cloud, Amazon's Cloud Search/Azure Search would be compelling options to try as well.
You can use AWS DMS to migrate data to redshift now easily , you can also realtime ongoing changes with it.