MongoDB merging db - mongodb

Is there a way to merge two mongodb databases?
In a way all records and files from DB2 should be merged to DB1.
I have a Java based web application with several APIs to download file content from the MongoDB. So I'm thinking using bash curl download the file, read the records properties then re-upload (merge) to the destination DB1.
This however will have an issue since the same Mongo _id ObjectID("xxxx") from DB2 cannot be transfer to DB1. MongoDB will automatically generate and assign ObjectID("xxxx") value based on what I understand.

Yes, use Mongodump and Mongorestore.
the chance for a duplicate document id (assuming its not the same document) is extremely low.
and in that case mongo will let you know insertion has failed and you could choose to deal with it however you see fit.
You could also use the write concern flag with the restore to decide how to deal with it while uploading.

Related

Insert MongoDB document with an objectId that existed in the past

I've a bunch of collections (12) and I need to rename many fields of them. I can't do it live in the database, all I can do is download and reupload a dump of it.
So I've downloaded the collection with mongodump manipulated the data and I'm planning to use mongorestore to push it back on the database.
I'm wondering what will happen then with ObjectIds.. I know that an objectId is unique throughout the database so I'm thinking about deleting all the old data right before using mongorestore, is it ok or will I still have problems with the ids?
You can specify any value for MongoID whatever you want. You even can use string instead of MongoID.
If you have production app you need to perform upgrade and migrate data by application itself step by step.
If you have one proccess singlethreaded application or if you can run your app in that way - it is most simple case. Else you need synchronization service.
Be carefull with async/await and promises and so on asyncronous processes. They receive and have in memory one the data in one time and continue process with that data in another time, and it need to have in mind that.
You need to do:
modify service to be ready to both data format
create modification code which will go through all the data and migrate it
modify service to be ready only to new data format once all the data migrate done

Database level migrations in parse server

Is there a possible way of writing database migrations for Parse Server?
My use case is: I want some tables with pre-populated data whenever I connect my application to a fresh mongodb server. Say for example, setup a staging environment, setup a local development environment etc etc.
I could not really find anything in the docs.
Am i going in the right direction or am I missing something?
Part of the Parse.com -> Parse-Server migration is that you now have to manage your own database. Parse-Server gives you the necessary tools to connect to your database, but you have to do things like manage indexes (mlab gives you weekly tips on what indexes you could add for improvement!), uploading large amounts of data, etc.
So, if your question is "Does Parse-Server do this?" No, and they won't.
If your question is "Can this be done?" Well, yes! MongoDB has an upload feature that takes in JSON or CSV, if I recall correctly. I know you can upload a collection, I'm not sure if you can do multiple collections at a time. One caveat is that you need to set createdAt and updatedAt and objectId yourself for this, but yes you can do this.
If you aren't too familiar with working with raw data that mongo needs, you could always setup the tables you want if they aren't too big, export all the data, and then use that data export as the import for all the fresh databases thereafter. Only issue there is that updatedAt and createdAt will show old dates on new instances.

Sync postgreSql data with ElasticSearch

Ultimately I want to have a scalable search solution for the data in PostgreSql. My finding points me towards using Logstash to ship write events from Postgres to ElasticSearch, however I have not found a usable solution. The soluions I have found involve using jdbc-input to query all data from Postgres on an interval, and the delete events are not captured.
I think this is a common use case so I hope you guys could share with me your experience, or give me some pointers to proceed.
If you need to also be notified on DELETEs and delete the respective record in Elasticsearch, it is true that the Logstash jdbc input will not help. You'd have to use a solution working around the binlog as suggested here
However, if you still want to use the Logstash jdbc input, what you could do is simply soft-delete records in PostgreSQL, i.e. create a new BOOLEAN column in order to mark your records as deleted. The same flag would then exist in Elasticsearch and you can exclude them from your searches with a simple term query on the deleted field.
Whenever you need to perform some cleanup, you can delete all records flagged deleted in both PostgreSQL and Elasticsearch.
You can also take a look at PGSync.
It's similar to Debezium but a lot easier to get up and running.
PGSync is a Change data capture tool for moving data from Postgres to Elasticsearch.
It allows you to keep Postgres as your source-of-truth and expose structured denormalized
documents in Elasticsearch.
You simply define a JSON schema describing the structure of the data in
Elasticsearch.
Here is an example schema: (you can also have nested objects)
e.g
{
"nodes": {
"table": "book",
"columns": [
"isbn",
"title",
"description"
]
}
}
PGsync generates queries for your document on the fly.
No need to write queries like Logstash. It also supports and tracks deletion operations.
It operates both a polling and an event-driven model to capture changes made to date
and notification for changes that occur at a point in time.
The initial sync polls the database for changes since the last time the daemon
was run and thereafter event notification (based on triggers and handled by the pg-notify)
for changes to the database.
It has very little development overhead.
Create a schema as described above
Point pgsync at your Postgres database and Elasticsearch cluster
Start the daemon.
You can easily create a document that includes multiple relations as nested objects. PGSync tracks any changes for you.
Have a look at the github repo for more details.
You can install the package from PyPI
Please take a look at Debezium. It's a change data capture (CDC) platform, which allow you to steam your data
I created a simple github repository, which shows how it works

MongoDB into AWS Redshift

We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data.
I've looked around for options to make the data available and easier to consume, and have settled on two promising options:
AWS Redshift
Hadoop + Hive
We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight).
As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster.
I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising.
Has anyone done something like this?
I ended up coding up our own migrator using NodeJS.
I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end.
Timestamped data
Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp.
Plugins returning cursors
We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin.
How the cursors are used
The migrator engine then uses this cursor, and loops through each record.
It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes.
Delimited exports from S3 into a table on redshift
The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table.
So for example, if I had a plugin configured with a table name of employees, it would create a temp table with the name of temp_employees.
Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an is_deleted flag in redshift.
Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine.
So in summary, each plugin/migration provides the following to the engine:
A cursor, which optionally uses the last migrated date passed to it
from the engine, in order to ensure that only deltas are moved
across.
A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file
A schema file, this is a SQL file containing the schema for the table at redshift
Redshift is a data ware housing product and Mongo DB is a NoSQL DB. Clearly, they are not a replacement of each other and can co-exist and serve different purpose. Now how to save and update records at both places.
You can move all Mongo DB data to Redshift as a one time activity.
Redshift is not a good fit for real time write. For Near Real Time Sync to Redshift, you should Modify program that writes into Mongo DB.
Let that program also writes into S3 locations. S3 location to redshift movement can be done on regular interval.
Mongo DB being a document storage engine, Apache Solr, Elastic Search can be considered as possible replacements. But they do not support SQL type querying capabilities.They basically use a different filtering mechanism. For eg, for Solr, you might need to use the Dismax Filter.
On Cloud, Amazon's Cloud Search/Azure Search would be compelling options to try as well.
You can use AWS DMS to migrate data to redshift now easily , you can also realtime ongoing changes with it.

Mongodb to redshift

We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis).
How can we do it? Should we export the mongo to csv?
I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.
To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.
You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.
Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).
So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.
In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:
mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
head data.json > sample.json
ddlgenerator postgresql sample.json
Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.
Next you upload the data into Redshift.
If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.
For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.
Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.
For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.
Here are some related resources although in the end I did not use them:
Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
Related Redit question
Blog post about dealing with JSON arrays in Redshift
Honestly, I'd recommend using a third party here. I've used Panoply (panoply.io) and would recommend it. It'll take your mongo collections and flatten them into their own tables in redshift.
AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.
MongoDB versions 2.6.x and 3.x as a database source
Document Mode and Table Mode supported
Supports change data capture(CDC)
Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
A few questions that would be helpful to know would be:
Is this an add-only always increasing incremental sync i.e. data is only being added and not being updated / removed or rather your redshift instance is interested only in additions?
Is the data inconsistency due to delete / updates happening at source and not being fed to redshift instance ok?
Does it need to be daily-incremental batch or can it be realtime as it is happening as well?
Depending on your situation may be mongoexport works for you, but you have to understand the shortcoming of it, which can be found at http://docs.mongodb.org/manual/reference/program/mongoexport/ .
I had to tackle the same issue (not on a daily basis though).
as ask mentioned, You can use mongoexport in order to export the data, but keep in mind that redshift doesn't support arrays, so in case your collections data contains arrays you'll find it a bit problematic.
My solution to this was to pipe the mongoexport into a small utility program I wrote that transforms the mongoexport json rows into my desired csv output.
piping the output also allows you to make the process parallel.
Mongoexport allows you to add a mongodb query to the command, so if your collection data supports it you can spawn N different mongoexport processes, pipe it's results into the other program and decrease the total runtime of the migration process.
Later on, I uploaded the files to S3, and performed a COPY into the relevant table.
This should be a pretty easy solution.
Stitch Data is the best tool ever I've ever seen to replicate incrementally from MongoDB to Redshift within a few clicks and minutes.
Automatically and dynamically Detect DML, DDL for tables for replication.