We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis).
How can we do it? Should we export the mongo to csv?
I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.
To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.
You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.
Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).
So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.
In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:
mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
head data.json > sample.json
ddlgenerator postgresql sample.json
Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.
Next you upload the data into Redshift.
If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.
For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.
Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.
For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.
Here are some related resources although in the end I did not use them:
Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
Related Redit question
Blog post about dealing with JSON arrays in Redshift
Honestly, I'd recommend using a third party here. I've used Panoply (panoply.io) and would recommend it. It'll take your mongo collections and flatten them into their own tables in redshift.
AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.
MongoDB versions 2.6.x and 3.x as a database source
Document Mode and Table Mode supported
Supports change data capture(CDC)
Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
A few questions that would be helpful to know would be:
Is this an add-only always increasing incremental sync i.e. data is only being added and not being updated / removed or rather your redshift instance is interested only in additions?
Is the data inconsistency due to delete / updates happening at source and not being fed to redshift instance ok?
Does it need to be daily-incremental batch or can it be realtime as it is happening as well?
Depending on your situation may be mongoexport works for you, but you have to understand the shortcoming of it, which can be found at http://docs.mongodb.org/manual/reference/program/mongoexport/ .
I had to tackle the same issue (not on a daily basis though).
as ask mentioned, You can use mongoexport in order to export the data, but keep in mind that redshift doesn't support arrays, so in case your collections data contains arrays you'll find it a bit problematic.
My solution to this was to pipe the mongoexport into a small utility program I wrote that transforms the mongoexport json rows into my desired csv output.
piping the output also allows you to make the process parallel.
Mongoexport allows you to add a mongodb query to the command, so if your collection data supports it you can spawn N different mongoexport processes, pipe it's results into the other program and decrease the total runtime of the migration process.
Later on, I uploaded the files to S3, and performed a COPY into the relevant table.
This should be a pretty easy solution.
Stitch Data is the best tool ever I've ever seen to replicate incrementally from MongoDB to Redshift within a few clicks and minutes.
Automatically and dynamically Detect DML, DDL for tables for replication.
Related
I have a database currently in Mongo running on an EC2 instance and would like to migrate the data to DynamoDB. Is this possible and what is the most cost effective way to achieve this?
When you ask for a "cost effective way" to migrate data, I assume you are looking for existing technologies that can ease your life. If so, you could do the following:
Export your MongoDB data to a text file, say in tsv format, using mongoexport.
Upload that file somewhere in S3.
Import this data, in S3, to DynamoDB using AWS Data Pipeline.
Of course, you should design & finalize your DynamoDB table schema before doing all this.
Whenever you are changing databases, you have to be very careful about the way you migrate data. Certain data formats maintain type consistency, while others do not.
Then there are just data formats that cannot handle your schema. For example, CSV is great at handling data when it is one row per entry, but how do you render an embedded array in CSV? It really isn't possible, JSON is good at this, but JSON has its own problems.
The easiest example of this is JSON and DateTime. JSON does not have a specification for storing DateTime values, they can end up as ISO8601 dates, or perhaps UNIX Epoch Timestamps, or really anything a developer can dream up. What about Longs, Doubles, Ints? JSON doesn't discriminate, it makes them all strings, which can cause loss of precision if not deserialized correctly.
This makes it very important that you choose the appropriate translation medium. The generally means you have to roll your own solution. This means loading up the drivers for both databases, reading an entry from one, translating, and writing to this other. This is the best way to be absolutely sure errors are handled properly for your environment, that types are kept consistently, and that the code properly translates schema from source to destination (if necessary).
What does this all mean for you? It means a lot of leg work for you. It is possible somebody has already rolled something that is broad enough for your case, but I have found in the past that it is best for you to do it yourself.
I know this post is old, Amazon made it possible with AWS DMS, check this document :
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
Some relevant parts:
Using an Amazon DynamoDB Database as a Target for AWS Database
Migration Service
You can use AWS DMS to migrate data to an Amazon DynamoDB table.
Amazon DynamoDB is a fully managed NoSQL database service that
provides fast and predictable performance with seamless scalability.
AWS DMS supports using a relational database or MongoDB as a source.
I'm trying to use tELTPostgresqlOutput with postgres 9.3 server and this is the result:
With a simple tPostgresqlInput and a tLogRow it works perfectly.
This is not how to use the ELT components. These should be used to do in database server transformations such as creating a star schema table from multiple tables in the same database. This allows you to use the database to do the transformation and avoid reading the data into memory for your job. It's particularly useful when dealing with large datasets that can't be broken down for the transformation.
If you want to transfer data from one database server/vendor to another you will need to use ETL components (pretty much anything not explicitly marked ELT) to read data out of the source database and write it back to the target database.
In this case you should be using a tMSSQLInput component to read in the data you need, a tMap to transform the data in the way you want and a tPostgresqlOutput component to write the data out to the Postgres database.
I am receiving .dmp and .mdb files from a customer & need to get that data into MongoDB.
Is there any way to straight import these file types into Mongo?
The goal is to programmatically ingest these into mongo in any way I can. The only rule is that customer will not change their method of data delivery, meaning I'm stuck with the .dmp and .mdb files as a source.
Any assistance would be greatly appreciated.
Here are a few options/ideas:
Convert mdb to csv, then use mongoimport --type csv to import into MongoDB.
Use an ETL tool, e.g. Pentaho, Informatica, etc. This will give you much more flexibility for doing any necessary transformation/conversion of data.
Write a custom ETL tool, using libraries that know how to read mdb and dmp files.
You don't mention how you plan to use this data, how many tables are in the database, and how normalized the tables are. Depending on the specifics of your use case, it's very possible that loading the data from Access "as is" will not be a good choice since normalized schemas are not a good fit for MongoDB and MongoDB does not natively support joins. This is where an ETL tool can help, by extracting the source data and transforming it into an appropriate JSON structure.
MongoDB has released ODBC drivers. Go Here MongoDB ODBC Drivers connect MSAccess directly to MongoDB through ODBC. Voila!
We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data.
I've looked around for options to make the data available and easier to consume, and have settled on two promising options:
AWS Redshift
Hadoop + Hive
We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight).
As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster.
I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising.
Has anyone done something like this?
I ended up coding up our own migrator using NodeJS.
I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end.
Timestamped data
Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp.
Plugins returning cursors
We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin.
How the cursors are used
The migrator engine then uses this cursor, and loops through each record.
It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes.
Delimited exports from S3 into a table on redshift
The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table.
So for example, if I had a plugin configured with a table name of employees, it would create a temp table with the name of temp_employees.
Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an is_deleted flag in redshift.
Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine.
So in summary, each plugin/migration provides the following to the engine:
A cursor, which optionally uses the last migrated date passed to it
from the engine, in order to ensure that only deltas are moved
across.
A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file
A schema file, this is a SQL file containing the schema for the table at redshift
Redshift is a data ware housing product and Mongo DB is a NoSQL DB. Clearly, they are not a replacement of each other and can co-exist and serve different purpose. Now how to save and update records at both places.
You can move all Mongo DB data to Redshift as a one time activity.
Redshift is not a good fit for real time write. For Near Real Time Sync to Redshift, you should Modify program that writes into Mongo DB.
Let that program also writes into S3 locations. S3 location to redshift movement can be done on regular interval.
Mongo DB being a document storage engine, Apache Solr, Elastic Search can be considered as possible replacements. But they do not support SQL type querying capabilities.They basically use a different filtering mechanism. For eg, for Solr, you might need to use the Dismax Filter.
On Cloud, Amazon's Cloud Search/Azure Search would be compelling options to try as well.
You can use AWS DMS to migrate data to redshift now easily , you can also realtime ongoing changes with it.
Dear all ,
Can any one suggest me the postgres tool for linux which is used to find the
difference between the 2 given database
I tried with the apgdiff 2.3 but it gives the difference in terms of schema not the data
but I need both !
Thanks in advance !
Comparing data is not easy especially if your database is huge. I created Python program that can dump PostgreSQL data schema to file that can be easily compared via 3rd party diff programm: http://code.activestate.com/recipes/576557-dump-postgresql-db-schema-to-text/?in=user-186902
I think that this program can be extended by dumping all tables data into separate CSV files, similar to those used by PostgreSQL COPY command. Remember to add the same ORDER BY in SELECT ... queries. I have created tool that reads SELECT statements from file and saves results in separate files. This way I can manage which tables and fields I want to compare (not all fields can be used in ORDER BY, and not all are important for me). Such configuration can be easily created using "dump schema" utility.
Check out dbsolo DBSOLO. It does both object and data compares and can create a sync script based on the results. It's free to try and $99 to buy. My guess is the 99 bucks will be money well spent to avoid trying to come up with your own software to do this.
Data Compare
http://www.dbsolo.com/help/datacomp.html
Object Compare
http://www.dbsolo.com/help/compare.html
apgdiff https://www.apgdiff.com/
It's an opensource solution. I used it before for checking differences between differences in dumps. Quite useful
[EDIT]
It's for differenting by schema only