Import data from HDFS to MongoDB using mongoimport - mongodb

I have a set of files on HDFS. Can I directly load these files into mongoDB (using mongoimport) without copying the files from HDFS to my hard disk.

Have you tried MongoInsertStorage?
You can simply load the dataset using pig and then use MongoInsertStorage to dump directly into Mongo. It internally launches a bunch of mappers that do exactly what is mentioned by 'David Gruzman's answer on this page. One of the advantages of this approach, is the parallelism and speed you achieve due to multiple mappers simultaneously inserting into the Mongo collection.
Here's a rough cut of what can be done with pig
REGISTER mongo-java-driver.jar
REGISTER mongo-hadoop-core.jar
REGISTER mongo-hadoop-pig.jar
DEFINE MongoInsertStorage com.mongodb.hadoop.pig.MongoInsertStorage();
-- you need this here since multiple mappers could spawn with the same
-- data set and write duplicate records into the collection
SET mapreduce.reduce.speculative false
-- or some equivalent loader
BIG_DATA = LOAD '/the/path/to/your/data' using PigStorage('\t');
STORE BIG_DATA INTO 'mongodb://hostname:27017/db USING MongoInsertStorage('', '');
More information here
https://github.com/mongodb/mongo-hadoop/tree/master/pig#inserting-directly-into-a-mongodb-collection

Are you storing CSV/JSON files in HDFS? If so, you just need some way of mapping them to your filesystem so you can point mongoimport to the file.
Alternatively mongoimport will take input from stdin unless a file is specified.

You can use mongoimport without the --file argument, and load from stdin:
hadoop fs -text /path/to/file/in/hdfs/*.csv | mongoimport ...

If we speak about big data I would look into scalable solutions.
We had similar case of serious data set (several terabytes) sitting in HDFS. This data, although with some transformation was to be loaded into Mongo.
What we did was to develop MapReduce Job which run over data and each mapper inserts its split of data into mongodb via API.

Related

How to read and write BSON files with Spark?

I have many MongoDB dumps in gzip compressed BSON files, each with multiple documents. I would like to read them directly to Spark, ideally partitioning on individual document level.
Previous discussions (1, 2) are old and use the depracated Hadoop Mongo connector. The new, actively maintained Spark Mongo connector seems to implement a DefaultSource interface, a couple custom partitioners, and a connection layer.
I would like to extract (or contribute) a way to read a multi-document BSON file from disk into a DataFrame, such that different documents can be loaded into different partitions. Writing would also be great to have for completeness, but I'm not sure how robust can writing to a single file from multiple writers be. I am new to Spark and unsure where to start.

MongoDB merging db

Is there a way to merge two mongodb databases?
In a way all records and files from DB2 should be merged to DB1.
I have a Java based web application with several APIs to download file content from the MongoDB. So I'm thinking using bash curl download the file, read the records properties then re-upload (merge) to the destination DB1.
This however will have an issue since the same Mongo _id ObjectID("xxxx") from DB2 cannot be transfer to DB1. MongoDB will automatically generate and assign ObjectID("xxxx") value based on what I understand.
Yes, use Mongodump and Mongorestore.
the chance for a duplicate document id (assuming its not the same document) is extremely low.
and in that case mongo will let you know insertion has failed and you could choose to deal with it however you see fit.
You could also use the write concern flag with the restore to decide how to deal with it while uploading.

How to improve the speed of a lot of small files' read and write?

My Job is to Improve the speed of reading a lot of small file (1KB) from disk to write into our database.
The database is open source to me, and I can change all the code from the client to the server.
The database architecture is that , it is a simple master-slave distributed HDFS based database like HBase. The small file from disk can be insert into our database and combined into bigger block automatically and then write into HDFS.(also the big file can be split to smaller block by database and then write into HDFS)
One way to change the client is to increase the thread number.
I don't have any other idea.Or you can provide some idea to do the performance analysis.
One of the way to process such small files could be to convert these small files to a sequence file and store it into HDFS. Then use this file as a Map Reduce job input file to put the data into HBase or similar database.
This uses aws as an example but it could be any storage/queue setup:
If the files were able to exist on a shared storage such as S3 you could add one queue entry for each file and then just start throwing servers at the queue to add the files to the db. At that point the bottleneck becomes the db instead of the client.

Importing AccessDB and Oracle directly into MongoDB

I am receiving .dmp and .mdb files from a customer & need to get that data into MongoDB.
Is there any way to straight import these file types into Mongo?
The goal is to programmatically ingest these into mongo in any way I can. The only rule is that customer will not change their method of data delivery, meaning I'm stuck with the .dmp and .mdb files as a source.
Any assistance would be greatly appreciated.
Here are a few options/ideas:
Convert mdb to csv, then use mongoimport --type csv to import into MongoDB.
Use an ETL tool, e.g. Pentaho, Informatica, etc. This will give you much more flexibility for doing any necessary transformation/conversion of data.
Write a custom ETL tool, using libraries that know how to read mdb and dmp files.
You don't mention how you plan to use this data, how many tables are in the database, and how normalized the tables are. Depending on the specifics of your use case, it's very possible that loading the data from Access "as is" will not be a good choice since normalized schemas are not a good fit for MongoDB and MongoDB does not natively support joins. This is where an ETL tool can help, by extracting the source data and transforming it into an appropriate JSON structure.
MongoDB has released ODBC drivers. Go Here MongoDB ODBC Drivers connect MSAccess directly to MongoDB through ODBC. Voila!

Mongodb to redshift

We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis).
How can we do it? Should we export the mongo to csv?
I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.
To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.
You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.
Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).
So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.
In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:
mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
head data.json > sample.json
ddlgenerator postgresql sample.json
Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.
Next you upload the data into Redshift.
If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.
For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.
Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.
For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.
Here are some related resources although in the end I did not use them:
Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
Related Redit question
Blog post about dealing with JSON arrays in Redshift
Honestly, I'd recommend using a third party here. I've used Panoply (panoply.io) and would recommend it. It'll take your mongo collections and flatten them into their own tables in redshift.
AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.
MongoDB versions 2.6.x and 3.x as a database source
Document Mode and Table Mode supported
Supports change data capture(CDC)
Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
A few questions that would be helpful to know would be:
Is this an add-only always increasing incremental sync i.e. data is only being added and not being updated / removed or rather your redshift instance is interested only in additions?
Is the data inconsistency due to delete / updates happening at source and not being fed to redshift instance ok?
Does it need to be daily-incremental batch or can it be realtime as it is happening as well?
Depending on your situation may be mongoexport works for you, but you have to understand the shortcoming of it, which can be found at http://docs.mongodb.org/manual/reference/program/mongoexport/ .
I had to tackle the same issue (not on a daily basis though).
as ask mentioned, You can use mongoexport in order to export the data, but keep in mind that redshift doesn't support arrays, so in case your collections data contains arrays you'll find it a bit problematic.
My solution to this was to pipe the mongoexport into a small utility program I wrote that transforms the mongoexport json rows into my desired csv output.
piping the output also allows you to make the process parallel.
Mongoexport allows you to add a mongodb query to the command, so if your collection data supports it you can spawn N different mongoexport processes, pipe it's results into the other program and decrease the total runtime of the migration process.
Later on, I uploaded the files to S3, and performed a COPY into the relevant table.
This should be a pretty easy solution.
Stitch Data is the best tool ever I've ever seen to replicate incrementally from MongoDB to Redshift within a few clicks and minutes.
Automatically and dynamically Detect DML, DDL for tables for replication.