I used bsondump to export a huge (69GB) file to json. I expected to get a valid json array, but instead the objects are not separated.
There is an option to create a json array using mongoexport. But this bson file was exported from another machine, and due to size and performance considerations I do not want to import this large file before I can use mongoexport to export it from the db instead.
How can I export a valid json array using bsondump?
EDIT
To give more background why I need to convert from a bson based mongodb export to json:
1) I was trying to use mongoexport to export a json directly from mongodb. Just like this:
mongoexport -d mydb -c notifications --jsonArray -o lv.json
The problem with this is that there is no progress available for the export, and it runs significantly slower than mongodump (e.g. it never finished before I had to stop). I'm putting significant strain on a production server. As I stated in my original question, it's not an option for that reason.
2) mongodump works way faster, likely because it doesn't have to convert to json and just dumps the internal data. It also showed progress, so I knew when it would finish. So that's the only thing I could run on the production server.
mongodump --db mydb
Edit 2
After exporting to .bson it is then possible to use bsondump to convert the .bson file into a .json file:
bsondump mydata.bson > mydata.json
To make the point clear here: bsondump has no --jsonArray option like mongoexport. So it cannot export a valid json array, but instead dumps multpiple root objects into one file. The result is an invalid document, which one would have to pre-parse.
/Edit2
3) I have basically two options: Importing the bson dump into a local db, and exporting it to a proper json file using mongoexport --jsonArray. Or find a way around bsondump itself not being able to export to a proper json array file. The third option, implementing a bson parser into my tool, is something that I'm not really keen off...
The large file size is not a problem for my tool. My tool is written in C++ and specialized for large data streams. I use rapidjson with a SAX parser under the hood, and filter out records via an own SQL-like evaluator. Memory usage is in the area of < 10MB usually since I use a SAX parser instead of DOM.
To answer my own question: bsondump is currently missing the option to create a json array as output (like mongoexport's --jsonArray option). I've created a feature request [1] and maybe it will be added to a next version of bsondump.
Meanwhile, I've created a small tool for my purpose which converts my data into a json array.
[1] https://jira.mongodb.org/browse/TOOLS-1734
Related
I'm trying to import a JSON file into my local database, but there seems to be an issue when trying to import a JSON file larger than a certain size.
From the command prompt I run:
C:\Program Files\MongoDB\Server\4.4\bin>mongoimport -d localhost --port=27017 -c users --db=cptechdemolocal --type json --file=C:\data\importdata\users.json --numInsertionWorkers 8
This just throws an error of 'Failed: an inserted document is too large'.
Is there a way to do this without splitting the JSON file? Maybe some missing arguments or an alternative method I can look at?
Thank you so much in advance.
MongoDB documents (when serialized to BSON) must not exceed 16 mb. See https://docs.mongodb.com/manual/reference/limits/.
Inserting a larger document won't work, you need to reorganize your data.
I managed to fix the issue by adding in --batchSize 1, which completed the import in batches and had no need for me to split the data in my JSON file.
I have a site that I created using mongodb but now I want to create a new site with MySQL. I want to retrieve data from my old site (the one using mongodb). I use RoboMongo software to connect to mongodb server but I don't see my old data (*.pdf, *.doc). I think that the data is in binary, isn't it?
How can I retrieve this data?
The binary data you've highlighted is stored using a convention called GridFS. Robomongo 0.8.x doesn't support decoding GridFS binary data (see: issue #255).
In order to extract the files you'll either need to:
use the command line mongofiles utility included with MongoDB. For example:
mongofiles list to see files stored
mongofiles get filename to get a specific file
use a different program or driver that supports GridFS
We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis).
How can we do it? Should we export the mongo to csv?
I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.
To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.
You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.
Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).
So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.
In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:
mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
head data.json > sample.json
ddlgenerator postgresql sample.json
Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.
Next you upload the data into Redshift.
If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.
For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.
Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.
For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.
Here are some related resources although in the end I did not use them:
Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
Related Redit question
Blog post about dealing with JSON arrays in Redshift
Honestly, I'd recommend using a third party here. I've used Panoply (panoply.io) and would recommend it. It'll take your mongo collections and flatten them into their own tables in redshift.
AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.
MongoDB versions 2.6.x and 3.x as a database source
Document Mode and Table Mode supported
Supports change data capture(CDC)
Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
A few questions that would be helpful to know would be:
Is this an add-only always increasing incremental sync i.e. data is only being added and not being updated / removed or rather your redshift instance is interested only in additions?
Is the data inconsistency due to delete / updates happening at source and not being fed to redshift instance ok?
Does it need to be daily-incremental batch or can it be realtime as it is happening as well?
Depending on your situation may be mongoexport works for you, but you have to understand the shortcoming of it, which can be found at http://docs.mongodb.org/manual/reference/program/mongoexport/ .
I had to tackle the same issue (not on a daily basis though).
as ask mentioned, You can use mongoexport in order to export the data, but keep in mind that redshift doesn't support arrays, so in case your collections data contains arrays you'll find it a bit problematic.
My solution to this was to pipe the mongoexport into a small utility program I wrote that transforms the mongoexport json rows into my desired csv output.
piping the output also allows you to make the process parallel.
Mongoexport allows you to add a mongodb query to the command, so if your collection data supports it you can spawn N different mongoexport processes, pipe it's results into the other program and decrease the total runtime of the migration process.
Later on, I uploaded the files to S3, and performed a COPY into the relevant table.
This should be a pretty easy solution.
Stitch Data is the best tool ever I've ever seen to replicate incrementally from MongoDB to Redshift within a few clicks and minutes.
Automatically and dynamically Detect DML, DDL for tables for replication.
I want to import data from a CSV file into MongoDB. There are some long and double values stored in the CSV. How do I import these long and double values as long and double into mongodb using mongoimport command. Writing a script is another thing I have written this script but I want to use mongoimport to make it more easier. What I basically want is MySQL like import which allows me to assign datatype while importing the data.
mongoimport will always use the default double representation.
It can't be used to differentiate between double and long.
See Reference > MongoDB Package Components > mongoimport
Note Do not use mongoimport and mongoexport for full instance, production backups because they will not reliably capture data type information. Use mongodump and mongorestore as described in Backup Strategies for MongoDB Systems for this kind of functionality.
I have a set of files on HDFS. Can I directly load these files into mongoDB (using mongoimport) without copying the files from HDFS to my hard disk.
Have you tried MongoInsertStorage?
You can simply load the dataset using pig and then use MongoInsertStorage to dump directly into Mongo. It internally launches a bunch of mappers that do exactly what is mentioned by 'David Gruzman's answer on this page. One of the advantages of this approach, is the parallelism and speed you achieve due to multiple mappers simultaneously inserting into the Mongo collection.
Here's a rough cut of what can be done with pig
REGISTER mongo-java-driver.jar
REGISTER mongo-hadoop-core.jar
REGISTER mongo-hadoop-pig.jar
DEFINE MongoInsertStorage com.mongodb.hadoop.pig.MongoInsertStorage();
-- you need this here since multiple mappers could spawn with the same
-- data set and write duplicate records into the collection
SET mapreduce.reduce.speculative false
-- or some equivalent loader
BIG_DATA = LOAD '/the/path/to/your/data' using PigStorage('\t');
STORE BIG_DATA INTO 'mongodb://hostname:27017/db USING MongoInsertStorage('', '');
More information here
https://github.com/mongodb/mongo-hadoop/tree/master/pig#inserting-directly-into-a-mongodb-collection
Are you storing CSV/JSON files in HDFS? If so, you just need some way of mapping them to your filesystem so you can point mongoimport to the file.
Alternatively mongoimport will take input from stdin unless a file is specified.
You can use mongoimport without the --file argument, and load from stdin:
hadoop fs -text /path/to/file/in/hdfs/*.csv | mongoimport ...
If we speak about big data I would look into scalable solutions.
We had similar case of serious data set (several terabytes) sitting in HDFS. This data, although with some transformation was to be loaded into Mongo.
What we did was to develop MapReduce Job which run over data and each mapper inserts its split of data into mongodb via API.