How do you import binary data with mongoimport? - mongodb

I've tried every combination to import binary data for Mongo and I CANNOT get it to work. I've tried using new BinData(0, <bindata>) and I've tried using
{
"$binary" : "<bindata>",
"$type" : "0"
}
The first one gives me a parsing error. The second gives me an error reading "Invalid use of a reserved field name."
I can import other objects fine. For reference, I'm trying to import a BASE64-encoded image string. Here is my current version of the JSON I'm using:
{"_id" : "72984ce4-de03-407f-8911-e7b03f0fec26","OriginalWidth" : 73, "OriginalHeight" : 150, { "$binary" : "", "$type" : "0" }, "ContentType" : "image/jpeg", "Name" : "test.jpg", "Type" : "5ade8812-e64a-4c64-9e23-b3aa7722cfaa"}

I actually figured out this problem and thought I'd come back to SO to help anyone out who might be struggling.
Essentially, what I was doing was using C# to generate a JSON file. That file was used on an import script that ran and brought in all kinds of data. One of the fields in a collection required storing binary image data as a Base64-encoded string. The Mongo docs (Import Export Tools and Importing Interesting Types) were helpful, but only to a certain point.
To format the JSON properly for this, I had to use the following C# snippet to get an image file as a byte array and dump it into a string. There is a more efficient way of doing this for larger strings (StringBuilder for starters), but I'm simplifying for the purpose of illustrating the example:
byte[] bytes = File.ReadAllBytes(imageFile);
output = "{\"Data\" : {\"$binary\" : \"" + Convert.ToBase64String(bytes) + "\", \"$type\" : \"00\"}, \"ContentType\" : \"" + GetMimeType(fileInfo.Name) + "\", \"Name\" : \"" + fileInfo.Name + "\"}";
I kept on failing on the type part, by the way. It translates to generic binary data is specified in the BSON spec here: http://bsonspec.org/#/specification.
If you want to skip straight to the JSON, the above code output a string very similar to this:
{"Data": {"$binary": "[Byte array as Base64 string]", "$type": "00"}, "ContentType": "image/jpeg", "Name": "test.jpg"}
Then, I just used the mongoimport tool to process the resulting JSON.
Note: since I'm already in C#, I could've just used the Mongo DLL and done processing there, but for this particular case, I had to create the JSON files raw in the code. Fun times.

Related

JSON data getting retrieved from mongodb with formats added explicitly inside field e.g.({"field": {$numberInt: "20"}}). How to process that data?

I have used mongo import to import data into mongodb from csv files. I am trying to retrieve data from an Mongodb realm service. The returned data for the entry is as follows:
{
"_id": "6124edd04543fb222e",
"Field1": "some string",
"Field2": {
"$numberDouble": "145.81"
},
"Field3": {
"$numberInt": "0"
},
"Field4": {
"$numberInt": "15"
},
"Field5": {
"$numberInt": "0"
}
How do I convert this into normal JSON by removing $numberInt and $numberDouble like :
{
"_id": "6124edd04543fb222e",
"Field1": "some string",
"Field2": 145.8,
"Field3": 0,
"Field4": 15,
"Field5": 0
}
The fields are also different for different documents so cannot use Mongoose directly. Are there any solutions to this?
Also would help to know why the numbers are being stored as $numberInt:"".
Edit:
For anyone with the same problem this is how I solved it.
The array of documents is in EJSON format instead of JSON like said in the upvoted answer. To covert it back into normal JSON, I used JSON.stringify to first convert each document I got from map function into string and then parsed it using EJSON.parse with
{strict:false} (this option is important)
option to convert it into normal JSON.
{restaurants.map((restaurant) => {
restaurant=EJSON.parse(JSON.stringify(restaurant),{strict:false});
}
EJSON.parse documentation here. The module to be installed and imported is mongodb-extjson.
The format with $numberInt etc. is called (MongoDB) Extended JSON.
You are getting it on the output side either because this is how you inserted your data (meaning your inserted data was incorrect, you need to fix the ingestion side) or because you requested extended JSON serialization.
If the data in the database is correct, and you want non-extended JSON output, you generally need to write your own serializers to JSON since there are multiple possibilities of how to format the data. MongoDB's JSON output format is the Extended JSON you're seeing in your first quote.

Does ReactiveMongo handle extended JSON to BSON conversion fully?

I have been trying to use reactivemongo to insert some documents into a mongodb collection with a few BSON types.
I am using the Play JSON library to parse and manipulate some documents in extended JSON, here is one example:
{
"_id" : {"$oid": "5f3403dc7e562db8e0aced6b"},
"some_datetime" : {
"$date" : {"$date": 1597841586927}
}
}
I'm using reactivemongo-play-json, and so I have to import the following so my JsObject is automatically cast to a reactivemongo BSONDocument when passing it to collection.insert.one
import reactivemongo.play.json.compat._
import json2bson._
Unfortunately, once I open my mongo shell and look at the document I just inserted, this is the result:
{
"_id" : ObjectId("5f3403dc7e562db8e0aced6b"),
"some_datetime" : {
"$date" : NumberLong("1597244282116")
},
}
Only the _id has been understood as a BSON type described using extended JSON, and I'd expect the some_datetime field to be something like a ISODate(), same as I'd expect to see UUID()-type values instead of their extended JSON description which looks like this:
{'$binary': 'oKQrIfWuTI6JpPbPlYGYEQ==', '$type': '04'}
How can I make sure this extended JSON is actually converted to proper BSON types?
Turns out the problem is that what I thought to be extended JSON is actually not; my datetime should be formatted as:
{"$date": {"$numberLong": "1597841586927"}}
instead of
{"$date": 1597841586927}
The wrong format was introduced by my data source - a kafka connect mongo source connector not serializing documents to proper extended JSON by default (see this stackoverflow post).

How to import comma separated JSON documents into Mongodb?

I have dozens of JSON files given by my colleague, they are comma separated documents, look like as follows:
{
"_id" : ObjectId("566a8d08b9ac7b7dc2ddb90a"),
"login_ip" : "180.173.143.x",
"login_time" : NumberLong("1478757697373"),
"logout_time" : NumberLong("1478757878035"),
"role" : NumberInt("5"),
"server_ip" : "115.28.94.x",
"thirdPartyId" : NumberInt("-1"),
"ver" : NumberInt("1036")
},
{
"_id" : ObjectId("566a8d0db9ac7b7dc2ddb90b"),
"login_ip" : "116.226.162.x",
"login_time" : NumberLong("1456103011531"),
"logout_time" : NumberLong("1456111567354"),
"role" : NumberInt("10002"),
"server_ip" : "115.28.94.x",
"thirdPartyId" : NumberInt("6"),
"ver" : NumberInt("1056")
},
...
I've tried to import them to my local mongodb with mongoimport tool, but it has trouble to locate the starting position of the second document, complains about the syntax of these files, in spite of the fact that the first document is parsed into the db.
+ mongoimport.exe --db eques --collection users_2017_04_25 --type json --bypassDocumentValidation --file 'E:\sample\mongdo/users_2017_04_25.json'
2017-06-12T14:01:32.029+0800 connected to: localhost
2017-06-12T14:01:32.040+0800 Failed: error processing document #2: invalid character ',' looking for beginning of value
2017-06-12T14:01:32.040+0800 imported 0 documents
PS: there're many files to be imported, please don't suggest me turn them into JSON arrays.
Please help.
The JSON documents above is expressed in BSON much of it mongoimport tool doesn't understand.
You should export your data in standard JSON format first(in Mongo world, it's call Strict mode), then feed it to mongoimport tool.

Pig's AvroStorage LOAD removes unicode chars from input

I am using pig to read avro files and normalize/transform the data before writing back out. The avro files have records of the form:
{
"type" : "record",
"name" : "KeyValuePair",
"namespace" : "org.apache.avro.mapreduce",
"doc" : "A key/value pair",
"fields" : [ {
"name" : "key",
"type" : "string",
"doc" : "The key"
}, {
"name" : "value",
"type" : {
"type" : "map",
"values" : "bytes"
},
"doc" : "The value"
} ]
}
I have used the AvroTools command-line utility in conjunction with jq to dump the first record to JSON:
$ java -jar avro-tools-1.8.1.jar tojson part-m-00000.avro | ./jq --compact-output 'select(.value.pf_v != null)' | head -n 1 | ./jq .
{
"key": "some-record-uuid",
"value": {
"pf_v": "v1\u0003Basic\u0001slcvdr1rw\u001a\u0004v2\u0003DayWatch\u0001slcva2omi\u001a\u0004v3\u0003Performance\u0001slc1vs1v1w1p1g1i\u0004v4\u0003Fundamentals\u0001snlj1erwi\u001a\u0004v5\u0003My Portfolio\u0001svr1dews1b2b3k1k2\u001a\u0004v0\u00035"
}
}
I run the following pig commands:
REGISTER avro-1.8.1.jar
REGISTER json-simple-1.1.1.jar
REGISTER piggybank-0.15.0.jar
REGISTER jackson-core-2.8.6.jar
REGISTER jackson-databind-2.8.6.jar
DEFINE AvroLoader org.apache.pig.piggybank.storage.avro.AvroStorage();
AllRecords = LOAD 'part-m-00000.avro'
USING AvroLoader()
AS (key: chararray, value: map[]);
Records = FILTER AllRecords BY value#'pf_v' is not null;
SmallRecords = LIMIT Records 10;
DUMP SmallRecords;
The corresponding record for the last command above is as follows:
...
(some-record-uuid,[pf_v#v03v1Basicslcviv2DayWatchslcva2omiv3Performanceslc1vs1v1w1p1g1i])
...
As you can see the unicode chars have been removed from the pf_v value. The unicode characters are actually being used as delimiters in these values so I will need them in order to fully parse the records into their desired normalized state. The unicode characters are clearly present in the encoded .avro file (as demonstrated by dumping the file to JSON). Is anybody aware of a way to get AvroStorage to not remove the unicode chars when loading records?
Thank you!
Update:
I have also performed the same operation using Avro's python DataFileReader:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
reader = DataFileReader(open("part-m-00000.avro", "rb"), DatumReader())
for rec in reader:
if 'some-record-uuid' in rec['key']:
print rec
print '--------------------------------------------'
break
reader.close()
This prints a dict with what looks like hex chars substituted for the unicode chars (which is preferable to removing them entirely):
{u'value': {u'pf_v': 'v0\x033\x04v1\x03Basic\x01slcvi\x1a\x04v2\x03DayWatch\x01slcva2omi\x1a\x04v3\x03Performance\x01slc1vs1v1w1p1g1i\x1a'}, u'key': u'some-record-uuid'}

Storing a query in Mongo

This is the case: A webshop in which I want to configure which items should be listed in the sjop based on a set of parameters.
I want this to be configurable, because that allows me to experiment with different parameters also change their values easily.
I have a Product collection that I want to query based on multiple parameters.
A couple of these are found here:
within product:
"delivery" : {
"maximum_delivery_days" : 30,
"average_delivery_days" : 10,
"source" : 1,
"filling_rate" : 85,
"stock" : 0
}
but also other parameters exist.
An example of such query to decide whether or not to include a product could be:
"$or" : [
{
"delivery.stock" : 1
},
{
"$or" : [
{
"$and" : [
{
"delivery.maximum_delivery_days" : {
"$lt" : 60
}
},
{
"delivery.filling_rate" : {
"$gt" : 90
}
}
]
},
{
"$and" : [
{
"delivery.maximum_delivery_days" : {
"$lt" : 40
}
},
{
"delivery.filling_rate" : {
"$gt" : 80
}
}
]
},
{
"$and" : [
{
"delivery.delivery_days" : {
"$lt" : 25
}
},
{
"delivery.filling_rate" : {
"$gt" : 70
}
}
]
}
]
}
]
Now to make this configurable, I need to be able to handle boolean logic, parameters and values.
So, I got the idea, since such query itself is JSON, to store it in Mongo and have my Java app retrieve it.
Next thing is using it in the filter (e.g. find, or whatever) and work on the corresponding selection of products.
The advantage of this approach is that I can actually analyse the data and the effectiveness of the query outside of my program.
I would store it by name in the database. E.g.
{
"name": "query1",
"query": { the thing printed above starting with "$or"... }
}
using:
db.queries.insert({
"name" : "query1",
"query": { the thing printed above starting with "$or"... }
})
Which results in:
2016-03-27T14:43:37.265+0200 E QUERY Error: field names cannot start with $ [$or]
at Error (<anonymous>)
at DBCollection._validateForStorage (src/mongo/shell/collection.js:161:19)
at DBCollection._validateForStorage (src/mongo/shell/collection.js:165:18)
at insert (src/mongo/shell/bulk_api.js:646:20)
at DBCollection.insert (src/mongo/shell/collection.js:243:18)
at (shell):1:12 at src/mongo/shell/collection.js:161
But I CAN STORE it using Robomongo, but not always. Obviously I am doing something wrong. But I have NO IDEA what it is.
If it fails, and I create a brand new collection and try again, it succeeds. Weird stuff that goes beyond what I can comprehend.
But when I try updating values in the "query", changes are not going through. Never. Not even sometimes.
I can however create a new object and discard the previous one. So, the workaround is there.
db.queries.update(
{"name": "query1"},
{"$set": {
... update goes here ...
}
}
)
doing this results in:
WriteResult({
"nMatched" : 0,
"nUpserted" : 0,
"nModified" : 0,
"writeError" : {
"code" : 52,
"errmsg" : "The dollar ($) prefixed field '$or' in 'action.$or' is not valid for storage."
}
})
seems pretty close to the other message above.
Needles to say, I am pretty clueless about what is going on here, so I hope some of the wizzards here are able to shed some light on the matter
I think the error message contains the important info you need to consider:
QUERY Error: field names cannot start with $
Since you are trying to store a query (or part of one) in a document, you'll end up with attribute names that contain mongo operator keywords (such as $or, $ne, $gt). The mongo documentation actually references this exact scenario - emphasis added
Field names cannot contain dots (i.e. .) or null characters, and they must not start with a dollar sign (i.e. $)...
I wouldn't trust 3rd party applications such as Robomongo in these instances. I suggest debugging/testing this issue directly in the mongo shell.
My suggestion would be to store an escaped version of the query in your document as to not interfere with reserved operator keywords. You can use the available JSON.stringify(my_obj); to encode your partial query into a string and then parse/decode it when you choose to retrieve it later on: JSON.parse(escaped_query_string_from_db)
Your approach of storing the query as a JSON object in MongoDB is not viable.
You could potentially store your query logic and fields in MongoDB, but you have to have an external app build the query with the proper MongoDB syntax.
MongoDB queries contain operators, and some of those have special characters in them.
There are rules for mongoDB filed names. These rules do not allow for special characters.
Look here: https://docs.mongodb.org/manual/reference/limits/#Restrictions-on-Field-Names
The probable reason you can sometimes successfully create the doc using Robomongo is because Robomongo is transforming your query into a string and properly escaping the special characters as it sends it to MongoDB.
This also explains why your attempt to update them never works. You tried to create a document, but instead created something that is a string object, so your update conditions are probably not retrieving any docs.
I see two problems with your approach.
In following query
db.queries.insert({
"name" : "query1",
"query": { the thing printed above starting with "$or"... }
})
a valid JSON expects key, value pair. here in "query" you are storing an object without a key. You have two options. either store query as text or create another key inside curly braces.
Second problem is, you are storing query values without wrapping in quotes. All string values must be wrapped in quotes.
so your final document should appear as
db.queries.insert({
"name" : "query1",
"query": 'the thing printed above starting with "$or"... '
})
Now try, it should work.
Obviously my attempt to store a query in mongo the way I did was foolish as became clear from the answers from both #bigdatakid and #lix. So what I finally did was this: I altered the naming of the fields to comply to the mongo requirements.
E.g. instead of $or I used _$or etc. and instead of using a . inside the name I used a #. Both of which I am replacing in my Java code.
This way I can still easily try and test the queries outside of my program. In my Java program I just change the names and use the query. Using just 2 lines of code. It simply works now. Thanks guys for the suggestions you made.
String documentAsString = query.toJson().replaceAll("_\\$", "\\$").replaceAll("#", ".");
Object q = JSON.parse(documentAsString);