How do bson arrays compare (in mongodb/pymongo)? - mongodb

I would like to store in mongdb some very large integers, exactly (several thousands decimal digits). This will not work of course with the standard types supported by BSON, and I am trying to think of the most elegant workaround, considering that I would like to perform range searches and similar things. This requirement excludes storing the integers as strings as it makes the range searches impractical.
One way I can think of is to encode the 2^32-expansion using (variable-length) arrays of standard ints, and add to this array a first entry for the length of the array itself. That way lexicographical ordering on these arrays corresponds to the usual ordering of arbitrarily large integers.
For instance, in a collection I could have the 5 documents
{"name": "me", "fortune": [1,1000]}
{"name": "scrooge mcduck", "fortune": [11,1,0,0,0,0,0,0,0,0,0,0]}
{"name": "bruce wayne","fortune": [2, 10,0]}
{"name": "bill gates", "fortune": [2,1,1000]}
{"name": "francis", "fortune": [0]}
Thus Bruce Wayne's net worth is 10*2^32, Bill Gates' 2^32+1000 and Scrooge McDuck's 2^320.
I can then do a sort using {"fortune":1} and on my machine (with pymongo) it returns them in the order francis < me < bill < bruce < scrooge, as expected.
However, I am making assumptions that I haven't seen documented anywhere about the way BSON arrays compare, and the range searches don't seem to work the way I think (for instance,
find({"fortune":{$gte:[2,5,0]}})
returns no document, but I would wish for bruce and scrooge).
Can anyone help me? Thanks

You can instead store left padded strings which represent exact integer equal to the fortune.
eg. "1000000" = 1 million
"0010000" = 10 thousand
"2000000" = 2 million
"0200000" = 2 hundred thousand
Left padding with zeroes will ensure that lexographical comparison of these strings directly corresponds to their comparison as numeric values also. You will have to
assume a safe MAXIMUM possible value of fortune here, say a 20 digit number, and
pad the 0s accordingly
So a sample documents would be :
{"name": "scrooge mcduck", "fortune": "00001100000000000000" }
{"name": "bruce wayne", "fortune": "00000200000000000000" }
querying:
> db.test123.find()
{ "_id" : ObjectId("4f87e142f1573cffecd0f65e"), "name" : "bruce wayne", "fortune" : "00000200000000000000" }
{ "_id" : ObjectId("4f87e150f1573cffecd0f65f"), "name" : "donald", "fortune" : "00000150000000000000" }
{ "_id" : ObjectId("4f87e160f1573cffecd0f660"), "name" : "mickey", "fortune" : "00000000000000100000" }
> db.test123.find({ "fortune" : {$gte: "00000200000000000000"}});
{ "_id" : ObjectId("4f87e142f1573cffecd0f65e"), "name" : "bruce wayne", "fortune" : "00000200000000000000" }
> db.test123.find({ "fortune" : {$lt: "00000200000000000000"}});
{ "_id" : ObjectId("4f87e150f1573cffecd0f65f"), "name" : "donald", "fortune" : "00000150000000000000" }
{ "_id" : ObjectId("4f87e160f1573cffecd0f660"), "name" : "mickey", "fortune" : "00000000000000100000" }
The querying / sorting will work naturally as mongodb compares strings lexographically.
However, to do other numeric operations on your data, you will have to write custom logic in your data processing script ( PHP,Python,Ruby etc)
For querying and data storage, this string version should do fine.

Unfortunately your assumption about array comparison is incorrect. Range queries that, for example, query for all array values smaller than 3 ({array:{$lt:3}}) will return all arrays where at least one element is less than three, regardless of the element's position. As such your approach will not work.
What does work, but is a bit less obvious, is using binary blobs for your very large integers since those are byte-order compared. That requires you set an upper bit limit for your integers but that should be fairly straightforward. You can test it in shell using BinData(subType, base64) notation :
db.col.find({fortune:{$gt:BinData(0, "e8MEnzZoFyMmD7WSHdNrFJyEk8M=")}})
So all you'd have to do is create methods to convert your big integers from, say, strings to two-complements binary and you're set. Good luck

Related

$reduce with $regex in $group aggregation so that length can be displayed

I think I have a pretty complex one here - not sure if I can do this or not.
I have data that has an address and a data field. The data field is a hex value. I would like to run an aggregation that groups the data by address and then the length of the hex data. All of the data will come in as 16 characters long, but the length of that data should calculated in bytes.
I think I have to take the data, strip the trailing 00's (using regex 00+$), and divide that number by 2 to get the length. After that, I would have to then group by address and final byte length.
An example dataset would be:
{addr:829, data:'4100004822000000'}
{addr:829, data:'4100004813000000'}
{addr:829, data:'4100004804000000'}
{addr:506, data:'0000108000000005'}
{addr:506, data:'0000108000000032'}
{addr:229, data:'0065005500000000'}
And my desired output would be:
{addr:829, length:5}
{addr:506, length:8}
{addr:229, length:4}
Is this even possible in an aggregation query w/o having to use external code to do?
This is not too complicated if your "data" is in fact strings as you show in your sample data. Assuming data exists and is set to something (you can add error checking as needed) you can get the result you want like this:
db.coll.aggregate([
{$addFields:{lastNonZero:{$add:[2,{$reduce:{
initialValue:-2,
input:{$range:[0,{$strLenCP:"$data"},2]},
in:{$cond:{
if: {$eq:["00",{$substr:["$data","$$this",2]}]},
then: "$$value",
else: "$$this"
}}
}}]}}},
{$group:{_id:{
addr:"$addr",
length:{$divide:["$lastNonZero",2]}
}}}
])
I used two stages but of course they could be combined into a single $group if you wish. Here in $reduce I step through data 2 characters at a time, checking if they are equal to "00". Every time they are not I update the value to where I am in the sequence. Since that returns the position of the last non-"00" characters, we add 2 to it to find where the string of zeros that goes to the end starts and then later in $group we divide that by 2 to get the true length.
On your sample data, this returns:
{ "_id" : { "addr" : 229, "length" : 4 } }
{ "_id" : { "addr" : 506, "length" : 8 } }
{ "_id" : { "addr" : 829, "length" : 5 } }
You can add a $project stage to transform the field names into ones you want returned.

How to incorporate $lt and $divide in mongodb

I am a newbie in MongoDB but I am trying to query to identify if any of my field meets the requirements.
Consider the following:
I have a collection where EACH document is formatted as:
{
"nutrition" : [
{
"name" : "Energy",
"unit" : "kcal",
"value" : 150.25,
"_id" : ObjectId("fdsfdslkfjsdf")
}
{---then there's more in the array---}
]
"serving" : 4
"id": "Food 1"
}
My current code looks something like this:
db.recipe.find(
{"nutrition": {$elemMatch: {"name": "Energy", "unit": "kcal", "value": {$lt: 300}}}},
{"id":1, _id:0}
)
Under the array nutrition, there's a field with its name called Energy with it's value being a number. It checks if that value is less than 300 and outputs all the documents that meets this requirement (I made it only output the field called id).
Now my question is the following:
1) For each document, I have another field called "serving" and I am supposed to find out if "value"/"serving" is still less than 300. (As in divide value by serving and see if it's still less than 300)
2) Since I am using .find, I am guessing I can't use $divide operator from aggregation?
3) I been trying to play around with aggregation operators like $divide + $cond, but no luck so far.
4) Normally in other languages, I would just create a variable a = value/serving then run it through an if statement to check if it's less than 300 but I am not sure if that's possible in MongoDB
Thank you.
In case anyone was struggling with similar problem, I figured out how to do this.
db.database.aggregate([
{$unwind: "$nutrition"}, //starts aggregation
{$match:{"nutrition.name": "Energy", "nutrition.unit": "kcal"}}, //breaks open the nutrition array
{$project: {"Calories per serving": {$divide: ["$nutrition.value", "$ingredients_servings"]}, //filters out anything not named Energy and unit is not kcal, so basically 'nutrition' array now has become a field with one data which is the calories in kcal data
"id": 1, _id:0}}, //show only the food id
{$match:{"Calories per serving": {$lt: 300}}} //filters out any documents that has their calories per serving equal to or greater than 300
]}
So basically, you open the array and filter out any sub-fields you don't want in the document, then display it using project along with any of your math evaluations that needs to be done. Then you filter out any condition you had, which for me was that I don't want to see any foods that have their calories per serving more than 300.

MongoDB: Should You Pre-Allocate a Document if Using $addToSet or $push?

I've been studying up on MongoDB and I understand that it is highly recommended that documents structures are completely built-out (pre-allocated) at the point of insert, this way future changes to that document do not require the document to be moved around on the disk. Does this apply when using $addToSet or $push?
For example, say I have the following document:
"_id" : "rsMH4GxtduZZfxQrC",
"createdAt" : ISODate("2015-03-01T12:08:23.007Z"),
"market" : "LTC_CNY",
"type" : "recentTrades",
"data" : [
{
"date" : "1422168530",
"price" : 13.8,
"amount" : 0.203,
"tid" : "2435402",
"type" : "buy"
},
{
"date" : "1422168529",
"price" : 13.8,
"amount" : 0.594,
"tid" : "2435401",
"type" : "buy"
},
{
"date" : "1422168529",
"price" : 13.79,
"amount" : 0.594,
"tid" : "2435400",
"type" : "buy"
}
]
And I am using one of the following commands to add a new array of objects (newData) to the data field:
$addToSet to add to the end of the array:
Collection.update(
{ _id: 'rsMH4GxtduZZfxQrC' },
{
$addToSet: {
data: {
$each: newData
}
}
}
);
$push (with $position) to add to the front of the array:
Collection.update(
{ _id: 'rsMH4GxtduZZfxQrC' },
{
$push: {
data: {
$each: newData,
$position: 0
}
}
}
);
The data array in the document will grow due to new objects that were added from newData. So will this type of document update cause the document to be moved around on the disk?
For this particular system, the data array in these documents can grow to upwards of 75k objects within, so if these documents are indeed being moved around on disk after every $addToSet or $push update, should the document be defined with 75k nulls (data: [null,null...null]) on insert, and then perhaps use $set to replace the values over time? Thanks!
I understand that it is highly recommended that documents structures are completely built-out (pre-allocated) at the point of insert, this way future changes to that document do not require the document to be moved around on the disk. Does this apply when using $addToSet or $push?
It's recommended if it's feasible for the use case, which it usually isn't. Time series data is a notable exception. It doesn't really apply with $addToSet and $push because they tend to increase the size of the document by growing an array.
the data array in these documents can grow to upwards of 75k objects within
Stop. Are you sure you want constantly growing arrays with tens of thousands of entries? Are you going to query wanting specific entries back? Are you going to index any fields in the array entries? You probably want to rethink your document structure. Maybe you want each data entry to be a separate document with fields like market, type, createdAt replicated in each? You wouldn't be worrying about document moves.
Why will the array grow to 75K entries? Can you do less entries per document? Is this time series data? It's great to be able to preallocate documents and do in-place updates with the mmap storage engine, but it's not feasible for every use case and it's not a requirement for MongoDB to perform well.
should the document be defined with 75k nulls (data: [null,null...null]) on insert, and then perhaps use $set to replace the values over time?
No, this is not really helpful. The document size will be computed based on the BSON size of the null values in the array, so when you replace null with another type the size will increase and you'll get document rewrites anyway. You would need to preallocate the array with objects with all fields set to a default value for its type, e.g.
{
"date" : ISODate("1970-01-01T00:00:00Z") // use a date type instead of a string date
"price" : 0,
"amount" : 0,
"tid" : "000000", // assuming 7 character code - strings icky for default preallocation
"type" : "none" // assuming it's "buy" or "sell", want a default as long as longest real values
}
MongoDB uses the power of two allocation strategy to store your documents, which means it will allocate the size of the document^2 for storage. Therefore if your nested arrays don't lead to a total growth larger then the original size to the power of two, mongo will not have to reallocate the document.
See: http://docs.mongodb.org/manual/core/storage/
Bottom line here is that any "document growth" is pretty much always going to result in the "physical move" of the storage allocation unless you have "pre-allocated" by some means on the original document submission. Yes there is "power of two" allocation, but this does not always mean anything valid to your storage case.
The additional "catch" here is on "capped collections", where indeed the "hidden catch" is that such "pre-allocation" methods are likely not to be "replicated" to other members in a replica set if those instructions fall outside of the "oplog" period where the replica set entries are applied.
Growing any structure beyond what is allocated from an "initial allocation" or the general tricks that can be applied will result in that document being "moved" in storage space when it grows beyond the space it was originally supplied with.
In order to ensure this does not happen, then you always "pre-allocate" to the expected provisions of your data on the original creation. And with the obvious caveat of the condition already described.

How powerful is the query language of MongoDB and What is the Cap for Arrays?

Ok let's first look at a screen shot. This is a screen shot of a text file we call it VCF file. How many rows it might have? Maybe 100,000 rows of things like this:
I am totally new and novice to MongoDB so I thought of a schema like this:
So for example notice REF in that text file is a Key/Value in my schema. But like I said it might have 200,000 rows...
So:
Are Arrays still a good thing I can use? storing 200,000 members in that array?
How powerful I can query on it? so in the text file we have rows, for example that #CHROM20 in POS of 14370 has a REF of "G" and ALT of "A" ... so with my Schema can we find and return it? Let's say we say search for patients that have "G" in their REF field, so are MongoDB queries powerful enough to search and return such a result?
Is it a bad schema? Do you have better recommendations/advice?
Any sample query could you give for my qquesry in question will be so helpful to give me some ideas..
Sorry for the very slow reply, I had left for holiday when you replied. The following syntax achieves the desired outcome.
> db.refs.insert({ref:[A,T,ATC,G]})
> db.refs.insert({ref:['A','T','ATC','G']})
> db.refs.findOne()
{
"_id" : ObjectId("4fda21bb8a807d87a65aba37"),
"ref" : [
"A",
"T",
"ATC",
"G"
]
}
> db.refs.insert({ref:['TCG','TA']})
> db.refs.find()
{ "_id" : ObjectId("4fda21bb8a807d87a65aba37"), "ref" : [ "A", "T", "ATC", "G" ] }
{ "_id" : ObjectId("4fda22438a807d87a65aba38"), "ref" : [ "TCG", "TA" ] }
> db.refs.find({ref :{$all : ['G']}})
{ "_id" : ObjectId("4fda21bb8a807d87a65aba37"), "ref" : [ "A", "T", "ATC", "G" ] }
Is this what you had in mind?
A big concern in schema design is avoid the 16MB document limit. While you can have as many documents as can be addressed with 64 bit address space, I don't know how your document is likely to grow. This restriction may necessitate that you break out some of the fields into other documents that you reference.
Let's say we say search for patients that have "G" in their REF field
Does ref:[TCG,TA] count or only ref:[A,T,ATC,G] ?

Search for a record where a value is between two item fields in MongoDB

I have a MondoDB collection with over 5 million items. Each item has a "start" and "end" fields containing integer values.
Items don't have overlapping starts and ends.
e.g. this would be invalid:
{start:100, end:200}
{start:150, end:250}
I am trying to locate an item where a given value is between start and end
start <= VALUE <= end
The following query works, but it takes 5 to 15 seconds to return
db.blocks.find({ "start" : { $lt : 3232235521 }, "end" :{ $gt : 3232235521 }}).limit(1);
I've added the following indexes for testing with very little improvement
db.blocks.ensureIndex({start:1});
db.blocks.ensureIndex({end:1});
//also a compounded one
db.blocks.ensureIndex({start:1,end:1});
** Edit **
The result of explain() on the query results in:
> db.blocks.find({ "start" : { $lt : 3232235521 }, "end" :{ $gt : 3232235521 }}).limit(1).explain();
{
"cursor" : "BtreeCursor end_1",
"nscanned" : 1160982,
"nscannedObjects" : 1160982,
"n" : 0,
"millis" : 5779,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"end" : [
[
3232235521,
1.7976931348623157e+308
]
]
}
}
What would be the best approach to speeding this specific query up?
actually I'm working on similar problem and my friend find a nice way to solve this.
If you don't have overlapping data, you can do this:
query using start field and sort function
validate with end field
for example you can do
var x = 100;
var results = db.collection.find({start:{$lte:x}}).sort({start:-1}).limit(1)
if (results!=null) {
var result = results[0];
if (result.end > x) {
return result;
} else {
return null; // no range contain x
}
}
If you are sure that there will always range containing x, then you do not have to validate the result.
By using this piece of code, you only have to index by either start or end field and your query become a lot faster.
--- edit
I did some benchmark, using composite index takes 100-100,000ms per query, in the other hand using one index takes 1-5ms per query.
I guess compbound index should work faster for you:
db.blocks.ensureIndex({start:1, end:1});
You can also use explain to see number of scanned object, etc and choose best index.
Also if you are using mongodb < 2.0 you need to update to 2.0+, because there indexes work faster.
Also you can limit results to optimize query.
This might help: how about you introduce some redundancy. If there is not a big variance in the lengths of the intervals, then you can introduce a tag field for each record - this tag field is a single value or string that represents a large interval - say for example tag 50,000 is used to tag all records with intervals that are at least partially in the range 0-50,000 and tag 100,000 is for all intervals in the range 50,000-100,000, and so on. Now you can index on the tag as primary and one of the end points of record range as secondary.
Records on the edge of big interval would have more than one tag - so we are talking multikeys. On your query you would of course calculate the big interval tag and use it in the query.
You would roughly want SQRT of total records per tag - just a starting point for tests, then you can fine tune the big interval size.
Of course this would make writing bit slower.