How to incorporate $lt and $divide in mongodb - mongodb

I am a newbie in MongoDB but I am trying to query to identify if any of my field meets the requirements.
Consider the following:
I have a collection where EACH document is formatted as:
{
"nutrition" : [
{
"name" : "Energy",
"unit" : "kcal",
"value" : 150.25,
"_id" : ObjectId("fdsfdslkfjsdf")
}
{---then there's more in the array---}
]
"serving" : 4
"id": "Food 1"
}
My current code looks something like this:
db.recipe.find(
{"nutrition": {$elemMatch: {"name": "Energy", "unit": "kcal", "value": {$lt: 300}}}},
{"id":1, _id:0}
)
Under the array nutrition, there's a field with its name called Energy with it's value being a number. It checks if that value is less than 300 and outputs all the documents that meets this requirement (I made it only output the field called id).
Now my question is the following:
1) For each document, I have another field called "serving" and I am supposed to find out if "value"/"serving" is still less than 300. (As in divide value by serving and see if it's still less than 300)
2) Since I am using .find, I am guessing I can't use $divide operator from aggregation?
3) I been trying to play around with aggregation operators like $divide + $cond, but no luck so far.
4) Normally in other languages, I would just create a variable a = value/serving then run it through an if statement to check if it's less than 300 but I am not sure if that's possible in MongoDB
Thank you.

In case anyone was struggling with similar problem, I figured out how to do this.
db.database.aggregate([
{$unwind: "$nutrition"}, //starts aggregation
{$match:{"nutrition.name": "Energy", "nutrition.unit": "kcal"}}, //breaks open the nutrition array
{$project: {"Calories per serving": {$divide: ["$nutrition.value", "$ingredients_servings"]}, //filters out anything not named Energy and unit is not kcal, so basically 'nutrition' array now has become a field with one data which is the calories in kcal data
"id": 1, _id:0}}, //show only the food id
{$match:{"Calories per serving": {$lt: 300}}} //filters out any documents that has their calories per serving equal to or greater than 300
]}
So basically, you open the array and filter out any sub-fields you don't want in the document, then display it using project along with any of your math evaluations that needs to be done. Then you filter out any condition you had, which for me was that I don't want to see any foods that have their calories per serving more than 300.

Related

MongoDB find lowest missing value

I have a collection in MongoDB that looks something like:
{
"foo": "something",
"tag": 0,
},
{
"foo": "bar",
"tag": 1,
},
{
"foo": "hello",
"tag": 0,
},
{
"foo": "world",
"tag": 3,
}
If we consider this example, there are entries in the collection with tag of value 0, 1 or 3 and these aren't unique values, tag value can be repeated. My goal is to find that 2 is missing. Is there a way to do this with a query?
Query1
in the upcoming mongodb 5.2 we will have sort on arrays that could do this query easier without set operation but this will be ok also
group and find the min,max and all the values
take the range(max-min)
the missing are (setDifference range_above tags)
and from them you take only the smallest => 2
Test code here
aggregate(
[{"$group":
{"_id":null,
"min":{"$min":"$tag"},
"max":{"$max":"$tag"},
"tags":{"$addToSet":"$tag"}}},
{"$project":
{"_id":0,
"missing":
{"$min":
{"$setDifference":
[{"$range":[0, {"$subtract":["$max", "$min"]}]}, "$tags"]}}}}])
Query2
in Mongodb 5 (the current version) we can use also $setWindowFields
sort by tag, add the dense-rank(same values=same rank), and the min
then find the difference of tag-min
and then filter those that this difference < rank
and find the max of them (max of the tag that are ok)
increase 1 to find the one missing
*test it before using it to be sure, i tested it 3-4 times seemed ok,
for big collection if you have many different tags, this is better i think. (the above addtoset can cause memory problems)
Test code here
aggregate(
[{"$setWindowFields":
{"output":{"rank":{"$denseRank":{}}, "min":{"$first":"$tag"}},
"sortBy":{"tag":1}}},
{"$set":{"difference":{"$subtract":["$tag", "$min"]}}},
{"$match":{"$expr":{"$lt":["$difference", "$rank"]}}},
{"$group":{"_id":null, "last":{"$max":"$tag"}}},
{"$project":{"_id":0, "missing":{"$add":["$last", 1]}}}])

Does Indexing small arrays of subdocuments in Mongodb affect performance?

My Mongodb collection has this document structure:
{
_id: 1,
my_dict: {
my_key: [
{id: x, other_fields: other_values},
...
]
},
...
},
I need to update the array subdocuments very often, so an Index on the id field seems like a good idea. Still, I have many documents (millions) but my arrays inside them are small (max ~20 elements). Would it still improve performance a lot to index it, compared to the cost of indexing?
PS: I'm not using the id as a key (dict instead of an array), as I also often need to get the number of elements in "the array" ($size only works on arrays). I cannot use count as I am using Mongodb 3.2.
Followup question: If it would make a very big difference, I could instead use a dict like so:
{id: {others_fields: other_values}}
and store the size myself in a field. What I dislike about this is that I would need another field and update it myself (possible errors maybe, as I would need to use $inc each time I add/delete an item) instead of relying on "real" values. I would also have to manage the possibility that a key could be called _my_size, which would conflict with my logic. It would look then like this:
{
_id: 1,
my_dict: {
my_key: {
id: {other_fields: other_values},
_my_size: 1
},
},
},
Still not sure which is best for performance. I will need to update the subdocument (with id field) a lot, as well as computing the $size a lot (maybe 1/10 of the calls to update).
Which Schema/Strategy would give me a better performance? Or even more important, would it actually make a big difference? (possibly thousands of calls per second)
Update example:
update(
{_id: 1, my_dict.my_key.id: update_data_id},
{$set: {my_dict.my_key: update_data}}
)
Getting the size example:
aggregate(
{$match: {_id: 1}},
{$project: {_id: 0, nb_of_sub_documents: {$size: $my_dict.my_key}}}

What does $sum:1 mean in Mongo

I have a collection foo:
{ "_id" : ObjectId("5837199bcabfd020514c0bae"), "x" : 1 }
{ "_id" : ObjectId("583719a1cabfd020514c0baf"), "x" : 3 }
{ "_id" : ObjectId("583719a6cabfd020514c0bb0") }
I use this query:
db.foo.aggregate({$group:{_id:1, avg:{$avg:"$x"}, sum:{$sum:1}}})
Then I get a result:
{ "_id" : 1, "avg" : 2, "sum" : 3 }
What does {$sum:1} mean in this query?
From the official docs:
When used in the $group stage, $sum has the following syntax and returns the collective sum of all the numeric values that result from applying a specified expression to each document in a group of documents that share the same group by key:
{ $sum: < expression > }
Since in your example the expression is 1, it will aggregate a value of one for each document in the group, thus yielding the total number of documents per group.
Basically it will add up the value of expression for each row. In this case since the number of rows is 3 so it will be 1+1+1 =3 . For more details please check mongodb documentation https://docs.mongodb.com/v3.2/reference/operator/aggregation/sum/
For example if the query was:
db.foo.aggregate({$group:{_id:1, avg:{$avg:"$x"}, sum:{$sum:$x}}})
then the sum value would be 1+3=4
I'm not sure what MongoDB version was there 6 years ago or whether it had all these goodies, but it seems to stand to reason that {$sum:1} is nothing but a hack for {$count:{}}.
In fact, $sum here is more expensive than $count, as it is being performed as an extra, whereas $count is closer to the engine. And even if you don't give much stock to performance, think of why you're even asking: because that is a less-than-obvious hack.
My option would be:
db.foo.aggregate({$group:{_id:1, avg:{$avg:"$x"}, sum:{$count:{}}}})
I just tried this on Mongo 5.0.14 and it runs fine.
The good old "Just because you can, doesn't mean you should." is still a thing, no?

mongo db design for fast queries on ranges

Currently, I have a mongoDb collection with documents of the following type (~1000 docs):
{ _id : "...",
id : "0000001",
gaps : [{start:100, end:110}, {start:132, end:166}...], // up to ~1k elems
bounds: [68, 88, 126, 228...], // up to 100 elems
length: 300,
seq : "ACCGACCCGAGGACCCCTGAGATG..."
}
"gaps" and "bounds" values in the array refer to coordinates in "seq" and "length" refers to "seq" length.
I have defined indexes on "id", "gaps" and "bounds".
I need to query based on coordinates ranges. For example given "from=100" and "to=200" I want to retrieve for each document a sub-array of "gaps", a sub-array of "bounds" and the "seq" substring that lay inside the range (between 100 and 200 in this case).
Right now, this is done using the following aggregation pipeline:
db.annot.aggregate(
{$match : {"id" : "000001"}},
{$unwind : "$gaps"},
{$unwind : "$bounds"},
{$match: {
$or :[
{"gaps.start" : {$gte:from, $lte:to}},
{"gaps.end" : {$gte:from, $lte:to}},
{"bounds" : {$gte:from, $lte:to}}
]
}
},
{$project:{
id:1,
gaps:1,
bounds:1,
subseq:{$substr:["$seq", from, (to-from)]}}},
{$group : {
_id : "$id",
gaps : {"$addToSet" : "$gaps"},
bounds : {"$addToSet" : "$bounds"},
subseq : {"$first" : "$subseq"}}},
)
Querying the whole collection (leaving out the first "$match" in the pipeline) takes ~14 seconds.
Querying individually all the documents sequentially takes at most 50msec each (~19 secs in total).
Querying individually all the documents in parallel takes in total ~7 secs.
Querying with a pipeline that only matches the id (ie, using the first "$match" in the pipeline) takes ~5 secs in total.
What would be the best db and query design to maximize the performance of this kind of queries?
What would be the best db and query design to maximize the performance of this kind of queries?
Since you ask for improving your code and design, i suggest you to switch to the latest version of mongodb if you have not yet. That should be a good starting point. For these type of problems, the basic idea should be to reduce the number of documents being input to each pipeline operation.
I suggest you to have a additional variable named range which contains all the numbers between from and to, inclusive of both. This allows us to apply functions like $intersection on the bounds array.
So the variables, the aggregate operation needs from the environment should be:
var from = x; var to = y; var range=[x,...,y];
The first step is to match the number of documents that have the
id,gaps sub documents and bounds value in our range. This
reduces the number of documents being input to the next stage to say 500.
The next step is to $redact the non conforming gaps sub
documents. This stage now works on the 500 documents filtered in the
previous step.
The third step is to $project our fields as our need.
Notice that we have not required to use the $unwind operator anywhere and achieved the task.
db.collection.aggregate([
{$match : {"id" : "0000001",
"gaps.start":{$gte:from},
"gaps.end":{$lte:to},
"bounds" : {$gte:from, $lte:to}}},
{$redact:{
$cond:[
{$and:[
{$gte:[{$ifNull: ["$start", from]},from]},
{$lte:[{$ifNull: ["$end", to]},to]}
]},"$$DESCEND","$$PRUNE"
]
}},
{$project: {
"bounds":{$setIntersection:[range,"$bounds"]},
"id":1,
"gaps":1,
"length":1,
"subseq":{$substr:["$seq", from, (to-from)]}}}
])

Count fields in a MongoDB Collection

I have a collection of documents like this one:
{
"_id" : ObjectId("..."),
"field1": "some string",
"field2": "another string",
"field3": 123
}
I'd like to be able to iterate over the entire collection, and find the entire number of fields there are. In this example document there are 3 (I don't want to include _id), but it ranges from 2 to 50 fields in a document. Ultimately, I'm just looking for the average number of fields per document.
Any ideas?
Iterate over the entire collection, and find the entire number of fields there are
Now you can utilise aggregation operator $objectToArray (SERVER-23310) to turn keys into values and count them. This operator is available in MongoDB v3.4.4+
For example:
db.collection.aggregate([
{"$project":{"numFields":{"$size":{"$objectToArray":"$$ROOT"}}}},
{"$group":{"_id":null, "fields":{"$sum":"$numFields"}, "docs":{"$sum":1}}},
{"$project":{"total":{"$subtract":["$fields", "$docs"]}, _id:0}}
])
First stage $project is to turn all keys into array to count fields. Second stage $group is to sum the number of keys/fields in the collection, also the number of documents processed. Third stage $project is subtracting the total number of fields with the total number of documents (As you don't want to count for _id ).
You can easily add $avg to count for average on the last stage.
PRIMARY> var count = 0;
PRIMARY> db.my_table.find().forEach( function(d) { for(f in d) { count++; } });
PRIMARY> count
1074942
This is the most simple way I could figure out how to do this. On really large datasets, it probably makes sense to go the Map-Reduce path. But, while your set is small enough, this'll do.
This is O(n^2), but I'm not sure there is a better way.
You could create a Map-Reduce job. In the Map step iterate over the properties of each document as a javascript object, output the count and reduce to get the total.
For a simple way just find() all value and for each set of record get size of array.
db.getCollection().find(<condition>)
then for each set of result, get the size of array.
sizeOf(Array[i])