MongoDB sort using a custom function - mongodb

Let's say I have a collection that looks like:
{
_id: 'aaaaaaaaaaaaaaaaaaaaaaaaa',
score: 10
hours: 50
},
{
_id: 'aaaaaaaaaaaaaaaaaaaaaaaab',
score: 5
hours: 55
},
{
_id: 'aaaaaaaaaaaaaaaaaaaaaaaac',
score: 15
hours: 60
}
I want to sort this list by a custom order, namely
value = (score - 1) / (T + 2) ^ G
score: score
T: current_hours - hours
G: some constant
How do I do this? I assume this is going to require writing a custom sorting function that compares the score and hours fields in addition to taking a current_hours as an input, performs that comparison and returns the sorted list. Note that hours and current_hours is simply the number of hours that have elapsed since some arbitrary starting point. So if I'm running this query 80 hours after the application started, current_hours takes the value of 80.
Creating an additional field value and keeping it constantly updated is probably too expensive for millions of documents.
I know that if this is possible, this is going to look something like
db.items.aggregate([
{ "$project" : {
"_id" : 1,
"score" : 1,
"hours" : 1,
"value" : { SOMETHING HERE, ALSO REQUIRES PASSING current_hours }
}
},
{ "$sort" : { "value" : 1 } }
])
but I don't know what goes into value

I think value will look something like this:
"value": {
$let: {
vars: {
score: "$score",
t: {
"$subtract": [
80,
"$hours"
]
},
g: 3
},
in: {
"$divide": [
{
"$subtract": [
"$$score",
1
]
},
{
"$pow": [
{
"$add": [
"$$t",
2
]
},
"$$g"
]
}
]
}
}
}
Playground example here
Although it's verbose, it should be reasonably straightforward to follow. It uses the arithmetic expression operators to build the calculation that you are requesting. A few specific notes:
We use $let here to set some vars for usage. This includes the "runtime" value for current_hours (80 in the example per the description) and 3 as an example for G. We also "reuse" score here which is not strictly necessary, but done for consistency of the next point.
$ refers to fields in the document where $$ refer to variables. That's why everything in the vars definition uses $ and everything for the actual calculation in in uses $$. The reference to score inside of in could have been done via just the field name ($), but I personally prefer the consistency of this approach.

Related

What is the best way to use collection as round robin in Mongodb

I have a collection named items with three documents.
{
_id: 1,
item: "Pencil"
}
{
_id: 1,
item: "Pen"
}
{
_id: 1,
item: "Sharpner"
}
How could I query to get the document as round-robin?
Consider I got multiple user requests at the same time.
so one should get Pencil other will get Pen and then other will get Sharpner.
then start again from the first one.
If changing schema is a choice I am also ready for that.
I think I found a way to do this without changing the schema. It is based on skip() and limit(). Moreover you can specify to keep the internal sorting order for returned documents but as the guide says you should not rely on this, especially because you are losing performance since the indexing is overridden:
The $natural parameter returns items according to their natural order
within the database. This ordering is an internal implementation
feature, and you should not rely on any particular structure within
it.
Anyway, this is the query:
db.getCollection('YourCollection').find().skip(counter).limit(1)
Where counter stores the current index for your documents.
Few things to start..
_id has to be unique across a collection especially when the collection is only a replication set.
This is a very stateful requirement and would not work well with a distributed set of services for example.
With that said, assuming you really just want to iterate from the database i would use cursors to accomplish this. This will do a collection scan and is very inefficient for the record.
var myCursor = db.items.find().sort({_id:1});
while (myCursor.hasNext()) {
printjson(myCursor.next());
}
My suggestion is that you should pull all results from the database at once and do your iteration in the application tier.
var myCursor = db.inventory.find().sort({_id:1});
var documentArray = myCursor.toArray();
documentArray.foreach(doSomething)
If this is about distribution you may consider fetching random documents instead of round-robin via aggregation/$sample:
db.collection.aggregate([
{
"$sample": {
"size": 1
}
}
])
playground
Or there is options to randomize via $rand ...
Use text findOneAndUpdate after restructuring the data objects
db.counter.findOneAndUpdate( {}, pipeline)
{
"_id" : ObjectId("624317a681e72a1cfd7f2b7e"),
"values" : [
"Pencil",
"Pen",
"Sharpener"
],
"selected" : "Pencil",
"counter" : 1
}
db.counter.findOneAndUpdate( {}, pipeline)
{
"_id" : ObjectId("624317a681e72a1cfd7f2b7e"),
"values" : [
"Pencil",
"Pen",
"Sharpener"
],
"selected" : "Pen",
"counter" : 2
}
where the data object is now:
{
"_id" : ObjectId("6242fe3bc1551d0f3562bcb2"),
"values" : [
"Pencil",
"Pen",
"Sharpener"
],
"selected" : "Pencil",
"counter" : 1
}
and the pipeline is:
[{$project: {
values: 1,
selected: {
$arrayElemAt: [
'$values',
'$counter'
]
},
counter: {
$mod: [
{
$add: [
'$counter',
1
]
},
{
$size: '$values'
}
]
}
}}]
This has some merits:
Firstly, using findOneAndUpdate means that moving the pointer to the
next item in the list and reading the object happen at once.
Secondly,by using the {$size: "$values"} adding a value into the list
doesn't change the logic.
And, instead of a string an object could be used instead.
Problems:
This method would be unwieldy with more than 10's of entries
It is hard to prove that this method works as advertised so there is an accompanying Kotlin project. The project uses coroutines so it is calling a find/update asynchronously.
text GitHub
The alternative (assuming 50K items and not 3):
Set-up a simple counter {counter: 0} and update as follows:
db.counter.findOneAndUpdate({},
[{$project: {
counter: {
$mod: [
{
$add: [
'$counter',
1
]
},
50000
]
}
}}])
Then use a simple select query to find the right document.
I've updated the github to include this example.

Searching with Precedence on Array Order

My gut feeling is that the answer is no, but is it possible to perform a search in Mongodb comparing the similarity of arrays where order is important?
E.g.
I have three documents like so
{'_id':1, "my_list": ["A",2,6,8,34,90]},
{'_id':2, "my_list": ["A","F",2,6,19,8,90,55]},
{'_id':3, "my_list": [90,34,8,6,3,"A"]}
1 and 2 are the most similar, 3 is wildly different irrespective of the fact it contains all of the same values as 1.
Ideally I would do a search similar to {"my_list" : ["A",2,6,8,34,90] } and the results would be document 1 and 2.
It's almost like a regex search with wild cards. I know I can do this in python easily enough, but speed is important and I'm dealing with 1.3 million documents.
Any "comparison" or "selection" is actually more or less subjective to the actual logic applied. But as a general principle you could always consider the product of the matched indices from the array to test against and the array present in the document. For example:
var sample = ["A",2,6,8,34,90];
db.getCollection('source').aggregate([
{ "$match": { "my_list": { "$in": sample } } },
{ "$addFields": {
"score": {
"$add": [
{ "$cond": {
"if": {
"$eq": [
{ "$size": { "$setIntersection": [ "$my_list", sample ] }},
{ "$size": { "$literal": sample } }
]
},
"then": 100,
"else": 0
}},
{ "$sum": {
"$map": {
"input": "$my_list",
"as": "ml",
"in": {
"$multiply": [
{ "$indexOfArray": [
{ "$reverseArray": "$my_list" },
"$$ml"
]},
{ "$indexOfArray": [
{ "$reverseArray": { "$literal": sample } },
"$$ml"
]}
]
}
}
}}
]
}
}},
{ "$sort": { "score": -1 } }
])
Would return the documents in order like this:
/* 1 */
{
"_id" : 1.0,
"my_list" : [ "A", 2, 6, 8, 34, 90],
"score" : 155.0
}
/* 2 */
{
"_id" : 2.0,
"my_list" : ["A", "F", 2, 6, 19, 8, 90, 55],
"score" : 62.0
}
/* 3 */
{
"_id" : 3.0,
"my_list" : [ 90, 34, 8, 6, 3, "A"],
"score" : 15.0
}
The key being that when applied using $reverseArray, the values from $indexOfArray will be "larger" produced by the matching index on order from "first to last" ( reversed ) which gives a larger "weight" to matches at the beginning of the array than those as it moves towards the end.
Of course you should make consideration for things like the second document does in fact contain "most" of the matches and have more array entries would place a "larger" weight on the initial matches than in the first document.
From the above "A" scores more in the second document than in the first because the array is longer even though both matched "A" in the first position. However there is also some effect that "F" is a mismatch and therefore has a greater negative effect than it would if it was later in the array. Same applies to "A" in the last document, where at the end of the array the match has little bearing on the overall weight.
The counter to this in consideration is to add some logic to consider the "exact match" case, such as here the $size comparison from the $setIntersection of the sample and the current array. This would adjust the scores to ensure that something that matched all provided elements actually scored higher than a document with less positional matches, but more elements overall.
With a "score" in place you can then filter out results ( i.e $limit ) or whatever other logic you can apply in order to only return the actual results wanted. But the first step is calculating a "score" to work from.
So it's all generally subjective to what logic actually means a "nearest match", but the $reverseArray and $indexOfArray operations are generally key to putting "more weight" on the earlier index matches rather than the last.
Overall you are looking for "calculation" of logic. The aggregation framework has some of the available operators, but which ones actually apply are up to your end implementation. I'm just showing something that "logically works" to but more weight on "earlier matches" in an array comparison rather than "latter matches", and of course the "most weight" where the arrays are actually the same.
NOTE: Similar logic could be achieved using the includeArrayIndex option of $unwind for earlier version of MongoDB without the main operators used above. However the process does require usage of $unwind to deconstruct arrays in the first place, and the performance hit this would incur would probably negate the effectiveness of the operation.

mongodb: document with the maximum number of matched targets

I need help to solve the following issue. My collection has a "targets" field.
Each user can have 0 or more targets.
When I run my query I'd like to retrieve the document with the maximum number of matched targets.
Ex:
documents=[{
targets:{
"cluster":"01",
}
},{
targets:{
"cluster":"01",
"env":"DC",
"core":"PO"
}
},{
targets:{
"cluster":"01",
"env":"DC",
"core":"PO",
"platform":"IG"
}
}];
userTarget={
"cluster":"01",
"env":"DC",
"core":"PO"
}
You seem to be asking to return the document where the most conditions were met, and possibly not all conditions. The basic process is an $or query to return the documents that can match either of the conditions. Then you basically need a statement to calculate "how many terms" were met in the document, and return the one that matched the most.
So the combination here is an .aggregate() statement using the intitial results from $or to calculate and then sort the results:
// initial targets object
var userTarget = {
"cluster":"01",
"env":"DC",
"core":"PO"
};
// Convert to $or condition
// and the calcuation condition to match
var orCondition = [],
scoreCondition = []
Object.keys(userTarget).forEach(function(key) {
var query = {},
cond = { "$cond": [{ "$eq": ["$target." + key, userTarget[key]] },1,0] };
query["target." + key] = userTarget[key];
orCondition.push(query);
scoreCondition.push(cond);
});
// Run aggregation
Model.aggregate(
[
// Match with condition
{ "$match": { "$or": orCondition } },
// Calculate a "score" based on matched fields
{ "$project": {
"target": 1,
"score": {
"$add": scoreCondition
}
}},
// Sort on the greatest "score" (descending)
{ "$sort": { "score": -1 } },
// Return the first document
{ "$limit": 1 }
],
function(err,result) {
// check errors
// Remember that result is an array, even if limitted to one document
console.log(result[0]);
}
)
So before processing the aggregate statement, we are going to generate the dynamic parts of the pipeline operations based on the input in the userTarget object. This would produce an orCondition like this:
{ "$match": {
"$or": [
{ "target.cluster" : "01" },
{ "target.env" : "DC" },
{ "target.core" : "PO" }
]
}}
And the scoreCondition would expand to a coding like this:
"score": {
"$add": [
{ "$cond": [{ "$eq": [ "$target.cluster", "01" ] },1,0] },
{ "$cond": [{ "$eq": [ "$target.env", "DC" ] },1,0] },
{ "$cond": [{ "$eq": [ "$target.core", "PO" ] },1,0] },
]
}
Those are going to be used in the selection of possible documents and then for counting the terms that could match. In particular the "score" is made by evaluating each condition within the $cond ternary operator, and then either attributing a score of 1 where there was a match, or 0 where there was not a match on that field.
If desired, it would be simple to alter the logic to assign a higher "weight" to each field with a different value going towards the score depending on the deemed importance of the match. At any rate, you simply $add these score results together for each field for the overall "score".
Then it is just a simple matter of applying the $sort to the returned "score", and then using $limit to just return the top document.
It's not super efficient, since even though there is a match for all three conditions the basic question you are asking of the data cannot presume that there is, hence it needs to look at all data where "at least one" condition was a match, and then just work out the "best match" from those possible results.
Ideally, I would personally run an additional query "first" to see if all three conditions were met, and if not then look for the other cases. That still is two separate queries, and would be different from simply just pushing the "and" conditions for all fields as the first statement in $or.
So the preferred implementation I think should be:
Look for a document that matches all given field values; if not then
Run the either/or on every field and count the condition matches.
That way, if all fields match then the first query is fastest and only needs to fall back to the slower but required implementaion shown in the listing if there was no actual result.

Grouping documents in pairs using mongo aggregation

I have a collection of items,
[ a, b, c, d ]
And I want to group them in pairs such as,
[ [ a, b ], [ b, c ], [ c, d ] ]
This will be used in calculating the differences between each item in the original collection, but that part is solved using several techniques such as the one in this question.
I know that this is possible with map reduce, but I want to know if it's possible with aggregation.
Edit: Here's an example,
The collection of items; each item is an actual document.
[
{ val: 1 },
{ val: 3 },
{ val: 6 },
{ val: 10 },
]
Grouped version:
[
[ { val: 1 }, { val: 3 } ],
[ { val: 3 }, { val: 6 } ],
[ { val: 6 }, { val: 10 } ]
]
The resulting collection (or aggregation result):
[
{ diff: 2 },
{ diff: 3 },
{ diff: 4 }
]
This is something that just cannot be done with the aggregation framework, and the only current MongoDB method available for this type of operation is mapReduce.
The reason being that the a aggregation framework has no way of referring to any other document in the pipeline than the present one. This actually applies to "grouping" pipeline stages as well, since even though things are grouped on a "key" you cant really deal with individual documents in the way you want to.
MapReduce on the other hand has one feature available that allows you to do what you want here, and it's not even "directly" related to aggregation. It is in fact the ability to have "globally scoped variables" across all stages. And having a "variable" to basically "store the last document" is all you need to achieve your result.
So it's quite simple code, and there is in fact no "reduction" required:
db.collection.mapReduce(
function () {
if (lastVal != null)
emit( this._id, this.val - lastVal );
lastVal = this.val;
},
function() {}, // mapper is not called
{
"scope": { "lastVal": null },
"out": { "inline": 1 }
}
)
Which gives you a result much like this:
{
"results" : [
{
"_id" : ObjectId("54a425a99b8bcd6f73e2d662"),
"value" : 2
},
{
"_id" : ObjectId("54a425a99b8bcd6f73e2d663"),
"value" : 3
},
{
"_id" : ObjectId("54a425a99b8bcd6f73e2d664"),
"value" : 4
}
],
"timeMillis" : 3,
"counts" : {
"input" : 4,
"emit" : 3,
"reduce" : 0,
"output" : 3
},
"ok" : 1
}
That's really just picking "something unique" as the emitted _id value rather than anything specific, because all this is really doing is the difference between values on differing documents.
Global variables are usually the solution to these types of "pairing" aggregations or producing "running totals". Right now the aggregation framework has no access to global variables, even though it might well be a nice this to have. The mapReduce framework has them, so it is probably fair to say that they should be available to the aggregation framework as well.
Right now they are not though, so stick with mapReduce.

MongoDB - Querying between a time range of hours

I have a MongoDB datastore set up with location data stored like this:
{
"_id" : ObjectId("51d3e161ce87bb000792dc8d"),
"datetime_recorded" : ISODate("2013-07-03T05:35:13Z"),
"loc" : {
"coordinates" : [
0.297716,
18.050614
],
"type" : "Point"
},
"vid" : "11111-22222-33333-44444"
}
I'd like to be able to perform a query similar to the date range example but instead on a time range. i.e. Retrieve all points recorded between 12AM and 4PM (can be done with 1200 and 1600 24 hour time as well).
e.g.
With points:
"datetime_recorded" : ISODate("2013-05-01T12:35:13Z"),
"datetime_recorded" : ISODate("2013-06-20T05:35:13Z"),
"datetime_recorded" : ISODate("2013-01-17T07:35:13Z"),
"datetime_recorded" : ISODate("2013-04-03T15:35:13Z"),
a query
db.points.find({'datetime_recorded': {
$gte: Date(1200 hours),
$lt: Date(1600 hours)}
});
would yield only the first and last point.
Is this possible? Or would I have to do it for every day?
Well, the best way to solve this is to store the minutes separately as well. But you can get around this with the aggregation framework, although that is not going to be very fast:
db.so.aggregate( [
{ $project: {
loc: 1,
vid: 1,
datetime_recorded: 1,
minutes: { $add: [
{ $multiply: [ { $hour: '$datetime_recorded' }, 60 ] },
{ $minute: '$datetime_recorded' }
] }
} },
{ $match: { 'minutes' : { $gte : 12 * 60, $lt : 16 * 60 } } }
] );
In the first step $project, we calculate the minutes from hour * 60 + min which we then match against in the second step: $match.
Adding an answer since I disagree with the other answers in that even though there are great things you can do with the aggregation framework, this really is not an optimal way to perform this type of query.
If your identified application usage pattern is that you rely on querying for "hours" or other times of the day without wanting to look at the "date" part, then you are far better off storing that as a numeric value in the document. Something like "milliseconds from start of day" would be granular enough for as many purposes as a BSON Date, but of course gives better performance without the need to compute for every document.
Set Up
This does require some set-up in that you need to add the new fields to your existing documents and make sure you add these on all new documents within your code. A simple conversion process might be:
MongoDB 4.2 and upwards
This can actually be done in a single request due to aggregation operations being allowed in "update" statements now.
db.collection.updateMany(
{},
[{ "$set": {
"timeOfDay": {
"$mod": [
{ "$toLong": "$datetime_recorded" },
1000 * 60 * 60 * 24
]
}
}}]
)
Older MongoDB
var batch = [];
db.collection.find({ "timeOfDay": { "$exists": false } }).forEach(doc => {
batch.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": {
"$set": {
"timeOfDay": doc.datetime_recorded.valueOf() % (60 * 60 * 24 * 1000)
}
}
}
});
// write once only per reasonable batch size
if ( batch.length >= 1000 ) {
db.collection.bulkWrite(batch);
batch = [];
}
})
if ( batch.length > 0 ) {
db.collection.bulkWrite(batch);
batch = [];
}
If you can afford to write to a new collection, then looping and rewriting would not be required:
db.collection.aggregate([
{ "$addFields": {
"timeOfDay": {
"$mod": [
{ "$subtract": [ "$datetime_recorded", Date(0) ] },
1000 * 60 * 60 * 24
]
}
}},
{ "$out": "newcollection" }
])
Or with MongoDB 4.0 and upwards:
db.collection.aggregate([
{ "$addFields": {
"timeOfDay": {
"$mod": [
{ "$toLong": "$datetime_recorded" },
1000 * 60 * 60 * 24
]
}
}},
{ "$out": "newcollection" }
])
All using the same basic conversion of:
1000 milliseconds in a second
60 seconds in a minute
60 minutes in an hour
24 hours a day
The modulo from the numeric milliseconds since epoch which is actually the value internally stored as a BSON date is the simple thing to extract as the current milliseconds in the day.
Query
Querying is then really simple, and as per the question example:
db.collection.find({
"timeOfDay": {
"$gte": 12 * 60 * 60 * 1000, "$lt": 16 * 60 * 60 * 1000
}
})
Of course using the same time scale conversion from hours into milliseconds to match the stored format. But just like before you can make this whatever scale you actually need.
Most importantly, as real document properties which don't rely on computation at run-time, you can place an index on this:
db.collection.createIndex({ "timeOfDay": 1 })
So not only is this negating run-time overhead for calculating, but also with an index you can avoid collection scans as outlined on the linked page on indexing for MongoDB.
For optimal performance you never want to calculate such things as in any real world scale it simply takes an order of magnitude longer to process all documents in the collection just to work out which ones you want than to simply reference an index and only fetch those documents.
The aggregation framework may just be able to help you rewrite the documents here, but it really should not be used as a production system method of returning such data. Store the times separately.