Sort a mongo collection by a string field [duplicate] - mongodb

I want to sort a collection by putting items with a specific values before other items.
For example I want all the items with "getthisfirst": "yes" to be before all the others.
{"getthisfirst": "yes"}
{"getthisfirst": "yes"}
{"getthisfirst": "no"}
{"getthisfirst": "maybe"}

This as a general concept is called "weighting". So without any other mechanism in place, then you handle this logically in a MongoDB query by "projecting" the values for the "weight" into the document logically.
Your method for "projecting" and altering the fields present in your document is the .aggregate() method, and specifically it's $project pipeline stage:
db.collection.aggregate([
{ "$project": {
"getthisfirst": 1,
"weight": {
"$cond": [
{ "$eq": [ "$getthisfirst", "yes" ] },
10,
{ "$cond": [
{ "$eq": [ "$getthisfirst", "maybe" ] },
5,
0
]}
]
}
}},
{ "$sort": { "weight": -1 } }
]);
The $cond operator here is a "ternary" ( if/then/else ) condition where the first argument is a conditional statment arriving to boolean true|false. If true "then" the second argument is returned as the result, otherwise the "else" or third argument is returned in response.
In this "nested" case, then where the "yes" is a match then a certain "weight" score is assigned, otherwise we move on to the next condition test where when "maybe" is a match then anoter score is assigned, or otherwise the score is 0 since we only have three posibilities to match.
Then the $sort condition is applied in order to, well "order" ( in decending order ) the results with the largest "weight" on top.

Related

MongoDB query for finding number of people with conflicting schedules [duplicate]

I have startTime and endTime for all records like this:
{
startTime : 21345678
endTime : 31345678
}
I am trying to find number of all the conflicts. For example if there are two records and they overlap the number of conflict is 1. If there are three records and two of them overlap the conflict is 1. If there are three records and all three overlap the conflicts is 3 i.e [(X1, X2), (X1, X3), (X2, X3)]
As an algorithm I am thinking of sorting the data by start time and for each sorted record checking the end time and finding the records with start time less than the end time. This will be O(n2) time. A better approach will be using interval tree and inserting each record into the tree and finding the counts when overlaps occur. This will be O(nlgn) time.
I have not used mongoDB much so what kind of query can I use to achieve something like this?
As you correctly mention, there are different approaches with varying complexity inherent to their execution. This basically covers how they are done and which one you implement actually depends on which your data and use case is best suited to.
Current Range Match
MongoDB 3.6 $lookup
The most simple approach can be employed using the new syntax of the $lookup operator with MongoDB 3.6 that allows a pipeline to be given as the expression to "self join" to the same collection. This can basically query the collection again for any items where the starttime "or" endtime of the current document falls between the same values of any other document, not including the original of course:
db.getCollection('collection').aggregate([
{ "$lookup": {
"from": "collection",
"let": {
"_id": "$_id",
"starttime": "$starttime",
"endtime": "$endtime"
},
"pipeline": [
{ "$match": {
"$expr": {
"$and": [
{ "$ne": [ "$$_id", "$_id" },
{ "$or": [
{ "$and": [
{ "$gte": [ "$$starttime", "$starttime" ] },
{ "$lte": [ "$$starttime", "$endtime" ] }
]},
{ "$and": [
{ "$gte": [ "$$endtime", "$starttime" ] },
{ "$lte": [ "$$endtime", "$endtime" ] }
]}
]},
]
},
"as": "overlaps"
}},
{ "$count": "count" },
]
}},
{ "$match": { "overlaps.0": { "$exists": true } } }
])
The single $lookup performs the "join" on the same collection allowing you to keep the "current document" values for the "_id", "starttime" and "endtime" values respectively via the "let" option of the pipeline stage. These will be available as "local variables" using the $$ prefix in subsequent "pipeline" of the expression.
Within this "sub-pipeline" you use the $match pipeline stage and the $expr query operator, which allows you to evaluate aggregation framework logical expressions as part of the query condition. This allows the comparison between values as it selects new documents matching the conditions.
The conditions simply look for the "processed documents" where the "_id" field is not equal to the "current document", $and where either the "starttime"
$or "endtime" values of the "current document" falls between the same properties of the "processed document". Noting here that these as well as the respective $gte and $lte operators are the "aggregation comparison operators" and not the "query operator" form, as the returned result evaluated by $expr must be boolean in context. This is what the aggregation comparison operators actually do, and it's also the only way to pass in values for comparison.
Since we only want the "count" of the matches, the $count pipeline stage is used to do this. The result of the overall $lookup will be a "single element" array where there was a count, or an "empty array" where there was no match to the conditions.
An alternate case would be to "omit" the $count stage and simply allow the matching documents to return. This allows easy identification, but as an "array embedded within the document" you do need to be mindful of the number of "overlaps" that will be returned as whole documents and that this does not cause a breach of the BSON limit of 16MB. In most cases this should be fine, but for cases where you expect a large number of overlaps for a given document this can be a real case. So it's really something more to be aware of.
The $lookup pipeline stage in this context will "always" return an array in result, even if empty. The name of the output property "merging" into the existing document will be "overlaps" as specified in the "as" property to the $lookup stage.
Following the $lookup, we can then do a simple $match with a regular query expression employing the $exists test for the 0 index value of output array. Where there actually is some content in the array and therefore "overlaps" the condition will be true and the document returned, showing either the count or the documents "overlapping" as per your selection.
Other versions - Queries to "join"
The alternate case where your MongoDB lacks this support is to "join" manually by issuing the same query conditions outlined above for each document examined:
db.getCollection('collection').find().map( d => {
var overlaps = db.getCollection('collection').find({
"_id": { "$ne": d._id },
"$or": [
{ "starttime": { "$gte": d.starttime, "$lte": d.endtime } },
{ "endtime": { "$gte": d.starttime, "$lte": d.endtime } }
]
}).toArray();
return ( overlaps.length !== 0 )
? Object.assign(
d,
{
"overlaps": {
"count": overlaps.length,
"documents": overlaps
}
}
)
: null;
}).filter(e => e != null);
This is essentially the same logic except we actually need to go "back to the database" in order to issue the query to match the overlapping documents. This time it's the "query operators" used to find where the current document values fall between those of the processed document.
Because the results are already returned from the server, there is no BSON limit restriction on adding content to the output. You might have memory restrictions, but that's another issue. Simply put we return the array rather than cursor via .toArray() so we have the matching documents and can simply access the array length to obtain a count. If you don't actually need the documents, then using .count() instead of .find() is far more efficient since there is not the document fetching overhead.
The output is then simply merged with the existing document, where the other important distinction is that since theses are "multiple queries" there is no way of providing the condition that they must "match" something. So this leaves us with considering there will be results where the count ( or array length ) is 0 and all we can do at this time is return a null value which we can later .filter() from the result array. Other methods of iterating the cursor employ the same basic principle of "discarding" results where we do not want them. But nothing stops the query being run on the server and this filtering is "post processing" in some form or the other.
Reducing Complexity
So the above approaches work with the structure as described, but of course the overall complexity requires that for each document you must essentially examine every other document in the collection in order to look for overlaps. Therefore whilst using $lookup allows for some "efficiency" in reduction of transport and response overhead, it still suffers the same problem that you are still essentially comparing each document to everything.
A better solution "where you can make it fit" is to instead store a "hard value"* representative of the interval on each document. For instance we could "presume" that there are solid "booking" periods of one hour within a day for a total of 24 booking periods. This "could" be represented something like:
{ "_id": "A", "booking": [ 10, 11, 12 ] }
{ "_id": "B", "booking": [ 12, 13, 14 ] }
{ "_id": "C", "booking": [ 7, 8 ] }
{ "_id": "D", "booking": [ 9, 10, 11 ] }
With data organized like that where there was a set indicator for the interval the complexity is greatly reduced since it's really just a matter of "grouping" on the interval value from the array within the "booking" property:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } }
])
And the output:
{ "_id" : 10, "docs" : [ "A", "D" ] }
{ "_id" : 11, "docs" : [ "A", "D" ] }
{ "_id" : 12, "docs" : [ "A", "B" ] }
That correctly identifies that for the 10 and 11 intervals both "A" and "D" contain the overlap, whilst "B" and "A" overlap on 12. Other intervals and documents matching are excluded via the same $exists test except this time on the 1 index ( or second array element being present ) in order to see that there was "more than one" document in the grouping, hence indicating an overlap.
This simply employs the $unwind aggregation pipeline stage to "deconstruct/denormalize" the array content so we can access the inner values for grouping. This is exactly what happens in the $group stage where the "key" provided is the booking interval id and the $push operator is used to "collect" data about the current document which was found in that group. The $match is as explained earlier.
This can even be expanded for alternate presentation:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } },
{ "$unwind": "$docs" },
{ "$group": {
"_id": "$docs",
"intervals": { "$push": "$_id" }
}}
])
With output:
{ "_id" : "B", "intervals" : [ 12 ] }
{ "_id" : "D", "intervals" : [ 10, 11 ] }
{ "_id" : "A", "intervals" : [ 10, 11, 12 ] }
It's a simplified demonstration, but where the data you have would allow it for the sort of analysis required then this is the far more efficient approach. So if you can keep the "granularity" to be fixed to "set" intervals which can be commonly recorded on each document, then the analysis and reporting can use the latter approach to quickly and efficiently identify such overlaps.
Essentially, this is how you would implement what you basically mentioned as a "better" approach anyway, and the first being a "slight" improvement over what you originally theorized. See which one actually suits your situation, but this should explain the implementation and the differences.

MongoDB sort by relevance (mix $and and $or)

with 2 documents like :
{
"name": "hello",
"family": 1
},
{
"name": "world",
"family": 1,
"category": 2
}
and a query like :
doc.find({$or: [{family: 1}, {category: 2}]})
how can i have results sorted with the one matching the 2 conditions ("world") as a first result but still have the doc matching only 1 condition as a last result ("hello") ?
i can't use the default $and operator as i would not see the "hello" document that do not match both conditions.
i saw how aggregation could help but for a more complex example than that it would be a lot of computation, i'm guessing this is common use case and there must be something obvious i'm missing
You cannot do that sort of query (pun not intended) with a simple .find() statement. What you are asking for involves "weighting", which is applying a "calculated precedence to values.
Anything with "calculation" basically conditions to be programmatically applied, and the particular assertion here to "sort" rules out the "JavaScript runner" options like mapReduce and simply leaves the Aggregation Framework or other handling of the results.
For the aggregation framework approach you would need to $project a calculated "weight" to each matched document based on the conditions:
db.collection.aggregate([
// Same match conditions to filter
{ "$match": { "$or": [{ "family": 1, }, { "category": 2 }] } },
// Assign the "weight" based on conditions
{ "$project": {
"name": 1,
"family": 1,
"weight": {
"$add": [
{ "$cond": {
"if": { "$eq": [ "$family", 1 ] },
"then": 1,
"else": 0
}},
{ "$cond": {
"if": { "$eq": [ "$category", 2 ] },
"then": 1,
"else": 0
}}
]
}
}},
// Then sort "descending" with highest "weight" on top
{ "$sort": { "weight": -1 } }
])
Basically you are using $cond to evaluate the condition that the returned document actually has data meeting your condition, since in the selection either field being present is a valid response. Where the condition is present we assign a value, and where not the value is 0.
When "both" conditions are present the $add operation combines the total in the weight. So here documents that met only one condition have a 1 and for both they have 2. If you waned for example "family" to have the greater preference, then you would assign 2 in the condition, leaving you with possible document scores of:
3 : For both category and family
2 : For family only
1 : For category only
You could shorten the syntax of the $project in MongoDB 3.4 or later with the $addFields pipeline operator instead, which is most useful when you have a "lot" of other document properties you want to return without needing to list them all in the $project.
Aside from this, the database services don't allow for "calculations" on the "sort". This is considered "manipulation", which is the purpose of the Aggregation Framework.
Whilst you can do the same sort of "weighting" by post processing the result set in client code, the issue here is of course where you want to "limit" the results to return in actions like "paging". This is where running the operations on the server comes into play, and the reason why you use the Aggregation Framework for this.

mongodb: document with the maximum number of matched targets

I need help to solve the following issue. My collection has a "targets" field.
Each user can have 0 or more targets.
When I run my query I'd like to retrieve the document with the maximum number of matched targets.
Ex:
documents=[{
targets:{
"cluster":"01",
}
},{
targets:{
"cluster":"01",
"env":"DC",
"core":"PO"
}
},{
targets:{
"cluster":"01",
"env":"DC",
"core":"PO",
"platform":"IG"
}
}];
userTarget={
"cluster":"01",
"env":"DC",
"core":"PO"
}
You seem to be asking to return the document where the most conditions were met, and possibly not all conditions. The basic process is an $or query to return the documents that can match either of the conditions. Then you basically need a statement to calculate "how many terms" were met in the document, and return the one that matched the most.
So the combination here is an .aggregate() statement using the intitial results from $or to calculate and then sort the results:
// initial targets object
var userTarget = {
"cluster":"01",
"env":"DC",
"core":"PO"
};
// Convert to $or condition
// and the calcuation condition to match
var orCondition = [],
scoreCondition = []
Object.keys(userTarget).forEach(function(key) {
var query = {},
cond = { "$cond": [{ "$eq": ["$target." + key, userTarget[key]] },1,0] };
query["target." + key] = userTarget[key];
orCondition.push(query);
scoreCondition.push(cond);
});
// Run aggregation
Model.aggregate(
[
// Match with condition
{ "$match": { "$or": orCondition } },
// Calculate a "score" based on matched fields
{ "$project": {
"target": 1,
"score": {
"$add": scoreCondition
}
}},
// Sort on the greatest "score" (descending)
{ "$sort": { "score": -1 } },
// Return the first document
{ "$limit": 1 }
],
function(err,result) {
// check errors
// Remember that result is an array, even if limitted to one document
console.log(result[0]);
}
)
So before processing the aggregate statement, we are going to generate the dynamic parts of the pipeline operations based on the input in the userTarget object. This would produce an orCondition like this:
{ "$match": {
"$or": [
{ "target.cluster" : "01" },
{ "target.env" : "DC" },
{ "target.core" : "PO" }
]
}}
And the scoreCondition would expand to a coding like this:
"score": {
"$add": [
{ "$cond": [{ "$eq": [ "$target.cluster", "01" ] },1,0] },
{ "$cond": [{ "$eq": [ "$target.env", "DC" ] },1,0] },
{ "$cond": [{ "$eq": [ "$target.core", "PO" ] },1,0] },
]
}
Those are going to be used in the selection of possible documents and then for counting the terms that could match. In particular the "score" is made by evaluating each condition within the $cond ternary operator, and then either attributing a score of 1 where there was a match, or 0 where there was not a match on that field.
If desired, it would be simple to alter the logic to assign a higher "weight" to each field with a different value going towards the score depending on the deemed importance of the match. At any rate, you simply $add these score results together for each field for the overall "score".
Then it is just a simple matter of applying the $sort to the returned "score", and then using $limit to just return the top document.
It's not super efficient, since even though there is a match for all three conditions the basic question you are asking of the data cannot presume that there is, hence it needs to look at all data where "at least one" condition was a match, and then just work out the "best match" from those possible results.
Ideally, I would personally run an additional query "first" to see if all three conditions were met, and if not then look for the other cases. That still is two separate queries, and would be different from simply just pushing the "and" conditions for all fields as the first statement in $or.
So the preferred implementation I think should be:
Look for a document that matches all given field values; if not then
Run the either/or on every field and count the condition matches.
That way, if all fields match then the first query is fastest and only needs to fall back to the slower but required implementaion shown in the listing if there was no actual result.

MongoDB: Sort by field existing and then alphabetically

In my database I have a field of name. In some records it is an empty string, in others it has a name in it.
In my query, I'm currently doing:
db.users.find({}).sort({'name': 1})
However, this returns results with an empty name field first, then alphabetically returns results. As expected, doing .sort({'name': -1}) returns results with a name and then results with an empty string, but it's in reverse-alphabetical order.
Is there an elegant way to achieve this type of sorting?
How about:
db.users.find({ "name": { "$exists": true } }).sort({'name': 1})
Because after all when a field you want to sort on is not actually present then the returned value is null and therefor "lower" in the order than any positive result. So it makes sense to exclude those results if you really are only looking for something with a matching value.
If you really want all the results in there and regarless of a null content, then I suggest you "weight" them via .aggregate():
db.users.aggregate([
{ "$project": {
"name": 1,
"score": {
"$cond": [
{ "$ifNull": [ "$name", false ] },
1,
10
]
}
}},
{ "$sort": { "score": 1, "name": 1 } }
])
And that moves all null results to the "end of the chain" by assigning a value as such.
If you want to filter out documents with an empty "name" field, change your query: db.users.find({"name": {"$ne": ""}}).sort({"name": 1})

Filtering and Matching Arrays within Arrays

I am looking for querying JSON file which has nested array structure. Each design element has multiple SLID and status. I want to write mongodb query to get designs with highest SLID and status as "OLD".
Here is the sample JSON:
{
"_id" : ObjectId("55cddc30f1a3c59ca1e88f30"),
"designs" : [
{
"Deid" : 1,
"details" : [
{
"SLID" : 1,
"status" : "OLD"
},
{
"SLID" : 2,
"status" : "NEW"
}
]
},
{
"Deid" : 2,
"details" : [
{
"SLID" : 1,
"status" : "NEW"
},
{
"SLID" : 2,
"status" : "NEW"
},
{
"SLID" : 3,
"status" : "OLD"
}
]
}
]
}
In this sample the expected query should return the following as SLID is highest with status "OLD".
{
"_id" : ObjectId("55cddc30f1a3c59ca1e88f30"),
"designs" : [
{
"Deid" : 2,
"details" : [
{
"SLID" : 3,
"status" : "OLD"
}
]
}
]
}
I have tried following query but it kept returning other details array element (which has status "NEW") along with above element.
db.Collection.find({"designs": {$all: [{$elemMatch: {"details.status": "OLD"}}]}},
{"designs.details":{$slice:-1}})
Edit:
To summarize the problem:
Requirement is to get all design from document set with highest SLID (always the last item in details array) if it has status as "OLD".
Present Problem
What you should have been picking up from the previously linked question is that the positional $ operator itself is only capable of matching the first matched element within an array. When you have nested arrays like you do, then this means "always" the "outer" array can only be reported and never the actual matched position within the inner array nor any more than a single match.
Other examples show usage of the aggregation framework for MongoDB in order to "filter" elements from the array by generally processing with $unwind and then using conditions to match the array elements that you require. This is generally what you need to do in this case to get matches from your "inner" array. While there have been improvements since the first answers, your "last match" or effectively a "slice" condition, excludes other present possibilities. Therefore:
db.junk.aggregate([
{ "$match": {
"designs.details.status": "OLD"
}},
{ "$unwind": "$designs" },
{ "$unwind": "$designs.details" },
{ "$group": {
"_id": {
"_id": "$_id",
"Deid": "$designs.Deid"
},
"details": { "$last": "$designs.details"}
}},
{ "$match": {
"details.status": "OLD"
}},
{ "$group": {
"_id": "$_id",
"details": { "$push": "$details"}
}},
{ "$group": {
"_id": "$_id._id",
"designs": {
"$push": {
"Deid": "$_id.Deid",
"details": "$details"
}
}
}}
])
Which would return on your document or any others like it a result like:
{
"_id" : ObjectId("55cddc30f1a3c59ca1e88f30"),
"designs" : [
{
"Deid" : 2,
"details" : [
{
"SLID" : 3,
"status" : "OLD"
}
]
}
]
}
The key there being to $unwind both arrays and then $group back on the relevant unique elements in order to "slice" the $last element from each "inner" array.
The next of your conditions requires a $match in order to see that the "status" field of that element is the value that you want. Then of course since the documents have been essentially "de-normalized" by the $unwind operations and even with the subsequent $group, the following $group statements re-construct the document into it's original form.
Aggregation pipelines can either be quite simple or quite difficult depending on what you want to do, and reconstruction of documents with filtering like this means you need to take care in the steps, particularly if there are other fields involved. As you should also appreciate here, this process of $unwind to de-normalize and $group operations is not very efficient, and can cause significant overhead depending on the number of possible documents that can be met by the initial $match query.
Better Solution
While currently only available in the present development branch, there are some new operators available to the aggregation pipeline that make this much more efficient, and effectively "on par" with the performance of a general query. These are notably the $filter and $slice operators, which can be employed in this case as follows:
db.junk.aggregate([
{ "$match": {
"designs.details.status": "OLD"
}},
{ "$redact": {
"$cond": [
{ "$gt": [
{ "$size":{
"$filter": {
"input": "$designs",
"as": "designs",
"cond": {
"$anyElementTrue":[
{ "$map": {
"input": {
"$slice": [
"$$designs.details",
-1
]
},
"as": "details",
"in": {
"$eq": [ "$$details.status", "OLD" ]
}
}}
]
}
}
}},
0
]},
"$$KEEP",
"$$PRUNE"
]
}},
{ "$project": {
"designs": {
"$map": {
"input": {
"$filter": {
"input": "$designs",
"as": "designs",
"cond": {
"$anyElementTrue":[
{ "$map": {
"input": {
"$slice": [
"$$designs.details",
-1
]
},
"as": "details",
"in": {
"$eq": [ "$$details.status", "OLD" ]
}
}}
]
}
}
},
"as": "designs",
"in": {
"Deid": "$$designs.Deid",
"details": { "$slice": [ "$$designs.details", -1] }
}
}
}
}}
])
This effectively makes the operations just a $match and $project stage only, which is basically what is done with a general .find() operation. The only real addition here is a $redact stage, which allows the documents to be additionally filtered from the initial query condition by futher logical conditions that can inspect the document.
In this case, we can see if the document not only contains an "OLD" status, but also that this is the last element of at least one of the inner arrays matches that status it it's own last entry, otherwise it is "pruned" from the results for not meeting that condition.
In both the $redact and $project, the $slice operator is used to get the last entry from the "details" array within the "designs" array. In the initial case it is applied with $filter to remove any elements where the condition did not match from the "outer" or "designs" array, and then later in the $project to just show the last element from the "designs" array in final presentation. That last "reshape" is done by $map to replace the whole arrays with the last element slice only.
Whilst the logic there seems much more long winded than the initial statement, the performance gain is potentially "huge" due to being able to treat each document as a "unit" without the need to denormalize or otherwise de-construct until the final projection is made.
Best Solution for Now
In summary, the current processes you can use to achieve the result are simply not efficient for solving the problem. It would in fact be more efficient to simply match the documents that meet the basic condition ( contain a "status" that is "OLD" ) in conjuntction with a $where condition to test the last element of each array. However the actual "filtering" of the arrays in output is best left to client code:
db.junk.find({
"designs.details.status": "OLD",
"$where": function() {
return this.designs.some(function(design){
return design.details.slice(-1)[0].status == "OLD";
});
}
}).forEach(function(doc){
doc.designs = doc.designs.filter(function(design) {
return design.details.slice(-1)[0].status == "OLD";
}).map(function(design) {
design.details = design.details.slice(-1);
return design;
});
printjson(doc);
});
So the query condition at least only returns the documents that match all conditions, and then the client side code filters out the content from arrays by testing the last elements and then just slicing out that final element as the content to display.
Right now, that is probably the most efficient way to do this as it mirrors the future aggregation operation capabilties.
The problems are really in the structure of your data. While it may suit your display purposes of your application, the usage of nested arrays makes this notoriously difficult to query as well as "impossible" to atomically update due to the limitations of the positional operator mentioned before.
While you can always $push new elements to the inner array by matching it's existence within or just the presence of the outer array element, what you cannot do is alter that "status" of an inner element in an atomic operation. In order to modify in such a way, you need to retrieve the entire document, then modifiy the contents in code, and save back the result.
The problems with that process mean you are likely to "collide" on updates and possibly overwrite the changes made by another concurrent request to the one you are currently processing.
For these reasons you really should reconsider all of your design goals for the application and the suitability to such a structure. Keeping a more denormalized form may cost you in some other areas, but it sure would make things much more simple to both query and update with the kind of inspection level this seems to require.
The end conclusion here should be that you reconsider the design. Though getting your results is both possible now and in the future, the other operational blockers should be enough to warrant a change.