group in aggregate framework stopped working properly - mongodb

I hate this kind of questions but maybe you can point me to obvious. I'm using Mongo 2.2.2.
I have a collection (in replica set) with 6M documents which has string field called username on which I have index. The index was non-unique but recently I made it unique. Suddenly following query gives me false alarms that I have duplicates.
db.users.aggregate(
{ $group : {_id : "$username", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} } );
which returns
{
"result" : [
{
"_id" : "davidbeges",
"total" : 2
},
{
"_id" : "jesusantonio",
"total" : 2
},
{
"_id" : "elesitasweet",
"total" : 2
},
{
"_id" : "theschoolofbmx",
"total" : 2
},
{
"_id" : "longflight",
"total" : 2
},
{
"_id" : "thenotoriouscma",
"total" : 2
}
],
"ok" : 1
}
I tested this query on sample collection with few documents and it works as expected.

One of 10gen responded in their JIRA.
Are there any updates on this collection? If so, I'd try adding {$sort: {username:1}} to the front of the pipeline. That will ensure that you only see each username once if it is unique.
If there are updates going on, it is possible that aggregation would see a document twice if it moves due to growth. Another possibility is that a document was deleted after being seen by the aggregation and a new one was inserted with the same username.
So sorting by username before grouping helped.

I think the answer may lie in the fact that your $group is not using an index, it's just doing a scan over the entire collection. These operators can use and index currently in the aggregation framework:
$match $sort $limit $skip
And they will work if placed before:
$project $unwind $group
However, $group by itself will not use an index. When you do your find() test I am betting you are using the index, possibly as a covered index (you can verify by looking at an explain() for that query), rather than scanning the collection. Basically my theory is that your index has no dupes, but your collection does.
Edit: This likely happens because a document is updated/moved during the aggregation operation and hence is seen twice, not because of dupes in the collection as originally thought.
If you add an operator earlier in the pipeline that can use the index but not alter the results fed into $group, then you can avoid the issue.

Related

MongoDB index not used when sorting, although prefix matches

I'm trying to fetch a set of records in the most efficient way from MongoDB, but it goes wrong when I add a sorting stage to the pipeline. The server does not use my intended index. According to the documentation it should however match the prefix:
https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/#sort-and-non-prefix-subset-of-an-index
I have an index which looks like this:
{
"v" : 2,
"key" : {
"account_id" : 1,
"cdr_block" : 1,
"billing.total_billed" : 1,
"times.created" : -1
},
"name" : "IDX_by_account_and_block_sorted"
}
So I would suppose that when I filter on account_id, cdr_block and billing.total_billed, followed by a sort on times.created, the index would be used.
However that is not the case; when I check the query explanations in MongoDB shell;
this one does NOT use the index, but uses an index that is composed of times.created only, so it takes a few minutes:
db.getCollection("cdr").aggregate(
[
{
"$match" : {
"account_id" : 160.0,
"cdr_block" : ObjectId("5d11e0364f853f15824aff47"),
"billing.total_billed" : {
"$gt" : 0.0
}
}
},
{
"$sort" : {
"times.created" : -1.0
}
}
],
{
"allowDiskUse" : true
}
);
If I leave out the $sort stage, it does use my above mentioned index.
I was thinking that it was perhaps due to the fact that it's an aggregation, but this 'regular' query also doesn't use the index:
db.getCollection("cdr").find({
"account_id" : 160.0,
"cdr_block" : ObjectId("5d11e0364f853f15824aff47"),
"billing.total_billed" : {
"$gt" : 0.0
}
}).sort({"times.created" : -1 });
$sort Operator and Performance
$sort operator can take advantage of an index when placed at the beginning of the pipeline or placed before the $project, $unwind, and $group aggregation operators. If $project, $unwind, or $group occur prior to the $sort operation,
$sort cannot use any indexes.

Sorting on index of array mongodb

I have a collection where i have objects like:
{
"_id" : ObjectId("5ab212249a639865c58b744e"),
"levels" : [
{
"levelId" : 0,
"siteId" : "5a0ff11dc7bd083ea6a706b1",
"title" : "Hospital Services"
},
{
"levelId" : 1,
"siteId" : "5a0ff220c7bd083ea6a706d0",
"title" : "Reference Testing"
},
{
"levelId" : 2,
"siteId" : "5a0ff24fc7bd083ea6a706da",
"title" : "Des Moines(Reference Testing)"
}
]
}
I want to sort on the title field of 2nd object of levels array e.g. levels.2.title
Currently my mongo query looks like:
db.getCollection('5aaf63a69a639865c58b2ab9').aggregate([
{$sort : {'levels.2.title':1}}
])
But it is not giving desired results.
Please help.
You can try below query in 3.6.
db.col.aggregate({$sort:{"levels.2.title":1}});
This aggregation and find semantics are different in 3.4. More on jira here
So
db.col.find().sort({"levels.2.title":1})
works as expected and aggregation sort is not working as expected.
Use below aggregation in 3.4.
Use $arrayElemAt to project the second element in $addFields to keep the computed value as the extra field in the document followed by $sort sort on field.
$project with exclusion to drop the sort field to get expected output.
db.col.aggregate([
{"$addFields":{ "sort_element":{"$arrayElemAt":["$levels", 2]}}},
{"$sort":{"sort_element.title":-1}},
{"$project":{"sort_element":0}}
])
Also, You can use $let expression to output the title field directly in $addFields stage.
db.col.aggregate([
{"$addFields":{ "sort_field":{"$let:{"vars":{"ele":{$arrayElemAt":["$levels", 2]}}, in:"$$ele.title"}}}},
{"$sort":{"sort_field":-1}},
{"$project":{"sort_field":0}}
])

Possible to avoid $unwind / aggregation on large array using $elemMatch and regular query?

I have a collection of documents (call it 'logs') which looks similar to this:
{
"_id" : ObjectId("52f523892491e4d58e85d70a"),
"ds_id" : "534d35d72491de267ca08e96",
"eT" : NumberLong(1391784000),
"vars" : [{
"n" : "ActPow",
"val" : 73.4186401367188,
"u" : "kWh",
"dt" : "REAL",
"cM" : "AVE",
"Q" : 99
}, {
"n" : "WinSpe",
"val" : 3.06327962875366,
"u" : "m/s",
"dt" : "REAL",
"cM" : "AVE",
"Q" : 99
}]
}
The vars array holds about 150 subdocuments, not just the two I have shown above. What I'd like to do now is to run a query which retrieves the val of the two subdocuments in the vars array that I have shown above.
Using the aggregation framework, I've been able to come up with the following:
db.logs.aggregate( [
{ $match :
{ ds_id: "534d35d72491de267ca08e96",
eT: { $lt : 1391784000 },
vars: { $elemMatch: { n: "PowCrvVld", val: 3 }}
}
},
{ $unwind : "$vars" },
{ $match :
{ "vars.n" : { $in : ["WinSpe", "ActPow"] }},
{ $project : { "vars.n" : 1, N : 1}
}
]);
While this works, I run up against the 16MB limit when running larger queries. Seeing as I have about 150 subdocuments in the vars array, I'd also like to avoid $unwind if it's possible.
Using a regular query and using $elemMatch I have been able to retrieve ONE of the values:
db.logs.TenMinLog.find({
ds_id : "534d35d72491de267ca08e96",
eT : { $lt : 1391784000 },
vars : { $elemMatch : { n : "PowCrvVld", val : 3 }
}
}, {
ds_id : 1,
vars : { $elemMatch : { n : "ActPow", cM : "AVE" }
});
What my question comes down to is if there's a way to use $elemMatch on an array multiple times in the <projection> part of find. If not, is there another way to easily retrieve those two subdocuments without using $unwind? I am also open to other suggestions that would be more performant that I may not be aware of. Thanks!
If you're using MongoDB 2.6 you can use the $redact operator to prune the elements from the vars array.
In MongoDB 2.6 you can also return results as a cursor to avoid the 16MB limit. From the docs:
In MongoDB 2.6 the aggregate command can return results as a cursor or
store the results in a collection, which are not subject to the size
limit. The db.collection.aggregate() returns a cursor and can return
result sets of any size.
I'd strongly consider a move to MongoDB version 2.6. Aggregation has been enhanced to return a cursor which eliminates the 16MB document limit:
Changed in version 2.6:
The db.collection.aggregate() method returns a cursor and can return
result sets of any size. Previous versions returned all results in a
single document, and the result set was subject to a size limit of 16
megabytes.
http://docs.mongodb.org/manual/core/aggregation-pipeline/
Also there are a number of enhancements that you may find useful for more complex aggregation queries:
Aggregation Enhancements
The aggregation pipeline adds the ability to return result sets of any
size, either by returning a cursor or writing the output to a
collection. Additionally, the aggregation pipeline supports variables
and adds new operations to handle sets and redact data.
The db.collection.aggregate() now returns a cursor, which enables the
aggregation pipeline to return result sets of any size. Aggregation
pipelines now support an explain operation to aid analysis of
aggregation operations. Aggregation can now use a more efficient
external-disk-based sorting process.
New pipeline stages:
$out stage to output to a collection.
$redact stage to allow additional control to accessing the data.
New or modified operators:
set expression operators.
$let and $map operators to allow for the use of variables.
$literal operator and $size operator.
$cond expression now accepts either an object or an array.
http://docs.mongodb.org/manual/release-notes/2.6/
Maybe this works.
db.logs.TenMinLog.find({
ds_id : "534d35d72491de267ca08e96",
eT : { $lt : 1391784000 },
vars : { $or: [{ $elemMatch : { n : "PowCrvVld", val : 3 },
{ $elemMatch : { n : <whatever>, val : <whatever> }]
}
}
}, {
ds_id : 1,
vars : { $elemMatch : { n : "ActPow", cM : "AVE" }
});
Hope it works as you want.

Mongodb: Get documents sorted by a dynamic ranking

I have these documents:
{ "_id" : ObjectId("52abac78f8b13c1e6d05aeed"), "score" : 125494, "updated" : ISODate("2013-12-14T00:55:20.339Z"), "url" : "http://pictwittrer.com/1crfS1t" }
{ "_id" : ObjectId("52abac86f8b13c1e6d05af0f"), "score" : 123166, "updated" : ISODate("2013-12-14T00:55:34.354Z"), "url" : "http://bit.ly/JghJ1N" }
Now, i would like to get all documents sorted by this dynamic ranking:
ranking = score / (NOW - updated).abs
ranking is a float value where:
- score is the value of scopre property of my document
- the denominator is just the difference between NOW (when I'm executing this query) and updated field of my document
I'd want to do this because I want the old documents are sorted last
I'm new to Mongodb and aggregation frameworks but considering the answer Tim B gave I came up with this:
db.coll.aggregate(
{ $project : {
"ranking" : {
"$divide" : ["$score", {"$subtract":[new Date(), "$updated"]}]
}
}
},
{ $sort : {"ranking" : 1}})
Using $project you can reshape documents to insert precomputed values, in your case the ranking field. After that using $sort you can sort the documents by rank in the order you like by specifying 1 for ascending or -1 for descending.
I'm sorry for the terrible code formatting, I tried to make it as readable as possible.
Look at the MongoDB aggregation framework, you can do a project to create the score you want and then a sort to sort by that created score.
http://docs.mongodb.org/manual/core/aggregation-pipeline/
http://docs.mongodb.org/manual/reference/command/aggregate/#dbcmd.aggregate

MongoDB fetch documents with sort by count

I have a document with sub-document which looks something like:
{
"name" : "some name1"
"like" : [
{ "date" : ISODate("2012-11-30T19:00:00Z") },
{ "date" : ISODate("2012-12-02T19:00:00Z") },
{ "date" : ISODate("2012-12-01T19:00:00Z") },
{ "date" : ISODate("2012-12-03T19:00:00Z") }
]
}
Is it possible to fetch documents "most liked" (average value for the last 7 days) and sort by the count?
There are a few different ways to solve this problem. The solution I will focus on uses mongodb's aggregation framework. First, here is an aggregation pipeline that will solve your problem, following it will be an explanation/breakdown of what is happening in the command.
db.testagg.aggregate(
{ $unwind : '$likes' },
{ $group : { _id : '$_id', numlikes : { $sum : 1 }}},
{ $sort : { 'numlikes' : 1}})
This pipeline has 3 main commands:
1) Unwind: this splits up the 'likes' field so that there is 1 'like' element per document
2) Group: this regroups the document using the _id field, incrementing the numLikes field for every document it finds. This will cause numLikes to be filled with a number equal to the number of elements that were in "likes" before
3) Sort: Finally, we sort the return values in ascending order based on numLikes. In a test I ran the output of this command is:
{"result" : [
{
"_id" : 1,
"numlikes" : 1
},
{
"_id" : 2,
"numlikes" : 2
},
{
"_id" : 3,
"numlikes" : 3
},
{
"_id" : 4,
"numlikes" : 4
}....
This is for data inserted via:
for (var i=0; i < 100; i++) {
db.testagg.insert({_id : i})
for (var j=0; j < i; j++) {
db.testagg.update({_id : i}, {'$push' : {'likes' : j}})
}
}
Note that this does not completely answer your question as it avoids the issue of picking the date range, but it should hopefully get you started and moving in the right direction.
Of course, there are other ways to solve this problem. One solution might be to just do all of the sorting and manipulations client-side. This is just one method for getting the information you desire.
EDIT: If you find this somewhat tedious, there is a ticket to add a $size operator to the aggregation framework, I invite you to watch and potentially upvote it to try and speed to addition of this new operator if you are interested.
https://jira.mongodb.org/browse/SERVER-4899
A better solution would be to keep a count field that will record how many likes for this document. While you can use aggregation to do this, the performance will likely be not very good. Having a index on the count field will make read operation fast, and you can use atomic operation to increment the counter when inserting new likes.
You can use this simplify the above aggregation query by the following from mongodb v3.4 onwards:
> db.test.aggregate([
{ $unwind: "$like" },
{ $sortByCount: "$_id" }
]).pretty()
{ "_id" : ObjectId("5864edbfa4d3847e80147698"), "count" : 4 }
Also as #ACE said you can now use $size within a projection instead:
db.test.aggregate([
{ $project: { count: { $size : "$like" } } }
]);
{ "_id" : ObjectId("5864edbfa4d3847e80147698"), "count" : 4 }