MongoDB index not used when sorting, although prefix matches - mongodb

I'm trying to fetch a set of records in the most efficient way from MongoDB, but it goes wrong when I add a sorting stage to the pipeline. The server does not use my intended index. According to the documentation it should however match the prefix:
https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/#sort-and-non-prefix-subset-of-an-index
I have an index which looks like this:
{
"v" : 2,
"key" : {
"account_id" : 1,
"cdr_block" : 1,
"billing.total_billed" : 1,
"times.created" : -1
},
"name" : "IDX_by_account_and_block_sorted"
}
So I would suppose that when I filter on account_id, cdr_block and billing.total_billed, followed by a sort on times.created, the index would be used.
However that is not the case; when I check the query explanations in MongoDB shell;
this one does NOT use the index, but uses an index that is composed of times.created only, so it takes a few minutes:
db.getCollection("cdr").aggregate(
[
{
"$match" : {
"account_id" : 160.0,
"cdr_block" : ObjectId("5d11e0364f853f15824aff47"),
"billing.total_billed" : {
"$gt" : 0.0
}
}
},
{
"$sort" : {
"times.created" : -1.0
}
}
],
{
"allowDiskUse" : true
}
);
If I leave out the $sort stage, it does use my above mentioned index.
I was thinking that it was perhaps due to the fact that it's an aggregation, but this 'regular' query also doesn't use the index:
db.getCollection("cdr").find({
"account_id" : 160.0,
"cdr_block" : ObjectId("5d11e0364f853f15824aff47"),
"billing.total_billed" : {
"$gt" : 0.0
}
}).sort({"times.created" : -1 });

$sort Operator and Performance
$sort operator can take advantage of an index when placed at the beginning of the pipeline or placed before the $project, $unwind, and $group aggregation operators. If $project, $unwind, or $group occur prior to the $sort operation,
$sort cannot use any indexes.

Related

mongodb aggregation, use $sum 2 individual fields and group

As I am new to mongo, I have a slight problem getting something done the way I need it.
I am trying to group the collection by "token_address", count the occurencies of that as totalTransfers and also for each token_address sum the values of the "decimal" property.
The desired outlook would look like this
{
"token_address" : "0x2a746fb4d7338e4677c102f0ce46dae3971de1cc",
"totalTransfers" : 4.0, //occurencies per token_address in collection
"decimal" :132.423 //the $sum of each decimal per token_address
}
This is how the documents look
{
"_id" : "BrBr1vuhNRmmaZliYopQocD2",
"from_address" : "0x7ed77e237fa0a87fc12afb92d2999e6f90e1d43b",
"log_index" : 442,
"to_address" : "0x31d5e41636c2331d8be9ea9c4393a0ff4e597b6c",
"transaction_hash" : "0x1a80b66839b021ef9c1a902f19d28b77d8e688b2e3ebb9bfc185443ae1830403",
"_created_at" : ISODate("2022-03-21T14:09:49.894Z"),
"_updated_at" : ISODate("2022-03-21T14:09:49.894Z"),
"block_hash" : "0x80dfe8642f998ce7fb3e692ab574d9786efdd81ba6aeace060ae9cc919a8acbf",
"block_number" : 14209975,
"block_timestamp" : ISODate("2022-02-15T09:47:30.000Z"),
"confirmed" : true,
"decimal" : NumberDecimal("0.1206"),
"historical" : true,
"token_address" : "0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2",
"transaction_index" : 278,
"value" : "120600000000000000"
}
This is my aggregation which gives which results in the last output
db.EthTokenTransfers.aggregate([
{ $project: {
token_address : 1 // Inclusion mode
}},
{ $group : { _id : '$token_address', totalTransfers : { $sum : 1 }, decimal: {$sum:"$decimal"}
} }
])
{
"_id" : "0x2a746fb4d7338e4677c102f0ce46dae3971de1cc",
"totalTransfers" : 4.0,
"decimal" : 0
}
Can some one point me towards the correct way of doing this? I been trying for over an hour for such a simple task.
The problem is your are removing fields in the $project stage.
Check this example only with the first stage where the output is only the field token_address.
So to the next aggregation stage this is the income data, only the token_address. And that's why totalTransfersis correct (you group by the existing token_address field and $sum one for each one) but decimal is always 0, because the field does not exists.
One solution can be add decimal into $project like this example

MongoDB Sorting: Equivalent Aggregation Query

I have following students collection
{ "_id" : ObjectId("5f282eb2c5891296d8824130"), "name" : "Rajib", "mark" : "1000" }
{ "_id" : ObjectId("5f282eb2c5891296d8824131"), "name" : "Rahul", "mark" : "1200" }
{ "_id" : ObjectId("5f282eb2c5891296d8824132"), "name" : "Manoj", "mark" : "1000" }
{ "_id" : ObjectId("5f282eb2c5891296d8824133"), "name" : "Saroj", "mark" : "1400" }
My requirement is to sort the collection basing on 'mark' field in descending order. But it should not display 'mark' field in final result. Result should come as:
{ "name" : "Saroj" }
{ "name" : "Rahul" }
{ "name" : "Rajib" }
{ "name" : "Manoj" }
Following query I tried and it works fine.
db.students.find({},{"_id":0,"name":1}).sort({"mark":-1})
My MongoDB version is v4.2.8. Now question is what is the equivalent Aggregation Query of the above query. I tried following two queries. But both didn't give me desired result.
db.students.aggregate([{"$project":{"name":1,"_id":0}},{"$sort":{"mark":-1}}])
db.students.aggregate([{"$project":{"name":1,"_id":0,"mark":1}},{"$sort":{"mark":-1}}])
Why it is working in find()?
As per Cursor.Sort, When a set of results are both sorted and projected, the MongoDB query engine will always apply the sorting first.
Why it isn't working in aggregate()?
As per Aggregation Pipeline, The MongoDB aggregation pipeline consists of stages. Each stage transforms the documents as they pass through the pipeline. Pipeline stages do not need to produce one output document for every input document; e.g., some stages may generate new documents or filter out documents.
You need to correct:
You should change pipeline order, because if you have not selected mark field in $project then it will no longer available in further pipelines and it will not affect $sort operation.
db.students.aggregate([
{ "$sort": { "mark": -1 } },
{ "$project": { "name": 1, "_id": 0 } }
])
Playground: https://mongoplayground.net/p/xtgGl8AReeH

Using multikey indexes for mongo max/min - Find the newest record for a given key

I am trying to use a multi-key index to find the newest records by another key. I can't seem to make it perform.
In pseudo sql I would say
create table my_table (user_id int, post_time timestamp, content text);
create index my_index (user_id,post_time) on my_table;
I can then then hit the index to find the newest post_time for each user
select user_id,max(post_time) from my_table group by user_id
All nice and fast even with many millions of records, data will come from the index and we don't hit the table at all.
With Mongo
db.my_table.ensureIndex( { user_id:1,post_time:1} )
And query
db.my_table.aggregate( { $group:{ '_id':'$user_id', 'max':{ $max:'$post_time'} } )
But this doesn't hit the index - it seems to do a (slow) table scan.
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"post_time" : 1,
"user_id" : 1,
"_id" : 0
},
"plan" : {
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false,
"allPlans" : [
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false
}
]
}
}
},
{
"$group" : {
"_id" : "$user_id",
"max" : {
"$max" : "$post_time"
}
}
}
],
"ok" : 1
}
What do I need to do here to make this query perform? Is there a better approach / data structure I should use with mongo?
Unfortunately, your aggregate query cannot be covered by any index you create.
Only the $match, $sort and $geoNear stages can make use of the indexes when they occur at the beginning of the pipeline.
From the docs,
The $match and $sort pipeline operators can take advantage of an index
when they occur at the beginning of the pipeline. New in version 2.4:
The $geoNear pipeline operator takes advantage of a geospatial index.
When using $geoNear, the $geoNear pipeline operation must appear as
the first stage in an aggregation pipeline. Even when the pipeline
uses an index, aggregation still requires access to the actual
documents; i.e. indexes cannot fully cover an aggregation pipeline.

Possible to avoid $unwind / aggregation on large array using $elemMatch and regular query?

I have a collection of documents (call it 'logs') which looks similar to this:
{
"_id" : ObjectId("52f523892491e4d58e85d70a"),
"ds_id" : "534d35d72491de267ca08e96",
"eT" : NumberLong(1391784000),
"vars" : [{
"n" : "ActPow",
"val" : 73.4186401367188,
"u" : "kWh",
"dt" : "REAL",
"cM" : "AVE",
"Q" : 99
}, {
"n" : "WinSpe",
"val" : 3.06327962875366,
"u" : "m/s",
"dt" : "REAL",
"cM" : "AVE",
"Q" : 99
}]
}
The vars array holds about 150 subdocuments, not just the two I have shown above. What I'd like to do now is to run a query which retrieves the val of the two subdocuments in the vars array that I have shown above.
Using the aggregation framework, I've been able to come up with the following:
db.logs.aggregate( [
{ $match :
{ ds_id: "534d35d72491de267ca08e96",
eT: { $lt : 1391784000 },
vars: { $elemMatch: { n: "PowCrvVld", val: 3 }}
}
},
{ $unwind : "$vars" },
{ $match :
{ "vars.n" : { $in : ["WinSpe", "ActPow"] }},
{ $project : { "vars.n" : 1, N : 1}
}
]);
While this works, I run up against the 16MB limit when running larger queries. Seeing as I have about 150 subdocuments in the vars array, I'd also like to avoid $unwind if it's possible.
Using a regular query and using $elemMatch I have been able to retrieve ONE of the values:
db.logs.TenMinLog.find({
ds_id : "534d35d72491de267ca08e96",
eT : { $lt : 1391784000 },
vars : { $elemMatch : { n : "PowCrvVld", val : 3 }
}
}, {
ds_id : 1,
vars : { $elemMatch : { n : "ActPow", cM : "AVE" }
});
What my question comes down to is if there's a way to use $elemMatch on an array multiple times in the <projection> part of find. If not, is there another way to easily retrieve those two subdocuments without using $unwind? I am also open to other suggestions that would be more performant that I may not be aware of. Thanks!
If you're using MongoDB 2.6 you can use the $redact operator to prune the elements from the vars array.
In MongoDB 2.6 you can also return results as a cursor to avoid the 16MB limit. From the docs:
In MongoDB 2.6 the aggregate command can return results as a cursor or
store the results in a collection, which are not subject to the size
limit. The db.collection.aggregate() returns a cursor and can return
result sets of any size.
I'd strongly consider a move to MongoDB version 2.6. Aggregation has been enhanced to return a cursor which eliminates the 16MB document limit:
Changed in version 2.6:
The db.collection.aggregate() method returns a cursor and can return
result sets of any size. Previous versions returned all results in a
single document, and the result set was subject to a size limit of 16
megabytes.
http://docs.mongodb.org/manual/core/aggregation-pipeline/
Also there are a number of enhancements that you may find useful for more complex aggregation queries:
Aggregation Enhancements
The aggregation pipeline adds the ability to return result sets of any
size, either by returning a cursor or writing the output to a
collection. Additionally, the aggregation pipeline supports variables
and adds new operations to handle sets and redact data.
The db.collection.aggregate() now returns a cursor, which enables the
aggregation pipeline to return result sets of any size. Aggregation
pipelines now support an explain operation to aid analysis of
aggregation operations. Aggregation can now use a more efficient
external-disk-based sorting process.
New pipeline stages:
$out stage to output to a collection.
$redact stage to allow additional control to accessing the data.
New or modified operators:
set expression operators.
$let and $map operators to allow for the use of variables.
$literal operator and $size operator.
$cond expression now accepts either an object or an array.
http://docs.mongodb.org/manual/release-notes/2.6/
Maybe this works.
db.logs.TenMinLog.find({
ds_id : "534d35d72491de267ca08e96",
eT : { $lt : 1391784000 },
vars : { $or: [{ $elemMatch : { n : "PowCrvVld", val : 3 },
{ $elemMatch : { n : <whatever>, val : <whatever> }]
}
}
}, {
ds_id : 1,
vars : { $elemMatch : { n : "ActPow", cM : "AVE" }
});
Hope it works as you want.

group in aggregate framework stopped working properly

I hate this kind of questions but maybe you can point me to obvious. I'm using Mongo 2.2.2.
I have a collection (in replica set) with 6M documents which has string field called username on which I have index. The index was non-unique but recently I made it unique. Suddenly following query gives me false alarms that I have duplicates.
db.users.aggregate(
{ $group : {_id : "$username", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} } );
which returns
{
"result" : [
{
"_id" : "davidbeges",
"total" : 2
},
{
"_id" : "jesusantonio",
"total" : 2
},
{
"_id" : "elesitasweet",
"total" : 2
},
{
"_id" : "theschoolofbmx",
"total" : 2
},
{
"_id" : "longflight",
"total" : 2
},
{
"_id" : "thenotoriouscma",
"total" : 2
}
],
"ok" : 1
}
I tested this query on sample collection with few documents and it works as expected.
One of 10gen responded in their JIRA.
Are there any updates on this collection? If so, I'd try adding {$sort: {username:1}} to the front of the pipeline. That will ensure that you only see each username once if it is unique.
If there are updates going on, it is possible that aggregation would see a document twice if it moves due to growth. Another possibility is that a document was deleted after being seen by the aggregation and a new one was inserted with the same username.
So sorting by username before grouping helped.
I think the answer may lie in the fact that your $group is not using an index, it's just doing a scan over the entire collection. These operators can use and index currently in the aggregation framework:
$match $sort $limit $skip
And they will work if placed before:
$project $unwind $group
However, $group by itself will not use an index. When you do your find() test I am betting you are using the index, possibly as a covered index (you can verify by looking at an explain() for that query), rather than scanning the collection. Basically my theory is that your index has no dupes, but your collection does.
Edit: This likely happens because a document is updated/moved during the aggregation operation and hence is seen twice, not because of dupes in the collection as originally thought.
If you add an operator earlier in the pipeline that can use the index but not alter the results fed into $group, then you can avoid the issue.