Aggregate framework can't use indexes - mongodb

I run this command:
db.ads_view.aggregate({$group: {_id : "$campaign", "action" : {$sum: 1} }});
ads_view : 500 000 documents.
this queries take 1.8s . this is its profile : https://gist.github.com/afecec63a994f8f7fd8a
indexed : db.ads_view.ensureIndex({campaign: 1});
But mongodb don't use index. Anyone know if can aggregate framework use indexes, how to index this query.

This is a late answer, but since $group in Mongo as of version 4.0 still won't make use of indexes, it may be helpful for others.
To speed up your aggregation significantly, performe a $sort before $group.
So your query would become:
db.ads_view.aggregate({$sort:{"campaign":1}},{$group: {_id : "$campaign", "action" : {$sum: 1} }});
This assumes an index on campaign, which should have been created according to your question. In Mongo 4.0, create the index with db.ads_view.createIndex({campaign:1}).
I tested this on a collection containing 5.5+ Mio. documents. Without $sort, the aggregation would not have finished even after several hours; with $sort preceeding $group, aggregation is taking a couple of seconds.

The $group operator is not one of the ones that will use an index currently. The list of operators that do (as of 2.2) are:
$match
$sort
$limit
$skip
From here:
http://docs.mongodb.org/manual/applications/aggregation/#pipeline-operators-and-indexes
Based on the number of yields going on in the gist, I would assume you either have a very active instance or that a lot of this data is not in memory when you are doing the group (it will yield on page fault usually too), hence the 1.8s
Note that even if $group could use an index, and your index covered everything being grouped, it would still involve a full scan of the index to do the group, and would likely not be terrible fast anyway.

$group doesn't use an index because it doesn't have to. When you $group your items you're essentially indexing all documents passing through the $group stage of the pipeline using your $group's _id. If you used an index that matched the $group's _id, you'd still have to pass through all the docs in the index so it's the same amount of work.

Related

Can an index on a subfield cover queries on projections of that field?

Imagine you have a schema like:
[{
name: "Bob",
naps: [{
time: 2019-05-01T15:35:00,
location: "sofa"
}, ...]
}, ...
]
So lots of people, each with a few dozen naps. You want to find out 'what days do people take the most naps?', so you index naps.time, and then query with:
aggregate([
{$unwind: naps},
{$group: { _id: {$day: "$naps.time"}, napsOnDay: {"$sum": 1 } }
])
But when doing explain(), mongo tells me no index was used in this query, when clearly the index on the time Date field could have been. Why is this? How can I get mongo to use the index for the more optimal query?
Indexes stores pointers to actual documents, and can only be used when working with a material document (i.e. the document that is actually stored on disk).
$match or $sort does not mutate the actual documents, and thus indexes can be used in these stages.
In contrast, $unwind, $group, or any other stages that changes the actual document representations basically loses the connection between the index and the material documents.
Additionally, when those stages are processed without $match, you're basically saying that you want to process the whole collection. There is no point in using the index if you want to process the whole collection.

MongoDB aggregation $lookup to a field that is an indexed array

I am trying a fairly complex aggregate command on two collections involving $lookup pipeline. This normally works just fine on simple aggregation as long as index is set on foreignField.
But my $lookup is more complex as the indexed field is not just a normal Int64 field but actually an array of Int64. When doing a simple find(), it is easy to verify using explain() that the index is being used. But explaining the aggregate pipeline does not explain whether index is being used in the $lookup pipeline. All my timing tests seem to indicate that the index is not being used. MongoDB version is 3.6.2. Db compatibility is set to 3.6.
As I said earlier, I am not using simple foreignField lookup but the 3.6-specific pipeline + $match + $expr...
Could using pipeline be showstopper for the index? Does anyone have any deep experience with the new $lookup pipeline syntax and/or the index on an array field?
Examples
Either of the following works fine and if explained, shows that index on followers is being used.
db.col1.find({followers: {$eq : 823778}})
db.col1.find({followers: {$in : [823778]}})
But the following one does not seem to make use of the index on followers [there are more steps in the pipeline, stripped for readability].
db.col2.aggregate([
{$match:{field: "123"}},
{$lookup:{
from: "col1",
let : {follower : "$follower"},
pipeline: [{
$match: {
$expr: {
$or: [
{ $eq : ["$follower", "$$follower"] },
{ $in : ["$$follower", "$followers"]}
]
}
}
}],
as: "followers_all"
}
}])
This is a missing feature which is going to part of 3.8 version.
Currently eq matches in lookup sub pipeline are optimised to use indexes.
Refer jira fixed in 3.7.1 ( dev version).
Also, this may be relevant as well for non-multi key indexes.

Combine MonogDB aggregate function with $in or $nin

I'm trying to query data using MongoDB query which combine $match together with $nin operators:
coll = mdb.get_collection_by_name('Client')
query = [
{"$group": {"_id": {"Domain": "$Domain"}, "count":{"$sum":1}}},
{"$match": { "UserName": { "$nin": excluded_users}}},
{"$sort": { "count": -1 } }
]
res = list(coll.aggregate(query))
print(res)
As it seems, MongoDB ignores my $match line and calculate records which contains excluded users as well. I tried to search for similar examples but didn't find queries which combine $match with $in or $nin. Is that supported?
Am I doing something wrong? What is the best way to combine aggregate queries with $nin or $in operators?
Thanks
A $group step removes all fields which are not explicitly stated. That means the documents which arrive in your $match stage only have the fields _id and count. There is no field UserName anymore you could match by. If you want to filter by UserName, you need to make sure the UserName's are also part of the documents generated by $group.
Because $group squashes multiple documents into one, you need to specify how to deal if multiple of the grouped documents have different values for UserName. Options are $last/$first to use the value of the last/first aggregated document (warning: without an explicit $sort before $group, the order is unpredictable), $addToSet to create an array of all unique values or $push to create an array of all values with duplicates.
But chances are, that you want to filter the documents with the unwanted documents before you group them, so a better solution might be to move $match before $group.

difference between aggregate ($match) and find, in MongoDB?

What is the difference between the $match operator used inside the aggregate function and the regular find in Mongodb?
Why doesn't the find function allow renaming the field names like the aggregate function?
e.g. In aggregate we can pass the following string:
{ "$project" : { "OrderNumber" : "$PurchaseOrder.OrderNumber" , "ShipDate" : "$PurchaseOrder.ShipDate"}}
Whereas, find does not allow this.
Why does not the aggregate output return as a DBCursor or a List? and also why can't we get a count of the documents that are returned?
Thank you.
Why does not the aggregate output return as a DBCursor or a List?
The aggregation framework was created to solve easy problems that otherwise would require map-reduce.
This framework is commonly used to compute data that requires the full db as input and few document as output.
What is the difference between the $match operator used inside the aggregate function and the regular find in Mongodb?
One of differences, like you stated, is the return type. Find operations output return as a DBCursor.
Other differences:
Aggregation result must be under 16MB. If you are using shards, the full data must be collected in a single point after the first $group or $sort.
$match only purpose is to improve aggregation's power, but it has some other uses, like improve the aggregation performance.
and also why can't we get a count of the documents that are returned?
You can. Just count the number of elements in the resulting array or add the following command to the end of the pipe:
{$group: {_id: null, count: {$sum: 1}}}
Why doesn't the find function allow renaming the field names like the aggregate function?
MongoDB is young and features are still coming. Maybe in a future version we'll be able to do that. Renaming fields is more critical in aggregation than in find.
EDIT (2014/02/26):
MongoDB 2.6 aggregation operations will return a cursor.
EDIT (2014/04/09):
MongoDB 2.6 was released with the predicted aggregation changes.
I investigated a few things about the aggregation and find call:
I did this with a descending sort in a table of 160k documents and limited my output to a few documents.
The Aggregation command is slower than the find command.
If you access to the data like ToList() the aggregation command is faster than the find.
if you watch at the total times (point 1 + 2) the commands seem to be equal
Maybe the aggregation automatically calls the ToList() and does not have to call it again. If you dont call ToList() afterwards the find() call will be much faster.
7 [ms] vs 50 [ms] (5 documents)

Time Complexity of $addToset vs $push when element does not exist in the Array

Given: Connection is Safe=True so Update's return will contain update information.
Say I have a documents that look like:
[{'a': [1]}, {'a': [2]}, {'a': [1,2]}]
And I issue:
coll.update({}, {'$addToSet': {'a':1}}, multi=True)
The result would be:
{u'connectionId': 28,
u'err': None,
u'n': 3,
u'ok': 1.0,
u'updatedExisting': True
}
Even when come documents already have that value. To avoid this I could issue a command.
coll.update({'a': {'$ne': 1}}, {'$push': {'a':1}}, multi=True)
What's the Time Complexity Comparison for $addToSet vs. $push with a $ne check ?
Looks like $addToSet is doing the same thing as your command: $push with a $ne check. Both would be O(N)
https://github.com/mongodb/mongo/blob/master/src/mongo/db/ops/update_internal.cpp
if speed is really important then why not use a hash:
instead of:
{'$addToSet': {'a':1}}
{'$addToSet': {'a':10}}
use:
{$set: {'a.1': 1}
{$set: {'a.10': 1}
Edit
Ok since I read your question wrong all along it turns out that actually you are looking at two different queries and judging the time complexity between them.
The first query being:
coll.update({}, {'$addToSet': {'a':1}}, multi=True)
And the second being:
coll.update({'a': {'$ne': 1}}, {'$push': {'a':1}}, multi=True)
First problem springs to mind here, no indexes. $addToSet, being an update modifier, I do not believe it uses an index as such you are doing a full table scan to accomplish what you need.
In reality you are looking for all documents that do not have 1 in a already and looking to $push the value 1 to that a array.
So 2 points to the second query even before we get into time complexity here because the first query:
Does not use indexes
Would be a full table scan
Would then do a full array scan (with no index) to $addToSet
So I have pretty much made my mind up here that the second query is what your looking for before any of the Big O notation stuff.
There is a problem to using big O notation to explain the time complexity of each query here:
I am unsure of what perspective you want, whether it is per document or for the whole collection.
I am unsure of indexes as such. Using indexes will actually create a Log algorithm on a however not using indexes does not.
However the first query would look something like: O(n) per document since:
The $addToSet would need to iterate over each element
The $addToSet would then need to do an O(1) op to insert the set if it does not exist. I should note I am unsure whether the O(1) is cancelled out or not (light reading suggests my version), I have cancelled it out here.
Per collection, without the index it would be: O(2n2) since the complexity of iterating a will expodentially increase with every new document.
The second query, without indexes, would look something like: O(2n2) (O(n) per document) I believe since $ne would have the same problems as $addToSet without indexes. However with indexes I believe this would actually be O(log n log n) (O(log n) per document) since it would first find all documents with a in then all documents without 1 in their set based upon the b-tree.
So based upon time complexity and the notes at the beginning I would say query 2 is better.
If I am honest I am not used to explaining in "Big O" Notation so this is experimental.
Hope it helps,
Adding my observation in difference between addToSet and push from bulk update of 100k documents.
when you are doing bulk update. addToSet will be executed separately.
for example,
bulkInsert.find({x:y}).upsert().update({"$set":{..},"$push":{ "a":"b" } , "$setOnInsert": {} })
will first insert and set the document. And then it executes addToSet query.
I saw clear difference of 10k between
db.collection_name.count() #gives around 40k
db.collection_name.count({"a":{$in:["b"]}}) # it gives only around 30k
But when replaced $addToSet with $push. both count query returned same value.
note: when you're not concerned about duplicate entry in array. you can go with $push.