Why sort document by id is slower with $match than not in mongodb? - mongodb

So, I tried to query
db.collection('collection_name').aggregate([
{
$match: { owner_id: '5be9b2f03ef77262c2bd49e6' }
},
{
$sort: { _id: -1 }
}])
the query above takes up 20s
but If I tried to query
db.collection('collection_name').aggregate([{$sort : {_id : -1}}])
it's only take 0.7s
Why does it the one without $match is actually faster than without match ?
update :
when I try this query
db.getCollection('callbackvirtualaccounts').aggregate([
{
$match: { owner_id: '5860457640b4fe652bd9c3eb' }
},
{
$sort: { created: -1 }
}
])
it's only takes 0.781s
Why sort by _id is slower than by created field ?
note : I'm using mongodb v3.0.0

db.collection('collection_name').aggregate([
{
$match: { owner_id: '5be9b2f03ef77262c2bd49e6' }
},
{
$sort: { _id: -1 }
}])
This collection probably won't be having and index on owner_id; Try using below mentioned index creation query and rerun your previous code.
db.collection('collection_name').createIndexes({ owner_id:1}) //Simple Index
or
db.collection('collection_name').createIndexes({ owner_id:1,_id:-1}) //Compound Index
**Note:: If you don't know how to compound index yet, you can create simple indexes individually on all keys which are used either in match or sort and that should be making query efficient as well.

The query speed depends upon a lot of factors. The size of collection, size of the document, indexes defined on the collection (and used in the queries and properly), the hardware components (like CPU, RAM, network) and other processes running at the time the query is running.
You have to tell what indexes are defined on the collection being discussed for further analysis. The command will retrieve them: db.collection.getIndexes()
Note the unique index on the _id field is created by default, and cannot be modified or deleted.
(i)
But If I tried to query: db.collection.aggregate( [ { $sort : { _id : -1 } } ] ) it's
only take 0.7s.
The query is faster because there is an index on the _id field and it is used in sort process. Aggregation queries use indexes with sort stage and when this sort happens early in the pipeline. You can verify if the index is used or not by generating a query plan (use explain with executionStats mode). There will be an index scan (IXSCAN) in the generated query plan.
(ii)
db.collection.aggregate([
{
$match: { owner_id: '5be9b2f03ef77262c2bd49e6' }
},
{
$sort: { _id: -1 }
}
])
The query above takes up 20s.
When I try this query it's only takes 0.781s.
db.collection.aggregate([
{
$match: { owner_id: '5860457640b4fe652bd9c3eb' }
},
{
$sort: { created: -1 }
}
])
Why sort by _id is slower than by created field ?
Cannot come to any conclusions with the available information. In general, the $match and $sort stages present early in the aggregation query can use any indexes created on the fields used in the operations.
Generating a query plan will reveal what the issues are.
Please run the explain with executionStats mode and post the query plan details for all queries in question. There is documentation for Mongodb v3.0.0 version on generation query plans using explain: db.collection.explain()

Related

How does mongodb use an index to count documents?

According to docs, db.collection.countDocuments() wraps this:
db.collection.aggregate([
{ $match: <query> },
{ $group: { _id: null, n: { $sum: 1 } } }
])
Even if there is an index, all of the matched docs will be passed into the $group to be counted, no?
If not, how is mongodb able to count the docs without processing all matching docs?
The MongoDB query planner can make some optimizations.
In that sample aggregation, it can see that no fields are required except for the ones referenced in <query>, so it can add an implicit $project stage to select only those fields.
If those fields and the _id are all included in a single index, there is no need to fetch the documents to execute that query, all the necessary information is available from the index.

Why $sort on indexed fields with $group stage does not exceed RAM limit, but $sort alone does?

I have a collection with about 50,000 items with created indexes on e.g. name, and _id
If I use db.items.find().sort({ name: 1, _id: 1 })
or:
db.items.aggregate([
{
$match: {}
},
{
$sort: {
name 1,
_id: 1
}
}
])
then it exceed the RAM limit: Executor error during find command :: caused by :: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit. and I have to pass { allowDiskUse: true } to aggregate if I want this to work.
However when I use $group stage in the aggregation pipeline it does not exceed RAM limit and it works:
db.items.aggregate.aggregate([
{
$match: {}
},
{
$sort: {
name 1,
_id: 1
}
},
{
$group: {
_id: 1,
x: {
$push: {
_id: '$_id'
}
}
}
}
])
Why is this happening with $sort alone, but not with $sort + $group?
I have a theory it's connected to this feature.
If a pipeline sorts and groups by the same field and the $group stage only uses the $first accumulator operator, consider adding an index on the grouped field which matches the sort order. In some cases, the $group stage can use the index to quickly find the first document of each group.
While the pipeline optimizations and the way things "actually" run is a black box this is the only thing I can think of ( that is mentioned in the docs at least ).
I'm assuming this "optimization" kicks in, making the $group stage utilize the index. meaning the pipeline might be holding "less" memory as it's using the index to scan this. Eventually you're not returning the name making the total result smaller.
Again this is pure speculation, but it's the best I got.

MongoDb aggregate with limit and without limit

There is a collection in mongo
In the collection of 40 million records
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$sort: {timeline: -1}
}
])
This request never ends
But if you add a limit before sorting, and the limit is higher than the total number of records in advance, for example, 1,000,000,000,000,000 - the request will be processed instantly
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$limit: 10000000000000000
},
{
$sort: {timeline: -1}
}
])
Please tell me why this is happening?
What problems can I expect in the future if I leave it this way?
TLDR: Mongo is using the wrong index for the query
Why is this happening?
Well basically every query you do Mongo simulates a quick "competition" between the relevant indexes in order to choose which one to use, the first index to retrieve 1001 documents "wins".
Now usually this situation of picking the wrong index occurs with ascending or descending fields and a matching index making this index with the fetching competition under certain conditions, Meaning this is very risky as you can have stable code that can suddenly become a huge bottleneck.
What can we do?
You have a few options:
Use the hint option and make Mongo use the compound index you have ready for this pipeline.
Drop the rogue index to ensure this will never happen again elsewhere (which is my recommended option).
Keep doing what you're doing. basically by adding this random $limit stage you're throwing Mongo's competition off and ensuring the right index will be picked.

How to order MongoDB Aggregation with match, sort, and limit

My current aggregation is:
db.group_members.aggregate({
$match: { user_id: { $in: [1,2,3] } }
}, {
$group: { _id: "$group_id" }
}, {
$sort: { last_post_at: -1 }
}, {
$limit: 5
})
For a document structure of:
{
_id: '...',
user_id: '...',
group_id: '...',
last_post_at: Date,
}
I've also got an index on {user_id: 1, last_post_at: -1}
Since my index is already on last_post_at is the sort useless? I'm not 100% sure how the ordering of this.
My end goal is to replicate this SQL:
SELECT DISTINCT ON (group_id)
FROM group_members
WHERE user_id in [1,2,3]
ORDER_BY last_post_at DESC
LIMIT 5
I'm wondering how to make it performant for a very large group_members and still return it in the right order.
UPDATE:
I'm hoping to find a solution that will limit the number of documents loaded into memory. This will be a fairly large collection and accessed very frequently.
Put the $sort before the $group, otherwise MongoDB can't use the index to help with sorting.
However, in your query it looks like you want to query for a relatively small number of user_ids compared to the total size of your group_members collection. So I recommend an index on user_id only. In that case MongoDB will have to sort your results in memory by last_post_at, but this is worthwhile in exchange for using an index for the initial lookup by user_id.

Combine two fields from different documents in Mongodb

I have these documents in a collection :
{topic : "a",
messages : [ObjectId("21312321321323"),ObjectId("34535345353"),...]
},
{topic : "b,
messages : [ObjectId("1233232323232"),ObjectId("6556565656565"),...]
}
Is there a posibility to get a result with the combination of messages fields ? I like to get this for example :
{[
ObjectId(""),ObjectId(""),ObjectId(""),ObjectId("")
]}
I thought that this was possible with MapReduce but in my case the documents doesn't have anything in common. Right now I'm doing this in the backend using javascript and loops, but i think that this isn't the best option. Thanks.
You could use the $group operator in the Aggregation Framework. To use the Aggregation Framework you will want to be sure you're running on MongoDB 2.2 or newer, of course.
If used with $push you will get all the lists of messages concatenated together.
db.myCollection.aggregate({ $group: { messages: { $push: '$messages' } } });
If used with $addToSet you will get only the distinct values.
db.myCollection.aggregate({ $group: { messages: { $addToSet: '$messages' } } });
And if you want to filter down the candidate documents first, you can use $match.
db.myCollection.aggregate([
{ $match: { topic: { $in: [ 'a', 'b' ] } } },
{ $group: { matches: { $sum: 1 }, messages: { $push: '$messages' } } }
]);
One option is to use the aggregation framework.
However, if you're planning on having a large number of results (beyond just a "lightweight" result), a result document exceeding 16MB in size, or using excessive system memory, you'll need to just loop through the objects in the collection and concatenate the results manually (as you suggest you might be doing now) or risk mongodb throwing an exception.
Aggregation limits may be found at the bottom of this page:
http://docs.mongodb.org/manual/applications/aggregation/
Given the limitations, you may want to just use find with a projection to return just messages.
(And with anything like this, I'd strongly recommend you do some performance benchmarks to compare options with your data on your servers as the "Internet" would suggest right now that some people have found the Aggregation support to be slower than other techniques).