How to find max sum after $group sorting - mongodb

I have the following query:
db.pmusers.aggregate([
{
$unwind: '$preferableUsersIds'
},
{
$group:{_id: '$preferableUsersIds', number:{$sum: 1}}
},
{
$sort:{number:-1}
},
{
$limit:1
}
])
I understand that it is not the optimal solution because I sort all rntries instead of find only one.
Does mongoDb support to rewrite it in more efficient way?
P.S.
I know aboout $max but don't see how it can help me.
this one works at least not faster:
db.pmusers.aggregate([
{
$unwind: '$preferableUsersIds'
},
{
$sortByCount: "$preferableUsersIds"
},
{
$limit:1
}
])

See the answer here
MongoDB coalesces the sort and limit in an aggregation to optimise the query. You can also add an index if you're doing this a lot to make sure it's a covered query.
Text from linked answer:
According to the MongoDB documentation:
When a $sort immediately precedes a $limit, the optimizer can coalesce
the $limit into the $sort. This allows the sort operation to only
maintain the top n results as it progresses, where n is the specified
limit, and MongoDB only needs to store n items in memory.

Related

Why $sort on indexed fields with $group stage does not exceed RAM limit, but $sort alone does?

I have a collection with about 50,000 items with created indexes on e.g. name, and _id
If I use db.items.find().sort({ name: 1, _id: 1 })
or:
db.items.aggregate([
{
$match: {}
},
{
$sort: {
name 1,
_id: 1
}
}
])
then it exceed the RAM limit: Executor error during find command :: caused by :: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit. and I have to pass { allowDiskUse: true } to aggregate if I want this to work.
However when I use $group stage in the aggregation pipeline it does not exceed RAM limit and it works:
db.items.aggregate.aggregate([
{
$match: {}
},
{
$sort: {
name 1,
_id: 1
}
},
{
$group: {
_id: 1,
x: {
$push: {
_id: '$_id'
}
}
}
}
])
Why is this happening with $sort alone, but not with $sort + $group?
I have a theory it's connected to this feature.
If a pipeline sorts and groups by the same field and the $group stage only uses the $first accumulator operator, consider adding an index on the grouped field which matches the sort order. In some cases, the $group stage can use the index to quickly find the first document of each group.
While the pipeline optimizations and the way things "actually" run is a black box this is the only thing I can think of ( that is mentioned in the docs at least ).
I'm assuming this "optimization" kicks in, making the $group stage utilize the index. meaning the pipeline might be holding "less" memory as it's using the index to scan this. Eventually you're not returning the name making the total result smaller.
Again this is pure speculation, but it's the best I got.

MongoDb aggregate with limit and without limit

There is a collection in mongo
In the collection of 40 million records
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$sort: {timeline: -1}
}
])
This request never ends
But if you add a limit before sorting, and the limit is higher than the total number of records in advance, for example, 1,000,000,000,000,000 - the request will be processed instantly
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$limit: 10000000000000000
},
{
$sort: {timeline: -1}
}
])
Please tell me why this is happening?
What problems can I expect in the future if I leave it this way?
TLDR: Mongo is using the wrong index for the query
Why is this happening?
Well basically every query you do Mongo simulates a quick "competition" between the relevant indexes in order to choose which one to use, the first index to retrieve 1001 documents "wins".
Now usually this situation of picking the wrong index occurs with ascending or descending fields and a matching index making this index with the fetching competition under certain conditions, Meaning this is very risky as you can have stable code that can suddenly become a huge bottleneck.
What can we do?
You have a few options:
Use the hint option and make Mongo use the compound index you have ready for this pipeline.
Drop the rogue index to ensure this will never happen again elsewhere (which is my recommended option).
Keep doing what you're doing. basically by adding this random $limit stage you're throwing Mongo's competition off and ensuring the right index will be picked.

Why sort document by id is slower with $match than not in mongodb?

So, I tried to query
db.collection('collection_name').aggregate([
{
$match: { owner_id: '5be9b2f03ef77262c2bd49e6' }
},
{
$sort: { _id: -1 }
}])
the query above takes up 20s
but If I tried to query
db.collection('collection_name').aggregate([{$sort : {_id : -1}}])
it's only take 0.7s
Why does it the one without $match is actually faster than without match ?
update :
when I try this query
db.getCollection('callbackvirtualaccounts').aggregate([
{
$match: { owner_id: '5860457640b4fe652bd9c3eb' }
},
{
$sort: { created: -1 }
}
])
it's only takes 0.781s
Why sort by _id is slower than by created field ?
note : I'm using mongodb v3.0.0
db.collection('collection_name').aggregate([
{
$match: { owner_id: '5be9b2f03ef77262c2bd49e6' }
},
{
$sort: { _id: -1 }
}])
This collection probably won't be having and index on owner_id; Try using below mentioned index creation query and rerun your previous code.
db.collection('collection_name').createIndexes({ owner_id:1}) //Simple Index
or
db.collection('collection_name').createIndexes({ owner_id:1,_id:-1}) //Compound Index
**Note:: If you don't know how to compound index yet, you can create simple indexes individually on all keys which are used either in match or sort and that should be making query efficient as well.
The query speed depends upon a lot of factors. The size of collection, size of the document, indexes defined on the collection (and used in the queries and properly), the hardware components (like CPU, RAM, network) and other processes running at the time the query is running.
You have to tell what indexes are defined on the collection being discussed for further analysis. The command will retrieve them: db.collection.getIndexes()
Note the unique index on the _id field is created by default, and cannot be modified or deleted.
(i)
But If I tried to query: db.collection.aggregate( [ { $sort : { _id : -1 } } ] ) it's
only take 0.7s.
The query is faster because there is an index on the _id field and it is used in sort process. Aggregation queries use indexes with sort stage and when this sort happens early in the pipeline. You can verify if the index is used or not by generating a query plan (use explain with executionStats mode). There will be an index scan (IXSCAN) in the generated query plan.
(ii)
db.collection.aggregate([
{
$match: { owner_id: '5be9b2f03ef77262c2bd49e6' }
},
{
$sort: { _id: -1 }
}
])
The query above takes up 20s.
When I try this query it's only takes 0.781s.
db.collection.aggregate([
{
$match: { owner_id: '5860457640b4fe652bd9c3eb' }
},
{
$sort: { created: -1 }
}
])
Why sort by _id is slower than by created field ?
Cannot come to any conclusions with the available information. In general, the $match and $sort stages present early in the aggregation query can use any indexes created on the fields used in the operations.
Generating a query plan will reveal what the issues are.
Please run the explain with executionStats mode and post the query plan details for all queries in question. There is documentation for Mongodb v3.0.0 version on generation query plans using explain: db.collection.explain()

How to return documents where two fields have same value [duplicate]

This question already has answers here:
MongoDb query condition on comparing 2 fields
(4 answers)
Closed 3 years ago.
Is it possible to find only those documents in a collections with same value in two given fields?
{
_id: 'fewSFDewvfG20df',
start: 10,
end: 10
}
As here start and end have the same value, this document would be selected.
I think about something like...
Collection.find({ start: { $eq: end } })
... which wouldn't work, as end has to be a value.
You can use $expr in mongodb 3.6 to match the two fields from the same document.
db.collection.find({ "$expr": { "$eq": ["$start", "$end"] } })
or with aggregation
db.collection.aggregate([
{ "$match": { "$expr": { "$eq": ["$start", "$end"] }}}
])
You have two options here. The first one is to use the $where operator.
Collection.find( { $where: "this.start === this.end" } )
The second option is to use the aggregation framework and the $redact operator.
Collection.aggregate([
{ "$redact": {
"$cond": [
{ "$eq": [ "$start", "$end" ] },
"$$KEEP",
"$$PRUNE"
]
}}
])
Which one is better?
The $where operator does a JavaScript evaluation and can't take advantage of indexes so query using $where can cause a drop of performance in your application. See considerations. If you use $where each of your document will be converted from BSON to JavaScript object before the $where operation which, will cause a drop of performance. Of course your query can be improved if you have an index filter. Also There is security risk if you're building your query dynamically base on user input.
The $redact like the $where doesn't use indexes and even perform a collection scan, but your query performance improves when you $redact because it is a standard MongoDB operators. That being said the aggregation option is far better because you can always filter your document using the $match operator.
$where here is fine but could be avoided. Also I believe that you only need $where when you have a schema design problem. For example adding another boolean field to the document with index can be a good option here.
this query is fast, since least function calls are involved,
Collection.find("this.start == this.end");

Combine two fields from different documents in Mongodb

I have these documents in a collection :
{topic : "a",
messages : [ObjectId("21312321321323"),ObjectId("34535345353"),...]
},
{topic : "b,
messages : [ObjectId("1233232323232"),ObjectId("6556565656565"),...]
}
Is there a posibility to get a result with the combination of messages fields ? I like to get this for example :
{[
ObjectId(""),ObjectId(""),ObjectId(""),ObjectId("")
]}
I thought that this was possible with MapReduce but in my case the documents doesn't have anything in common. Right now I'm doing this in the backend using javascript and loops, but i think that this isn't the best option. Thanks.
You could use the $group operator in the Aggregation Framework. To use the Aggregation Framework you will want to be sure you're running on MongoDB 2.2 or newer, of course.
If used with $push you will get all the lists of messages concatenated together.
db.myCollection.aggregate({ $group: { messages: { $push: '$messages' } } });
If used with $addToSet you will get only the distinct values.
db.myCollection.aggregate({ $group: { messages: { $addToSet: '$messages' } } });
And if you want to filter down the candidate documents first, you can use $match.
db.myCollection.aggregate([
{ $match: { topic: { $in: [ 'a', 'b' ] } } },
{ $group: { matches: { $sum: 1 }, messages: { $push: '$messages' } } }
]);
One option is to use the aggregation framework.
However, if you're planning on having a large number of results (beyond just a "lightweight" result), a result document exceeding 16MB in size, or using excessive system memory, you'll need to just loop through the objects in the collection and concatenate the results manually (as you suggest you might be doing now) or risk mongodb throwing an exception.
Aggregation limits may be found at the bottom of this page:
http://docs.mongodb.org/manual/applications/aggregation/
Given the limitations, you may want to just use find with a projection to return just messages.
(And with anything like this, I'd strongly recommend you do some performance benchmarks to compare options with your data on your servers as the "Internet" would suggest right now that some people have found the Aggregation support to be slower than other techniques).