When I try to sort a collection directly I can sort it on any fields without a problem like:
db.getCollection('collection_1').find({SOME_ID: 20246}).sort({SOME_STATUS: -1})
But when I am trying to sort the same collection in aggregate with other collection it does not sort on some of the fields. Like the above mentioned SOME_STATUS field does not sort anymore
db.getCollection('collection_1').aggregate([
{ $match: { SOME_ID: 20246 } },
{ $skip: 0 },
{ $limit: 10 },
{$lookup: { from: 'collection_2', localField: 'SOME_OTHER_ID', foreignField: 'SOME_OTHER_ID', as: 'SOME_OTHER_INFO'}},
{ $sort: { SOME_STATUS: 1} },
])
This query has no effect on sorting.
What could probably be the catch here?
UPDATE : The problem was with sequence passed to the aggregate function, $sort should come before $skip. Writing it at last gives it only limited documents to sort from which or may not have the multiple values of SOME_STATUS
Related
CollectionB has a field (let's call it "otherId") that maps to _id fields on CollectionA.
I have a query that filters CollectionB to show specific documents, I want to, basically, return all documents from CollectionA, each having an additional boolean field, a flag that is true if the _id of that document is in the filtered results of the query of CollectionB.
I came up with this aggregation:
{
from: "CollectionB",
pipeline: [
{
$match: {
< Basically the filtering query goes here >
},
}
],
as: "isOverThere"
}
This will add a isOverThere field to all results, that will always contain the filtered result set. Not quite what I need...
The filtering query is a geometry query - I don't think it's relevant to specify exactly what it is for this question (it works correctly and returns the right results).
Answering my own question. Figured out a way. Left here for others + to see if anyone has comments/a better solution.
Aggregation stages:
$lookup:
{
from: "CollectionA",
localField: "_id",
foreignField: "otherId",
pipeline: [
{
$match: {
geometry: {
<< Geometry query here >>
},
},
},
],
as: "filtered",
}
$addFields
{
isOverThere: {
$toBool: {
$ne: ["$filtered", []],
},
}
}
I have two collections:
user ( userID, liveID )
live ( liveID )
I want to get all lives with a count of how many users have the corresponding liveID associated. This is simple, here is what I did:
db.getCollection('live').aggregate([
{ $match: { /* whatever if needed */ }},
{ $lookup: {
from: 'user',
localField: 'liveID',
foreignField: 'liveID',
as: 'count'
}},
{ $addFields: { 'count': { $size: '$count' }}}, // I do this since I don't want the results, just the count
]
);
This query is pretty fast and in a dataset of 10,000 lives and 10,000 users it takes roughly 0.031 seconds.
Now, I need to filter the results and just get lives where its count value is greater than zero. I tried adding a simple $match operation on my pipeline as { $match: { 'count': { $gt : 0 }}} and it increases significantly the operation time up to 1.91 seconds.
I figured that I'm probably doing something non optimal here, I tried using $project, however it only allows me to modify the item and not completely remove it from the final dataset. I also found some examples using $lookup pipelines, but I couldn't create a query that works.
Is there something I'm missing here?
Instead of having a $addFields to get the size of the count array field and then $match to filter the documents with size greater than zero - you can combine both the stages as a single $match stage. The $expr operator allows using Aggregation operators with the $match stage (and also within the find method). Using the $expr build the $match stage as follows:
{ $match: { $expr: { $gt: [ { $size: "$count" }, 0 ] } } }
This stage will follow the $lookup in the pipeline. Doing work in lesser stages in a pipeline is a best practice as well as it improves performance especially when the number of documents being processed are large.
Depending on how many live would match your initial condition, it might be better to find the users first. Then join with the live and match it later
db.getCollection('user').aggregate([
{
$group: {
_id: '$liveID',
count: { $sum: 1 } // if you need the count
}
},
{
$lookup: {
from: 'live',
localField: '_id',
foreignField: 'liveID',
as: 'live'
}
},
{
$unwind: '$live'
}, {
$replaceRoot: { $newRoot: { $mergeObjects: ['$live', { count: '$count' }] }
}, {
$match: { /* whatever if needed */ }
}]);
I'm joining two collection using aggregation and using "output" to generate a new collection with the joined data, but it's taking too much time(never ends).
How can I enhance the performance?
The collection A has 260k documents and collection B has 17.2m.
I have already tested the same script with different data sets and it works fine. At first glance, issue seems to be related with the size of collections.
db.colection_A.aggregate([
{
$match : { property_X: "X" }
},
{ "$lookup":
{
from: "collection_B",
localField: "property_A",
foreignField: "property_B",
as: "joined_data"
}
},
{ $unwind:
{
path: "$joined_data",
preserveNullAndEmptyArrays: false
}
},
{ $project:
{
"_id": 0,
"joined_data": 1
}
},
{ $replaceRoot:
{ newRoot: "$joined_data" }
},
//{ $limit : 1 }
{ $out: "new_collection"}
]);
Expected result is creation of the collection "new_collection" containing data filtered in the "match" and "lookup" conditions.
The aggregation query was working fine, the problem was related with indexing in Mongo.
After creation of the index, the query performed much better.
Index was created with:
db.collection_B.createIndex( { property_B: 1 } );
I have two collections
Posts:
{
"_Id": "1",
"_PostTypeId": "1",
"_AcceptedAnswerId": "192",
"_CreationDate": "2012-02-08T20:02:48.790",
"_Score": "10",
...
"_OwnerUserId": "6",
...
},
...
and users:
{
"_Id": "1",
"_Reputation": "101",
"_CreationDate": "2012-02-08T19:45:13.447",
"_DisplayName": "Geoff Dalgas",
...
"_AccountId": "2"
},
...
and I want to find users who write between 5 and 15 posts.
This is how my query looks like:
db.posts.aggregate([
{
$lookup: {
from: "users",
localField: "_OwnerUserId",
foreignField: "_AccountId",
as: "X"
}
},
{
$group: {
_id: "$X._AccountId",
posts: { $sum: 1 }
}
},
{
$match : {posts: {$gte: 5, $lte: 15}}
},
{
$sort: {posts: -1 }
},
{
$project : {posts: 1}
}
])
and it works terrible slow. For 6k users and 10k posts it tooks over 40 seconds to get response while in relational database I get response in a split second.
Where's the problem? I'm just getting started with mongodb and it's quite possible that I messed up this query.
from https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/
foreignField Specifies the field from the documents in the from
collection. $lookup performs an equality match on the foreignField to
the localField from the input documents. If a document in the from
collection does not contain the foreignField, the $lookup treats the
value as null for matching purposes.
This will be performed the same as any other query.
If you don't have an index on the field _AccountId, it will do a full tablescan query for each one of the 10,000 posts. The bulk of the time will be spent in that tablescan.
db.users.ensureIndex("_AccountId", 1)
speeds up the process so it's doing 10,000 index hits instead of 10,000 table scans.
In addition to bauman.space's suggestion to put an index on the _accountId field (which is critical), you should also do your $match stage as early as possible in the aggregation pipeline (i.e. as the first stage). Even though it won't use any indexes (unless you index the posts field), it will filter the result set before doing the $lookup (join) stage.
The reason why your query is terribly slow is that for every post, it is doing a non-indexed lookup (sequential read) for every user. That's around 60m reads!
Check out the Pipeline Optimization section of the MongoDB Aggregation Docs.
First use $match then $lookup. $match filter the rows need to be examined to $lookup. It's efficient.
as long as you're going to group by user _AccountId, you should do the $group first by _OwnerUserId then lookup only after filtering accounts having 10<postsCount<15 this will reduce lookups:
db.posts.aggregate([{
$group: {
_id: "$_OwnerUserId",
postsCount: {
$sum: 1
},
posts: {
$push: "$$ROOT"
} //if you need to keep original posts data
}
},
{
$match: {
postsCount: {
$gte: 5,
$lte: 15
}
}
},
{
$lookup: {
from: "users",
localField: "_id",
foreignField: "_AccountId",
as: "X"
}
},
{
$unwind: "$X"
},
{
$sort: {
postsCount: -1
}
},
{
$project: {
postsCount: 1,
X: 1
}
}
])
The title says it all. How come if a document does not result in any matching outer document according to its matching field, then how come it's not included in the pipeline's result set?
I'm testing out the new aggregators in Mongo 3.2 and I've gone so far as to perform a nested array lookup by first unwinding, and then grouping the documents back up. All I have left is to have the results include all local documents that didn't meet the $lookup criteria, which is what I thought was the standard definition of "left outer join".
Here's the query:
db.users.aggregate([
{
$unwind: "$profile",
$unwind: "$profile.universities"
},
{
$lookup: {
from: "universities",
localField: "profile.universities._id",
foreignField: "_id",
as: "profile.universities"
}
},
{
$group: {
_id: "$_id",
universities: {
$addToSet: "$profile.universities"
}
}
}
]).pretty()
So if I have a user that has an empty profile.universities array, then I need it to be included in the result set regardless of the $lookup returning any matches, but it does not. How can I do this, and any reason why Mongo constructed $lookup to operate this way?
This behavior isn't related to $lookup, it's because the default behavior for $unwind is to omit documents where the referenced field is missing or an empty array.
To preserve the unwound documents even when profile.universities is an empty array, you can set its preserveNullAndEmptyArrays option to true:
db.users.aggregate([
{
$unwind: "$profile",
$unwind: {
path: "$profile.universities",
preserveNullAndEmptyArrays: true
}
},
{
$lookup: {
from: "universities",
localField: "profile.universities._id",
foreignField: "_id",
as: "profile.universities"
}
},
{
$group: {
_id: "$_id",
universities: {
$addToSet: "$profile.universities"
}
}
}
]).pretty()