MongoDB: aggregation group by a field in large collection - mongodb

I have a large (millions) collection of files where tags is an array field like this.
{
"volume" : "abc",
"name" : "file1.txt",
"type" : "txt",
"tags" : [ "Interesting", "Weird" ], ...many other fields
}
Now I want to return count of unique tags for the entire collection. I am using aggregate for that. Here's my query.
db.files.aggregate(
{ "$match" : {"volume":"abc"}},
{ "$project" : { "tags" : 1}},
{ "$unwind" : "$tags"},
{ "$group" : { "_id" : "$tags" , "count" : { "$sum" : 1}}},
{ "$sort" : { "count" : 1}}
)
I am seeing that it takes around 3 seconds for this to return for a collection of 1.2M files. I do have index on tags and volume fields.
I am using MongoDB 2.4. Since 2.6 is not out, I cannot use the .explain() here.
Any ideas how I can improve this performance? I do need to have a summary count. Also, I cannot pre-compute these counts as my $match will be variable based on volume, type, a particular tag, some date time of file etc.

Related

Indexing MongoDB for sort consistency

The MongoDB documentation says that MongoDB doesn't store documents in a collection in a particular order. So if you have this collection:
db.restaurants.insertMany( [
{ "_id" : 1, "name" : "Central Park Cafe", "borough" : "Manhattan"},
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens"},
{ "_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn"},
{ "_id" : 4, "name" : "Stan's Pizzaria", "borough" : "Manhattan"},
{ "_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn"},
] );
and sorting like this:
db.restaurants.aggregate(
[
{ $sort : { borough : 1 } }
]
)
Then the sort order can be inconsistent since:
the borough field contains duplicate values for both Manhattan and Brooklyn. Documents are returned in alphabetical order by borough, but the order of those documents with duplicate values for borough might not to be the same across multiple executions of the same sort.
To return a consistent result it's recommended to modify the query to:
db.restaurants.aggregate(
[
{ $sort : { borough : 1, _id: 1 } }
]
)
My question relates to the efficiency of such a query. Let's say you have millions of documents, should you create a compound index, something like { borough: 1, _id: -1 }, to make it efficient? Or is it enough to index { borough: 1 } due to the, potentially, special nature of the _id field?
I'm using MongoDB 4.4.
If you need stable sort, you will have to sort on both the fields and for performant query you will need to have a compound index on both the fields.
{ borough: 1, _id: -1 }

Mongodb accessing documents

I've the following db:
{ "_id" : 1, "results" : [ { "product" : "abc", "score" : 10 }, { "product" : "xyz", "score" : 5 } ] }
{ "_id" : 2, "results" : [ { "product" : "abc", "score" : 8 }, { "product" : "xyz", "score" : 7 } ] }
{ "_id" : 3, "results" : [ { "product" : "abc", "score" : 7 }, { "product" : "xyz", "score" : 8 } ] }
I want to show the first score of each _id, i tried the following:
db.students.find({},{"results.$":1})
But it doesn't seem to work, any advice?
You can take advantage of aggregation pipeline to solve this.
Use $project in conjunction with $arrayElemAt to point to appropriate node index in the array.
So, to extract the documents of the first score, have written below query.
db.students.aggregate([ {$project: { scoredoc:{$arrayElemAt:["$results",0]}} } ]);
In case if you just wish to have scores excluding product, use $results.score as shown below.
db.students.aggregate([ {$project: { scoredoc:{$arrayElemAt:["$results.score",0]}} } ]);
Here scoredoc object will have all documents of first score element.
Hope this helps!
According to above mentioned description please try executing following query in MongoDB shell
db.students.find(
{results:
{$elemMatch:{score:{$exists:true}}}}, {'results.$.score':1}
)
According to MongoDB documentation
The positional $ operator limits the contents of an from the
query results to contain only the first element matching the query
document.
Hence in above mentioned query positional $ operator is used in projection section to retrieve first score of each document.

MongoDB select documents where field1 equals nested.field2 in aggregate pipeline

I have joined two collections on one field using '$lookup', while actually I needed two fields to have a unique match. My next step would be to unwind the array containing different values of the second field I need for a unique match and then compare these to the value of the second field it needs to match higher up. However, the second line in the snippet below returns no results.
// Request only the page that has been viewed
{ '$unwind' : '$DSpub.PublicationPages'},
{ '$match' : {'pageId' : '$DSpub.PublicationPages.PublicationPageId' } }
Is there a more appropriate way to do this? Or can I avoid doing this altogether by unwinding the "from" collection before performing the '$lookup', and then match both fields?
This is not as easy at it looks.
$match does not operate on dynamic data (that means we are comparing static value against data set). To overcome that - we can use $project phase to add a bool static flag, that can be utilized by $match
Please see example below:
Having input collection like this:
[{
"_id" : ObjectId("56be1b51a0f4c8591f37f62b"),
"name" : "Alice",
"sub_users" : [{
"_id" : ObjectId("56be1b51a0f4c8591f37f62a")
}
]
}, {
"_id" : ObjectId("56be1b51a0f4c8591f37f62a"),
"name" : "Bob",
"sub_users" : [{
"_id" : ObjectId("56be1b51a0f4c8591f37f62a")
}
]
}
]
We want to get only fields where _id and $docs.sub_users._id" are same, where docs are $lookup output.
db.collecction.aggregate([{
$lookup : {
from : "collecction",
localField : "_id",
foreignField : "_id",
as : "docs"
}
}, {
$unwind : "$docs"
}, {
$unwind : "$docs.sub_users"
}, {
$project : {
_id : 0,
fields : "$$ROOT",
matched : {
$eq : ["$_id", "$docs.sub_users._id"]
}
}
}, {
$match : {
matched : true
}
}
])
that gives output:
{
"fields" : {
"_id" : ObjectId("56be1b51a0f4c8591f37f62a"),
"name" : "Bob",
"sub_users" : [
{
"_id" : ObjectId("56be1b51a0f4c8591f37f62a")
}
],
"docs" : {
"_id" : ObjectId("56be1b51a0f4c8591f37f62a"),
"name" : "Bob",
"sub_users" : {
"_id" : ObjectId("56be1b51a0f4c8591f37f62a")
}
}
},
"matched" : true
}

MongoDB: Retrieve most referenced document

I have a MongoDB collection (called 'links') with documents like this one:
{
"_id" : ObjectId("544bc8abd4c66b0e3cf12665"),
"name" : "Pet 4056 AgR",
"file" : "P0001J01",
"quotes" : [
{
"_id" : ObjectId("544bc8afd4c66b0e3cf15173"),
"name" : "Pet 4837 ED",
"file" : "P1103J03"
},
{
"_id" : ObjectId("544bc8b6d4c66b0e3cf19425"),
"name" : "ACO 845 AgR",
"file" : "P2810J07"
},
{
"_id" : ObjectId("544bc8afd4c66b0e3cf14a77"),
"name" : "ACO 1574 AgR",
"file" : "P0924J05"
}
]
}
In my db, this means that this document references 3 other documents.
For each document, in its quotes array there are no two documents with the same id/name/file. The name field is unique in the collection.
Now, I need to get the document that is the most referenced. It's the document that appears in most quotes arrays. How can I do that?
I believe this is achieved through an aggregation, but I can't figure out how to do it, especially because the names are inside an array.
Thanks! :)
You can do this using the aggregation framework, but a key feature to working with arrays is that you use the $unwind pipeline operation to first "de-normalize" the array content as separate documents:
db.links.aggregate([
// Unwind the array
{ "$unwind": "$quotes" },
// Group by the inner "name" value and count the occurrences
{ "$group": {
"_id": "$quotes.name",
"count": { "$sum": 1 }
}},
// Sort to the highest count on top
{ "$sort": { "count": 1 } },
// Just return the largest value
{ "$limit": 1 }
])
So what $unwind does here is for each array element it takes a copy of the "outer" document that owns the array and produces a new document containing the outer and just the singular array element. Basically like this:
{
"_id" : ObjectId("544bc8abd4c66b0e3cf12665"),
"name" : "Pet 4056 AgR",
"file" : "P0001J01",
"quotes" :
{
"_id" : ObjectId("544bc8afd4c66b0e3cf15173"),
"name" : "Pet 4837 ED",
"file" : "P1103J03"
}
},
{
"_id" : ObjectId("544bc8abd4c66b0e3cf12665"),
"name" : "Pet 4056 AgR",
"file" : "P0001J01",
"quotes" :
{
"_id" : ObjectId("544bc8b6d4c66b0e3cf19425"),
"name" : "ACO 845 AgR",
"file" : "P2810J07"
}
}
This allows other aggregation pipeline stages to access content just as any normal document, so you can $group the occurrences on "quotes.name" without a problem.
Take a good look at all of the aggregation pipeline operators, it is worth understanding what they all do.

MongoDB: Query by size of array with a filtered value

In MongoDB, I have a collection ("users") with the following basic schema:
{
"_id" : ObjectId("50e5de00b623143995c5b739")
"name" : "Jon",
"emails_sent" : [
{
"type" : "invite",
"sent_time" : ISODate("2013-04-21T21:11:50.999Z")
},
{
"type" : "invite",
"sent_time" : ISODate("2013-04-15T21:10:35.999Z")
},
{
"type" : "follow",
"sent_time" : ISODate("2013-04-21T21:11:50.999Z")
}
]
}
I'd like to query for users based on the $size of emails_sent of a certain "type" only, e.g. only count "invite" emails. Is there any way to achieve this sort of "filtered count" in a standard mongo query?
Many thanks.
db.users.aggregate([
{$unwind:'$emails_sent'},
{$match: {'$emails_sent.type':'invite'}},
{$group : {_id : '$name', sumOfEmailsSent:{$sum:1} }}
]);
BTW you are missing a square bracket which closes the $emails_sent array.