MongoDB: Retrieve most referenced document - mongodb

I have a MongoDB collection (called 'links') with documents like this one:
{
"_id" : ObjectId("544bc8abd4c66b0e3cf12665"),
"name" : "Pet 4056 AgR",
"file" : "P0001J01",
"quotes" : [
{
"_id" : ObjectId("544bc8afd4c66b0e3cf15173"),
"name" : "Pet 4837 ED",
"file" : "P1103J03"
},
{
"_id" : ObjectId("544bc8b6d4c66b0e3cf19425"),
"name" : "ACO 845 AgR",
"file" : "P2810J07"
},
{
"_id" : ObjectId("544bc8afd4c66b0e3cf14a77"),
"name" : "ACO 1574 AgR",
"file" : "P0924J05"
}
]
}
In my db, this means that this document references 3 other documents.
For each document, in its quotes array there are no two documents with the same id/name/file. The name field is unique in the collection.
Now, I need to get the document that is the most referenced. It's the document that appears in most quotes arrays. How can I do that?
I believe this is achieved through an aggregation, but I can't figure out how to do it, especially because the names are inside an array.
Thanks! :)

You can do this using the aggregation framework, but a key feature to working with arrays is that you use the $unwind pipeline operation to first "de-normalize" the array content as separate documents:
db.links.aggregate([
// Unwind the array
{ "$unwind": "$quotes" },
// Group by the inner "name" value and count the occurrences
{ "$group": {
"_id": "$quotes.name",
"count": { "$sum": 1 }
}},
// Sort to the highest count on top
{ "$sort": { "count": 1 } },
// Just return the largest value
{ "$limit": 1 }
])
So what $unwind does here is for each array element it takes a copy of the "outer" document that owns the array and produces a new document containing the outer and just the singular array element. Basically like this:
{
"_id" : ObjectId("544bc8abd4c66b0e3cf12665"),
"name" : "Pet 4056 AgR",
"file" : "P0001J01",
"quotes" :
{
"_id" : ObjectId("544bc8afd4c66b0e3cf15173"),
"name" : "Pet 4837 ED",
"file" : "P1103J03"
}
},
{
"_id" : ObjectId("544bc8abd4c66b0e3cf12665"),
"name" : "Pet 4056 AgR",
"file" : "P0001J01",
"quotes" :
{
"_id" : ObjectId("544bc8b6d4c66b0e3cf19425"),
"name" : "ACO 845 AgR",
"file" : "P2810J07"
}
}
This allows other aggregation pipeline stages to access content just as any normal document, so you can $group the occurrences on "quotes.name" without a problem.
Take a good look at all of the aggregation pipeline operators, it is worth understanding what they all do.

Related

MongoDB select documents where field1 equals nested.field2 in aggregate pipeline

I have joined two collections on one field using '$lookup', while actually I needed two fields to have a unique match. My next step would be to unwind the array containing different values of the second field I need for a unique match and then compare these to the value of the second field it needs to match higher up. However, the second line in the snippet below returns no results.
// Request only the page that has been viewed
{ '$unwind' : '$DSpub.PublicationPages'},
{ '$match' : {'pageId' : '$DSpub.PublicationPages.PublicationPageId' } }
Is there a more appropriate way to do this? Or can I avoid doing this altogether by unwinding the "from" collection before performing the '$lookup', and then match both fields?
This is not as easy at it looks.
$match does not operate on dynamic data (that means we are comparing static value against data set). To overcome that - we can use $project phase to add a bool static flag, that can be utilized by $match
Please see example below:
Having input collection like this:
[{
"_id" : ObjectId("56be1b51a0f4c8591f37f62b"),
"name" : "Alice",
"sub_users" : [{
"_id" : ObjectId("56be1b51a0f4c8591f37f62a")
}
]
}, {
"_id" : ObjectId("56be1b51a0f4c8591f37f62a"),
"name" : "Bob",
"sub_users" : [{
"_id" : ObjectId("56be1b51a0f4c8591f37f62a")
}
]
}
]
We want to get only fields where _id and $docs.sub_users._id" are same, where docs are $lookup output.
db.collecction.aggregate([{
$lookup : {
from : "collecction",
localField : "_id",
foreignField : "_id",
as : "docs"
}
}, {
$unwind : "$docs"
}, {
$unwind : "$docs.sub_users"
}, {
$project : {
_id : 0,
fields : "$$ROOT",
matched : {
$eq : ["$_id", "$docs.sub_users._id"]
}
}
}, {
$match : {
matched : true
}
}
])
that gives output:
{
"fields" : {
"_id" : ObjectId("56be1b51a0f4c8591f37f62a"),
"name" : "Bob",
"sub_users" : [
{
"_id" : ObjectId("56be1b51a0f4c8591f37f62a")
}
],
"docs" : {
"_id" : ObjectId("56be1b51a0f4c8591f37f62a"),
"name" : "Bob",
"sub_users" : {
"_id" : ObjectId("56be1b51a0f4c8591f37f62a")
}
}
},
"matched" : true
}

How to create an index on the "name" field of a document in mongodb

I want to create an index on the name field of a document in mongodb so that when I do a find,I should get all the names to be displayed in the alphabetical order.How can I achieve this ? Can anyone please help me out ...
My documents in mongodb:
db.col.find();
{ "_id" : ObjectId("5696256b0c50bf42dcdfeae1"), "name" : "Daniel", "age" : 24 }
{ "_id" : ObjectId("569625850c50bf42dcdfeae2"), "name" : "Asha", "age" : 21 }
{ "_id" : ObjectId("569625a40c50bf42dcdfeae3"), "name" : "Hampi", "age" : 34 }
{ "_id" : ObjectId("5696260f0c50bf42dcdfeae5"), "name" : "Bhavana", "age" : 14 }
Actually you don't need an index in order to display your result alphabetically. What you need is the .sort() method.
db.collection.find().sort({'name': 1})
Which returns
{ "_id" : ObjectId("569625850c50bf42dcdfeae2"), "name" : "Asha", "age" : 21 }
{ "_id" : ObjectId("5696260f0c50bf42dcdfeae5"), "name" : "Bhavana", "age" : 14 }
{ "_id" : ObjectId("5696256b0c50bf42dcdfeae1"), "name" : "Daniel", "age" : 24 }
{ "_id" : ObjectId("569625a40c50bf42dcdfeae3"), "name" : "Hampi", "age" : 34 }
Creating an index on a field in your document will not automatically sort your result on that particular field you still need to use the .sort() method. see Use Indexes to Sort Query Results
If you want to return an array of all names in your documents in ascending order then you will need to use the .aggregate() method.
The first stage in the pipeline is the $sort stage where you sort your documents by "name" in ascending order. The last stage is the $group stage where you group your documents and use the $push accumulator operator to return an array of "names"
db.collection.aggregate([
{ "$sort": { "name": 1 } },
{ "$group": { "_id": null, "names": { "$push": "$name" } } }
])
Which yields:
{ "_id" : null, "names" : [ "Asha", "Bhavana", "Daniel", "Hampi" ] }

Mongo: how to retrieve ONLY subdocs that match certain properties

Having, for example, a collection named test and the following document is inside:
{
"_id" : ObjectId("5692ac4562c824cc5167379f"),
"list" : [
{
"name" : "elem1",
"type" : 1
},
{
"name" : "elem2",
"type" : 2
},
{
"name" : "elem3",
"type" : 1
},
{
"name" : "elem4",
"type" : 3
},
{
"name" : "elem4",
"type" : 2
}
]
}
Let's say I would like to retrieve a list of only those subdocuments inside list that match:
type = 2.
I've tried the following query:
db.getCollection('test').find({
'_id': ObjectId("5692ac4562c824cc5167379f"),
'list.type': 1
})
But the result I get contains every subdocument inside list, and I guess this is because inside list there are at least one document which's type equals 1.
Instead of that, the result I am interested to obtain would be every subdocument inside list that matches 'list.type': 1:
{
"_id" : ObjectId("5692ac4562c824cc5167379f"),
"list" : [
{
"name" : "elem1",
"type" : 1
},
{
"name" : "elem3",
"type" : 1
}
]
}
...so $and $elemMatch is not what I am really looking for as they return just the first matching element.
Anyone knows how to achieve what I am looking for?
db.myCol.aggregate([
{ $unwind: "$list" },
{ $match: { "list.type":1 } },
{ $group: { "_id":"$_id", list: {$push:"$list"}} }
])

MongoDB: aggregation group by a field in large collection

I have a large (millions) collection of files where tags is an array field like this.
{
"volume" : "abc",
"name" : "file1.txt",
"type" : "txt",
"tags" : [ "Interesting", "Weird" ], ...many other fields
}
Now I want to return count of unique tags for the entire collection. I am using aggregate for that. Here's my query.
db.files.aggregate(
{ "$match" : {"volume":"abc"}},
{ "$project" : { "tags" : 1}},
{ "$unwind" : "$tags"},
{ "$group" : { "_id" : "$tags" , "count" : { "$sum" : 1}}},
{ "$sort" : { "count" : 1}}
)
I am seeing that it takes around 3 seconds for this to return for a collection of 1.2M files. I do have index on tags and volume fields.
I am using MongoDB 2.4. Since 2.6 is not out, I cannot use the .explain() here.
Any ideas how I can improve this performance? I do need to have a summary count. Also, I cannot pre-compute these counts as my $match will be variable based on volume, type, a particular tag, some date time of file etc.

What's the $unwind operator in MongoDB?

This is my first day with MongoDB so please go easy with me :)
I can't understand the $unwind operator, maybe because English is not my native language.
db.article.aggregate(
{ $project : {
author : 1 ,
title : 1 ,
tags : 1
}},
{ $unwind : "$tags" }
);
The project operator is something I can understand, I suppose (it's like SELECT, isn't it?). But then, $unwind (citing) returns one document for every member of the unwound array within every source document.
Is this like a JOIN? If yes, how the result of $project (with _id, author, title and tags fields) can be compared with the tags array?
NOTE: I've taken the example from MongoDB website, I don't know the structure of tags array. I think it's a simple array of tag names.
The thing to remember is that MongoDB employs an "NoSQL" approach to data storage, so perish the thoughts of selects, joins, etc. from your mind. The way that it stores your data is in the form of documents and collections, which allows for a dynamic means of adding and obtaining the data from your storage locations.
That being said, in order to understand the concept behind the $unwind parameter, you first must understand what the use case that you are trying to quote is saying. The example document from mongodb.org is as follows:
{
title : "this is my title" ,
author : "bob" ,
posted : new Date () ,
pageViews : 5 ,
tags : [ "fun" , "good" , "fun" ] ,
comments : [
{ author :"joe" , text : "this is cool" } ,
{ author :"sam" , text : "this is bad" }
],
other : { foo : 5 }
}
Notice how tags is actually an array of 3 items, in this case being "fun", "good" and "fun".
What $unwind does is allow you to peel off a document for each element and returns that resulting document.
To think of this in a classical approach, it would be the equivilent of "for each item in the tags array, return a document with only that item".
Thus, the result of running the following:
db.article.aggregate(
{ $project : {
author : 1 ,
title : 1 ,
tags : 1
}},
{ $unwind : "$tags" }
);
would return the following documents:
{
"result" : [
{
"_id" : ObjectId("4e6e4ef557b77501a49233f6"),
"title" : "this is my title",
"author" : "bob",
"tags" : "fun"
},
{
"_id" : ObjectId("4e6e4ef557b77501a49233f6"),
"title" : "this is my title",
"author" : "bob",
"tags" : "good"
},
{
"_id" : ObjectId("4e6e4ef557b77501a49233f6"),
"title" : "this is my title",
"author" : "bob",
"tags" : "fun"
}
],
"OK" : 1
}
Notice that the only thing changing in the result array is what is being returned in the tags value. If you need an additional reference on how this works, I've included a link here.
$unwind duplicates each document in the pipeline, once per array element.
So if your input pipeline contains one article doc with two elements in tags, {$unwind: '$tags'} would transform the pipeline to be two article docs that are the same except for the tags field. In the first doc, tags would contain the first element from the original doc's array, and in the second doc, tags would contain the second element.
consider the below example to understand this
Data in a collection
{
"_id" : 1,
"shirt" : "Half Sleeve",
"sizes" : [
"medium",
"XL",
"free"
]
}
Query -- db.test1.aggregate( [ { $unwind : "$sizes" } ] );
output
{ "_id" : 1, "shirt" : "Half Sleeve", "sizes" : "medium" }
{ "_id" : 1, "shirt" : "Half Sleeve", "sizes" : "XL" }
{ "_id" : 1, "shirt" : "Half Sleeve", "sizes" : "free" }
As per mongodb official documentation :
$unwind Deconstructs an array field from the input documents to output a document for each element. Each output document is the input document with the value of the array field replaced by the element.
Explanation through basic example :
A collection inventory has the following documents:
{ "_id" : 1, "item" : "ABC", "sizes": [ "S", "M", "L"] }
{ "_id" : 2, "item" : "EFG", "sizes" : [ ] }
{ "_id" : 3, "item" : "IJK", "sizes": "M" }
{ "_id" : 4, "item" : "LMN" }
{ "_id" : 5, "item" : "XYZ", "sizes" : null }
The following $unwind operations are equivalent and return a document for each element in the sizes field. If the sizes field does not resolve to an array but is not missing, null, or an empty array, $unwind treats the non-array operand as a single element array.
db.inventory.aggregate( [ { $unwind: "$sizes" } ] )
or
db.inventory.aggregate( [ { $unwind: { path: "$sizes" } } ]
Above query output :
{ "_id" : 1, "item" : "ABC", "sizes" : "S" }
{ "_id" : 1, "item" : "ABC", "sizes" : "M" }
{ "_id" : 1, "item" : "ABC", "sizes" : "L" }
{ "_id" : 3, "item" : "IJK", "sizes" : "M" }
Why is it needed?
$unwind is very useful while performing aggregation. it breaks complex/nested document into simple document before performaing various operation like sorting, searcing etc.
To know more about $unwind :
https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/
To know more about aggregation :
https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/
Let me explain in a way corelated to RDBMS way. This is the statement:
db.article.aggregate(
{ $project : {
author : 1 ,
title : 1 ,
tags : 1
}},
{ $unwind : "$tags" }
);
to apply to the document / record:
{
title : "this is my title" ,
author : "bob" ,
posted : new Date () ,
pageViews : 5 ,
tags : [ "fun" , "good" , "fun" ] ,
comments : [
{ author :"joe" , text : "this is cool" } ,
{ author :"sam" , text : "this is bad" }
],
other : { foo : 5 }
}
The $project / Select simply returns these field/columns as
SELECT author, title, tags FROM article
Next is the fun part of Mongo, consider this array tags : [ "fun" , "good" , "fun" ] as another related table (can't be a lookup/reference table because values has some duplication) named "tags". Remember SELECT generally produces things vertical, so unwind the "tags" is to split() vertically into table "tags".
The end result of $project + $unwind:
Translate the output to JSON:
{ "author": "bob", "title": "this is my title", "tags": "fun"},
{ "author": "bob", "title": "this is my title", "tags": "good"},
{ "author": "bob", "title": "this is my title", "tags": "fun"}
Because we didn't tell Mongo to omit "_id" field, so it's auto-added.
The key is to make it table-like to perform aggregation.