MongoDB Aggregate a sub field of an array (Array is main field) - mongodb

I have a database in MongoDB. There are three main fields: _id, Reviews, HotelInfo. Reviews and HotelInfo are arrays. Reviews has a field called Author. I would like to print out every author name (once) and the amount of times they appear within the dataset.
I tried:
db.reviews.aggregate( {$group : { _id : '$Reviews.Author', count : {$sum : 1}}} ).pretty()
A part of the results were:
"_id" : [
"VijGuy",
"Stephtastic",
"dakota431",
"luce_sociator",
"ArkansasMomOf3",
"ccslilqt6969",
"RJeanM",
"MissDusty",
"sammymd",
"A TripAdvisor Member",
"A TripAdvisor Member"
],
"count" : 1
How it should be:
{ "_id" : "VijGuy", "count" : 1 }, { "_id" : "Stephtastic", "count" : 1 }
I posted the JSON format below.
Any idea on how to do this would be appreciated
JSON Format

Lets assume that this is our collection.
[{
_id: 1,
Reviews: [{Author: 'elad' , txt: 'good'}, {Author: 'chen', txt: 'bad'}]
},
{
_id: 2,
Reviews: [{Author: 'elad', txt : 'nice'}]
}]
to get the data as you want we need to first use the unwind stage and then the group stage.
[{ $unwind: '$Reviews' }, {$group : { _id : '$Reviews.Author', count : {$sum : 1}}}]
You need to first unwind the collection by the Reviews field.
after the unwind stage our data in the pipeline will look like this.
{_id:1, Reviews: {Author: 'elad' , txt: 'good'}},
{_id:1, Reviews: {Author: 'chen' , txt: 'bad'}},
{_id:2, Revies: {Author: 'elad', txt : 'nice'}
The unwind created a document for each element in Reviews array with the element itself and his host document. Now its easy to group in useful ways as you want. Now we can use the same group that you wrote and we will get the results.
after the group our data will look like this:
[{_id: 'elad',sum:2},{_id: 'chen', sum: 1}]
Unwind is a very important pipeline stage in the aggregation framework. Its help us transform complex and nested documents into flat and simple, and that help us to query the data in different ways.
What's the $unwind operator in MongoDB?

Related

How to store a list of values from mongo subdocument

I have a collection with documents like this
{
"_id" : ObjectId("59de9454e4b03289d79eeab4"),
"type" : "Draft",
.
.
.
headers
{
"payload" : [
{
"_id" : "ABC",
},
{
"_id" :"DEF",
}]
how should i store the _Id values in a variable like
result: ["ABC","DEF"]
I tried a query like this, but is not working
var result = []
db.request.find({"_id" : ObjectId("59de9454e4b03289d79eeab4")}).forEach(function(u) { result.push(u.headers.payload._id) })
Your input doc is a tad malformed but if payload is a single field subfield of headers i.e. it is not an array, and you want to do a $match or otherwise filter the material before you extract the _id values and you want all the _id values (not a distinct set) and returned neatly in a field called result then this does the trick:
db.foo.aggregate([
{$match: {"_id": 1}}
,{$project: {result: {$map: {
input: "$headers.payload",
as: "z",
in: "$$z._id"
}}
}}
]);

MongoDB aggregation/map-reduce

I'm new to MongoDB and I need to do an aggregation which seems to me quite difficult. A document looks something like this
{
"_id" : ObjectId("568192aef8bd6b0cd0f649c6"),
"conference" : "IEEE International Conference on Acoustics, Speech and Signal Processing",
"prism:aggregationType" : "Conference Proceeding",
"children-id" : [
"SCOPUS_ID:84948148564",
"SCOPUS_ID:84927603733",
"SCOPUS_ID:84943521758",
"SCOPUS_ID:84905234683",
"SCOPUS_ID:84876113709"
],
"dc:identifier" : "SCOPUS_ID:84867598678"
}
The example contains just the fields I need in the aggregation. Prism:aggregationType can have 5 different values(conference proceeding, book, journal etc.). Children-id says that this document is cited by an array of other documents(SCOPUS_ID is an unique ID for each document).
What I want to do is to group first by conference, then for each conference I want to know for each prism:aggregationType how many citing documents are($gt > 0).
For example, lets say there are 100 documents that have the conference from above. These 100 documents are cited by 250 documents. I want to know from all of these 250 documents how many have "prism:aggregationType" : "Conference Proceeding", "prism:aggregationType" : "Journal" etc.
An output could look like this:
{
"conference" : "IEEE International Conference on Acoustics, Speech and Signal Processing",
"aggregationTypes" : [{"Conference Proceeding" : 50} , {"Journal" : 200}]
}
It is not important if it is done with aggregation pipeline or map-reduce.
EDIT
Is there any way to combine these 2 into one aggregation:
db.articles.aggregate([
{ $match:{
conference : {$ne : null}
}},
{$unwind:'$children-id'},
{$group: {
_id: {conference: '$conference'},
'cited-by':{$push:{'dc:identifier':"$children-id"}}
}}
]);
db.articles.find( { 'dc:identifier': { $in: [ 'SCOPUS_ID:84943302953', 'SCOPUS_ID:84927603733'] } }, {'prism:aggregationType':1} );
In the query I want to replace the array from $in with the array created with $push
Please try this one through aggregation
> db.collections
.aggregate([
// 1. get the size of `children-id` array through $project
{$project: {
conference: 1,
IEEE1: 1,
'prism:aggregationType': 1,
'children-id': {$size: '$children-id'}
}},
// 2. group by `conference` and `prism:aggregationType` and sum the size of `children-id`
{$group: {
_id: {
conference:'$conference',
aggregationType: '$prism:aggregationType'
},
ids: {$sum: '$children-id'}
}},
// 3. group by `conference`, and make pair of the conference processing ids size and journal ids size
{$group: {
_id: '$_id.conference',
aggregationTypes: {
$cond: [{$eq: ['$_id.aggregationType', 'Conference Proceeding']},
{$push: {"Conference Proceeding": '$ids'}},
{$push: {"Journal": '$ids'}}
]}
}}
]);
As we had a chat,
using $lookup in aggregation pipeline is unfortunately bonded to mongodb 3.2 which is not a case as R driver can use mongo 2.6 and source documents are in more than one collection.
The code I wrote in the EDIT section is also the final result I come up with(a little bit modified)
db.articles.aggregate([
{ $match:{
conference : {$ne : null}
}},
{$unwind:'$children-id'},
{$group: {
_id: '$conference',
'cited-by':{$push:"$children-id"}
}}
]);
db.articles.find( { 'dc:identifier': { $in: [ 'SCOPUS_ID:84943302953', 'SCOPUS_ID:84927603733'] } }, {'prism:aggregationType':1} );
The result will look like this for each conference:
{
"_id" : "Annual Conference on Privacy, Security and Trust",
"cited-by" : [
"SCOPUS_ID:84942789431",
"SCOPUS_ID:84928151617",
"SCOPUS_ID:84939229259",
"SCOPUS_ID:84946407175",
"SCOPUS_ID:84933039513",
"SCOPUS_ID:84942789431",
"SCOPUS_ID:84942607254",
"SCOPUS_ID:84948165954",
"SCOPUS_ID:84926379258",
"SCOPUS_ID:84946771354",
"SCOPUS_ID:84944223683",
"SCOPUS_ID:84942789431",
"SCOPUS_ID:84939169499",
"SCOPUS_ID:84947104346",
"SCOPUS_ID:84948764343",
"SCOPUS_ID:84938075139",
"SCOPUS_ID:84946196118",
"SCOPUS_ID:84930820238",
"SCOPUS_ID:84947785321",
"SCOPUS_ID:84933496680",
"SCOPUS_ID:84942789431"
]
}
I iterate through all the documents I get (around 250) and then I use the cited-by array inside $in. I use index over dc:identifier so it works instantly.
$lookup could be an alternative to get the things done from aggregate pipeline but packages in R does not support versions above 2.6.
Thank you for your time anyway :)

Most frequent word in MongoDB collection

I got a MongoDB collection where each entry has a product field containing a string array. What i would like to do is find the most frequent word in the whole collection. Any ideas on how to do that ?
Here is a sample object:
{
"_id" : ObjectId("55e02d333b88f425f84191af"),
"product" : [
" x bla y "
],
"hash_key" : "ecfe355b2f45dfbaf361cff4d314d4cc",
"price" : [
"z"
],
"image" : "image_url"
}
Looking at the sample object, what I would like to do is count "x", "bla" and "y" singularly.
I recently had to do something similar. I had a collection of objects and each object had a list of keywords. To count the frequency of each keyword, I used the following aggregation pipeline, which uses the MongoDB version 4.4 $accumulator group operation.
db.collectionname.aggregate(
{$match: {available: true}}, // Some criteria to filter the documents
{$project:
{ _id: 0, keywords: 1}}, // Only keep keywords
{$group:
{_id: null, keywords: // Accumulate keywords into one array
{$accumulator: {
init: function(){return new Array()},
accumulate: function(state, value){return state.concat(value)},
accumulateArgs: ["$keywords"],
merge: function(state1, state2){return state1.concat(state2)},
lang: "js"}}}},
{$unwind: "$keywords"}, // Split array into fields
{$group: {_id: "$keywords", freq: {$sum: 1}}}, // Group keywords and count frequencies
{$sort: {freq: -1}}, // Sort in reverse order
{$limit: 5} // Take first five
)
I have no idea if this is the most efficient solution. However, it solved the problem for me.

how to use mongodb aggregate and retrieve entire documents

I am seriously baffled by mongodb's aggregate function. All I want is to find the newest document in my collection. Let's say each record has a field "created"
db.collection.aggregate({
$group: {
_id:0,
'id':{$first:"$_id"},
'max':{$max:"$created"}
}
})
yields the correct result, but I want the entire document in the result? How would I do that?
This is the structure of the document:
{
"_id" : ObjectId("52310da847cf343c8c000093"),
"created" : 1389073358,
"image" : ObjectId("52cb93dd47cf348786d63af2"),
"images" : [
ObjectId("52cb93dd47cf348786d63af2"),
ObjectId("52f67c8447cf343509d63af2")
],
"organization" : ObjectId("522949d347cf3402c3000001"),
"published" : 1392601521,
"status" : "PUBLISHED",
"tags" : [ ],
"updated" : 1392601521,
"user_id" : ObjectId("52214ce847cf344902000000")
}
In the documentation i found that the $$ROOT expression addresses this problem.
From the DOC:
http://docs.mongodb.org/manual/reference/operator/aggregation/group/#group-documents-by-author
query = [
{
'$sort': {
'created': -1
}
},
{
$group: {
'_id':null,
'max':{'$first':"$$ROOT"}
}
}
]
db.collection.aggregate(query)
db.collection.aggregate([
{
$group: {
'_id':"$_id",
'otherFields':{ $push: { fields: $ROOT } }
}
}
])
I think I figured it out. For example, I have a collection containing an array of images (or pointers). Now I want to find the document with the most images
results=[];
db.collection.aggregate([
{$unwind: "$images"},
{$group:{_id:"$_id", 'imagecount':{$sum:1}}},
{$group:{_id:"$_id",'max':{$max: "$imagecount"}}},
{$sort:{max:-1}},
{$group:{_id:0,'id':{$first:'$_id'},'max':{$first:"$max"}}}
]).result.forEach(function(d){
results.push(db.stories.findOne({_id:d.id}));
});
now the final array will contain the document with the most images. Since images is an array, I use $unwind, I then group by document id and $sum:1, pipe that into a $group that finds the max, pipe it into reverse $sort for max and $group out the first result. Finally I fetchOne the document and push it into the results array.
You should be using db.collection.find() rather than db.collection.aggregate():
db.collection.find().sort({"created":-1}).limit(1)

MongoDB aggregation by array field

Suppose my collection contains documents like:
{"email": "jim#abc.com", "fb" : { "name" : {
"full" : "Jim Bent"} }, "apps": ["com.abc.1", "com.abc.2"]}
{"email": "john#abc.com", "fb" : { "name" : {
"full" : "John Smith"}}, "apps": ["com.abc.1", "com.abc.3" ]}
I want to write a query and export it out via a csv file that outputs the emails, fb.name.full grouping by the "apps" array fields in this entire collection.
That is: for "com.abc.1", it outputs Jim Bent with his email and John Smith and email.
For "com.abc.2", it will output only Jim Bent, whilst for "com.abc.3", it will output only John Smith.
I have researched a bit, but mongoexport doesn't allow complex queries, and I am not able to write an $unwind function either. So i am hitting a wall.
Any advice is appreciated. Thank you.
You can do this with Javascript and the mongo shell by creating a file (eg: myquery.js) with the following code:
printjson(
db.collection.aggregate([
{$unwind: '$apps'},
{$group: { _id: '$apps', info: { '$push': { email: '$email', name: '$fb.name.full'}}}},
{$project: {app: '$_id', info: 1, '_id': 0}}
]))
then you can perform the query from the command line as:
mongo database myquery.js