MongoDB aggregation by array field - mongodb

Suppose my collection contains documents like:
{"email": "jim#abc.com", "fb" : { "name" : {
"full" : "Jim Bent"} }, "apps": ["com.abc.1", "com.abc.2"]}
{"email": "john#abc.com", "fb" : { "name" : {
"full" : "John Smith"}}, "apps": ["com.abc.1", "com.abc.3" ]}
I want to write a query and export it out via a csv file that outputs the emails, fb.name.full grouping by the "apps" array fields in this entire collection.
That is: for "com.abc.1", it outputs Jim Bent with his email and John Smith and email.
For "com.abc.2", it will output only Jim Bent, whilst for "com.abc.3", it will output only John Smith.
I have researched a bit, but mongoexport doesn't allow complex queries, and I am not able to write an $unwind function either. So i am hitting a wall.
Any advice is appreciated. Thank you.

You can do this with Javascript and the mongo shell by creating a file (eg: myquery.js) with the following code:
printjson(
db.collection.aggregate([
{$unwind: '$apps'},
{$group: { _id: '$apps', info: { '$push': { email: '$email', name: '$fb.name.full'}}}},
{$project: {app: '$_id', info: 1, '_id': 0}}
]))
then you can perform the query from the command line as:
mongo database myquery.js

Related

MongoDB Aggregate a sub field of an array (Array is main field)

I have a database in MongoDB. There are three main fields: _id, Reviews, HotelInfo. Reviews and HotelInfo are arrays. Reviews has a field called Author. I would like to print out every author name (once) and the amount of times they appear within the dataset.
I tried:
db.reviews.aggregate( {$group : { _id : '$Reviews.Author', count : {$sum : 1}}} ).pretty()
A part of the results were:
"_id" : [
"VijGuy",
"Stephtastic",
"dakota431",
"luce_sociator",
"ArkansasMomOf3",
"ccslilqt6969",
"RJeanM",
"MissDusty",
"sammymd",
"A TripAdvisor Member",
"A TripAdvisor Member"
],
"count" : 1
How it should be:
{ "_id" : "VijGuy", "count" : 1 }, { "_id" : "Stephtastic", "count" : 1 }
I posted the JSON format below.
Any idea on how to do this would be appreciated
JSON Format
Lets assume that this is our collection.
[{
_id: 1,
Reviews: [{Author: 'elad' , txt: 'good'}, {Author: 'chen', txt: 'bad'}]
},
{
_id: 2,
Reviews: [{Author: 'elad', txt : 'nice'}]
}]
to get the data as you want we need to first use the unwind stage and then the group stage.
[{ $unwind: '$Reviews' }, {$group : { _id : '$Reviews.Author', count : {$sum : 1}}}]
You need to first unwind the collection by the Reviews field.
after the unwind stage our data in the pipeline will look like this.
{_id:1, Reviews: {Author: 'elad' , txt: 'good'}},
{_id:1, Reviews: {Author: 'chen' , txt: 'bad'}},
{_id:2, Revies: {Author: 'elad', txt : 'nice'}
The unwind created a document for each element in Reviews array with the element itself and his host document. Now its easy to group in useful ways as you want. Now we can use the same group that you wrote and we will get the results.
after the group our data will look like this:
[{_id: 'elad',sum:2},{_id: 'chen', sum: 1}]
Unwind is a very important pipeline stage in the aggregation framework. Its help us transform complex and nested documents into flat and simple, and that help us to query the data in different ways.
What's the $unwind operator in MongoDB?

Exporting nested data from a MongoDB

I am trying to export nested fields from MongoDB to a CSV.
From the below code, I would like to extract the scale name (e.g. “Security” & “Power”) and the raw_score (e.g. 2 & 3, respectively) fields. These four fields would be stored in four columns in a CSV, where each column is an extract field.
"results" : {
"scales" : [
{
"scale" : {
"name" : "Security",
"code" : "SEC",
"multiplier" : 1
},
"raw_score" : 2
},
{
"scale" : {
"name" : "Power",
"code" : "POW",
"multiplier" : -1
},
"raw_score" : 3
}
],
In the past I have been successful using dot notation to extract nested fields (a working example below from a previous extraction), yet I am unsure how do to extract fields that share the same name.
mongoexport -d production_hoganx_collector_061817 -c records --type=csv -o col_liwc_summary_061817.csv -f user_id,post_analysis.liwc_scores.tone
How can I extract the name and raw_score fields using the mongoexport command? I have tried to export the database to a JSON file and then extract the data via R, however this method takes too long to complete.
If mongoexport is not suitable, I am open to hearing alternatives!
Many thanks,
I'm assuming this is a one time thing so I suggest using an aggregate to build a new collection with the scales array unwinded.
Unwind fans out a document in n documents, where n is the amount of elements in the unwind specified array-type field. So for example if you had a document like this one:
{
name: "Some name",
email: ["somename#somedomain.com", "name#someotherdomain.com"]
}
An unwind on the email field would result in two documents:
{
name: "Some name",
email: "somename#somedomain.com"
},
{
name: "Some name",
email: "name#someotherdomain.com"
}
So in your case I think you should use that to unwind your scales field like this:
db.collection.aggregate([
{$match: yourCondition},
{$unwind: "$scales"},
{$project: {
_id: false,
scales: true,
... other fields ...
}},
{$out: "unwindedcollection"}
]);
At this point you should be able to use mongoexport from the new collection generated (unwindedcollection), using the dot notation you used before.
Be sure to set false on _id, otherwise you'll end up with a duplicate _id error. You don't want to project that field so it creates new ids when inserting in the new collection you're dumping your aggregate results.
I'll leave the links to the docs of the concepts I used for this:
aggregate: https://docs.mongodb.com/manual/reference/method/db.collection.aggregate/
$project: https://docs.mongodb.com/manual/reference/operator/aggregation/project/
$unwind: https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/
$out: https://docs.mongodb.com/manual/reference/operator/aggregation/out/

MongoDB aggregation/map-reduce

I'm new to MongoDB and I need to do an aggregation which seems to me quite difficult. A document looks something like this
{
"_id" : ObjectId("568192aef8bd6b0cd0f649c6"),
"conference" : "IEEE International Conference on Acoustics, Speech and Signal Processing",
"prism:aggregationType" : "Conference Proceeding",
"children-id" : [
"SCOPUS_ID:84948148564",
"SCOPUS_ID:84927603733",
"SCOPUS_ID:84943521758",
"SCOPUS_ID:84905234683",
"SCOPUS_ID:84876113709"
],
"dc:identifier" : "SCOPUS_ID:84867598678"
}
The example contains just the fields I need in the aggregation. Prism:aggregationType can have 5 different values(conference proceeding, book, journal etc.). Children-id says that this document is cited by an array of other documents(SCOPUS_ID is an unique ID for each document).
What I want to do is to group first by conference, then for each conference I want to know for each prism:aggregationType how many citing documents are($gt > 0).
For example, lets say there are 100 documents that have the conference from above. These 100 documents are cited by 250 documents. I want to know from all of these 250 documents how many have "prism:aggregationType" : "Conference Proceeding", "prism:aggregationType" : "Journal" etc.
An output could look like this:
{
"conference" : "IEEE International Conference on Acoustics, Speech and Signal Processing",
"aggregationTypes" : [{"Conference Proceeding" : 50} , {"Journal" : 200}]
}
It is not important if it is done with aggregation pipeline or map-reduce.
EDIT
Is there any way to combine these 2 into one aggregation:
db.articles.aggregate([
{ $match:{
conference : {$ne : null}
}},
{$unwind:'$children-id'},
{$group: {
_id: {conference: '$conference'},
'cited-by':{$push:{'dc:identifier':"$children-id"}}
}}
]);
db.articles.find( { 'dc:identifier': { $in: [ 'SCOPUS_ID:84943302953', 'SCOPUS_ID:84927603733'] } }, {'prism:aggregationType':1} );
In the query I want to replace the array from $in with the array created with $push
Please try this one through aggregation
> db.collections
.aggregate([
// 1. get the size of `children-id` array through $project
{$project: {
conference: 1,
IEEE1: 1,
'prism:aggregationType': 1,
'children-id': {$size: '$children-id'}
}},
// 2. group by `conference` and `prism:aggregationType` and sum the size of `children-id`
{$group: {
_id: {
conference:'$conference',
aggregationType: '$prism:aggregationType'
},
ids: {$sum: '$children-id'}
}},
// 3. group by `conference`, and make pair of the conference processing ids size and journal ids size
{$group: {
_id: '$_id.conference',
aggregationTypes: {
$cond: [{$eq: ['$_id.aggregationType', 'Conference Proceeding']},
{$push: {"Conference Proceeding": '$ids'}},
{$push: {"Journal": '$ids'}}
]}
}}
]);
As we had a chat,
using $lookup in aggregation pipeline is unfortunately bonded to mongodb 3.2 which is not a case as R driver can use mongo 2.6 and source documents are in more than one collection.
The code I wrote in the EDIT section is also the final result I come up with(a little bit modified)
db.articles.aggregate([
{ $match:{
conference : {$ne : null}
}},
{$unwind:'$children-id'},
{$group: {
_id: '$conference',
'cited-by':{$push:"$children-id"}
}}
]);
db.articles.find( { 'dc:identifier': { $in: [ 'SCOPUS_ID:84943302953', 'SCOPUS_ID:84927603733'] } }, {'prism:aggregationType':1} );
The result will look like this for each conference:
{
"_id" : "Annual Conference on Privacy, Security and Trust",
"cited-by" : [
"SCOPUS_ID:84942789431",
"SCOPUS_ID:84928151617",
"SCOPUS_ID:84939229259",
"SCOPUS_ID:84946407175",
"SCOPUS_ID:84933039513",
"SCOPUS_ID:84942789431",
"SCOPUS_ID:84942607254",
"SCOPUS_ID:84948165954",
"SCOPUS_ID:84926379258",
"SCOPUS_ID:84946771354",
"SCOPUS_ID:84944223683",
"SCOPUS_ID:84942789431",
"SCOPUS_ID:84939169499",
"SCOPUS_ID:84947104346",
"SCOPUS_ID:84948764343",
"SCOPUS_ID:84938075139",
"SCOPUS_ID:84946196118",
"SCOPUS_ID:84930820238",
"SCOPUS_ID:84947785321",
"SCOPUS_ID:84933496680",
"SCOPUS_ID:84942789431"
]
}
I iterate through all the documents I get (around 250) and then I use the cited-by array inside $in. I use index over dc:identifier so it works instantly.
$lookup could be an alternative to get the things done from aggregate pipeline but packages in R does not support versions above 2.6.
Thank you for your time anyway :)

how to use mongodb aggregate and retrieve entire documents

I am seriously baffled by mongodb's aggregate function. All I want is to find the newest document in my collection. Let's say each record has a field "created"
db.collection.aggregate({
$group: {
_id:0,
'id':{$first:"$_id"},
'max':{$max:"$created"}
}
})
yields the correct result, but I want the entire document in the result? How would I do that?
This is the structure of the document:
{
"_id" : ObjectId("52310da847cf343c8c000093"),
"created" : 1389073358,
"image" : ObjectId("52cb93dd47cf348786d63af2"),
"images" : [
ObjectId("52cb93dd47cf348786d63af2"),
ObjectId("52f67c8447cf343509d63af2")
],
"organization" : ObjectId("522949d347cf3402c3000001"),
"published" : 1392601521,
"status" : "PUBLISHED",
"tags" : [ ],
"updated" : 1392601521,
"user_id" : ObjectId("52214ce847cf344902000000")
}
In the documentation i found that the $$ROOT expression addresses this problem.
From the DOC:
http://docs.mongodb.org/manual/reference/operator/aggregation/group/#group-documents-by-author
query = [
{
'$sort': {
'created': -1
}
},
{
$group: {
'_id':null,
'max':{'$first':"$$ROOT"}
}
}
]
db.collection.aggregate(query)
db.collection.aggregate([
{
$group: {
'_id':"$_id",
'otherFields':{ $push: { fields: $ROOT } }
}
}
])
I think I figured it out. For example, I have a collection containing an array of images (or pointers). Now I want to find the document with the most images
results=[];
db.collection.aggregate([
{$unwind: "$images"},
{$group:{_id:"$_id", 'imagecount':{$sum:1}}},
{$group:{_id:"$_id",'max':{$max: "$imagecount"}}},
{$sort:{max:-1}},
{$group:{_id:0,'id':{$first:'$_id'},'max':{$first:"$max"}}}
]).result.forEach(function(d){
results.push(db.stories.findOne({_id:d.id}));
});
now the final array will contain the document with the most images. Since images is an array, I use $unwind, I then group by document id and $sum:1, pipe that into a $group that finds the max, pipe it into reverse $sort for max and $group out the first result. Finally I fetchOne the document and push it into the results array.
You should be using db.collection.find() rather than db.collection.aggregate():
db.collection.find().sort({"created":-1}).limit(1)

Matching for latest documents for a unique set of fields before aggregating

Assuming I have the following document structures:
> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
'group_x: ['TAG-1','TAG-2'],
'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
'group_x: ['TAG-1'],
'group_y': ['XYZ']
}
}
> db.cases.findOne()
{
'id': ObjectId("50ad8d451d41c8fc58000099")
'name': 'Sample Case 1'
}
Is there a way to perform a $match in aggregation framework that will retrieve only all the latest Log for each unique combination of case_id and group_x? I am sure this can be done with multiple $group pipeline but as much as possible, I want to immediately limit the number of documents that will pass through the pipeline via the $match operator. I am thinking of something like the $max operator except it is used in $match.
Any help is very much appreciated.
Edit:
So far, I can come up with the following:
db.logs.aggregate(
{$match: {...}}, // some match filters here
{$project: {tag:'$tag_doc.group_x', case:'$case_id', latest:{uploaded_at:1}}},
{$unwind: '$tag'},
{$group: {_id:{tag:'$tag', case:'$case'}, latest: {$max:'$latest'}}},
{$group: {_id:'$_id.tag', total:{$sum:1}}}
)
As I mentioned, what I want can be done with multiple $group pipeline but this proves to be costly when handling large number of documents. That is why, I wanted to limit the documents as early as possible.
Edit:
I still haven't come up with a good solution so I am thinking if the document structure itself is not optimized for my use-case. Do I have to update the fields to support what I want to achieve? Suggestions very much appreciated.
Edit:
I am actually looking for an implementation in mongodb similar to the one expected in How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? except it involves two distinct field values. Also, the $match operation is crucial because it makes the resulting set dynamic, with filters ranging to matching tags or within a range of dates.
Edit:
Due to the complexity of my use-case I tried to use a simple analogy but this proves to be confusing. Above is now the simplified form of the actual use case. Sorry for the confusion I created.
I have done something similar. But it's not possible with match, but only with one group pipeline. The trick is do use multi key with correct sorting:
{ user_id: 1, address: "xyz", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }, { user_id: 1, address: "xyz2", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }
if i wan't to group on user_id & address and i wan't the message with the latest date we need to create a key like this:
{ user_id:1, address:1, date_sent:-1 }
then you are able to perform aggregate without sort, which is much faster and will work on shards with replicas. if you don't have a key with correct sort order you can add a sort pipeline, but then you can't use it with shards, because all that is transferred to mongos and grouping is done their (also will get memory limit problems)
db.user_messages.aggregate(
{ $match: { user_id:1 } },
{ $group: {
_id: "$address",
count: { $sum : 1 },
date_sent: { $max : "$date_sent" },
message: { $first : "$message" },
} }
);
It's not documented that it should work like this - but it does. We use it on production system.
I'd use another collection to 'create' the search results on the fly - as new posts are posted - by upserting a document in this new collection every time a new blog post is posted.
Every new combination of author/tags is added as a new document in this collection, whereas a new post with an existing combination just updates an existing document with the content (or object ID reference) of the new blog post.
Example:
db.searchResult.update(
... {'author_id':'50ad8d451d41c8fc58000099', 'tag_doc.tags': ["TAG-1", "TAG-2" ]},
... { $set: { 'Referenceid':ObjectId("5152bc79e8bf3bc79a5a1dd8")}}, // or embed your blog post here
... {upsert:true}
)
Hmmm, there is no good way of doing this optimally in such a manner that you only need to pick out the latest of each author, instead you will need to pick out all documents, sorted, and then group on author:
db.posts.aggregate([
{$sort: {created_at:-1}},
{$group: {_id: '$author_id', tags: {$first: '$tag_doc.tags'}}},
{$unwind: '$tags'},
{$group: {_id: {author: '$_id', tag: '$tags'}}}
]);
As you said this is not optimal however, it is all I have come up with.
If I am honest, if you need to perform this query often it might actually be better to pre-aggregate another collection that already contains the information you need in the form of:
{
_id: {},
author: {},
tag: 'something',
created_at: ISODate(),
post_id: {}
}
And each time you create a new post you seek out all documents in this unqiue collection which fullfill a $in query of what you need and then update/upsert created_at and post_id to that collection. This would be more optimal.
Here you go:
db.logs.aggregate(
{"$sort" : { "uploaded_at" : -1 } },
{"$match" : { ... } },
{"$unwind" : "$tag_doc.group_x" },
{"$group" : { "_id" : { "case" :'$case_id', tag:'$tag_doc.group_x'},
"latest" : { "$first" : "$uploaded_at"},
"Name" : { "$first" : "$Name" },
"tag_doc" : { "$first" : "$tag_doc"}
}
}
);
You want to avoid $max when you can $sort and take $first especially if you have an index on uploaded_at which would allow you to avoid any in memory sorts and reduce the pipeline processing costs significantly. Obviously if you have other "data" fields you would add them along with (or instead of) "Name" and "tag_doc".