Map reduce in mongodb - mongodb

I have mongo documents in this format.
{"_id" : 1,"Summary" : {...},"Examples" : [{"_id" : 353,"CategoryId" : 4},{"_id" : 239,"CategoryId" : 28}, ... ]}
{"_id" : 2,"Summary" : {...},"Examples" : [{"_id" : 312,"CategoryId" : 2},{"_id" : 121,"CategoryId" : 12}, ... ]}
How can I map/reduce them to get a hash like:
{ [ result[categoryId] : count_of_examples , .....] }
I.e. count of examples of each category.
I have 30 categories at all, all specified in Categories collection.

If you can use 2.1 (dev version of upcoming release 2.2) then you can use Aggregation Framework and it would look something like this:
db.collection.aggregate( [
{$project:{"CatId":"$Examples.CategoryId","_id":0}},
{$unwind:"$CatId"},
{$group:{_id:"$CatId","num":{$sum:1} } },
{$project:{CategoryId:"$_id",NumberOfExamples:"$num",_id:0 }}
] );
The first step projects the subfield of Examples (CategoryId) into a top level field of a document (not necessary but helps with readability), then we unwind the array of examples which creates a separate document for each array value of CatId, we do a "group by" and count them (I assume each instance of CategoryId is one example, right?) and last we use projection again to relabel the fields and make the result look like this:
"result" : [
{
"CategoryId" : 12,
"NumberOfExamples" : 1
},
{
"CategoryId" : 2,
"NumberOfExamples" : 1
},
{
"CategoryId" : 28,
"NumberOfExamples" : 1
},
{
"CategoryId" : 4,
"NumberOfExamples" : 1
}
],
"ok" : 1

Related

Updating nested List in mongoDB Query working sometimes but with large data set it fails [duplicate]

This question already has answers here:
Updating a Nested Array with MongoDB
(2 answers)
Closed 5 years ago.
Following is a MongoDB document:
{
"_id" : 2,
"mem_id" : M002,
"email" : "xyz#gmail.com",
"event_type" : [
{
"name" : "MT",
"count" : 1,
"language" : [
{
"name" : "English",
"count" : 1,
"genre" : [
{
"name" : "Action",
"count" : 6
},
{
"name" : "Sci-Fi",
"count" : 3
}
],
"cast" : [
{
"name" : "Sam Wortington",
"count" : 2
},
{
"name" : "Bruce Willis",
"count" : 4
},
{
"name" : "Will Smith",
"count" : 7
},
{
"name" : "Irfan Khan",
"count" : 1
}
]
}
]
}
]
}
I'm not able to update fields that is of type array, specially event_type, language, genre and cast because of nesting. Basically, I wanted to update all the four mentioned fields along with count field for each and subdocuments. The update statement should insert a value to the tree if the value is new else should increment the count for that value.
What can be the query in mongo shell?
Thanks
You are directly hitting one of the current limitations of MongoDB.
The problem is that the engine does not support several positional operators.
See this Multiple use of the positional `$` operator to update nested arrays
There is an open ticket for this: https://jira.mongodb.org/browse/SERVER-831 (mentioned also there)
You can also read this one on how to change your data model: Updating nested arrays in mongodb
If it is feasible for you, you can do:
db.collection.update({_id:2,"event_type.name":'MT' ,"event_type.language.name":'English'},{$set:{"event_type.0.language.$.count":<number>}})
db.collection.update({_id:2,"event_type.name":'MT' ,"event_type.language.name":'English'},{$set:{"event_type.$.language.0.count":<number>}})
But you cannot do:
db.collection.update({_id:2,"event_type.name":'MT' ,"event_type.language.name":'English'},{$set:{"event_type.$.language.$.count":<number>}})
Let's take case by case:
To update the field name in event_type array:
db.testnested.update({"event_type.name" : "MT"}, {$set : {"event_type.name" : "GMT"}})
This command will update the name for an object inside the event_type list, to GMT from MT:
BEFORE:
db.testnested.find({}, {"event_type.name" : 1})
{ "_id" : 2, "event_type" : [ { "name" : "MT" } ] }
AFTER:
db.testnested.find({}, {"event_type.name" : 1})
{ "_id" : 2, "event_type" : [ { "name" : "GMT" } ] }
2.To update fields inside event_type, such as language, genre that are intern list:
There is no direct query for this. You need to read the document, update that document using the JavaScript or language of your choice, and then save() the same. I dont think there is any other way available till mongo 2.4
For further documentation, you can refer to save().
Thanks!

Mongo aggregation on array elements

I have a mongo document like
{ "_id" : 12, "location" : [ "Kannur","Hyderabad","Chennai","Bengaluru"] }
{ "_id" : 13, "location" : [ "Hyderabad","Chennai","Mysore","Ballary"] }
From this how can I get the location aggregation (distinct area count).
some thing like
Hyderabad 2,
Kannur 1,
Chennai 2,
Bengaluru 1,
Mysore 1,
Ballary 1
Using aggregation you cannot get the exact output that you want. One of the limitations of aggregation pipeline is its inability to transform values to keys in the output document.
For example, Kannur is one of the values of the location field, in the input document. In your desired output structure it needs to be the key("kannur":1). This is not possible using aggregation. While, this can be used achieving map-reduce, you can however get a very closely related and useful structure using aggregation.
Unwind the location array.
Group by the location fields, get the count of individual locations
using the $sum operator.
Group again all the documents once again to get a consolidated array
of results.
Code:
db.collection.aggregate([
{$unwind:"$location"},
{$group:{"_id":"$location","count":{$sum:1}}},
{$group:{"_id":null,"location_details":{$push:{"location":"$_id",
"count":"$count"}}}},
{$project:{"_id":0,"location_details":1}}
])
Sample o/p:
{
"location_details" : [
{
"location" : "Ballary",
"count" : 1
},
{
"location" : "Mysore",
"count" : 1
},
{
"location" : "Bengaluru",
"count" : 1
},
{
"location" : "Chennai",
"count" : 2
},
{
"location" : "Hyderabad",
"count" : 2
},
{
"location" : "Kannur",
"count" : 1
}
]
}

How sum in MongoDB nested document when the KEY is uncertain ?

First of all the status codes("200","404" or other) and time("1000","2000"..) are uncertain,
I want to calculate the number(5, 6 ...) for each status codes.
For example: {"200" : 11}, {"404" :11} or {"total" : 22}
Data Structure :
"_id" : "xxxxx"
"domain" : "www.test.com"
"status" : [
{"200" : [ {"1000" : 5}, {"2000": 6} ...]},
{"404" : [ {"1000" : 5}, {"2000": 6} ...]}
....
]
Any fantastic methods in MongoDB ?
Thank you for your help
Don't use data, like dates, as keys. Data belongs in values. The HTTP status codes are enumerated - you know all the possibilities - so you can use those as keys if you want to. From the look of the documents, you are storing information about requests to a page in a page document with the requests in an array. It's not a great idea to have an unbounded, constantly growing array in a document. I'd suggest refactoring the data to be request documents with the address denormalized into each:
{
"_id" : ObjectId(...),
"status" : 404,
"date" : ISODate("2014-10-30T18:23:09.471Z"),
"domain" : "www.test.com"
}
and then you can get the total number of 404 requests to test.com with the aggregation
db.requests.aggregate([
{ "$match" : { "domain" : "www.test.com" } },
{ "$group" : { "_id" : "$status", "count" : { "$sum" : 1 } } }
])
Index on domain to make it fast.
I think you can use the aggregation framework to pull something like that.
Check this:
db.errors.aggregate([{$unwind: "$status"}, {$group: {_id: "$status", total:{$sum:1}}}])
It will render a result like this:
...
"result" : [
{
"_id" : {
"500" : [
{
"1000" : 5
},
{
"2000" : 6
}
]
},
"total" : 1
},
...
The "total" field has the count that you're looking for.
Hope this helps.
Regards!

How can I select a number of records per a specific field using mongodb?

I have a collection of documents in mongodb, each of which have a "group" field that refers to a group that owns the document. The documents look like this:
{
group: <objectID>
name: <string>
contents: <string>
date: <Date>
}
I'd like to construct a query which returns the most recent N documents for each group. For example, suppose there are 5 groups, each of which have 20 documents. I want to write a query which will return the top 3 for each group, which would return 15 documents, 3 from each group. Each group gets 3, even if another group has a 4th that's more recent.
In the SQL world, I believe this type of query is done with "partition by" and a counter. Is there such a thing in mongodb, short of doing N+1 separate queries for N groups?
You cannot do this using the aggregation framework yet - you can get the $max or top date value for each group but aggregation framework does not yet have a way to accumulate top N plus there is no way to push the entire document into the result set (only individual fields).
So you have to fall back on MapReduce. Here is something that would work, but I'm sure there are many variants (all require somehow sorting an array of objects based on a specific attribute, I borrowed my solution from one of the answers in this question.
Map function - outputs group name as a key and the entire rest of the document as the value - but it outputs it as a document containing an array because we will try to accumulate an array of results per group:
map = function () {
emit(this.name, {a:[this]});
}
The reduce function will accumulate all the documents belonging to the same group into one array (via concat). Note that if you optimize reduce to keep only the top five array elements by checking date then you won't need the finalize function, and you will use less memory during running mapreduce (it will also be faster).
reduce = function (key, values) {
result={a:[]};
values.forEach( function(v) {
result.a = v.a.concat(result.a);
} );
return result;
}
Since I'm keeping all values for each key, I need a finalize function to pull out only latest five elements per key.
final = function (key, value) {
Array.prototype.sortByProp = function(p){
return this.sort(function(a,b){
return (a[p] < b[p]) ? 1 : (a[p] > b[p]) ? -1 : 0;
});
}
value.a.sortByProp('date');
return value.a.slice(0,5);
}
Using a template document similar to one you provided, you run this by calling mapReduce command:
> db.top5.mapReduce(map, reduce, {finalize:final, out:{inline:1}})
{
"results" : [
{
"_id" : "group1",
"value" : [
{
"_id" : ObjectId("516f011fbfd3e39f184cfe13"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.498Z"),
"contents" : 0.23778377776034176
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0e"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.467Z"),
"contents" : 0.4434165076818317
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe09"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.436Z"),
"contents" : 0.5935856597498059
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe04"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.405Z"),
"contents" : 0.3912118375301361
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfdff"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.372Z"),
"contents" : 0.221651989268139
}
]
},
{
"_id" : "group2",
"value" : [
{
"_id" : ObjectId("516f011fbfd3e39f184cfe14"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.504Z"),
"contents" : 0.019611883210018277
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0f"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.473Z"),
"contents" : 0.5670706110540777
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0a"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.442Z"),
"contents" : 0.893193120136857
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe05"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.411Z"),
"contents" : 0.9496864483226091
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe00"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.378Z"),
"contents" : 0.013748752186074853
}
]
},
{
"_id" : "group3",
...
}
]
}
],
"timeMillis" : 15,
"counts" : {
"input" : 80,
"emit" : 80,
"reduce" : 5,
"output" : 5
},
"ok" : 1,
}
Each result has _id as group name and values as array of most recent five documents from the collection for that group name.
you need aggregation framework $group stage piped in a $limit stage...
you want also to $sort the records in some ways or else the limit will have undefined behaviour, the returned documents will be pseudo-random (the order used internally by mongo)
something like that:
db.collection.aggregate([{$group:...},{$sort:...},{$limit:...}])
here there is the documentation if you want to know more

Listing, counting factors of unique Mongo DB values over all keys

I'm preparing a descriptive "schema" (quelle horreur) for a MongoDB I've been working with.
I used the excellent variety.js to create a list of all keys and show coverage of each key. However, in cases where the values corresponding to the keys have a small set of values, I'd like to be able to list the entire set as "available values." In R, I'd be thinking of these as the "factors" for the categorical variable, ie, gender : ["M", "F"].
I know I could just use R + RMongo, query each variable, and basically do the same procedure I would to create a histogram, but I'd like to know the proper Mongo.query()/javascript/Map,Reduce way to approach this. I understand the db.collection.aggregate() functions are designed for exactly this.
Before asking this, I referenced:
http://docs.mongodb.org/manual/reference/aggregation/
http://docs.mongodb.org/manual/reference/method/db.collection.distinct/
How to query for distinct results in mongodb with python?
Get a list of all unique tags in mongodb
http://cookbook.mongodb.org/patterns/count_tags/
But can't quite get the pipeline order right. So, for example, if I have documents like these:
{_id : 1, "key1" : "value1", "key2": "value3"}
{_id : 2, "key1" : "value2", "key2": "value3"}
I'd like to return something like:
{"key1" : ["value1", "value2"]}
{"key2" : ["value3"]}
Or better, with counts:
{"key1" : ["value1" : 1, "value2" : 1]}
{"key2" : ["value3" : 2]}
I recognize one problem with doing this will be any values that have a wide range of different values---so, text fields, or continuous variables. Ideally, if there were more than x different possible values, it would be nice to truncate, say to no more than 20 unique values. If I find it's actually more, I'd query that variable directly.
Is this something like:
db.collection.aggregate(
{$limit: 20,
$group: {
_id: "$??varname",
count: {$sum: 1}
}})
First, how can I reference ??varname? for the name of each key?
I saw this link which had 95% of it:
Binning and tabulate (unique/count) in Mongo
with...
input data:
{ "_id" : 1, "age" : 22.34, "gender" : "f" }
{ "_id" : 2, "age" : 23.9, "gender" : "f" }
{ "_id" : 3, "age" : 27.4, "gender" : "f" }
{ "_id" : 4, "age" : 26.9, "gender" : "m" }
{ "_id" : 5, "age" : 26, "gender" : "m" }
This script:
db.collection.aggregate(
{$project: {gender:1}},
{$group: {
_id: "$gender",
count: {$sum: 1}
}})
Produces:
{"result" :
[
{"_id" : "m", "count" : 2},
{"_id" : "f", "count" : 3}
],
"ok" : 1
}
But what I don't understand is how could I do this generically for an unknown number/name of keys with a potentially large number of return values? This sample knows the key name is gender, and that the response set will be small (2 values).
If you already ran a script that outputs the names of all keys in the collection, you can generate your aggregation framework pipeline dynamically. What that means is either extending the variety.js type script or just writing your own.
Here is what it might look like in JS if passed an array called "keys" which has several non-"_id" named fields (I'm assuming top level fields and that you don't care about arrays, embedded documents, etc).
keys = ["key1", "key2"];
group = { "$group" : { "_id" : null } } ;
keys.forEach( function(f) {
group["$group"][f+"List"] = { "$addToSet" : "$" + f }; } );
db.collection.aggregate(group);
{
"result" : [
{
"_id" : null,
"key1List" : [
"value2",
"value1"
],
"key2List" : [
"value3"
]
}
],
"ok" : 1
}