MongoDB, MapReduce and sorting

MongoDB, MapReduce and sorting - mongodb

I might be a bit in over my head on this as I'm still learning the ins and outs of MongoDB, but here goes.
Right now I'm working on a tool to search/filter through a dataset, sort it by an arbitrary datapoint (eg. popularity) and then group it by an id. The only way I see I can do this is through Mongo's MapReduce functionality.
I can't use .group() because I'm working with more than 10,000 keys and I also need to be able to sort the dataset.
My MapReduce code is working just fine, except for one thing: sorting. Sorting just doesn't want to work at all.
db.runCommand({
'mapreduce': 'products',
'map': function() {
emit({
product_id: this.product_id,
popularity: this.popularity
}, 1);
},
'reduce': function(key, values) {
var sum = 0;
values.forEach(function(v) {
sum += v;
});
return sum;
},
'query': {category_id: 20},
'out': {inline: 1},
'sort': {popularity: -1}
});
I already have a descending index on the popularity datapoint, so it's definitely not working because of a lack of that:
{
"v" : 1,
"key" : { "popularity" : -1 },
"ns" : "app.products",
"name" : "popularity_-1"
}
I just cannot figure out why it doesn't want to sort.
Instead of inlining the result set, I can't output it to another collection and then run a .find().sort({popularity: -1}) on that because of the way this feature is going to work.

First of all, Mongo map/reduce are not designed to be used in as a query tool (as it is in CouchDB), it is design for you to run background tasks. I use it at work to analyze traffic data.
What you are doing wrong however is that you're applying the sort() to your input, but it is useless because when the map() stage is done the intermediate documents are sorted by each keys. Because your key is a document, it is being sort by product_id, popularity.
This is how I generated my dataset
function generate_dummy_data() {
for (i=2; i < 1000000; i++) {
db.foobar.save({
_id: i,
category_id: parseInt(Math.random() * 30),
popularity: parseInt(Math.random() * 50)
})
}
}
And this my map/reduce task:
var data = db.runCommand({
'mapreduce': 'foobar',
'map': function() {
emit({
sorting: this.popularity * -1,
product_id: this._id,
popularity: this.popularity,
}, 1);
},
'reduce': function(key, values) {
var sum = 0;
values.forEach(function(v) {
sum += v;
});
return sum;
},
'query': {category_id: 20},
'out': {inline: 1},
});
And this is the end result (very long to paste it here):
http://cesarodas.com/results.txt
This works because now we're sorting by sorting, product_id, popularity. You can play with the sorting how ever you like just remember that the final sorting is by key regardless of you how your input is sorted.
Anyway as I said before you should avoid doing queries with Map/Reduce it was designed for background processing. If I were you I would design my data in such a way I could access it with simple queries, there is always a trade-off in this case complex insert/updates to have simple queries (that's how I see MongoDB).

As noted in discussion on the original question:
Map/Reduce with inline output currently cannot use an explicit sort key (see SERVER-3973). Possible workarounds include relying on the emitted key order (see #crodas's answer); outputting to a collection and querying that collection with sort order; or sorting the results in your application using something like usort().
OP's preference is for inline results rather than creating/deleting temporary collections.
The Aggregation Framework in MongoDB 2.2 (currently a production release candidate) would provide a suitable solution.
Here's an example of a similar query to the original Map/Reduce, but instead using the Aggregation Framework:
db.products.aggregate(
{ $match: { category_id: 20 }},
{ $group : {
_id : "$product_id",
'popularity' : { $sum : "$popularity" },
}},
{ $sort: { 'popularity': -1 }}
)
.. and sample output:
{
"result" : [
{
"_id" : 50,
"popularity" : 139
},
{
"_id" : 150,
"popularity" : 99
},
{
"_id" : 123,
"popularity" : 55
}
],
"ok" : 1
}

Related

Find aggregation on arbitrary number of different keys

I have a collection in mongoDB which looks like,
collection:
doc1
{ field1 : {
field1_1 : 'val1',
field1_2 : 'val2',
field1_3 : 'val3',
...
field1_N : 'valN' } }
doc2
{ field1 : {
field1_1 : 'val1',
field1_2 : 'val2',
field1_3 : 'val3',
...
field1_N : 'valN' } }
I want to find aggregation(sum, avg, min, max) on val1, val2, val3 ... valN. Is there any way to use mongo's aggregation feature? The keys are always different, and the aggregation should happen for all the values of field1
Edited:
The final output should look like,
doc1
{ field1 : {
sum: sumOf(val1, val2... valN),
avg: avgOf(val1, val2... valN)
... } }
doc2
{ field1 : {
sum: sumOf(val1, val2... valN),
avg: avgOf(val1, val2... valN)
... } }

can try this one using Map-Reduce instead of aggregate for your requirement. map-reduce operations provide some flexibility that is not presently available in the aggregation pipeline.
var mapFunction =
function() {
for (key in this.field1) {
emit(this._id, parseInt(this.field1[key]));
}
};
var reduceFunction =
function(key, values) {
return {sum:Array.sum(values), avg:Array.avg(values)};
};
db.getCollection('collectionName').mapReduce(mapFunction, reduceFunction, {out: {inline:1}});
As far I know if your document structure would be like :
field1 : [
{field1 : 'val1'},
{field1 : 'val2'},
{field1 : 'val3'},
...
{field1 : 'valN'}]
then you could solved easily by using aggregate. so for your structure mapReducemay better.

I'm not a mongodb ninja by any means, but I've a solution for your question. It's not the most efficient one, but if you are royally stuck, you can use it.
So the code below, creates a projection, of the avg data of the fields that you want to aggregate (avg) on, the down side to this is that you have to specify each field, rather than maybe a way of iterating through all fields and aggregating on that. You can definitely do iterations using JS, I've briefly come across it in the past, but it might take a small bit more tweaking than the code I've provided below.
Also, I've assumed that, field1.field1_1 : "int". Is an integer and not a string value.
db.random.aggregate(
[
{
$project:
{
_id: "$field1",
avgAmount: { $avg: ["$field1.field1_1", "$field1.field1_2", "$field1.field1.3"]}
}
}]
)
Best of luck anyway #1love.

You can try $sum for sum and $multiply for multiplication.. you can find all the aggregate functions available for MongoDb at the following link:
https://docs.mongodb.com/manual/reference/operator/aggregation/sum/
hope this helps..

How do you find aggregate in mongo array with size greater than two?

In the mongo 2.6 document, see few below
nms:PRIMARY> db.checkpointstest4.find()
{ "_id" : 1, "cpu" : [ 100, 20, 60 ], "hostname" : "host1" }
{ "_id" : 2, "cpu" : [ 40, 30, 80 ], "hostname" : "host1" }
I need to find average cpu (per cpu array index) per hosts I.E based on two above, average for host1 will be [70,25,70] because cpu[0] is 100+40=70 etc
I am lost when I have 3 array elements instead of two array elements, see mongodb aggregate average of array elements
Finally below worked for me:
var map = function () {
for (var idx = 0; idx < this.cpu.length; idx++) {
var mapped = {
idx: idx,
val: this.cpu[idx]
};
emit(this.hostname, {"cpu": mapped});
}
};
var reduce = function (key, values) {
var cpu = []; var sum = [0,0,0]; cnt = [0,0,0];
values.forEach(function (value) {
sum[value.cpu.idx] += value.cpu.val;
cnt[value.cpu.idx] +=1;
cpu[value.cpu.idx] = sum[value.cpu.idx]/cnt[value.cpu.idx]
});
return {"cpu": cpu};
};
db.checkpointstest4.mapReduce(map, reduce, {out: "checkpointstest4_result"});

In MongoDB 3.2 where includeArrayIndex showed up, you can do this;
db.test.aggregate(
{$unwind: {path:"$cpu", includeArrayIndex:"index"}},
{$group: {_id:{h:"$hostname",i:"$index"}, cpu:{$avg:"$cpu"}}},
{$sort:{"_id.i":1}},
{$group:{_id:"$_id.h", cpu:{$push:"$cpu"}}}
)
// Make a row for each array element with an index field added.
{$unwind: {path:"$cpu", includeArrayIndex:"index"}},
// Group by hostname+index, calculate average for each group.
{$group: {_id:{h:"$hostname",i:"$index"}, cpu:{$avg:"$cpu"}}},
// Sort by index (to get the array in the next step sorted correctly)
{$sort:{"_id.i":1}},
// Group by host, pushing the averages into an array in order.
{$group:{_id:"$_id.h", cpu:{$push:"$cpu"}}}

Upgrading would be your best option as mentioned with the includeArrayIndex available to $unwind from MongoDB 3.2 onwards.
If you cannot do that, then you can always process with mapReduce instead:
db.checkpointstest4.mapReduce(
function() {
var mapped = this.cpu.map(function(val) {
return { "val": val, "cnt": 1 };
});
emit(this.hostname,{ "cpu": mapped });
},
function(key,values) {
var cpu = [];
values.forEach(function(value) {
value.cpu.forEach(function(item,idx) {
if ( cpu[idx] == undefined )
cpu[idx] = { "val": 0, "cnt": 0 };
cpu[idx].val += item.val;
cpu[idx].cnt += item.cnt
});
});
return { "cpu": cpu };
},
{
"out": { "inline": 1 },
"finalize": function(key,value) {
return {
"cpu": value.cpu.map(function(cpu) {
return cpu.val / cpu.cnt;
})
};
}
}
)
So the steps there are in the "mapper" function to transform the array content to be an array of objects containing the "value" from the element and a "count" for later reference as input to the "reduce" function. You need this to be consistent with how the reducer is going to work with this and is necessary to get the overall counts needed to get the average.
In the "reducer" itself you are basically summing the array contents for each position for both the "value" and the "count". This is important as the "reduce" function can be called multiple times in the overall reduction process, feeding it's output as "input" in a subsequent call. So that is why both mapper and reducer are working in this format.
With the final reduced results, the finalize function is called to simply look at each summed "value" and "count" and divide by the count to return an average.
Mileage may vary on whether modern aggregation pipeline processing or indeed this mapReduce process will perform the best, mostly depending on the data. Using $unwind in the prescribed way will certainly increase the amount of documents to be analyzed and thus produce overhead. On the contrary, while JavaScript processing as opposed to native operators in the aggregation framework will generally be slower, but the document processing overhead here is reduced since this is keeping arrays.
The advice I would give is use this if upgrading to 3.2 is not an option, yet if even an option then at least benchmark the two on your data and expected growth to see which works best for you.
Returns
{
"results" : [
{
"_id" : "host1",
"value" : {
"cpu" : [
70,
25,
70
]
}
}
],
"timeMillis" : 38,
"counts" : {
"input" : 2,
"emit" : 2,
"reduce" : 1,
"output" : 1
},
"ok" : 1
}

In Mongo, how do I only display documents with the highest value for a key that they share?

Say I have the following four documents in a collection called "Store":
{ item: 'chair', modelNum: 1154, votes: 75 }
{ item: 'chair', modelNum: 1152, votes: 16 }
{ item: 'table', modelNum: 1017, votes: 24 }
{ item: 'table', modelNum: 1097, votes: 52 }
I would like to find only the documents with the highest number of votes for each item type.
The result of this simple example would return modelNum: 1154 and modelNum: 1097. Showing me the most popular model of chair and table, based on the customer inputed vote score.
What is the best way write this query and sort them by vote in descending order? I'm developing using meteor, but I don't think that should have an impact.
Store.find({????}).sort({votes: -1});

You can use $first or $last aggregation operators to achieve what you want. These operators are only useful when $group follows $sort. An example using $first:
db.collection.aggregate([
// Sort by "item" ASC, "votes" DESC
{"$sort" : {item : 1, votes : -1}},
// Group by "item" and pick the first "modelNum" (which will have the highest votes)
{"$group" : {_id : "$item", modelNum : {"$first" : "$modelNum"}}}
])
Here's the output:
{
"result" : [
{
"_id" : "table",
"modelNum" : 1097
},
{
"_id" : "chair",
"modelNum" : 1154
}
],
"ok" : 1
}

If you are looking to do this in Meteor and on the client I would just use an each loop and basic find. Minimongo keeps the data in memory so I don't think additional find calls are expensive.
like this:
Template.itemsList.helpers({
items: function(){
var itemNames = Store.find({}, {fields: {item: 1}}).map(
function( item ) { return item.item; }
);
var itemsMostVotes = _.uniq( itemNames ).map(
function( item ) {
return Store.findOne({item: item}, {sort: {votes: -1}});
}
);
return itemsMostVotes;
}
});
I have switched to findOne so this returns an array of objects rather than a cursor as find would. If you really want the cursor then you could query minimongo with the _ids from itemMostVotes.
You could also use the underscore groupBy and sortBy functions to do this.

You would need to use the aggregation framework.
So
db.Store.aggregate(
{$group:{_id:"$item", "maxVotes": {$max:"$votes"}}}
);

MongoDB select count(distinct x) on an indexed column - count unique results for large data sets

I have gone through several articles and examples, and have yet to find an efficient way to do this SQL query in MongoDB (where there are millions of rows documents)
First attempt
(e.g. from this almost duplicate question - Mongo equivalent of SQL's SELECT DISTINCT?)
db.myCollection.distinct("myIndexedNonUniqueField").length
Obviously I got this error as my dataset is huge
Thu Aug 02 12:55:24 uncaught exception: distinct failed: {
"errmsg" : "exception: distinct too big, 16mb cap",
"code" : 10044,
"ok" : 0
}
Second attempt
I decided to try and do a group
db.myCollection.group({key: {myIndexedNonUniqueField: 1},
initial: {count: 0},
reduce: function (obj, prev) { prev.count++;} } );
But I got this error message instead:
exception: group() can't handle more than 20000 unique keys
Third attempt
I haven't tried yet but there are several suggestions that involve mapReduce
e.g.
this one how to do distinct and group in mongodb? (not accepted, answer author / OP didn't test it)
this one MongoDB group by Functionalities (seems similar to Second Attempt)
this one http://blog.emmettshear.com/post/2010/02/12/Counting-Uniques-With-MongoDB
this one https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/trDn3jJjqtE
this one http://cookbook.mongodb.org/patterns/unique_items_map_reduce/
Also
It seems there is a pull request on GitHub fixing the .distinct method to mention it should only return a count, but it's still open: https://github.com/mongodb/mongo/pull/34
But at this point I thought it's worth to ask here, what is the latest on the subject? Should I move to SQL or another NoSQL DB for distinct counts? or is there an efficient way?
Update:
This comment on the MongoDB official docs is not encouraging, is this accurate?
http://www.mongodb.org/display/DOCS/Aggregation#comment-430445808
Update2:
Seems the new Aggregation Framework answers the above comment... (MongoDB 2.1/2.2 and above, development preview available, not for production)
http://docs.mongodb.org/manual/applications/aggregation/

1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values
pipeline = [
{ $group: { _id: "$myIndexedNonUniqueField"} },
{ $group: { _id: 1, count: { $sum: 1 } } }
];
//
// Run the aggregation command
//
R = db.runCommand(
{
"aggregate": "myCollection" ,
"pipeline": pipeline
}
);
printjson(R);
2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.
var SOURCE = db.myCollection;
var DEST = db.distinct
DEST.drop();
map = function() {
emit( this.myIndexedNonUniqueField , {count: 1});
}
reduce = function(key, values) {
var count = 0;
values.forEach(function(v) {
count += v['count']; // count each distinct value for lagniappe
});
return {count: count};
};
//
// run map/reduce
//
res = SOURCE.mapReduce( map, reduce,
{ out: 'distinct',
verbose: true
}
);
print( "distinct count= " + res.counts.output );
print( "distinct count=", DEST.count() );
Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You can save the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().

db.myCollection.aggregate(
{$group : {_id : "$myIndexedNonUniqueField"} },
{$group: {_id:1, count: {$sum : 1 }}});
straight to result:
db.myCollection.aggregate(
{$group : {_id : "$myIndexedNonUniqueField"} },
{$group: {_id:1, count: {$sum : 1 }}})
.result[0].count;

Following solution worked for me
db.test.distinct('user');
[ "alex", "England", "France", "Australia" ]
db.countries.distinct('country').length
4

How to count document elements inside a mongo collection with php?

I have the following structure of a mongo document:
{
"_id": ObjectId("4fba2558a0787e53320027eb"),
"replies": {
"0": {
"email": ObjectId("4fb89a181b3129fe2d000000"),
"sentDate": "2012-05-21T11: 22: 01.418Z"
}
"1": {
"email": ObjectId("4fb89a181b3129fe2d000000"),
"sentDate": "2012-05-21T11: 22: 01.418Z"
}
"2" ....
}
}
How do I count all the replies from all the documents in the collection?
Thank you!

In the following answer, I'm working with a simple data set with five replies across the collection:
> db.foo.find()
{ "_id" : ObjectId("4fba6b0c7c32e336fc6fd7d2"), "replies" : [ 1, 2, 3 ] }
{ "_id" : ObjectId("4fba6b157c32e336fc6fd7d3"), "replies" : [ 1, 2 ] }
Since we're not simply counting documents, db.collection.count() won't help us here. We'll need to resort to MapReduce to scan each document and aggregate the reply array lengths. Consider the following:
db.foo.mapReduce(
function() { emit('totalReplies', { count: this.replies.length }); },
function(key, values) {
var result = { count: 0 };
values.forEach(function(value) {
result.count += value.count;
});
return result;
},
{ out: { inline: 1 }}
);
The map function (first argument) runs across the entire collection and emits the number of replies in each document under a constant key. Mongo will then consider all emitted values and run the reduce function (second argument) a number of times to consolidate (literally reduce) the result. Hopefully the code here is straightforward. If you're new to map/reduce, one caveat is that the reduce method must be capable of processing its own output. This is explained in detail in the MapReduce docs linked above.
Note: if your collection is quite large, you may have to use another output mode (e.g. collection output); however, inline works well for small data sets.
Lastly, if you're using MongoDB 2.1+, we can take advantage of the Aggregation Framework to avoid writing JS functions and make this even easier:
db.foo.aggregate(
{ $project: { replies: 1 }},
{ $unwind: "$replies" },
{ $group: {
_id: "result",
totalReplies: { $sum: 1 }
}}
);
Three things are happening here. First, we tell Mongo that we're interested in the replies field. Secondly, we want to unwind the array so that we can iterate over all elements across the fields in our projection. Lastly, we'll tally up results under a "result" bucket (any constant will do), adding 1 to the totalReplies result for each iteration. Executing this query will yield the following result:
{
"result" : [{
"_id" : "result",
"totalReplies" : 5
}],
"ok" : 1
}
Although I wrote the above answers with respect to the Mongo client, you should have no trouble translating them to PHP. You'll need to use MongoDB::command() to run either MapReduce or aggregation queries, as the PHP driver currently has no helper methods for either. There's currently a MapReduce example in the PHP docs, and you can reference this Google group post for executing an aggregation query through the same method.

I haven't checked your code, might work as well. I've did the following and it just works:
$replies = $db->command(
array(
"distinct" => "foo",
"key" => "replies"
)
);
$all = count($replies['values']);

I've did it again using the group command of the PHP Mongo Driver. It's similar to a MapReduce command.
$keys = array("replies.type" => 1); //keys for group by
$initial = array("count" => 0); //initial value of the counter
$reduce = "function (obj, prev) { prev.count += obj.replies.length; }";
$condition = array('replies' => array('$exists' => true), 'replies.type' => 'follow');
$g = $db->foo->group($keys, $initial, $reduce, $condition);
echo $g['count'];
Thanks jmikola for giving links to Mongo.

JSON should be
{
"_id": ObjectId("4fba2558a0787e53320027eb"),
"replies":[
{
0: {
"email": ObjectId("4fb89a181b3129fe2d000000"),
"sentDate": "2012-05-21T11: 22: 01.418Z"
},
1: {
"email": ObjectId("4fb89a181b3129fe2d000000"),
"sentDate": "2012-05-21T11: 22: 01.418Z"
},
2: {....}
]
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

MongoDB, MapReduce and sorting - mongodb

Related

Find aggregation on arbitrary number of different keys

How do you find aggregate in mongo array with size greater than two?

In Mongo, how do I only display documents with the highest value for a key that they share?

MongoDB select count(distinct x) on an indexed column - count unique results for large data sets

How to count document elements inside a mongo collection with php?

Categories

Resources