MongoDB MapReduce, second argument of reduce function is multidimensional array - mongodb

I tried to use mapReduce for my collection. Just for debug I returned vals value passed as second argument do reduce function, like this:
db.runCommand({
"mapreduce":"MyCollection",
"map":function() {
emit( {
country_code:this.cc,
partner:this.di,
registeredPeriod:Math.floor((this.ca - 1399240800)/604800)
},
{
count:Math.ceil((this.lla - this.ca)/86400)
});
},
"reduce":function(k, vals) {
return {
'count':vals
};
},
"query":{
"ca":{
"$gte":1399240800
},
"di":405,
"cc":"1"
},
"out":{
"inline":true
}
});
And I got result like this:
{
"results" : [
{
"_id" : {
"country_code" : "1",
"distribution" : 405,
"installationPeriod" : 0
},
"value" : {
"count" : [
{
"count" : 37
},
{
"count" : 38
}
]
}
},
{
"_id" : {
"country_code" : "1",
"distribution" : 405,
"installationPeriod" : 1
},
"value" : {
"count" : 36
}
},
{
"_id" : {
"country_code" : "1",
"distribution" : 405,
"installationPeriod" : 4
},
"value" : {
"count" : [
{
"count" : [
{
"count" : 16
},
{
"count" : 16
}
]
},
{
"count" : 15
}
]
}
}
],
"timeMillis" : 38,
"counts" : {
"input" : 130,
"emit" : 130,
"reduce" : 5,
"output" : 6
},
"ok" : 1
}
I really don't know why I got multidimensional array as second argument for my reduce function. I mean about this part of result:
{
"_id" : {
"country_code" : "1",
"distribution" : 405,
"installationPeriod" : 4
},
"value" : {
"count" : [
{
"count" : [ // <= Why is this multidimensional?
{
"count" : 16
}
Why is this multidimensional? And why key of embedded array is same like returned from reduce function?

The reason is because this is mapReduce works. From the documentation point:
MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
And a later point:
the type of the return object must be identical to the type of the value emitted by the map function to ensure that the following operations is true:
So even though you have not "changed the signature" as that documentation points to, you are still only processing n items at once in one reduce pass and then another n items in the next pass. What happens in the eventual processing of this is that the array that was returned in one fragment is combined with the array from another fragment.
So what happened is your reduce returns an array, but it is not "all" of the items you emitted for the key, just some of them. Then another reduce on the same "key" processes more items. Finally those two arrays (or probably more) are again sent to the reduce, in an attempt to actually "reduce" those items as is intended.
That is the general concept, so it is no surprise that when you are just pushing back the array then that is what you get.
Short version, mapReduce processes the ouput "keys" in chunks and not all at once. Better to learn that now before it becomes a problem for you later.

Related

MongoDB: How to update a nested array value after checking if greater than zero?

I'm able to decrease a value in a nested array, but I want to check that the value is greater than zero (so never go to the negative numbers). For example, I have this:
{
"_id" : ObjectId("5a3b0bd69c0000c2a1d839af"),
"slots" : [
{
"id" : "V2qlAEk7Wp0tWwlyWSfX7KRZ",
"number" : 5.0
},
.......
.......
{
"id" : "VfB4f8G1KcgRA8qx0aby5nI0",
"number" : 0.0
}]
}
{
"_id" : ObjectId("5a3b0bd69c0000c2a1d839ag"),
"slots" : [
{
"id" : "V2qlAEEESbrEB4bwberbResbd",
"number" : 10.0
},
.......
.......
{
"id" : "DFwseEb5enRbfsbre54rtFfds",
"number" : 1.0
}]
}
If I want to decrease the number value of the first document [id=5a3b0bd69c0000c2a1d839af] of the slots.id = V2qlAEk7Wp0tWwlyWSfX7KRZ, I use:
db.getCollection('schedulers').update(
{
"_id" : ObjectId("5a3b0bd69c0000c2a1d839af"),
"slots.id" :"V2qlAEk7Wp0tWwlyWSfX7KRZ"
},
{
"$inc":{"slots.$.number":-1}
})
But I don't know how to check if the number is greater than 0 before decrease the value. I tried to use The filtered positional operator $[<identifier>], but I have an error:
db.getCollection('schedulers').update(
{
"_id" : ObjectId("5a3b0bd69c0000c2a1d839af"),
"slots.id" :"V2qlAEk7Wp0tWwlyWSfX7KRZ"
},
{
"$inc":{"slots.$[elm].number":-1}
},
{
arrayFilters: [ { "elm.number": {"$gt" : 0}} ]
})
The error is:
cannot use the part (slots of slots.$[elm].number) to traverse the element....
Also, apply the condition in the first part of the query doesn't fix the problem, cause mongo traverses all the array, so also if the number is zero it simply move to another element in the array until it finds number > 0 and when it finds it, it starts to decrease the number value of another element completely ignoring the two id conditions [because now the $ then point to another element of the array]:
db.getCollection('schedulers').update(
{
"_id" : ObjectId("5a3b0bd69c0000c2a1d839af"),
"slots.id" :"V2qlAEk7Wp0tWwlyWSfX7KRZ",
"slots.number":{"$gt" : 0}
},
{
"$inc":{"slots.$.number":-1}
})
I think the correct way to obtain what I want is to use the arrayFilters but I don't understand where is the problem.
Fixed using:
db.getCollection('schedulers').update(
{
"_id" : ObjectId("5a3b0bd69c0000c2a1d839af"),
"slots" :
{
"$elemMatch":
{
"id":"V2qlAEk7Wp0tWwlyWSfX7KRZ",
"number":
{
"$gt" : 0
}
}
}
},
{
"$inc":
{
"slots.$.number":-1
}
})
Thanks

mongodb fetch hundreds of data out of millions of data

In my database, I have millions of documents. Each of them has a time stamp. Some have the same time stamp. I want to get some points (a few hundreds or potentially more like thousands) to draw a graph. I don't want all the points. I want every n points I pick 1 point. I know there's aggregation framework and I tried that. The problem with that is since my data is huge. When I do aggregation work, The result exceeds document maximum size, 16MB, easily. There's also a function called skip in mongodb but it only skips first n documents. Are there good ways to achieve what I want? Or is there way to make aggregation result bigger? Thanks in advance!
I'm not sure how you can do this with either A/F or M/R - just skipping so that you have (f.e.) each 10th point is not something M/R allows you to do—unless you select each point based on a random value with a 10% change... which is probably not what you want. But that does work:
db.so.output.drop();
db.so.find().count();
map = function() {
// rand does 0-1, so < 0.1 means 10%
if (Math.random() < 0.1) {
emit(this._id, this);
}
}
reduce = function(key, values) {
return values;
}
db.so.mapReduce( map, reduce, { out: 'output' } );
db.output.find();
Which outputs something line:
{
"result" : "output",
"timeMillis" : 4,
"counts" : {
"input" : 23,
"emit" : 3,
"reduce" : 0,
"output" : 3
},
"ok" : 1,
}
> db.output.find();
{ "_id" : ObjectId("51ffc4bc16473d7b84172d85"), "value" : { "_id" : ObjectId("51ffc4bc16473d7b84172d85"), "date" : ISODate("2013-08-05T15:24:45Z") } }
{ "_id" : ObjectId("51ffc75316473d7b84172d8e"), "value" : { "_id" : ObjectId("51ffc75316473d7b84172d8e") } }
{ "_id" : ObjectId("51ffc75316473d7b84172d8f"), "value" : { "_id" : ObjectId("51ffc75316473d7b84172d8f") } }
or:
> db.so.mapReduce( map, reduce, { out: 'output' } );
{
"result" : "output",
"timeMillis" : 19,
"counts" : {
"input" : 23,
"emit" : 2,
"reduce" : 0,
"output" : 2
},
"ok" : 1,
}
> db.output.find();
{ "_id" : ObjectId("51ffc4bc16473d7b84172d83"), "value" : { "_id" : ObjectId("51ffc4bc16473d7b84172d83"), "date" : ISODate("2013-08-05T15:24:25Z") } }
{ "_id" : ObjectId("51ffc4bc16473d7b84172d86"), "value" : { "_id" : ObjectId("51ffc4bc16473d7b84172d86"), "date" : ISODate("2013-08-05T15:25:15Z") } }
Depending on a random factor.

MongoDB MapReduce producing different results for each document

This is a follow-up from this question, where I tried to solve this problem with the aggregation framework. Unfortunately, I have to wait before being able to update this particular mongodb installation to a version that includes the aggregation framework, so have had to use MapReduce for this fairly simple pivot operation.
I have input data in the format below, with multiple daily dumps:
"_id" : "daily_dump_2013-05-23",
"authors_who_sold_books" : [
{
"id" : "Charles Dickens",
"original_stock" : 253,
"customers" : [
{
"time_bought" : 1368627290,
"customer_id" : 9715923
}
]
},
{
"id" : "JRR Tolkien",
"original_stock" : 24,
"customers" : [
{
"date_bought" : 1368540890,
"customer_id" : 9872345
},
{
"date_bought" : 1368537290,
"customer_id" : 9163893
}
]
}
]
}
I'm after output in the following format, that aggregates across all instances of each (unique) author across all daily dumps:
{
"_id" : "Charles Dickens",
"original_stock" : 253,
"customers" : [
{
"date_bought" : 1368627290,
"customer_id" : 9715923
},
{
"date_bought" : 1368622358,
"customer_id" : 9876234
},
etc...
]
}
I have written this map function...
function map() {
for (var i in this.authors_who_sold_books)
{
author = this.authors_who_sold_books[i];
emit(author.id, {customers: author.customers, original_stock: author.original_stock, num_sold: 1});
}
}
...and this reduce function.
function reduce(key, values) {
sum = 0
for (i in values)
{
sum += values[i].customers.length
}
return {num_sold : sum};
}
However, this gives me the following output:
{
"_id" : "Charles Dickens",
"value" : {
"customers" : [
{
"date_bought" : 1368627290,
"customer_id" : 9715923
},
{
"date_bought" : 1368622358,
"customer_id" : 9876234
},
],
"original_stock" : 253,
"num_sold" : 1
}
}
{ "_id" : "JRR Tolkien", "value" : { "num_sold" : 3 } }
{
"_id" : "JK Rowling",
"value" : {
"customers" : [
{
"date_bought" : 1368627290,
"customer_id" : 9715923
},
{
"date_bought" : 1368622358,
"customer_id" : 9876234
},
],
"original_stock" : 183,
"num_sold" : 1
}
}
{ "_id" : "John Grisham", "value" : { "num_sold" : 2 } }
The even indexed documents have the customers and original_stock listed, but an incorrect sum of num_sold.
The odd indexed documents only have the num_sold listed, but it is the correct number.
Could anyone tell me what it is I'm missing, please?
Your problem is due to the fact that the format of the output of the reduce function should be identical to the format of the map function (see requirements for the reduce function for an explanation).
You need to change the code to something like the following to fix the problem, :
function map() {
for (var i in this.authors_who_sold_books)
{
author = this.authors_who_sold_books[i];
emit(author.id, {customers: author.customers, original_stock: author.original_stock, num_sold: author.customers.length});
}
}
function reduce(key, values) {
var result = {customers:[] , num_sold:0, original_stock: (values.length ? values[0].original_stock : 0)};
for (i in values)
{
result.num_sold += values[i].num_sold;
result.customers = result.customers.concat(values[i].customers);
}
return result;
}
I hope that helps.
Note : the change num_sold: author.customers.length in the map function. I think that's what you want

How can I select a number of records per a specific field using mongodb?

I have a collection of documents in mongodb, each of which have a "group" field that refers to a group that owns the document. The documents look like this:
{
group: <objectID>
name: <string>
contents: <string>
date: <Date>
}
I'd like to construct a query which returns the most recent N documents for each group. For example, suppose there are 5 groups, each of which have 20 documents. I want to write a query which will return the top 3 for each group, which would return 15 documents, 3 from each group. Each group gets 3, even if another group has a 4th that's more recent.
In the SQL world, I believe this type of query is done with "partition by" and a counter. Is there such a thing in mongodb, short of doing N+1 separate queries for N groups?
You cannot do this using the aggregation framework yet - you can get the $max or top date value for each group but aggregation framework does not yet have a way to accumulate top N plus there is no way to push the entire document into the result set (only individual fields).
So you have to fall back on MapReduce. Here is something that would work, but I'm sure there are many variants (all require somehow sorting an array of objects based on a specific attribute, I borrowed my solution from one of the answers in this question.
Map function - outputs group name as a key and the entire rest of the document as the value - but it outputs it as a document containing an array because we will try to accumulate an array of results per group:
map = function () {
emit(this.name, {a:[this]});
}
The reduce function will accumulate all the documents belonging to the same group into one array (via concat). Note that if you optimize reduce to keep only the top five array elements by checking date then you won't need the finalize function, and you will use less memory during running mapreduce (it will also be faster).
reduce = function (key, values) {
result={a:[]};
values.forEach( function(v) {
result.a = v.a.concat(result.a);
} );
return result;
}
Since I'm keeping all values for each key, I need a finalize function to pull out only latest five elements per key.
final = function (key, value) {
Array.prototype.sortByProp = function(p){
return this.sort(function(a,b){
return (a[p] < b[p]) ? 1 : (a[p] > b[p]) ? -1 : 0;
});
}
value.a.sortByProp('date');
return value.a.slice(0,5);
}
Using a template document similar to one you provided, you run this by calling mapReduce command:
> db.top5.mapReduce(map, reduce, {finalize:final, out:{inline:1}})
{
"results" : [
{
"_id" : "group1",
"value" : [
{
"_id" : ObjectId("516f011fbfd3e39f184cfe13"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.498Z"),
"contents" : 0.23778377776034176
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0e"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.467Z"),
"contents" : 0.4434165076818317
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe09"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.436Z"),
"contents" : 0.5935856597498059
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe04"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.405Z"),
"contents" : 0.3912118375301361
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfdff"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.372Z"),
"contents" : 0.221651989268139
}
]
},
{
"_id" : "group2",
"value" : [
{
"_id" : ObjectId("516f011fbfd3e39f184cfe14"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.504Z"),
"contents" : 0.019611883210018277
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0f"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.473Z"),
"contents" : 0.5670706110540777
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0a"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.442Z"),
"contents" : 0.893193120136857
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe05"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.411Z"),
"contents" : 0.9496864483226091
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe00"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.378Z"),
"contents" : 0.013748752186074853
}
]
},
{
"_id" : "group3",
...
}
]
}
],
"timeMillis" : 15,
"counts" : {
"input" : 80,
"emit" : 80,
"reduce" : 5,
"output" : 5
},
"ok" : 1,
}
Each result has _id as group name and values as array of most recent five documents from the collection for that group name.
you need aggregation framework $group stage piped in a $limit stage...
you want also to $sort the records in some ways or else the limit will have undefined behaviour, the returned documents will be pseudo-random (the order used internally by mongo)
something like that:
db.collection.aggregate([{$group:...},{$sort:...},{$limit:...}])
here there is the documentation if you want to know more

Mongodb Map/Reduce - Multiple Group By

I am trying to run a map/reduce function in mongodb where I group by 3 different fields contained in objects in my collection. I can get the map/reduce function to run, but all the emitted fields run together in the output collection. I'm not sure this is normal or not, but outputting the data for analysis takes more work to clean up. Is there a way to separate them, then use mongoexport?
Let me show you what I mean:
The fields I am trying to group by are the day, user ID (or uid) and destination.
I run these functions:
map = function() {
day = (this.created_at.getFullYear() + "-" + (this.created_at.getMonth()+1) + "-" + this.created_at.getDate());
emit({day: day, uid: this.uid, destination: this.destination}, {count:1});
}
/* Reduce Function */
reduce = function(key, values) {
var count = 0;
values.forEach(function(v) {
count += v['count'];
}
);
return {count: count};
}
/* Output Function */
db.events.mapReduce(map, reduce, {query: {destination: {$ne:null}}, out: "TMP"});
The output looks like this:
{ "_id" : { "day" : "2012-4-9", "uid" : "1234456", "destination" : "Home" }, "value" : { "count" : 1 } }
{ "_id" : { "day" : "2012-4-9", "uid" : "2345678", "destination" : "Home" }, "value" : { "count" : 1 } }
{ "_id" : { "day" : "2012-4-9", "uid" : "3456789", "destination" : "Login" }, "value" : { "count" : 1 } }
{ "_id" : { "day" : "2012-4-9", "uid" : "4567890", "destination" : "Contact" }, "value" : { "count" : 1 } }
{ "_id" : { "day" : "2012-4-9", "uid" : "5678901", "destination" : "Help" }, "value" : { "count" : 1 } }
When I attempt to use mongoexport, I can not separate day, uid, or destination by columns because the map combines the fields together.
What I would like to have would look like this:
{ { "day" : "2012-4-9" }, { "uid" : "1234456" }, { "destination" : "Home"}, { "count" : 1 } }
Is this even possible?
As an aside - I was able to make the output work by applying sed to the file and cleaning up the CSV. More work, but it worked. It would be ideal if I could get it out of mongodb in the correct format.
MapReduce only returns documents of the form {_id:some_id, value:some_value}
see: How to change the structure of MongoDB's map-reduce results?