MongoDB - reduce function does not work properly - mongodb

My map function returns key-value pairs where key is the name of a field and the value is an object {type: <field type>, count : 1}.
For example suppose I have these documents:
{
"_id" : ObjectId("57611ad6bcc0d7e01be886c8"),
"index" : NumberInt(0)
}
{
"_id" : ObjectId("57611ad6bcc0d7e01be886c9"),
"index" : NumberInt(7)
}
{
"_id" : ObjectId("57611ad6bcc0d7e01be886c7"),
"index" : NumberInt(9)
}
I have to retrieve the name of each field, its type and the number of occurrences of the field in my collection.
My map function works and I get:
"_id", [{type:"ObjectId", count:1},{type:"ObjectId", count:1},{type:"ObjectId", count:1}]
"index",[{type:"number", count:1},{type:"number", count:1},{type:"number", count:1}]
I want to delete duplicates from type.
I have the following reduce function:
function (key, stuff) {
reduceVal = {type:"", count:0};
var array = [];
for(var idx =0; idx < stuff.length; idx++) {
reduceVal.count += stuff[idx].count;
if(array.indexOf(stuff[idx].type) > -1) {
array.push(stuff[idx].type);
}
}
reduceVal.type = array.toString();
The if clause does not work. My target is to add an element to my array just if it is not a duplicate.
Expected output:
"_id", {type:"ObjectId", count:3}
"index", {type:"number", count:3}
How can I fix?

The reduce function works. The if statement was wrong: I have to add an element to my array when
if(array.indexOf(stuff[idx].type) === -1).

It looks like you just jumbled up your reduce function. As far as I can interpret this, you assume that the reducer is called once globally. This is not the case. Instead, it is called per key, i.e. the input to the reducer is somthing like:
First call:
key = "ObjectId", val = [{type:"ObjectId", count:1},{type:"ObjectId", count:1},{type:"ObjectId", count:1}]
Second call:
key = "number", val = [{type:"number", count:1},...]
Therefore, you need to sum up, knowing that the key is already set (this code is not tested and will have its shortcomings):
function(key, vals) {
var sum = 0;
for(var i = 0; i < vals.length; i++) {
sum += vals[i].count;
}
return { "type" : key, "count" : sum };
}

Related

update multi level document in mongodb [duplicate]

I have document like
{
id : 100,
heros:[
{
nickname : "test",
spells : [
{spell_id : 61, level : 1},
{spell_id : 1, level : 2}
]
}
]
}
I can't $set spell's level : 3 with spell_id : 1 inside spells that inside heros with nickname "test. I tried this query:
db.test.update({"heros.nickname":"test", "heros.spells.spell_id":1},
{$set:{"heros.spells.$.level":3}});
Errror i see is
can't append to array using string field name [spells]
Thanks for help.
You can only use the $ positional operator for single-level arrays. In your case, you have a nested array (heros is an array, and within that each hero has a spells array).
If you know the indexes of the arrays, you can use explicit indexes when doing an update, like:
> db.test.update({"heros.nickname":"test", "heros.spells.spell_id":1}, {$set:{"heros.0.spells.1.level":3}});
Try something like this:
db.test.find({"heros.nickname":"test"}).forEach(function(x) {
bool match = false;
for (i=0 ; i< x.heros[0].spells.length ; i++) {
if (x.heros[0].spells[i].spell_id == 1)
{
x.heros[0].spells[i].level = 3;
match = true;
}
}
if (match === true) db.test.update( { id: x.id }, x );
});
Apparently someone opened a ticket to add the ability to put a function inside the update clause, but it hasn't been addressed yet: https://jira.mongodb.org/browse/SERVER-458

Efficient Median Calculation in MongoDB

We have a Mongo collection named analytics and it tracks user visits by a cookie id. We want to calculate medians for several variables as users visit different pages.
Mongo does not yet have an internal method for calculating the median. I have used the below method for determining it, but I'm afraid there is be a more efficient way as I'm pretty new to JS. Any comments would be appreciated.
// Saves the JS function for calculating the Median. Makes it accessible to the Reducer.
db.system.js.save({_id: "myMedianValue",
value: function (sortedArray) {
var m = 0.0;
if (sortedArray.length % 2 === 0) {
//Even numbered array, average the middle two values
idx2 = sortedArray.length / 2;
idx1 = idx2 - 1;
m = (sortedArray[idx1] + sortedArray[idx2]) / 2;
} else {
//Odd numbered array, take the middle value
idx = Math.floor(sortedArray.length/2);
m = sortedArray[idx];
}
return m
}
});
var mapFunction = function () {
key = this.cookieId;
value = {
// If there is only 1 view it will look like this
// If there are multiple it gets passed to the reduceFunction
medianVar1: this.Var1,
medianVar2: this.Var2,
viewCount: 1
};
emit(key, value);
};
var reduceFunction = function(keyCookieId, valueDicts) {
Var1Array = Array();
Var2Array = Array();
views = 0;
for (var idx = 0; idx < valueDicts.length; idx++) {
Var1Array.push(valueDicts[idx].medianVar1);
Var2Array.push(valueDicts[idx].medianVar2);
views += valueDicts[idx].viewCount;
}
reducedDict = {
medianVar1: myMedianValue(Var1Array.sort(function(a, b){return a-b})),
medianVar2: myMedianValue(Var2Array.sort(function(a, b){return a-b})),
viewCount: views
};
return reducedDict
};
db.analytics.mapReduce(mapFunction,
reduceFunction,
{ out: "analytics_medians",
query: {Var1: {$exists:true},
Var2: {$exists:true}
}}
)
The simple way to get the median value is to index on the field, then skip to the value halfway through the results.
> db.test.drop()
> db.test.insert([
{ "_id" : 0, "value" : 23 },
{ "_id" : 1, "value" : 45 },
{ "_id" : 2, "value" : 18 },
{ "_id" : 3, "value" : 94 },
{ "_id" : 4, "value" : 52 },
])
> db.test.ensureIndex({ "value" : 1 })
> var get_median = function() {
var T = db.test.count() // may want { "value" : { "$exists" : true } } if some fields may be missing the value field
return db.test.find({}, { "_id" : 0, "value" : 1 }).sort({ "value" : 1 }).skip(Math.floor(T / 2)).limit(1).toArray()[0].value // may want to adjust skip this a bit depending on how you compute median e.g. in case of even T
}
> get_median()
45
It's not amazing because of the skip, but at least the query will be covered by the index. For updating the median, you could be fancier. When a new document comes in or the value of a document is updated, you compare its value to the median. If the new value is higher, you need to adjust the median up by finding the next highest value from the current median doc (or taking an average with it, or whatever to compute the new median correctly according to your rules)
> db.test.find({ "value" : { "$gt" : median } }, { "_id" : 0, "value" : 1 }).sort({ "value" : 1 }).limit(1)
You'd do the analogous thing if the new value is smaller than the current median. This bottlenecks your writes on this updating process, and has various cases to think about (how would you allow yourself to update multiple docs at once? update the doc that has the median value? update a doc whose value is smaller than the median to one whose value is larger than the median?), so it might be better just to update occasionally based on the skip procedure.
We ended up updating the medians every page request, rather than in bulk with a cron job or something. We have a Node API that uses Mongo's aggregation framework to do the match/sort the user's results. The array of results then pass to a median function within Node. The results are then written back to Mongo for that user. Not super pleased with it, but it doesn't appear to have locking issues and is performing well.

Delete all _id field from subdocuments

I have been using Mongoose to insert a large amount of data into a mongodb database. I noticed that by default, Mongoose adds _id fields to all subdocuments, leaving me with documents which look like this (I've removed many fields for brevity - I've also shrunken each array to one entry, they generally have more)
{
"start_time" : ISODate("2013-04-05T02:30:28Z"),
"match_id" : 165816931,
"players" : [
{
"account_id" : 4294967295,
"_id" : ObjectId("51daffdaa78cee5c36e29fba"),
"additional_units" : [ ],
"ability_upgrades" : [
{
"ability" : 5155,
"time" : 141,
"level" : 1,
"_id" : ObjectId("51daffdaa78cee5c36e29fca")
},
]
},
],
"_id" : ObjectId("51daffdca78cee5c36e2a02e")
}
I have found how to prevent Mongoose adding these by default (http://mongoosejs.com/docs/guide.html, see option: id), however I now have 95 million records with these extraneous _id fields on all subdocuments. I am interested in finding the best way of deleting all of these fields (leaving the _id on the top level document). My initial thoughts are to use a bunch of for...in loops on each object but this seems very inefficient.
Given Derick's answer, I have created a function to do this:
var deleteIdFromSubdocs = function (obj, isRoot) {
for (var key in obj) {
if (isRoot == false && key == "_id") {
delete obj[key];
} else if (typeof obj[key] == "object") {
deleteIdFromSubdocs(obj[key], false);
}
}
return obj;
And run it against a test collection using:
db.testobjects.find().forEach(function (x){ y = deleteIdFromSubdocs(x, true); db.testobjects.save(y); } )
This appears to work for my test collection. I'd like to see if anyone has any opinions on how this could be done better/any risks involved before I run it against the 95 million document collection.
The players._id could be removed using an update operation, like he following:
db.collection.update({'players._id': {$exists : 1}}, { $unset : { 'players.$._id' : 1 } }, false, true)
However, it's not possible use positional operator in nested arrays. So, one solution is run a script directly on our database:
var cursor = db.collection.find({'players.ability_upgrades._id': {$exists : 1}});
cursor.forEach(function(doc) {
for (var i = 0; i < doc.players.length; i++) {
var player = doc.players[i];
delete player['_id'];
for (var j = 0; j < player.ability_upgrades.length; j++) {
delete player.ability_upgrades[j]['_id'];
}
}
db.collection.save(doc);
});
Save the script to a file and call mongo with the file as parameter:
> mongo remove_oid.js --shell
The only solution is to do this one by one, exactly with a for...in loop as you described.
Just another version, try this with AngularJS and MongoDB ;-)
function removeIds (obj, isRoot) {
for (var key in obj._doc) {
if (isRoot == false && key == "_id") {
delete obj._doc._id;
} else if ((Object.prototype.toString.call( obj[key] ) === '[object Array]' )) {
for (var i=0; i<obj[key].length; i++)
removeIds(obj[key][i], false);
}
}
return obj;
}
Usage:
var newObj = removeIds(oldObj, true);
delete newObj._id;

Why am I losing some values every 100 documents?

I'm trying to understand a behavior with map/reduce.
Here's the map function:
function() {
var klass = this.error_class;
emit('klass', { model : klass, count : 1 });
}
And the reduce function:
function(key, values) {
var results = { count : 0, klass: { foo: 'bar' } };
values.forEach(function(value) {
results.count += value.count;
results.klass[value.model] = 0;
printjson(results);
});
return results;
}
Then I run it:
{
"count" : 85,
"klass" : {
"foo" : "bar",
"Twitter::Error::BadRequest" : 0
}
}
{
"count" : 86,
"klass" : {
"foo" : "bar",
"Twitter::Error::BadRequest" : 0,
"Stream:DirectMessage" : 0
}
}
At this point, everything is good, but here's come the yielding of the read lock every 100 documents:
{
"count" : 100,
"klass" : {
"foo" : "bar",
"Twitter::Error::BadRequest" : 0,
"Stream:DirectMessage" : 0
}
}
{ "count" : 100, "klass" : { "foo" : "bar", "undefined" : 0 } }
I kept my key foo and my count attribute kept being incremented. The problem is everything else became undefined.
So why am I losing the dynamic keys for my object while my count attribute is still good?
A thing to remember about your reduce function is that the values passed to it are either the output of your map function, or the return value of previous calls to reduce.
This is key - it means mapping / reducing of parts of the data can be farmed off to different machines (eg different shards of a mongo cluster) and then reduce used again to reassemble the data. It also means that mongo doesn't have to first map every value, keeping all the results in memory and then reduce them all: it can map and reduce in chunks, re-reducing where necessary.
In other words the following must be true:
reduce(k,[A,B,C]) == reduce(k, [A, reduce(k,[A,B]))
Your reduce function's output doesn't have a model property so if it gets used in a re-reduce those undefined values will crop up.
You either need to have your reduce function return something similar in format to what your map function emits so that you can process the two without distinction(usually the easiest) or else handle re-reduced values differently.

Update embedded object inside array inside array in MongoDB

I have document like
{
id : 100,
heros:[
{
nickname : "test",
spells : [
{spell_id : 61, level : 1},
{spell_id : 1, level : 2}
]
}
]
}
I can't $set spell's level : 3 with spell_id : 1 inside spells that inside heros with nickname "test. I tried this query:
db.test.update({"heros.nickname":"test", "heros.spells.spell_id":1},
{$set:{"heros.spells.$.level":3}});
Errror i see is
can't append to array using string field name [spells]
Thanks for help.
You can only use the $ positional operator for single-level arrays. In your case, you have a nested array (heros is an array, and within that each hero has a spells array).
If you know the indexes of the arrays, you can use explicit indexes when doing an update, like:
> db.test.update({"heros.nickname":"test", "heros.spells.spell_id":1}, {$set:{"heros.0.spells.1.level":3}});
Try something like this:
db.test.find({"heros.nickname":"test"}).forEach(function(x) {
bool match = false;
for (i=0 ; i< x.heros[0].spells.length ; i++) {
if (x.heros[0].spells[i].spell_id == 1)
{
x.heros[0].spells[i].level = 3;
match = true;
}
}
if (match === true) db.test.update( { id: x.id }, x );
});
Apparently someone opened a ticket to add the ability to put a function inside the update clause, but it hasn't been addressed yet: https://jira.mongodb.org/browse/SERVER-458