Efficient Median Calculation in MongoDB - mongodb

We have a Mongo collection named analytics and it tracks user visits by a cookie id. We want to calculate medians for several variables as users visit different pages.
Mongo does not yet have an internal method for calculating the median. I have used the below method for determining it, but I'm afraid there is be a more efficient way as I'm pretty new to JS. Any comments would be appreciated.
// Saves the JS function for calculating the Median. Makes it accessible to the Reducer.
db.system.js.save({_id: "myMedianValue",
value: function (sortedArray) {
var m = 0.0;
if (sortedArray.length % 2 === 0) {
//Even numbered array, average the middle two values
idx2 = sortedArray.length / 2;
idx1 = idx2 - 1;
m = (sortedArray[idx1] + sortedArray[idx2]) / 2;
} else {
//Odd numbered array, take the middle value
idx = Math.floor(sortedArray.length/2);
m = sortedArray[idx];
}
return m
}
});
var mapFunction = function () {
key = this.cookieId;
value = {
// If there is only 1 view it will look like this
// If there are multiple it gets passed to the reduceFunction
medianVar1: this.Var1,
medianVar2: this.Var2,
viewCount: 1
};
emit(key, value);
};
var reduceFunction = function(keyCookieId, valueDicts) {
Var1Array = Array();
Var2Array = Array();
views = 0;
for (var idx = 0; idx < valueDicts.length; idx++) {
Var1Array.push(valueDicts[idx].medianVar1);
Var2Array.push(valueDicts[idx].medianVar2);
views += valueDicts[idx].viewCount;
}
reducedDict = {
medianVar1: myMedianValue(Var1Array.sort(function(a, b){return a-b})),
medianVar2: myMedianValue(Var2Array.sort(function(a, b){return a-b})),
viewCount: views
};
return reducedDict
};
db.analytics.mapReduce(mapFunction,
reduceFunction,
{ out: "analytics_medians",
query: {Var1: {$exists:true},
Var2: {$exists:true}
}}
)

The simple way to get the median value is to index on the field, then skip to the value halfway through the results.
> db.test.drop()
> db.test.insert([
{ "_id" : 0, "value" : 23 },
{ "_id" : 1, "value" : 45 },
{ "_id" : 2, "value" : 18 },
{ "_id" : 3, "value" : 94 },
{ "_id" : 4, "value" : 52 },
])
> db.test.ensureIndex({ "value" : 1 })
> var get_median = function() {
var T = db.test.count() // may want { "value" : { "$exists" : true } } if some fields may be missing the value field
return db.test.find({}, { "_id" : 0, "value" : 1 }).sort({ "value" : 1 }).skip(Math.floor(T / 2)).limit(1).toArray()[0].value // may want to adjust skip this a bit depending on how you compute median e.g. in case of even T
}
> get_median()
45
It's not amazing because of the skip, but at least the query will be covered by the index. For updating the median, you could be fancier. When a new document comes in or the value of a document is updated, you compare its value to the median. If the new value is higher, you need to adjust the median up by finding the next highest value from the current median doc (or taking an average with it, or whatever to compute the new median correctly according to your rules)
> db.test.find({ "value" : { "$gt" : median } }, { "_id" : 0, "value" : 1 }).sort({ "value" : 1 }).limit(1)
You'd do the analogous thing if the new value is smaller than the current median. This bottlenecks your writes on this updating process, and has various cases to think about (how would you allow yourself to update multiple docs at once? update the doc that has the median value? update a doc whose value is smaller than the median to one whose value is larger than the median?), so it might be better just to update occasionally based on the skip procedure.

We ended up updating the medians every page request, rather than in bulk with a cron job or something. We have a Node API that uses Mongo's aggregation framework to do the match/sort the user's results. The array of results then pass to a median function within Node. The results are then written back to Mongo for that user. Not super pleased with it, but it doesn't appear to have locking issues and is performing well.

Related

MongoDB - reduce function does not work properly

My map function returns key-value pairs where key is the name of a field and the value is an object {type: <field type>, count : 1}.
For example suppose I have these documents:
{
"_id" : ObjectId("57611ad6bcc0d7e01be886c8"),
"index" : NumberInt(0)
}
{
"_id" : ObjectId("57611ad6bcc0d7e01be886c9"),
"index" : NumberInt(7)
}
{
"_id" : ObjectId("57611ad6bcc0d7e01be886c7"),
"index" : NumberInt(9)
}
I have to retrieve the name of each field, its type and the number of occurrences of the field in my collection.
My map function works and I get:
"_id", [{type:"ObjectId", count:1},{type:"ObjectId", count:1},{type:"ObjectId", count:1}]
"index",[{type:"number", count:1},{type:"number", count:1},{type:"number", count:1}]
I want to delete duplicates from type.
I have the following reduce function:
function (key, stuff) {
reduceVal = {type:"", count:0};
var array = [];
for(var idx =0; idx < stuff.length; idx++) {
reduceVal.count += stuff[idx].count;
if(array.indexOf(stuff[idx].type) > -1) {
array.push(stuff[idx].type);
}
}
reduceVal.type = array.toString();
The if clause does not work. My target is to add an element to my array just if it is not a duplicate.
Expected output:
"_id", {type:"ObjectId", count:3}
"index", {type:"number", count:3}
How can I fix?
The reduce function works. The if statement was wrong: I have to add an element to my array when
if(array.indexOf(stuff[idx].type) === -1).
It looks like you just jumbled up your reduce function. As far as I can interpret this, you assume that the reducer is called once globally. This is not the case. Instead, it is called per key, i.e. the input to the reducer is somthing like:
First call:
key = "ObjectId", val = [{type:"ObjectId", count:1},{type:"ObjectId", count:1},{type:"ObjectId", count:1}]
Second call:
key = "number", val = [{type:"number", count:1},...]
Therefore, you need to sum up, knowing that the key is already set (this code is not tested and will have its shortcomings):
function(key, vals) {
var sum = 0;
for(var i = 0; i < vals.length; i++) {
sum += vals[i].count;
}
return { "type" : key, "count" : sum };
}

MongoDb Summing multiple columns

I'm dabbling in mongoDb and trying to use map reduce queries. I need to sum up multiple values from different columns (num1, num2, num3, num4, num5). Going off this guide http://docs.mongodb.org/manual/tutorial/map-reduce-examples/ . There I'm trying to alter the first example there to sum up all the values.
This is what I am trying/tried. I'm not sure if it can take in multiple values like this, I just assumed.
var query1 = function(){ emit("sumValues", this.num1, this.num2, this.num3, this.num4, this.num5)};
var query1query = function(sumValues, totalSumOfValues){ return Array.sum(totalSumOfValues); }
db.testData.mapReduce( query1, query1query, {out: "query_one_results"})
This is the error I get.
Sun Dec 1 18:52:24.627 JavaScript execution failed: map reduce failed:{
"errmsg" : "exception: JavaScript execution failed: Error: fast_emit takes 2 args near 'ction (){ emit(\"sumValues\", this.c' ",
"code" : 16722,
"ok" : 0
Is there another way to sum up all these values? Or where is my error in what I have.
I also tried this and it seemed to work. but when I do a .find() on the file it creates it only seems to be retrieving the values of sum5 and adding them together.
var query1 = function(){ emit({sum1: this.sum1,sum2: this.sum2,sum3: this.sum3,sum4: this.sum4,sum5: this.sum5},{count:1});}
var query1query = function(key, val){ return Array.sum(val); };
db.testData.mapReduce( query1, query1query, {out: "query_one_results"})
{
"result" : "query_one_results",
"timeMillis" : 19503,
"counts" : {
"input" : 173657,
"emit" : 173657,
"reduce" : 1467,
"output" : 166859
},
"ok" : 1,
}
Ok I think I got it. This is what I ended up doing
> var map1= function(){
... emit("Total Sum", this.sum1);
... emit("Total Sum", this.sum2);
... emit("Total Sum", this.sum3);
... emit("Total Sum", this.sum4);
... emit("Total Sum", this.sum5);
... emit("Total Sum", this.sum6);
… };
> var reduce1 = function(key, val){
... return Array.sum(val)
... };
> db.delayData.mapReduce(map1,reduce1,{out:'query_one_result'});
{
"result" : "query_one_result",
"timeMillis" : 9264,
"counts" : {
"input" : 173657,
"emit" : 1041942,
"reduce" : 1737,
"output" : 1
},
"ok" : 1,
}
> db.query_one_result.find()
{ "_id" : "Total Sum", "value" : 250 }
You were getting the two values that you need to emit in map backwards. The first one represents a unique key value over which you are aggregating something. The second one is the thing you want to aggregate.
Since you are trying to get a single sum for the entire collection (it seems) you need to output a single key and then for value you need to output the sum of the four fields of each document.
map = function() { emit(1, this.num1+this.num2+this.num3+this.num4+this.num5); }
reduce = function(key, values) { return Array.sum(values); }
This will work as long as num1 through num5 are set in every document - you can make the function more robust by simply checking that each of those fields exists and if so, adding its value in.
In real life, you can do this faster and simpler with aggregation framework:
db.collection.aggregate({$group:{{$group:{_id:1, sum:{$sum:{$add:["$num1","$num2","$num3","$num4","$num5"]}}}})

Can the field names in a MongoDB document be queried, perhaps using aggregation?

In this article from the MongoDB blog, "Schema Design for Time Series Data in MongoDB" the author proposed storing multiple time series values in a single document as numbered children of a base timestamp (i.e. document per minute, seconds as array of values).
{
timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
type: “memory_used”,
values: {
0: 999999,
…
37: 1000000,
38: 1500000,
…
59: 2000000
}
}
The proposed schema sounds like a good one but they fail to mention how to query the "values" field names which would be required if you wanted to know when the last sample occurred.
How would you go about constructing a query to find something like the time of the most recent metric (combining timestamp_minute and highest field name in the values)?
Thanks so much!
You can just query the minute document and then use a loop on the client to
determine which timestamps have been set:
doc = c.find(...)
var last = 0
for (var i=0; i<60; i++)
if (i in doc.values)
last = i
Another approach which is a little more efficient is to use an array
instead of a document for the per-second samples, and then use the
length of the array to determine how many second samples have been
stored:
doc = c.find(...)
last = doc.values.length - 1
I found the answer "can the field names be queried" in another blog post which showed iterating over the keys (as Bruce suggests) only doing so in a MapReduce function ala:
var d = 0;
for (var key in this.values)
d = Math.max(d, parseInt(key));
For the MMS example schema (swapping in month for timestamp_minute and days in the values array labeled v below) here is the data and a query that produces the most recent metric date:
db.metricdata.find();
/* 0 */
{
"_id" : ObjectId("5277e223be9974e8415f66f6"),
"month" : ISODate("2013-10-01T04:00:00.000Z"),
"type" : "ga-pv",
"v" : {
"10" : 57,
"11" : 49,
"12" : 91,
"13" : 27,
...
}
}
/* 1 */
{
"_id" : ObjectId("5277e223be9974e8415f66f7"),
"month" : ISODate("2013-11-01T04:00:00.000Z"),
"type" : "ga-pv",
"v" : {
"1" : 145,
"2" : 51,
"3" : 63,
"4" : 29
}
}
And the map reduce function:
db.metricdata.mapReduce(
function() {
var y = this.month.getFullYear();
var m = this.month.getMonth();
var d = 0;
// Here is where the field names used
for (var key in this.v)
d = Math.max(d, parseInt(key));
emit(this._id, new Date(y,m,d));
},
function(key, val)
{
return null;
},
{out: "idandlastday"}
).find().sort({ value:-1}).limit(1)
This produces something like
/* 0 */
{
"_id" : ObjectId("5277e223be9974e8415f66f7"),
"value" : ISODate("2013-11-04T05:00:00.000Z")
}

Why am I losing some values every 100 documents?

I'm trying to understand a behavior with map/reduce.
Here's the map function:
function() {
var klass = this.error_class;
emit('klass', { model : klass, count : 1 });
}
And the reduce function:
function(key, values) {
var results = { count : 0, klass: { foo: 'bar' } };
values.forEach(function(value) {
results.count += value.count;
results.klass[value.model] = 0;
printjson(results);
});
return results;
}
Then I run it:
{
"count" : 85,
"klass" : {
"foo" : "bar",
"Twitter::Error::BadRequest" : 0
}
}
{
"count" : 86,
"klass" : {
"foo" : "bar",
"Twitter::Error::BadRequest" : 0,
"Stream:DirectMessage" : 0
}
}
At this point, everything is good, but here's come the yielding of the read lock every 100 documents:
{
"count" : 100,
"klass" : {
"foo" : "bar",
"Twitter::Error::BadRequest" : 0,
"Stream:DirectMessage" : 0
}
}
{ "count" : 100, "klass" : { "foo" : "bar", "undefined" : 0 } }
I kept my key foo and my count attribute kept being incremented. The problem is everything else became undefined.
So why am I losing the dynamic keys for my object while my count attribute is still good?
A thing to remember about your reduce function is that the values passed to it are either the output of your map function, or the return value of previous calls to reduce.
This is key - it means mapping / reducing of parts of the data can be farmed off to different machines (eg different shards of a mongo cluster) and then reduce used again to reassemble the data. It also means that mongo doesn't have to first map every value, keeping all the results in memory and then reduce them all: it can map and reduce in chunks, re-reducing where necessary.
In other words the following must be true:
reduce(k,[A,B,C]) == reduce(k, [A, reduce(k,[A,B]))
Your reduce function's output doesn't have a model property so if it gets used in a re-reduce those undefined values will crop up.
You either need to have your reduce function return something similar in format to what your map function emits so that you can process the two without distinction(usually the easiest) or else handle re-reduced values differently.

In MongoDB mapreduce, how can I flatten the values object?

I'm trying to use MongoDB to analyse Apache log files. I've created a receipts collection from the Apache access logs. Here's an abridged summary of what my models look like:
db.receipts.findOne()
{
"_id" : ObjectId("4e57908c7a044a30dc03a888"),
"path" : "/videos/1/show_invisibles.m4v",
"issued_at" : ISODate("2011-04-08T00:00:00Z"),
"status" : "200"
}
I've written a MapReduce function that groups all data by the issued_at date field. It summarizes the total number of requests, and provides a breakdown of the number of requests for each unique path. Here's an example of what the output looks like:
db.daily_hits_by_path.findOne()
{
"_id" : ISODate("2011-04-08T00:00:00Z"),
"value" : {
"count" : 6,
"paths" : {
"/videos/1/show_invisibles.m4v" : {
"count" : 2
},
"/videos/1/show_invisibles.ogv" : {
"count" : 3
},
"/videos/6/buffers_listed_and_hidden.ogv" : {
"count" : 1
}
}
}
}
How can I make the output look like this instead:
{
"_id" : ISODate("2011-04-08T00:00:00Z"),
"count" : 6,
"paths" : {
"/videos/1/show_invisibles.m4v" : {
"count" : 2
},
"/videos/1/show_invisibles.ogv" : {
"count" : 3
},
"/videos/6/buffers_listed_and_hidden.ogv" : {
"count" : 1
}
}
}
It's not currently possible, but I would suggest voting for this case: https://jira.mongodb.org/browse/SERVER-2517.
Taking the best from previous answers and comments:
db.items.find().hint({_id: 1}).forEach(function(item) {
db.items.update({_id: item._id}, item.value);
});
From http://docs.mongodb.org/manual/core/update/#replace-existing-document-with-new-document
"If the update argument contains only field and value pairs, the update() method replaces the existing document with the document in the update argument, except for the _id field."
So you need neither to $unset value, nor to list each field.
From https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/#cursor-snapshot
"MongoDB cursors can return the same document more than once in some situations. ... use a unique index on this field or these fields so that the query will return each document no more than once. Query with hint() to explicitly force the query to use that index."
AFAIK, by design Mongo's map reduce will spit results out in "value tuples" and I haven't seen anything that will configure that "output format". Maybe the finalize() method can be used.
You could try running a post-process that will reshape the data using
results.find({}).forEach( function(result) {
results.update({_id: result._id}, {count: result.value.count, paths: result.value.paths})
});
Yep, that looks ugly. I know.
You can do Dan's code with a collection reference:
function clean(collection) {
collection.find().forEach( function(result) {
var value = result.value;
delete value._id;
collection.update({_id: result._id}, value);
collection.update({_id: result.id}, {$unset: {value: 1}} ) } )};
A similar approach to that of #ljonas but no need to hardcode document fields:
db.results.find().forEach( function(result) {
var value = result.value;
delete value._id;
db.results.update({_id: result._id}, value);
db.results.update({_id: result.id}, {$unset: {value: 1}} )
} );
All the proposed solutions are far from optimal. The fastest you can do so far is something like:
var flattenMRCollection=function(dbName,collectionName) {
var collection=db.getSiblingDB(dbName)[collectionName];
var i=0;
var bulk=collection.initializeUnorderedBulkOp();
collection.find({ value: { $exists: true } }).addOption(16).forEach(function(result) {
print((++i));
//collection.update({_id: result._id},result.value);
bulk.find({_id: result._id}).replaceOne(result.value);
if(i%1000==0)
{
print("Executing bulk...");
bulk.execute();
bulk=collection.initializeUnorderedBulkOp();
}
});
bulk.execute();
};
Then call it:
flattenMRCollection("MyDB","MyMRCollection")
This is WAY faster than doing sequential updates.
While experimenting with Vincent's answer, I found a couple of problems. Basically, if you perform updates within a foreach loop, this will move the document to the end of the collection and the cursor will reach that document again (example). This can be circumvented if $snapshot is used. Hence, I am providing a Java example below.
final List<WriteModel<Document>> bulkUpdate = new ArrayList<>();
// You should enable $snapshot if performing updates within foreach
collection.find(new Document().append("$query", new Document()).append("$snapshot", true)).forEach(new Block<Document>() {
#Override
public void apply(final Document document) {
// Note that I used incrementing long values for '_id'. Change to String if
// you used string '_id's
long docId = document.getLong("_id");
Document subDoc = (Document)document.get("value");
WriteModel<Document> m = new ReplaceOneModel<>(new Document().append("_id", docId), subDoc);
bulkUpdate.add(m);
// If you used non-incrementing '_id's, then you need to use a final object with a counter.
if(docId % 1000 == 0 && !bulkUpdate.isEmpty()) {
collection.bulkWrite(bulkUpdate);
bulkUpdate.removeAll(bulkUpdate);
}
}
});
// Fixing bug related to Vincent's answer.
if(!bulkUpdate.isEmpty()) {
collection.bulkWrite(bulkUpdate);
bulkUpdate.removeAll(bulkUpdate);
}
Note : This snippet takes an average of 7.4 seconds to execute on my machine with 100k records and 14 attributes (IMDB dataset). Without batching, it takes an average of 25.2 seconds.