Can the field names in a MongoDB document be queried, perhaps using aggregation? - mongodb

In this article from the MongoDB blog, "Schema Design for Time Series Data in MongoDB" the author proposed storing multiple time series values in a single document as numbered children of a base timestamp (i.e. document per minute, seconds as array of values).
{
timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
type: “memory_used”,
values: {
0: 999999,
…
37: 1000000,
38: 1500000,
…
59: 2000000
}
}
The proposed schema sounds like a good one but they fail to mention how to query the "values" field names which would be required if you wanted to know when the last sample occurred.
How would you go about constructing a query to find something like the time of the most recent metric (combining timestamp_minute and highest field name in the values)?
Thanks so much!

You can just query the minute document and then use a loop on the client to
determine which timestamps have been set:
doc = c.find(...)
var last = 0
for (var i=0; i<60; i++)
if (i in doc.values)
last = i
Another approach which is a little more efficient is to use an array
instead of a document for the per-second samples, and then use the
length of the array to determine how many second samples have been
stored:
doc = c.find(...)
last = doc.values.length - 1

I found the answer "can the field names be queried" in another blog post which showed iterating over the keys (as Bruce suggests) only doing so in a MapReduce function ala:
var d = 0;
for (var key in this.values)
d = Math.max(d, parseInt(key));
For the MMS example schema (swapping in month for timestamp_minute and days in the values array labeled v below) here is the data and a query that produces the most recent metric date:
db.metricdata.find();
/* 0 */
{
"_id" : ObjectId("5277e223be9974e8415f66f6"),
"month" : ISODate("2013-10-01T04:00:00.000Z"),
"type" : "ga-pv",
"v" : {
"10" : 57,
"11" : 49,
"12" : 91,
"13" : 27,
...
}
}
/* 1 */
{
"_id" : ObjectId("5277e223be9974e8415f66f7"),
"month" : ISODate("2013-11-01T04:00:00.000Z"),
"type" : "ga-pv",
"v" : {
"1" : 145,
"2" : 51,
"3" : 63,
"4" : 29
}
}
And the map reduce function:
db.metricdata.mapReduce(
function() {
var y = this.month.getFullYear();
var m = this.month.getMonth();
var d = 0;
// Here is where the field names used
for (var key in this.v)
d = Math.max(d, parseInt(key));
emit(this._id, new Date(y,m,d));
},
function(key, val)
{
return null;
},
{out: "idandlastday"}
).find().sort({ value:-1}).limit(1)
This produces something like
/* 0 */
{
"_id" : ObjectId("5277e223be9974e8415f66f7"),
"value" : ISODate("2013-11-04T05:00:00.000Z")
}

Related

mongodb find element within a hash within a hash

I am attempting to build a query to run from Mongo client that will allow access to the following element of a hash within a hash within a hash.
Here is the structure of the data:
"_id" : ObjectId("BSONID"),
"e1" : "value",
"e2" : "value",
"e3" : "value"),
"updated_at" : ISODate("2015-08-31T21:04:37.669Z"),
"created_at" : ISODate("2015-01-05T07:20:17.833Z"),
"e4" : 62,
"e5" : {
"sube1" : {
"26444745" : {
"subsube1" : "value",
"subsube2" : "value",
"subsube3" : "value I am looking for",
"subsube4" : "value",
"subsube5" : "value"
},
"40937803" : {
"subsube1" : "value",
"subsube2" : "value",
"subsube3" : "value I am looking for",
"subsube4" : "value",
"subsube5" : "value"
},
"YCPGF5SRTJV2TVVF" : {
"subsube1" : "value",
"subsube2" : "value",
"subsube3" : "value I am looking for",
"subsube4" : "value",
"subsube5" : "value"
}
}
}
So I have tried dotted notation based on a suggestion for "diving" into an wildcard named hash using db.my_collection.find({"e5.sube1.subsube4": "value I am looking for"}) which keeps coming back with an empty result set. I have also tried the find using a match instead of an exact value using /value I am lo/ and still an empty result set. I know there is at least 1 document which has the "value I am looking for".
Any ideas - note I am restricted to using the Mongo shell client.
Thanks.
So since this is not capable of being made into a javascript/mongo shell array I will go to plan B which is write some code be it Perl or Ruby and pull the result set into an array of hashes and walk each document/sub-document.
Thanks Mario for the help.
You have two issues:
You're missing one level.
You are checking subsube4 instead of subsube3
Depending on what subdocument of sube1 you want to check, you should do
db.my_collection.find({"e5.sube1.26444745.subsube4": "value I am looking for"})
or
db.my_collection.find({"e5.sube1.40937803.subsube4": "value I am looking for"})
or
db.my_collection.find({"e5.sube1.YCPGF5SRTJV2TVVF.subsube4": "value I am looking for"})
You could use the $or operator if you want to look in any one of the three.
If you don't know the keys of your documents, that's an issue with your schema design: you should use arrays instead of objects. Similar case: How to query a dynamic key - mongodb schema design
EDIT
Since you explain that you have a special request to know the count of "value I am looking for" only one time, we can run a map reduce. You can run those commands in the shell.
Define map function
var iurMapFunction = function() {
for (var key in this.e5.sube1) {
if (this.e5.sube1[key].subsube3 == "value I am looking for") {
var value = {
count: 1,
subkey: key
}
emit(key, value);
}
}
};
Define reduce function
var iurReduceFunction = function(keys, countObjVals) {
reducedVal = {
count: 0
};
for (var idx = 0; idx < countObjVals.length; idx++) {
reducedVal.count += countObjVals[idx].count;
}
return reducedVal;
};
Run mapreduce command
db.my_collection.mapReduce(iurMapFunction,
iurReduceFunction, {
out: {
replace: "map_reduce_result"
},
}
);
Find your counts
db.map_reduce_result.find()
This should give you, for each dynamic key in your object, the number of times it had an embedded field subsube3 with value value I am looking for.

How can i store a bulleted list or a list of values in a mongo field

Below mongo collection has a field called 'responsibilities'. The value of this field is a long string as it contains bulleted values as shown in the sample document below). Is there any better way of storing this value.(Instead of storing long string values)
{ "_id" : ObjectId("551d6f4c40cd93dd6bec7dbf"),
"name" : "xxxx",
"desc" : "xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"teamsize" : 11,
"location" : "xxxxx",
"startDate" : ISODate("2014-06-01T00:00:00Z"),
"endDate" : ISODate("2015-03-01T00:00:00Z"),
"responsibilities" : "1. xxxxxxx , 2.xxxxxx, 3.xxxxxxxx",
"organisationName" : "xxxxxxxx"
}
You could split the string and store the elements in an array field. Splitting the string would require some regex manipulation:
var responsibilities = "1. Jnflkvkbfjvb1 2. Kjnfbhvjbv2 3. kbvrjvbrjvb3 • Jnflkvkbfjvb4 • Kjnfbhvjbv5 • kbvrjvbrjvb6 A. Jnflkvkbfjvb7 B. Kjnfbhvjbv8 C. kbvrjvbrjvb9 I. Jnflkvkbfjvb10 II. Kjnfbhvjbv11 III. kbvrjvbrjvb12";
var myarray = responsibilities.split(/([0-9A-Z]+[.)]|•)\s+/);
var res_array = myarray.filter(function(el, index) {
return index % 2 === 0; // JavaScript is zero-based want elements with a modulo of 0 - odd numbered indexes:
});
console.log(res_array[0]); // Jnflkvkbfjvb1
console.log(res_array[4]); // Kjnfbhvjbv5
console.log(res_array[10]); // Kjnfbhvjbv11
Regex meaning:
( # group 1
[0-9A-Z]+ # any combination of digits 0-9 or letters A-Z
[.)] # either a dot or a closing paren
| # ...or
• # a bullet sign
)\s+ # end group 1, match any following whitespace
Once you get the array then do an update on your collection as follows:
db.collection.update(
{ name: "xxxx" },
{ $push: { duties: { $each: res_array } } }
)

Efficient Median Calculation in MongoDB

We have a Mongo collection named analytics and it tracks user visits by a cookie id. We want to calculate medians for several variables as users visit different pages.
Mongo does not yet have an internal method for calculating the median. I have used the below method for determining it, but I'm afraid there is be a more efficient way as I'm pretty new to JS. Any comments would be appreciated.
// Saves the JS function for calculating the Median. Makes it accessible to the Reducer.
db.system.js.save({_id: "myMedianValue",
value: function (sortedArray) {
var m = 0.0;
if (sortedArray.length % 2 === 0) {
//Even numbered array, average the middle two values
idx2 = sortedArray.length / 2;
idx1 = idx2 - 1;
m = (sortedArray[idx1] + sortedArray[idx2]) / 2;
} else {
//Odd numbered array, take the middle value
idx = Math.floor(sortedArray.length/2);
m = sortedArray[idx];
}
return m
}
});
var mapFunction = function () {
key = this.cookieId;
value = {
// If there is only 1 view it will look like this
// If there are multiple it gets passed to the reduceFunction
medianVar1: this.Var1,
medianVar2: this.Var2,
viewCount: 1
};
emit(key, value);
};
var reduceFunction = function(keyCookieId, valueDicts) {
Var1Array = Array();
Var2Array = Array();
views = 0;
for (var idx = 0; idx < valueDicts.length; idx++) {
Var1Array.push(valueDicts[idx].medianVar1);
Var2Array.push(valueDicts[idx].medianVar2);
views += valueDicts[idx].viewCount;
}
reducedDict = {
medianVar1: myMedianValue(Var1Array.sort(function(a, b){return a-b})),
medianVar2: myMedianValue(Var2Array.sort(function(a, b){return a-b})),
viewCount: views
};
return reducedDict
};
db.analytics.mapReduce(mapFunction,
reduceFunction,
{ out: "analytics_medians",
query: {Var1: {$exists:true},
Var2: {$exists:true}
}}
)
The simple way to get the median value is to index on the field, then skip to the value halfway through the results.
> db.test.drop()
> db.test.insert([
{ "_id" : 0, "value" : 23 },
{ "_id" : 1, "value" : 45 },
{ "_id" : 2, "value" : 18 },
{ "_id" : 3, "value" : 94 },
{ "_id" : 4, "value" : 52 },
])
> db.test.ensureIndex({ "value" : 1 })
> var get_median = function() {
var T = db.test.count() // may want { "value" : { "$exists" : true } } if some fields may be missing the value field
return db.test.find({}, { "_id" : 0, "value" : 1 }).sort({ "value" : 1 }).skip(Math.floor(T / 2)).limit(1).toArray()[0].value // may want to adjust skip this a bit depending on how you compute median e.g. in case of even T
}
> get_median()
45
It's not amazing because of the skip, but at least the query will be covered by the index. For updating the median, you could be fancier. When a new document comes in or the value of a document is updated, you compare its value to the median. If the new value is higher, you need to adjust the median up by finding the next highest value from the current median doc (or taking an average with it, or whatever to compute the new median correctly according to your rules)
> db.test.find({ "value" : { "$gt" : median } }, { "_id" : 0, "value" : 1 }).sort({ "value" : 1 }).limit(1)
You'd do the analogous thing if the new value is smaller than the current median. This bottlenecks your writes on this updating process, and has various cases to think about (how would you allow yourself to update multiple docs at once? update the doc that has the median value? update a doc whose value is smaller than the median to one whose value is larger than the median?), so it might be better just to update occasionally based on the skip procedure.
We ended up updating the medians every page request, rather than in bulk with a cron job or something. We have a Node API that uses Mongo's aggregation framework to do the match/sort the user's results. The array of results then pass to a median function within Node. The results are then written back to Mongo for that user. Not super pleased with it, but it doesn't appear to have locking issues and is performing well.

Can MongoDB aggregate "top x" results in this document schema?

{
"_id" : "user1_20130822",
"metadata" : {
"date" : ISODate("2013-08-22T00:00:00.000Z"),
"username" : "user1"
},
"tags" : {
"abc" : 19,
"123" : 2,
"bca" : 64,
"xyz" : 14,
"zyx" : 12,
"321" : 7
}
}
Given the schema example above, is there a way to query this to retrieve the top "x" tags: E.g., Top 3 "tags" sorted descending?
Is this possible in a single document? e.g., top tags for a user on a given day
What if i have multiple documents that need to be combined together before getting the top? e.g., top tags for a user in a given month
I know this can be done by using a "document per user per tag per day" or by making "tags" an array, but I'd like to be able to do this as above, as it makes in place $inc's easier (many more of these happening than reads).
Or do I need to return back the whole document, and defer to the client on the sorting/limiting?
When you use object-keys as tag-names, you are making this kind of reporting very difficult. The aggreation framework has no $unwind-equivalent for objects. But there is always MapReduce.
Have your map-function emit one document for each key/value pair in the tags-subdocument. It should look something like this;
var mapFunction = function() {
for (var key in this.tags) {
emit(key, this.tags[key]);
}
}
Your reduce-function would then sum up the values emitted for the same key.
var reduceFunction = function(key, values) {
var sum = 0;
for (var i = 0; i < values.length; i++) {
sum += values[i];
}
return sum;
}
The complete MapReduce command would look something like this:
db.runCommand(
{
mapReduce: "yourcollection", // the collection where your data is stored
query: { _id : "user1_20130822" }, // or however you want to limit the results
map: mapFunction,
reduce: reduceFunction,
out: "inline", // means that the output is returned directly.
}
)
This will return all tags in unpredictable order. MapReduce has a sort and a limit option, but these only work on a field which has an index in the original collection, so you can't use it on a computed field. To get only the top 3, you would have to sort the results on the application-level. When you insist on doing the sorting and limiting on the database, define an output-collection to store the mapReduce results in (with the out-option set to out: { replace: "temporaryCollectionName" }) and then query that collection with sort and limit afterwards.
Keep in mind that when you use an intermediate collection, you must make sure that no two users run MapReduces with different queries into the same collection. When you have multiple users which want to view your top-3 list, you could let them query the output-collection and do the MapReduce in the background at regular intervales.

MongoDb Summing multiple columns

I'm dabbling in mongoDb and trying to use map reduce queries. I need to sum up multiple values from different columns (num1, num2, num3, num4, num5). Going off this guide http://docs.mongodb.org/manual/tutorial/map-reduce-examples/ . There I'm trying to alter the first example there to sum up all the values.
This is what I am trying/tried. I'm not sure if it can take in multiple values like this, I just assumed.
var query1 = function(){ emit("sumValues", this.num1, this.num2, this.num3, this.num4, this.num5)};
var query1query = function(sumValues, totalSumOfValues){ return Array.sum(totalSumOfValues); }
db.testData.mapReduce( query1, query1query, {out: "query_one_results"})
This is the error I get.
Sun Dec 1 18:52:24.627 JavaScript execution failed: map reduce failed:{
"errmsg" : "exception: JavaScript execution failed: Error: fast_emit takes 2 args near 'ction (){ emit(\"sumValues\", this.c' ",
"code" : 16722,
"ok" : 0
Is there another way to sum up all these values? Or where is my error in what I have.
I also tried this and it seemed to work. but when I do a .find() on the file it creates it only seems to be retrieving the values of sum5 and adding them together.
var query1 = function(){ emit({sum1: this.sum1,sum2: this.sum2,sum3: this.sum3,sum4: this.sum4,sum5: this.sum5},{count:1});}
var query1query = function(key, val){ return Array.sum(val); };
db.testData.mapReduce( query1, query1query, {out: "query_one_results"})
{
"result" : "query_one_results",
"timeMillis" : 19503,
"counts" : {
"input" : 173657,
"emit" : 173657,
"reduce" : 1467,
"output" : 166859
},
"ok" : 1,
}
Ok I think I got it. This is what I ended up doing
> var map1= function(){
... emit("Total Sum", this.sum1);
... emit("Total Sum", this.sum2);
... emit("Total Sum", this.sum3);
... emit("Total Sum", this.sum4);
... emit("Total Sum", this.sum5);
... emit("Total Sum", this.sum6);
… };
> var reduce1 = function(key, val){
... return Array.sum(val)
... };
> db.delayData.mapReduce(map1,reduce1,{out:'query_one_result'});
{
"result" : "query_one_result",
"timeMillis" : 9264,
"counts" : {
"input" : 173657,
"emit" : 1041942,
"reduce" : 1737,
"output" : 1
},
"ok" : 1,
}
> db.query_one_result.find()
{ "_id" : "Total Sum", "value" : 250 }
You were getting the two values that you need to emit in map backwards. The first one represents a unique key value over which you are aggregating something. The second one is the thing you want to aggregate.
Since you are trying to get a single sum for the entire collection (it seems) you need to output a single key and then for value you need to output the sum of the four fields of each document.
map = function() { emit(1, this.num1+this.num2+this.num3+this.num4+this.num5); }
reduce = function(key, values) { return Array.sum(values); }
This will work as long as num1 through num5 are set in every document - you can make the function more robust by simply checking that each of those fields exists and if so, adding its value in.
In real life, you can do this faster and simpler with aggregation framework:
db.collection.aggregate({$group:{{$group:{_id:1, sum:{$sum:{$add:["$num1","$num2","$num3","$num4","$num5"]}}}})