MongoDB mapreduce missing data with 'null' in return - mongodb

So this is strange. I'm trying to use mapreduce to group datetime/metrics under a unique port:
Document layout:
{
"_id" : ObjectId("5069d68700a2934015000000"),
"port_name" : "CL1-A",
"metric" : "340.0",
"port_number" : "0",
"datetime" : ISODate("2012-09-30T13:44:00Z"),
"array_serial" : "12345"
}
and mapreduce functions:
var query = {
'array_serial' : array,
'port_name' : { $in : ports },
'datetime' : { $gte : from, $lte : to}
}
var map = function() {
emit( { portname : this.port_name } , { datetime : this.datetime,
metric : this.metric });
}
var reduce = function(key, values) {
var res = { dates : [], metrics : [], count : 0}
values.forEach(function(value){
res.dates.push(value.datetime);
res.metrics.push(value.metric);
res.count++;
})
return res;
}
var command = {
mapreduce : collection,
map : map.toString(),
reduce : reduce.toString(),
query : query,
out : { inline : 1 }
}
mongoose.connection.db.executeDbCommand(command, function(err, dbres){
if(err) throw err;
console.log(dbres.documents);
res.json(dbres.documents[0].results);
})
If a small number of records is requested, say 5 or 10, or even 60 I get all the data back I'm expecting. Larger queries return truncated values....
I just did some more testing and it seems like it's limiting the record output to 100?
This is minutely data and when I run a query for a 24 hour period I would expect 1440 records back... I just ran it a received 80. :\
Is this expected? I'm not specifying a limit anywhere I can tell...
More data:
Query for records from 2012-10-01T23:00 - 2012-10-02T00:39 (100 minutes) returns correctly:
[
{
"_id": {
"portname": "CL1-A"
},
"value": {
"dates": [
"2012-10-01T23:00:00.000Z",
"2012-10-01T23:01:00.000Z",
"2012-10-01T23:02:00.000Z",
...cut...
"2012-10-02T00:37:00.000Z",
"2012-10-02T00:38:00.000Z",
"2012-10-02T00:39:00.000Z"
],
"metrics": [
"1596.0",
"1562.0",
"1445.0",
...cut...
"774.0",
"493.0",
"342.0"
],
"count": 100
}
}
]
...add one more minute to the query 2012-10-01T23:00 - 2012-10-02T00:39 (101 minutes) :
[
{
"_id": {
"portname": "CL1-A"
},
"value": {
"dates": [
null,
"2012-10-02T00:40:00.000Z"
],
"metrics": [
null,
"487.0"
],
"count": 2
}
}
]
the dbres.documents object shows the correct expected emitted records:
[ { results: [ [Object] ],
timeMillis: 8,
counts: { input: 101, emit: 101, reduce: 2, output: 1 },
ok: 1 } ]
...so is the data getting lost somewhere?

Rule number one of MapReduce:
Thou shall return from Reduce the exact same format that you emit with your key in Map.
Rule number two of MapReduce:
Thou shall reduce the array of values passed to reduce as many times as necessary. Reduce function may be called many times.
You've broken both of those rules in your implementation of reduce.
Your Map function is emitting key, value pairs.
key: port name (you should simply emit the name as the key, not a document)
value: a document representing three things you need to accumulate (date, metric, count)
Try this instead:
map = function() { // if you want to reduce to an array you have to emit arrays
emit ( this.port_name, { dates : [this.datetime], metrics : [this.metric], count: 1 });
}
reduce = function(key, values) { // for each key you get an array of values
var res = { dates: [], metrics: [], count: 0 }; // you must reduce them to one
values.forEach(function(value) {
res.dates = value.dates.concat(res.dates);
res.metrics = value.metrics.concat(res.metrics);
res.count += value.count; // VERY IMPORTANT reduce result may be re-reduced
})
return res;
}

Try to output the map reduce data in a temp collection instead of in memory. May that is the reason. From Mongo Docs:
{ inline : 1} - With this option, no collection will be created, and
the whole map-reduce operation will happen in RAM. Also, the results
of the map-reduce will be returned within the result object. Note that
this option is possible only when the result set fits within the 16MB
limit of a single document. In v2.0, this is your only available
option on a replica set secondary.
Also, It may not be the reason but MongoDB has data size limitations (2GB) on a 32bit machine.

Related

How do you find aggregate in mongo array with size greater than two?

In the mongo 2.6 document, see few below
nms:PRIMARY> db.checkpointstest4.find()
{ "_id" : 1, "cpu" : [ 100, 20, 60 ], "hostname" : "host1" }
{ "_id" : 2, "cpu" : [ 40, 30, 80 ], "hostname" : "host1" }
I need to find average cpu (per cpu array index) per hosts I.E based on two above, average for host1 will be [70,25,70] because cpu[0] is 100+40=70 etc
I am lost when I have 3 array elements instead of two array elements, see mongodb aggregate average of array elements
Finally below worked for me:
var map = function () {
for (var idx = 0; idx < this.cpu.length; idx++) {
var mapped = {
idx: idx,
val: this.cpu[idx]
};
emit(this.hostname, {"cpu": mapped});
}
};
var reduce = function (key, values) {
var cpu = []; var sum = [0,0,0]; cnt = [0,0,0];
values.forEach(function (value) {
sum[value.cpu.idx] += value.cpu.val;
cnt[value.cpu.idx] +=1;
cpu[value.cpu.idx] = sum[value.cpu.idx]/cnt[value.cpu.idx]
});
return {"cpu": cpu};
};
db.checkpointstest4.mapReduce(map, reduce, {out: "checkpointstest4_result"});
In MongoDB 3.2 where includeArrayIndex showed up, you can do this;
db.test.aggregate(
{$unwind: {path:"$cpu", includeArrayIndex:"index"}},
{$group: {_id:{h:"$hostname",i:"$index"}, cpu:{$avg:"$cpu"}}},
{$sort:{"_id.i":1}},
{$group:{_id:"$_id.h", cpu:{$push:"$cpu"}}}
)
// Make a row for each array element with an index field added.
{$unwind: {path:"$cpu", includeArrayIndex:"index"}},
// Group by hostname+index, calculate average for each group.
{$group: {_id:{h:"$hostname",i:"$index"}, cpu:{$avg:"$cpu"}}},
// Sort by index (to get the array in the next step sorted correctly)
{$sort:{"_id.i":1}},
// Group by host, pushing the averages into an array in order.
{$group:{_id:"$_id.h", cpu:{$push:"$cpu"}}}
Upgrading would be your best option as mentioned with the includeArrayIndex available to $unwind from MongoDB 3.2 onwards.
If you cannot do that, then you can always process with mapReduce instead:
db.checkpointstest4.mapReduce(
function() {
var mapped = this.cpu.map(function(val) {
return { "val": val, "cnt": 1 };
});
emit(this.hostname,{ "cpu": mapped });
},
function(key,values) {
var cpu = [];
values.forEach(function(value) {
value.cpu.forEach(function(item,idx) {
if ( cpu[idx] == undefined )
cpu[idx] = { "val": 0, "cnt": 0 };
cpu[idx].val += item.val;
cpu[idx].cnt += item.cnt
});
});
return { "cpu": cpu };
},
{
"out": { "inline": 1 },
"finalize": function(key,value) {
return {
"cpu": value.cpu.map(function(cpu) {
return cpu.val / cpu.cnt;
})
};
}
}
)
So the steps there are in the "mapper" function to transform the array content to be an array of objects containing the "value" from the element and a "count" for later reference as input to the "reduce" function. You need this to be consistent with how the reducer is going to work with this and is necessary to get the overall counts needed to get the average.
In the "reducer" itself you are basically summing the array contents for each position for both the "value" and the "count". This is important as the "reduce" function can be called multiple times in the overall reduction process, feeding it's output as "input" in a subsequent call. So that is why both mapper and reducer are working in this format.
With the final reduced results, the finalize function is called to simply look at each summed "value" and "count" and divide by the count to return an average.
Mileage may vary on whether modern aggregation pipeline processing or indeed this mapReduce process will perform the best, mostly depending on the data. Using $unwind in the prescribed way will certainly increase the amount of documents to be analyzed and thus produce overhead. On the contrary, while JavaScript processing as opposed to native operators in the aggregation framework will generally be slower, but the document processing overhead here is reduced since this is keeping arrays.
The advice I would give is use this if upgrading to 3.2 is not an option, yet if even an option then at least benchmark the two on your data and expected growth to see which works best for you.
Returns
{
"results" : [
{
"_id" : "host1",
"value" : {
"cpu" : [
70,
25,
70
]
}
}
],
"timeMillis" : 38,
"counts" : {
"input" : 2,
"emit" : 2,
"reduce" : 1,
"output" : 1
},
"ok" : 1
}

mongodb delete nested object without knowledge of object nodes

For the below document, I am trying to delete the node which contains id = 123
{
'_id': "1234567890",
"image" : {
"unknown-node-1" : {
"id" : 123
},
"unknown-node-2" : {
"id" : 124
}
}
}
Result should be as below.
{
'_id': "1234567890",
"image" : {
"unknown-node-2" : {
"id" : 124
}
}
}
The below query achieves the result. But i have to know the unknown-node-1 in advance. How can I achieve the results without pre-knowledge of node, but only
info that I have is image.*.id = 123
(* means unknown node)
Is it possible in mongo? or should I do these find on my app code.
db.test.update({'_id': "1234567890"}, {$unset: {'image.unknown-node-1': ""}})
Faiz,
There is no operator to help match and project a single key value pair without knowing the key. You'll have to write post processing code to scan each one of the documents to find the node with the id and then perform your removal.
If you have the liberty of changing your schema, you'll have more flexibilty. With a document design like this:
{
'_id': "1234567890",
"image" : [
{"id" : 123, "name":"unknown-node-1"},
{"id" : 124, "name":"unknown-node-2"},
{"id" : 125, "name":"unknown-node-3"}
]
}
You could remove documents from the array like this:
db.collectionName.update(
{'_id': "1234567890"},
{ $pull: { image: { id: 123} } }
)
This would result in:
{
'_id': "1234567890",
"image" : [
{"id" : 124, "name":"unknown-node-2"},
{"id" : 125, "name":"unknown-node-3"}
]
}
With your current schema, you will need a mechanism to get a list of the dynamic keys that you need to assemble the query before doing the update and one way of doing this would be with MapReduce. Take for instance the following map-reduce operation which will populate a separate collection with all the keys as the _id values:
mr = db.runCommand({
"mapreduce": "test",
"map" : function() {
for (var key in this.image) { emit(key, null); }
},
"reduce" : function(key, stuff) { return null; },
"out": "test_keys"
})
To get a list of all the dynamic keys, run distinct on the resulting collection:
> db[mr.result].distinct("_id")
[ "unknown-node-1", "unknown-node-2" ]
Now given the list above, you can assemble your query by creating an object that will have its properties set within a loop. Normally if you knew the keys beforehand, your query will have this structure:
var query = {
"image.unknown-node-1.id": 123
},
update = {
"$unset": {
"image.unknown-node-1": ""
}
};
db.test.update(query, update);
But since the nodes are dynamic, you will have to iterate the list returned from the mapReduce operation and for each element, create the query and update parameters as above to update the collection. The list could be huge so for maximum efficiency and if your MongoDB server is 2.6 or newer, it would be better to take advantage of using a write commands Bulk API that allow for the execution of bulk update operations which are simply abstractions on top of the server to make it easy to build bulk operations and thus get perfomance gains with your update over large collections. These bulk operations come mainly in two flavours:
Ordered bulk operations. These operations execute all the operation in order and error out on the first write error.
Unordered bulk operations. These operations execute all the operations in parallel and aggregates up all the errors. Unordered bulk operations do not guarantee order of execution.
Note, for older servers than 2.6 the API will downconvert the operations. However it's not possible to downconvert 100% so there might be some edge cases where it cannot correctly report the right numbers.
In your case, you could implement the Bulk API update operation like this:
mr = db.runCommand({
"mapreduce": "test",
"map" : function() {
for (var key in this.image) { emit(key, null); }
},
"reduce" : function(key, stuff) { return null; },
"out": "test_keys"
})
// Get the dynamic keys
var dynamic_keys = db[mr.result].distinct("_id");
// Get the collection and bulk api artefacts
var bulk = db.test.initializeUnorderedBulkOp(), // Initialize the Unordered Batch
counter = 0;
// Execute the each command, triggers for each key
dynamic_keys.forEach(function(key) {
// Create the query and update documents
var query = {},
update = {
"$unset": {}
};
query["image."+ key +".id"] = 123;
update["$unset"]["image." + key] = ";"
bulk.find(query).update(update);
counter++;
if (counter % 100 == 0 ) {
bulk.execute() {
// re-initialise batch operation
bulk = db.test.initializeUnorderedBulkOp();
}
});
if (counter % 100 != 0) { bulk.execute(); }

"Map Reduce" reduce function finds value undefined

I have the following collection:
{
"_id" : ObjectId("51f1fcc08188d3117c6da351"),
"cust_id" : "abc123",
"ord_date" : ISODate("2012-10-03T18:30:00Z"),
"status" : "A",
"price" : 25,
"items" : [{
"sku" : "ggg",
"qty" : 7,
"price" : 2.5
}, {
"sku" : "ppp",
"qty" : 5,
"price" : 2.5
}]
}
My map function is:
var map=function(){emit(this._id,this);}
For debugging purpose I overide the emit method as follows:
var emit = function (key,value){
print("emit");
print("key: " + key + "value: " + tojson(value));
reduceFunc2(key, toJson(value));
}
and the reduce function as follows:
var reduceFunc2 = function reduce(key,values){
var val = values;
print("val",val);
var items = [];
val.items.some(function (entry){
print("entry is:::"+entry);
if (entry.qty>5 && entry.sku=='ggg'){
items.push(entry)
}
});
val.items = items;
return val;
}
But when I apply map as:
var myDoc = db.orders.findOne({
_id: ObjectId("51f1fcc08188d3117c6da351")
});
map.apply(myDoc);
I get the following error:
emit key: 51f1fcc08188d3117c6da351 value:
{
"_id":" ObjectId(\"51f1fcc08188d3117c6da351\")",
"cust_id":"abc123",
"ord_date":" ISODate(\"2012-10-03T18:30:00Z\")",
"status":"A",
"price":25,
"items":[
{
"sku":"ggg",
"qty":7,
"price":2.5
},
{
"sku":"ppp",
"qty":5,
"price":2.5
}
]
}
value:: undefined
Tue Jul 30 12:49:22.920 JavaScript execution failed: TypeError: Cannot call method 'some' of undefined
you can find that their is an items field in the value as printed which is of array kind, even then it is throwing error cannot call some on undefined, if someone can tell where i am going wrong.
You have an error in your reduceFunc2 function:
var reduceFunc2 = function reduce(key,values){
var val = values[0]; //values is an array!!!
// ...
}
Reduce function meant to reduce an array of elements, emitted with the same key, to a single document. So, it accepts an array. You're emitting each key only once, so it's an array with a single element with it.
Now you'll be able to call your MapReduce normally:
db.orders.mapReduce(map, reduceFunc2, {out: {inline: 1}});
The way you overridden emit function is broken, so you shouldn't use it.
Update. Mongo may skip reduce operation if there is only one document associated with the given key, because there is no point in reducing a single document.
The idea of MapReduce is that you maps each document into an array of key-value pairs to be reduced on the next step. If there is more than one value associated with the given key, Mongo runs a reduce operation to reduce it to the single document. Mongo expects reduce function to return reduced document in the same format as the elements which was emitted. It's why Mongo may run reduce operation any number of times for each key (up to the number of emits). There is also no guarantee that reduce operation will be called at all if there is nothing to reduce (e.g. if there is only one element).
So, it's best to move map logic to the proper place.
Update 2. Anyway, why are you using MapReduce here? You can just query for the documents you need:
db.orders.find({}, {
items: {
$elemMatch: {
qty: {$gt: 5},
sku: 'qqq'
}
}
})
Update 3. If you really want to do it with MapReduce, try this:
db.runCommand({
mapreduce: 'orders',
query: {
items: {
$elemMatch: {
qty: {$gt: 5},
sku: 'ggg'
}
}
},
map: function map (){
this.items = this.items.filter(function (entry) {
return (entry.qty>5 && entry.sku=='ggg')
});
emit(this._id,this);
},
reduce: function reduce (key, values) {
return values[0];
},
verbose: true,
out: {
merge: 'map_reduce'
}
})

Merge changeset documents in a query

I have recorded changes from an information system in a mongo database. Every time a set of values are set or changed, a record is saved in the mongo database.
The change collection is in the following form:
{ "user_id": 1, "timestamp": { "date" : "2010-09-22 09:28:02", "timezone_type" : 3, "timezone" : "Europe/Paris" } }, "changes: { "fieldA": "valueA", "fieldB": "valueB", "fieldC": "valueC" } }
{ "user_id": 1, "timestamp": { "date" : "2010-09-24 19:01:52", "timezone_type" : 3, "timezone" : "Europe/Paris" } }, "changes: { "fieldA": "new_valueA", "fieldB": null, "fieldD": "valueD" } }
{ "user_id": 1, "timestamp": { "date" : "2010-10-01 11:11:02", "timezone_type" : 3, "timezone" : "Europe/Paris" } }, "changes: { "fieldD": "new_valueD" } }
Of course there are thousands of records per user with different attributes which represent millions of records. What I want to do is to see a user status at a given time. By example, the user_id 1 at 2010-09-30 would be
fieldA: new_valueA
fieldC: valueC
fieldD: valueD
This means I need to flatten all the changes prior to a given date for a given user into a single record. Can I do that directly in mongo ?
Edit: I am using the 2.0 version of mongodb hence cannot benefit from the aggregation framework.
Edit: It sounds I have found the answer to my question.
var mapTimeAndChangesByUserId = function() {
var key = this.user_id;
var value = { timestamp: this.timestamp.date, changes: this.changes };
emit(key, value);
}
var reduceMergeChanges = function(user_id, changeset) {
var mergeFunction = function(a, b) { for (var attr in b) a[attr] = b[attr]; };
var result = {};
changeset.forEach(function(e) { mergeFunction(result, e.changes); });
return { timestamp: changeset.pop().timestamp, changes: result };
}
The reduce function merges the changes in the order they come and returns the result.
db.user_change.mapReduce(
mapTimeAndChangesByUserId,
reduceMergeChanges,
{
out: { inline: 1 },
query: { user_id: 1, "timestamp.date": { $lt: "2010-09-30" } },
sort: { "timestamp.date": 1 }
});
'results' : [
"_id": 1,
"value": {
"timestamp": "2010-09-24 19:01:52",
"changes": {
"fieldA": "new_valueA",
"fieldB": null,
"fieldC": "valueC",
"fieldD": "valueD"
}
}
]
Which is fine to me.
You could write a MR to do this.
Since the fields are a lot like tags you can modify a nice cookbook example of counting tags here: http://cookbook.mongodb.org/patterns/count_tags/ of course instead of counting you want the latest value applied (assumption since this is not clear in your question) for that field.
So lets get our map function:
map = function() {
if (!this.changes) {
// If there were not changes for some reason lets bail this record
return;
}
// We iterate the changes
for (index in this.changes) {
emit(index /* We emit the field name */, this.changes[index] /* We emit the field value */);
}
}
And now for our reduce:
reduce = function(values){
// This part is dependant upon your input query. If you add a sort of
// date (ts) DESC then you will prolly want the first index (0) not the last as
// gathered here by values.length
return values[values.length];
}
And this will output a single document per field change of the type:
{
_id: your_field_ie_fieldA,
value: whoop
}
You can then iterate the end of the (most likely) in line output and, bam, you have your changes.
This is of course one way of dong it and is not designed to be run completely in line to your app, however that all depends on the size of the data your working on; it could be run very close.
I am unsure whether the group and distinct can run on this but it looks like it might: http://docs.mongodb.org/manual/reference/method/db.collection.group/#db-collection-group however I should note that group is basically a MR wrapper but you could do something like (untested just like the MR above):
db.col.group( {
key: { 'changes.fieldA': 1, // the rest of the fields },
cond: { 'timestamp.date': { $gt: new Date( '01/01/2012' ) } },
reduce: function ( curr, result ) { },
initial: { }
} )
But it does require you to define the keys instead of just iterating them programmably (maybe a better way).

MongoDB amount of entries

How get amount of entries (in collection), that hold only unique data in certain key?
For example:
{
"_id": ObjectId("4f9d996eba6a7aa62b0005ed"),
"tag": "web"
}
{
"_id": ObjectId("4f9d996eba6a7aa62b0006ed"),
"tag": "net"
}
{
"_id": ObjectId("4f9d996eba6a7aa62b0007ed"),
"tag": "web"
}
{
"_id": ObjectId("4f9d996eba6a7aa62b0008ed"),
"tag": "page"
}
Amount of entries with unique key "tag", and return 3.
ps: If it possible, how get list of all found unique values of key: "tag"?
You can use Map/Reduct to group entries.
I create content with the name of "content" and inserted 4 data (that you listed above) in it. Then by using the following map/reduce code you can group them by their tag. At the end result is written into the content name "result".
var map = function() {
emit(this.tag, 1);
};
var reduce = function(key,values){
var result = 0;
for(v in values){
result += values[v];
}
return result;
};
res = db.content.mapReduce(map,reduce,{out:"result"}});
Result is shown as follows.
{ "_id" : "net", "value" : 1 }
{ "_id" : "page", "value" : 1 }
{ "_id" : "web", "value" : 2 }
To get a unique list of values you need to either DISTINCT or GROUP BY. MongoDB supports both paradigms:
Distinct
Group
To get a list of all distinct tag values you would run:
db.mycollection.distinct('tag');
or
db.runCommand({ distinct: 'mycollection', key: 'tag' })
and you can get the count by just looking at the length of the result:
db.mycollection.distinct('tag').length
Note on distinct command from the documentation: Note: the distinct command results are returned as a single BSON object. If the results could be large (> max document size – 4/16MB ), use map/reduce instead