Trouble with mongo map reduce and aggregating key names - mongodb

I have a collection in my database representing IP addresses pulled from various sources. A sample of which looks like this:
{ "_id" : ObjectId("4e71060444dce16174378b79"), "ip" : "xxx.xxx.xxx.xxx", "sources" : { "Source1" : NumberLong(52), "Source2" : NumberLong(7) } }
Each object will have one or more sources.
My goal is to show the number of entries reported by each source without necessarily knowing the names of every possible source (because new ones can potentially be added at any time). I have attempted to address this with map reduce by simply emitting a 1 for each key in the sources hash of each object, but something is wrong with my syntax, it seems. If I do the following:
var map_s = function(){
for(var source in this.sources) {
emit(source, 1);
}
}
var red_s = function(key, values){
var total = 0;
values.forEach(function(){
total++;
});
return total;
}
var op = db.addresses.mapReduce(map_s, red_s, {out: 'results'});
db.results.find().forEach(printjson);
I get
{ "_id" : "Source1", "value" : 12 }
{ "_id" : "Source2", "value" : 230 }
{ "_id" : "Source3", "value" : 358 }
{ "_id" : "Source4", "value" : 398 }
{ "_id" : "Source5", "value" : 39 }
{ "_id" : "Source6", "value" : 420 }
{ "_id" : "Source7", "value" : 156 }
Which is far too small for the database size. For instance, I get the following in the shell if I count off of a specific source:
> db.addresses.count({"sources.Source4": {$exists: true}});
1260538
Where is my error?

Yes there is a problem in your reduce method, it must be idempotent.
Remember that reduce() may be called many times on intermediary results.
Instead of
values.forEach(function(){
total++;
});
You need:
values.forEach(function(x){
total += x;
});

Related

How do I maintain data types when copying document?

I need to make a change to use a generated ObjectId instead of String I was using but the field data type changes from Int to Double.
For example say we have a document
{_id: "Product Name", count: 415 }
Now I want to create a document
{_id: "some object id", name: "Product Name", count: 415 }
I am using similar code below but it makes the count a Double.
var cursor = db.products.find()
cursor.forEach(function(item)
{
var old_id= item._id;
item.name = old_id;
delete item._id;
db.products.insert(item);
db.products.remove({_id:old_id});
});
I can add this in the loop: item.count = NumberInt( item.count) to make sure it's an Int but
I really don't want to do this for each field that I have.
Is there anyway to do this without manually having to cast them? I don't understand why it takes an Int and turns it into a Double. I know Double is the default but the fields that I am working with are already Integers.
Well if I understand you, your documents look like this:
{ "_id" : "Apple", "count" : 187 }
{ "_id" : "Google", "count" : 123 }
{ "_id" : "Amazon", "count" : 325 }
{ "_id" : "Oracle", "count" : 566 }
You can use the Bulk Api to update your collection.
var bulk = db.collection.initializeUnorderedBulkOp();
Var count = 0;
db.collection.aggregate([{ $project: { '_id': 0, 'name': '$_id', 'count': 1 }}]).forEach(function(doc){
bulk.find({'_id': doc.name}).remove();
bulk.insert(doc);
count++;
if (count % 1000 == 0){
// Execute per 1000 operations and re-init.
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
}})
// Clean up queues
if (count % 1000 != 0){
bulk.execute();
}
Then:
db.collection.find()
Yields the following documents:
{ "_id" : ObjectId("55a7e2c7eb68594275546c7c"), "count" : 187, "name" : "Apple" }
{ "_id" : ObjectId("55a7e2c7eb68594275546c7d"), "count" : 123, "name" : "Google" }
{ "_id" : ObjectId("55a7e2c7eb68594275546c7e"), "count" : 325, "name" : "Amazon" }
{ "_id" : ObjectId("55a7e2c7eb68594275546c7f"), "count" : 566, "name" : "Oracle" }
Is there anyway to do this without manually having to cast them? I don't understand why it takes an Int and turns it into a Double. I know Double is the default but the fields that I am working with are already Integers.
You really don't need to worry about that if you are using the shell but as pointed out in the comment you can always use a language with native support for integers to preserve the data type.

Incremental MapReduce with time interval in mongoDB

i got some records from server with time interval of 10 minute (in 1 hours i will get 6 files)
i want to do map reduce on every 1 hours in next hours i will have to do map reduce of next group on 6 files with last hours file
how i will solve this problem ?
help me
im confuse frm last 1 month
thank You
Sushil Kr Singh
In order to summarize your 10-minute log files by the hour, you could round down the timestamp of each logfile to the nearest hour in the map function and group the results by hours in the reduce function.
Here is a little dummy example that illustrates this from the mongo shell:
Create 100 log files, each 10 minutes apart and containing a random number between 0-10, and insert them in the logs collection in the database:
for (var i = 0; i < 100; i++) {
d = new ISODate();
d.setMinutes(d.getMinutes() + i*10);
r = Math.floor(Math.random()*11)
db.logs.insert({timestamp: d, number: r})
}
To check what the logs collection looks like, send a query like db.logs.find().limit(3).pretty(), which results in:
{
"_id" : ObjectId("50455a3570537f9433c1efb2"),
"timestamp" : ISODate("2012-09-04T01:32:37.370Z"),
"number" : 2
}
{
"_id" : ObjectId("50455a3570537f9433c1efb3"),
"timestamp" : ISODate("2012-09-04T01:42:37.370Z"),
"number" : 3
}
{
"_id" : ObjectId("50455a3570537f9433c1efb4"),
"timestamp" : ISODate("2012-09-04T01:52:37.370Z"),
"number" : 8
}
Define a map function (in this example called mapf) that rounds the timestamps to the nearest hour (rounded down), which is used for the emit key. The emit value is the number for that log file.
mapf = function () {
// round down to nearest hour
d = this.timestamp;
d.setMinutes(0);
d.setSeconds(0);
d.setMilliseconds(0);
emit(d, this.number);
}
Define a reduce function, that sums over all the emitted values (i.e. numbers).
reducef = function (key, values) {
var sum = 0;
for (var v in values) {
sum += values[v];
}
return sum;
}
Now execute map/reduce on the logs collection. The out parameter here specifies that we want to write the results to the hourly_logs collection and merge existing documents with new results. This ensures that log files submitted later (e.g. after a server failure or other delay) will be included in the results once they appear in the logs.
db.logs.mapReduce(mapf, reducef, {out: { merge : "hourly_logs" }})
Lastly, to see the results, you can query a simple find on hourly_logs:
db.hourly_logs.find()
{ "_id" : ISODate("2012-09-04T02:00:00Z"), "value" : 33 }
{ "_id" : ISODate("2012-09-04T03:00:00Z"), "value" : 31 }
{ "_id" : ISODate("2012-09-04T04:00:00Z"), "value" : 21 }
{ "_id" : ISODate("2012-09-04T05:00:00Z"), "value" : 40 }
{ "_id" : ISODate("2012-09-04T06:00:00Z"), "value" : 26 }
{ "_id" : ISODate("2012-09-04T07:00:00Z"), "value" : 26 }
{ "_id" : ISODate("2012-09-04T08:00:00Z"), "value" : 25 }
{ "_id" : ISODate("2012-09-04T09:00:00Z"), "value" : 46 }
{ "_id" : ISODate("2012-09-04T10:00:00Z"), "value" : 27 }
{ "_id" : ISODate("2012-09-04T11:00:00Z"), "value" : 42 }
{ "_id" : ISODate("2012-09-04T12:00:00Z"), "value" : 43 }
{ "_id" : ISODate("2012-09-04T13:00:00Z"), "value" : 35 }
{ "_id" : ISODate("2012-09-04T14:00:00Z"), "value" : 22 }
{ "_id" : ISODate("2012-09-04T15:00:00Z"), "value" : 34 }
{ "_id" : ISODate("2012-09-04T16:00:00Z"), "value" : 18 }
{ "_id" : ISODate("2012-09-04T01:00:00Z"), "value" : 13 }
{ "_id" : ISODate("2012-09-04T17:00:00Z"), "value" : 25 }
{ "_id" : ISODate("2012-09-04T18:00:00Z"), "value" : 7 }
The result is an hourly summary of your 10-minute logs, with the _id field containing the start of the hour and the value field the sum of the random numbers. In your case, you may have different aggregation operators; modify the reduce functions according to your needs.
As Sammaye mentioned in the comment, you could automate the map/reduce call with a cron job entry to run every hour.
If you don't want to process the full logs collection every time, you can run incremental updates by limiting the documents to hourly time windows like so:
var q = { $and: [ {timestamp: {$gte: new Date(2012, 8, 4, 12, 0, 0) }},
{timestamp: {$lt: new Date(2012, 8, 4, 13, 0, 0) }} ] }
db.logs.mapReduce(mapf, reducef, {query: q, out: { merge : "hourly_logs" }})
This would only include log files between the hours of 12 and 13. Note that the month value in the Date() object starts at 0 (8=September). Because of the merge option, it is safe to run the m/r on already processed log files.

Mongodb Map/Reduce - Multiple Group By

I am trying to run a map/reduce function in mongodb where I group by 3 different fields contained in objects in my collection. I can get the map/reduce function to run, but all the emitted fields run together in the output collection. I'm not sure this is normal or not, but outputting the data for analysis takes more work to clean up. Is there a way to separate them, then use mongoexport?
Let me show you what I mean:
The fields I am trying to group by are the day, user ID (or uid) and destination.
I run these functions:
map = function() {
day = (this.created_at.getFullYear() + "-" + (this.created_at.getMonth()+1) + "-" + this.created_at.getDate());
emit({day: day, uid: this.uid, destination: this.destination}, {count:1});
}
/* Reduce Function */
reduce = function(key, values) {
var count = 0;
values.forEach(function(v) {
count += v['count'];
}
);
return {count: count};
}
/* Output Function */
db.events.mapReduce(map, reduce, {query: {destination: {$ne:null}}, out: "TMP"});
The output looks like this:
{ "_id" : { "day" : "2012-4-9", "uid" : "1234456", "destination" : "Home" }, "value" : { "count" : 1 } }
{ "_id" : { "day" : "2012-4-9", "uid" : "2345678", "destination" : "Home" }, "value" : { "count" : 1 } }
{ "_id" : { "day" : "2012-4-9", "uid" : "3456789", "destination" : "Login" }, "value" : { "count" : 1 } }
{ "_id" : { "day" : "2012-4-9", "uid" : "4567890", "destination" : "Contact" }, "value" : { "count" : 1 } }
{ "_id" : { "day" : "2012-4-9", "uid" : "5678901", "destination" : "Help" }, "value" : { "count" : 1 } }
When I attempt to use mongoexport, I can not separate day, uid, or destination by columns because the map combines the fields together.
What I would like to have would look like this:
{ { "day" : "2012-4-9" }, { "uid" : "1234456" }, { "destination" : "Home"}, { "count" : 1 } }
Is this even possible?
As an aside - I was able to make the output work by applying sed to the file and cleaning up the CSV. More work, but it worked. It would be ideal if I could get it out of mongodb in the correct format.
MapReduce only returns documents of the form {_id:some_id, value:some_value}
see: How to change the structure of MongoDB's map-reduce results?

How can I find elements of a MongoDB collection that are taking up a large amount of space?

If I have a collection with thousands of elements, is there a way I can easily find which elements are taking up the most space (in terms of MB)?
There's no built-in query for this, you have to iterate the collection, gather size for each document, and sort afterwards. Here's how it'd work:
var cursor = db.coll.find();
var doc_size = {};
cursor.forEach(function (x) {
var size = Object.bsonsize(x);
doc_size[x._id] = size;
});
At this point you'll have a hashmap with document ids as keys and their sizes as values.
Note that with this approach you will be fetching the entire collection over the wire. An alternative is to use MapReduce and do this server-side (inside mongo):
> function mapper() {emit(this._id, Object.bsonsize(this));}
> function reducer(obj, size_in_b) { return { id : obj, size : size_in_b}; }
>
> var results = db.coll.mapReduce(mapper, reducer, {out : {inline : 1 }}).results
> results.sort(function(r1, r2) { return r2.value - r1.value; })
inline:1 tells mongo not to create a temporary collection for results, everything will be kept in RAM.
And a sample output from one of my collections:
[
{
"_id" : ObjectId("4ce9339942a812be22560634"),
"value" : 1156115
},
{
"_id" : ObjectId("4ce9340442a812be24560634"),
"value" : 913413
},
{
"_id" : ObjectId("4ce9340642a812be26560634"),
"value" : 866833
},
{
"_id" : ObjectId("4ce9340842a812be28560634"),
"value" : 483614
},
...
{
"_id" : ObjectId("4ce9340742a812be27560634"),
"value" : 61268
}
]
>
Figured this out! I did this in two steps using Object.bsonsize():
db.myCollection.find().forEach(function(myObject) {
db.objectSizes.save({object_id: object._id, size: Object.bsonsize(chain)});
});
db.objectSizes.find().sort({size: -1}).limit(5).pretty();

MongoDB: _id Cannot Be An Array

I have a large dataset (about 1.1M documents) that I need to run mapreduce on.
The field to group on is an array named xref. Due to the size of the collection and the fact I'm doing this in a 32-bit environment, I'm trying to reduce the collection to another collection in a new database.
First, here's a data sample:
{ "_id" : ObjectId("4ec6d3aa61910ad451f12e01"),
"bii" : -32.9867,
"class" : 2456,
"decdeg" : -82.4856,
"lii" : 297.4896,
"name" : "HD 22237",
"radeg" : 50.3284,
"vmag" : 8,
"xref" : ["HD 22237", "CPD -82 65", "-82 64","PPM 376283", "SAO 258336",
"CP-82 65","GC 4125" ] }
{ "_id" : ObjectId("4ec6d44661910ad451f78eba"),
"bii" : -32.9901,
"class" : 2450,
"decdeg" : -82.4781,
"decpm" : 0.013,
"lii" : 297.4807,
"name" : "PPM 376283",
"radeg" : 50.3543,
"rapm" : 0.0357,
"vmag" : 8.4,
"xref" : ["HD 22237", "CPD -82 65", "-82 64","PPM 376283", "SAO 258336",
"CP-82 65","GC 4125" ] }
{ "_id" : ObjectId("4ec6d48a61910ad451feae04"),
"bii" : -32.9903,
"class" : 2450,
"decdeg" : -82.4779,
"decpm" : 0.027,
"hd_component" : 0,
"lii" : 297.4806,
"name" : "SAO 258336",
"radeg" : 50.3543,
"rapm" : 0.0355,
"vmag" : 8,
"xref" : ["HD 22237", "CPD -82 65", "-82 64","PPM 376283", "SAO 258336",
"CP-82 65","GC 4125" ] }
Here are the map and reduce functions (right now I'm only lii and bii fields):
function map() {
try {
emit(this.xref, {lii:this.lii, bii:this.bii});
} catch(e) {
}
}
function reduce(key, values) {
var result = {xref:key, lii: 0.0, bii: 0.0};
try {
values.forEach(function(value) {
if (value.lii && value.bii) {
result.lii += value.lii;
result.bii += value.bii;
}
});
result.bii /= values.length;
result.lii /= values.length;
} catch(e) {
}
return result;
}
Unfortunately, running this eventually comes up with an error message:
db.catalog.mapReduce(map, reduce, {out:{replace:"catalog2", db:"astro2"}});
Wed Nov 23 10:12:25 uncaught exception: map reduce failed:{
"assertion" : "_id cannot be an array",
"assertionCode" : 10099,
"errmsg" : "db assertion failure",
"ok" : 0
The xref field IS an array, but all values are equal in that array. Is it trying to use that array as the id field in the new collections?
Yes it is not possible to set _id as an array, because it has a special behavior for indexing.
The key you emit by is used as _id in the output collection.
Potentially this could work only with an "inline" output mode if the result is small, since it wont go to a collection.
But ideally you would translate the array into a string (for example concat the values) and use that as _id, or make it a sub-object instead of an array.
Also note that the result of your reduce function should not include the key.
Just return {lii: .., bii: ..}