Incremental MapReduce with time interval in mongoDB - mongodb

i got some records from server with time interval of 10 minute (in 1 hours i will get 6 files)
i want to do map reduce on every 1 hours in next hours i will have to do map reduce of next group on 6 files with last hours file
how i will solve this problem ?
help me
im confuse frm last 1 month
thank You
Sushil Kr Singh

In order to summarize your 10-minute log files by the hour, you could round down the timestamp of each logfile to the nearest hour in the map function and group the results by hours in the reduce function.
Here is a little dummy example that illustrates this from the mongo shell:
Create 100 log files, each 10 minutes apart and containing a random number between 0-10, and insert them in the logs collection in the database:
for (var i = 0; i < 100; i++) {
d = new ISODate();
d.setMinutes(d.getMinutes() + i*10);
r = Math.floor(Math.random()*11)
db.logs.insert({timestamp: d, number: r})
}
To check what the logs collection looks like, send a query like db.logs.find().limit(3).pretty(), which results in:
{
"_id" : ObjectId("50455a3570537f9433c1efb2"),
"timestamp" : ISODate("2012-09-04T01:32:37.370Z"),
"number" : 2
}
{
"_id" : ObjectId("50455a3570537f9433c1efb3"),
"timestamp" : ISODate("2012-09-04T01:42:37.370Z"),
"number" : 3
}
{
"_id" : ObjectId("50455a3570537f9433c1efb4"),
"timestamp" : ISODate("2012-09-04T01:52:37.370Z"),
"number" : 8
}
Define a map function (in this example called mapf) that rounds the timestamps to the nearest hour (rounded down), which is used for the emit key. The emit value is the number for that log file.
mapf = function () {
// round down to nearest hour
d = this.timestamp;
d.setMinutes(0);
d.setSeconds(0);
d.setMilliseconds(0);
emit(d, this.number);
}
Define a reduce function, that sums over all the emitted values (i.e. numbers).
reducef = function (key, values) {
var sum = 0;
for (var v in values) {
sum += values[v];
}
return sum;
}
Now execute map/reduce on the logs collection. The out parameter here specifies that we want to write the results to the hourly_logs collection and merge existing documents with new results. This ensures that log files submitted later (e.g. after a server failure or other delay) will be included in the results once they appear in the logs.
db.logs.mapReduce(mapf, reducef, {out: { merge : "hourly_logs" }})
Lastly, to see the results, you can query a simple find on hourly_logs:
db.hourly_logs.find()
{ "_id" : ISODate("2012-09-04T02:00:00Z"), "value" : 33 }
{ "_id" : ISODate("2012-09-04T03:00:00Z"), "value" : 31 }
{ "_id" : ISODate("2012-09-04T04:00:00Z"), "value" : 21 }
{ "_id" : ISODate("2012-09-04T05:00:00Z"), "value" : 40 }
{ "_id" : ISODate("2012-09-04T06:00:00Z"), "value" : 26 }
{ "_id" : ISODate("2012-09-04T07:00:00Z"), "value" : 26 }
{ "_id" : ISODate("2012-09-04T08:00:00Z"), "value" : 25 }
{ "_id" : ISODate("2012-09-04T09:00:00Z"), "value" : 46 }
{ "_id" : ISODate("2012-09-04T10:00:00Z"), "value" : 27 }
{ "_id" : ISODate("2012-09-04T11:00:00Z"), "value" : 42 }
{ "_id" : ISODate("2012-09-04T12:00:00Z"), "value" : 43 }
{ "_id" : ISODate("2012-09-04T13:00:00Z"), "value" : 35 }
{ "_id" : ISODate("2012-09-04T14:00:00Z"), "value" : 22 }
{ "_id" : ISODate("2012-09-04T15:00:00Z"), "value" : 34 }
{ "_id" : ISODate("2012-09-04T16:00:00Z"), "value" : 18 }
{ "_id" : ISODate("2012-09-04T01:00:00Z"), "value" : 13 }
{ "_id" : ISODate("2012-09-04T17:00:00Z"), "value" : 25 }
{ "_id" : ISODate("2012-09-04T18:00:00Z"), "value" : 7 }
The result is an hourly summary of your 10-minute logs, with the _id field containing the start of the hour and the value field the sum of the random numbers. In your case, you may have different aggregation operators; modify the reduce functions according to your needs.
As Sammaye mentioned in the comment, you could automate the map/reduce call with a cron job entry to run every hour.
If you don't want to process the full logs collection every time, you can run incremental updates by limiting the documents to hourly time windows like so:
var q = { $and: [ {timestamp: {$gte: new Date(2012, 8, 4, 12, 0, 0) }},
{timestamp: {$lt: new Date(2012, 8, 4, 13, 0, 0) }} ] }
db.logs.mapReduce(mapf, reducef, {query: q, out: { merge : "hourly_logs" }})
This would only include log files between the hours of 12 and 13. Note that the month value in the Date() object starts at 0 (8=September). Because of the merge option, it is safe to run the m/r on already processed log files.

Related

How do I maintain data types when copying document?

I need to make a change to use a generated ObjectId instead of String I was using but the field data type changes from Int to Double.
For example say we have a document
{_id: "Product Name", count: 415 }
Now I want to create a document
{_id: "some object id", name: "Product Name", count: 415 }
I am using similar code below but it makes the count a Double.
var cursor = db.products.find()
cursor.forEach(function(item)
{
var old_id= item._id;
item.name = old_id;
delete item._id;
db.products.insert(item);
db.products.remove({_id:old_id});
});
I can add this in the loop: item.count = NumberInt( item.count) to make sure it's an Int but
I really don't want to do this for each field that I have.
Is there anyway to do this without manually having to cast them? I don't understand why it takes an Int and turns it into a Double. I know Double is the default but the fields that I am working with are already Integers.
Well if I understand you, your documents look like this:
{ "_id" : "Apple", "count" : 187 }
{ "_id" : "Google", "count" : 123 }
{ "_id" : "Amazon", "count" : 325 }
{ "_id" : "Oracle", "count" : 566 }
You can use the Bulk Api to update your collection.
var bulk = db.collection.initializeUnorderedBulkOp();
Var count = 0;
db.collection.aggregate([{ $project: { '_id': 0, 'name': '$_id', 'count': 1 }}]).forEach(function(doc){
bulk.find({'_id': doc.name}).remove();
bulk.insert(doc);
count++;
if (count % 1000 == 0){
// Execute per 1000 operations and re-init.
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
}})
// Clean up queues
if (count % 1000 != 0){
bulk.execute();
}
Then:
db.collection.find()
Yields the following documents:
{ "_id" : ObjectId("55a7e2c7eb68594275546c7c"), "count" : 187, "name" : "Apple" }
{ "_id" : ObjectId("55a7e2c7eb68594275546c7d"), "count" : 123, "name" : "Google" }
{ "_id" : ObjectId("55a7e2c7eb68594275546c7e"), "count" : 325, "name" : "Amazon" }
{ "_id" : ObjectId("55a7e2c7eb68594275546c7f"), "count" : 566, "name" : "Oracle" }
Is there anyway to do this without manually having to cast them? I don't understand why it takes an Int and turns it into a Double. I know Double is the default but the fields that I am working with are already Integers.
You really don't need to worry about that if you are using the shell but as pointed out in the comment you can always use a language with native support for integers to preserve the data type.

Mongo DB find method confusion in Implicit AND

I am new to mongodb I have started to learn basic syntax recently. I was trying operators with find method, and I got a confusing case while trying Implicit AND.
My Collection mathtable having 400 documents is as follows:
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4b2") , "index" : 1 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4b3") , "index" : 2 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4b4") , "index" : 3 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4b5") , "index" : 4 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4b6") , "index" : 5 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4b7") , "index" : 6 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4b8") , "index" : 7 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4b9") , "index" : 8 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4ba") , "index" : 9 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4bb") , "index" : 10 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4bc") , "index" : 11 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4bd") , "index" : 12 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4be") , "index" : 13 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4bf") , "index" : 14 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4c0") , "index" : 15 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4c1") , "index" : 16 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4c2") , "index" : 17 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4c3") , "index" : 18 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4c4") , "index" : 19 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4c5") , "index" : 20 }
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4d1") , "index" : 1 }
..
..
{ "_id" : ObjectId("540efc2bd8af78d9b0f5d4z5") , "index" : 20 }
There are 400 rows in mathtable collection:
Value of index ranges from 1 to 20.
For each value of index there are 20 entries with different _id value.
I am trying below two operations and expecting same results, considering that they are both implicit AND cases.
Calculating even index values having a value greater than 5.
Using classic EXPLICIT AND ( results into 160 records ) :
db.mathtable.count({
$and: [
{ index: { $mod: [2,0] } },
{ index: { $gt: 5 } }
]
});
Using variable name only once ( results into 160 records ) :
db.mathtable.count({
index : { $mod : [2,0] , $gt:5 }
});
Using field name with every condition ( results into 300 records ):
db.mathtable.find({
index : { $mod : [2,0]} ,
index : {$gt:5}
});
Using field name with every condition, conditions in opposite order ( results into 200 records ):
db.mathtable.find({
index : {$gt:5} ,
index : { $mod : [2,0]}
});
There is no mention of implicit OR in mongoDB documentation( or at-least I did not find a direct reference like implicit AND ) .
I was expecting same count of records ( 160 ) in both cases. I am unable to understand why above codes are behaving differently.
Also, order of condition specification results into different number of results. As per observation, only the last condition specified in find was applied, when same field was specified multiple times. That is weird and incorrect.
NOTE: I am using Mongo-DB-2.6 and code is being executed on mongo shell that comes with the distribution.
Json or an associative array or a map does not contain duplicate keys:
db.mathtable.find({
index : { $mod : [2,0]} ,
index : {$gt:5}
});
The above will be considered equivalent to:
db.mathtable.find({
index : {$gt:5}
});
The first condition will be overwritten,
and the below,
db.mathtable.find({
index : {$gt:5} ,
index : { $mod : [2,0]}
});
will be equivalent to,
db.mathtable.find({
index : { $mod : [2,0]}
});
However in the first case,
db.mathtable.count({
$and: [
{ index: { $mod: [2,0] } },
{ index: { $gt: 5 } }
]
});
the $and takes two json documents as input and behaves as expected.
and in the second case, count takes a single document with no duplicate keys and behaves as expected.
db.mathtable.count({
index : { $mod : [2,0] , $gt:5 }
});
Hence the difference in the number of rows returned. Hope it is helpful.

Return range of documents around ID in MongoDB

I have an ID of a document and need to return the document plus the 10 documents that come before and the 10 documents after it. 21 docs total.
I do not have a start or end value from any key. Only the limit in either direction.
Best way to do this? Thank you in advance.
Did you know that ObjectID's contain a timestamp? And that therefore they always represent the natural insertion order. So if you are looking for documents before an after a known document _id you can do this:
Our documents:
{ "_id" : ObjectId("5307f2d80f936e03d1a1d1c8"), "a" : 1 }
{ "_id" : ObjectId("5307f2db0f936e03d1a1d1c9"), "b" : 1 }
{ "_id" : ObjectId("5307f2de0f936e03d1a1d1ca"), "c" : 1 }
{ "_id" : ObjectId("5307f2e20f936e03d1a1d1cb"), "d" : 1 }
{ "_id" : ObjectId("5307f2e50f936e03d1a1d1cc"), "e" : 1 }
{ "_id" : ObjectId("5307f2e90f936e03d1a1d1cd"), "f" : 1 }
{ "_id" : ObjectId("5307f2ec0f936e03d1a1d1ce"), "g" : 1 }
{ "_id" : ObjectId("5307f2ee0f936e03d1a1d1cf"), "h" : 1 }
{ "_id" : ObjectId("5307f2f10f936e03d1a1d1d0"), "i" : 1 }
{ "_id" : ObjectId("5307f2f50f936e03d1a1d1d1"), "j" : 1 }
{ "_id" : ObjectId("5307f3020f936e03d1a1d1d2"), "j" : 1 }
So we know the _id of "f", get it and the next 2 documents:
> db.items.find({ _id: {$gte: ObjectId("5307f2e90f936e03d1a1d1cd") } }).limit(3)
{ "_id" : ObjectId("5307f2e90f936e03d1a1d1cd"), "f" : 1 }
{ "_id" : ObjectId("5307f2ec0f936e03d1a1d1ce"), "g" : 1 }
{ "_id" : ObjectId("5307f2ee0f936e03d1a1d1cf"), "h" : 1 }
And do the same in reverse:
> db.items.find({ _id: {$lte: ObjectId("5307f2e90f936e03d1a1d1cd") } })
.sort({ _id: -1 }).limit(3)
{ "_id" : ObjectId("5307f2e90f936e03d1a1d1cd"), "f" : 1 }
{ "_id" : ObjectId("5307f2e50f936e03d1a1d1cc"), "e" : 1 }
{ "_id" : ObjectId("5307f2e20f936e03d1a1d1cb"), "d" : 1 }
And that's a much better approach than scanning a collection.
Neil's answer is a good answer to the question as stated (assuming that you are using automatically generated ObjectIds), but keep in mind that there's some subtlety around the concept of the 10 documents before and after a given document.
The complete format for an ObjectId is documented here. Note that it consists of the following fields:
timestamp to 1-second resolution,
machine identifier
process id
counter
Generally if you don't specify your own _ids they are automatically generated by the driver on the client machine. So as long as the ObjectIds are generated on a single process on a client single machine, their order does indeed reflect the order in which they were generated, which in a typical application will also be the insertion order (but need not be). However if you have multiple processes or multiple client machines, the order of the ObjectIds for objects generated within a given second by those multiple sources has an unpredictable relationship to the insertion order.

mongodb fetch hundreds of data out of millions of data

In my database, I have millions of documents. Each of them has a time stamp. Some have the same time stamp. I want to get some points (a few hundreds or potentially more like thousands) to draw a graph. I don't want all the points. I want every n points I pick 1 point. I know there's aggregation framework and I tried that. The problem with that is since my data is huge. When I do aggregation work, The result exceeds document maximum size, 16MB, easily. There's also a function called skip in mongodb but it only skips first n documents. Are there good ways to achieve what I want? Or is there way to make aggregation result bigger? Thanks in advance!
I'm not sure how you can do this with either A/F or M/R - just skipping so that you have (f.e.) each 10th point is not something M/R allows you to do—unless you select each point based on a random value with a 10% change... which is probably not what you want. But that does work:
db.so.output.drop();
db.so.find().count();
map = function() {
// rand does 0-1, so < 0.1 means 10%
if (Math.random() < 0.1) {
emit(this._id, this);
}
}
reduce = function(key, values) {
return values;
}
db.so.mapReduce( map, reduce, { out: 'output' } );
db.output.find();
Which outputs something line:
{
"result" : "output",
"timeMillis" : 4,
"counts" : {
"input" : 23,
"emit" : 3,
"reduce" : 0,
"output" : 3
},
"ok" : 1,
}
> db.output.find();
{ "_id" : ObjectId("51ffc4bc16473d7b84172d85"), "value" : { "_id" : ObjectId("51ffc4bc16473d7b84172d85"), "date" : ISODate("2013-08-05T15:24:45Z") } }
{ "_id" : ObjectId("51ffc75316473d7b84172d8e"), "value" : { "_id" : ObjectId("51ffc75316473d7b84172d8e") } }
{ "_id" : ObjectId("51ffc75316473d7b84172d8f"), "value" : { "_id" : ObjectId("51ffc75316473d7b84172d8f") } }
or:
> db.so.mapReduce( map, reduce, { out: 'output' } );
{
"result" : "output",
"timeMillis" : 19,
"counts" : {
"input" : 23,
"emit" : 2,
"reduce" : 0,
"output" : 2
},
"ok" : 1,
}
> db.output.find();
{ "_id" : ObjectId("51ffc4bc16473d7b84172d83"), "value" : { "_id" : ObjectId("51ffc4bc16473d7b84172d83"), "date" : ISODate("2013-08-05T15:24:25Z") } }
{ "_id" : ObjectId("51ffc4bc16473d7b84172d86"), "value" : { "_id" : ObjectId("51ffc4bc16473d7b84172d86"), "date" : ISODate("2013-08-05T15:25:15Z") } }
Depending on a random factor.

Trouble with mongo map reduce and aggregating key names

I have a collection in my database representing IP addresses pulled from various sources. A sample of which looks like this:
{ "_id" : ObjectId("4e71060444dce16174378b79"), "ip" : "xxx.xxx.xxx.xxx", "sources" : { "Source1" : NumberLong(52), "Source2" : NumberLong(7) } }
Each object will have one or more sources.
My goal is to show the number of entries reported by each source without necessarily knowing the names of every possible source (because new ones can potentially be added at any time). I have attempted to address this with map reduce by simply emitting a 1 for each key in the sources hash of each object, but something is wrong with my syntax, it seems. If I do the following:
var map_s = function(){
for(var source in this.sources) {
emit(source, 1);
}
}
var red_s = function(key, values){
var total = 0;
values.forEach(function(){
total++;
});
return total;
}
var op = db.addresses.mapReduce(map_s, red_s, {out: 'results'});
db.results.find().forEach(printjson);
I get
{ "_id" : "Source1", "value" : 12 }
{ "_id" : "Source2", "value" : 230 }
{ "_id" : "Source3", "value" : 358 }
{ "_id" : "Source4", "value" : 398 }
{ "_id" : "Source5", "value" : 39 }
{ "_id" : "Source6", "value" : 420 }
{ "_id" : "Source7", "value" : 156 }
Which is far too small for the database size. For instance, I get the following in the shell if I count off of a specific source:
> db.addresses.count({"sources.Source4": {$exists: true}});
1260538
Where is my error?
Yes there is a problem in your reduce method, it must be idempotent.
Remember that reduce() may be called many times on intermediary results.
Instead of
values.forEach(function(){
total++;
});
You need:
values.forEach(function(x){
total += x;
});