MapReduce in MongoDB doesn't output - mongodb

I was trying to use MongoDB 2.4.3 (also tried 2.4.4) with mapReduce on a cluster with 2 shards with each 3 replicas. I have a problem with results of the mapReduce job not being reduced into output collection. I tried an Incremental Map Reduce. I also tried "merging" instead of reducing, but that didn't work either.
The map reduce command run on mongos: (coll isn't sharded)
db.coll.mapReduce(map, reduce, {out: {reduce: "events", "sharded": true}})
Which yields the following output:
{
"result" : "events",
"counts" : {
"input" : NumberLong(2),
"emit" : NumberLong(2),
"reduce" : NumberLong(0),
"output" : NumberLong(28304112)
},
"timeMillis" : 418,
"timing" : {
"shardProcessing" : 11,
"postProcessing" : 407
},
"shardCounts" : {
"stats2/192.168.…:27017,192.168.…" : {
"input" : 2,
"emit" : 2,
"reduce" : 0,
"output" : 2
}
},
"postProcessCounts" : {
"stats1/192.168.…:27017,…" : {
"input" : NumberLong(0),
"reduce" : NumberLong(0),
"output" : NumberLong(14151042)
},
"stats2/192.168.…:27017,…" : {
"input" : NumberLong(0),
"reduce" : NumberLong(0),
"output" : NumberLong(14153070)
}
},
"ok" : 1,
}
So I see that the mapReduce is run over 2 records, which results in 2 records outputted. However in the postProcessCounts for both shards the input count stays 0. Also trying to find the record with a search on _id yields no result. In the log file of MongoDB I wasn't able to find error messages related to this.
After trying to reproduce this with a newly created output collection, that I also sharded on hashed _id and I also gave the same indexes, I wasn't able to reproduce this. When outputting the same input to a different collection
db.coll.mapReduce(map, reduce, {out: {reduce: "events_test2", "sharded": true}})
The result is stored in the output collection and I got the following output:
{
"result" : "events_test2",
"counts" : {
"input" : NumberLong(2),
"emit" : NumberLong(2),
"reduce" : NumberLong(0),
"output" : NumberLong(4)
},
"timeMillis" : 321,
"timing" : {
"shardProcessing" : 68,
"postProcessing" : 253
},
"shardCounts" : {
"stats2/192.168.…:27017,…" : {
"input" : 2,
"emit" : 2,
"reduce" : 0,
"output" : 2
}
},
"postProcessCounts" : {
"stats1/192.168.…:27017,…" : {
"input" : NumberLong(2),
"reduce" : NumberLong(0),
"output" : NumberLong(2)
},
"stats2/192.168.…:27017,…" : {
"input" : NumberLong(2),
"reduce" : NumberLong(0),
"output" : NumberLong(2)
}
},
"ok" : 1,
}
When running the script again with the same input ouputting again in the second collection, it shows that it is reducing in postProcessCounts. So the map and reduce functions do their job fine. Why doesn't it work on the larger first collection? Am I doing something wrong here? Are there any special limitations on collections that can be used as output for map-reduce?

mapReduce is run over 2 records, which results in 2 records outputted. However in the postProcessCounts for both shards the input count stays 0.
Map is run over 2 records. If those two records have a different key then the Map will output 2 keys and a value for each. Which is normal.
But something that I noticed in an older version of MongoDB (not sure if this applies in your case) is that if the "values array " for the reduce phase have a length, then reducing will be skipped.
Is the output collection empty in the first case?

Related

MongoDB optimization

I need to optimize my MongoDB performance but can't figure out how. Maybe there's some tips. Or maybe i should use another storage engine. Any ideas are welcome.
I have following log output in which are described query:
2015-08-04T15:09:56.226+0300 [conn129682] command mongodb_db1.$cmd command: aggregate { aggregate: "collection", pipeline: [ { $match: { _id.index_id_1: 4931359 } } ] } keyUpdates:0 numYields:39 locks(micros) r:83489 reslen:177280 286ms
I have collection named collection which contains following data structure:
{
"_id" : {
"x" : "x",
"index_id_1" : NumberLong(5617088)
},
"value" : {
"value_1" : 1.0000000000000000,
"value_2" : 0.0000000000000000,
"value_3" : 1.0000000000000000
}
}
By querying stats in result i have following details:
{
"ns" : "mongodb_db1.collection",
"count" : 2.07e+007,
"size" : 4968000000.0000000000000000,
"avgObjSize" : 240,
"storageSize" : 5524459408.0000000000000000,
"numExtents" : 25,
"nindexes" : 3,
"lastExtentSize" : 5.36601e+008,
"paddingFactor" : 1.0000000000000000,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 4475975728.0000000000000000,
"indexSizes" : {
"_id_" : 2884043120.0000000000000000,
"_id.x.index_id_1" : 1.07118e+009,
"_id.index_id_1" : 5.20754e+008
},
"ok" : 1.0000000000000000
}
Running on single node ( no shards ).
MongoDB version is: 2.4.
Installed RAM (MB): 24017 ( index size ~120GB )
10Gen / Mongodb are running a series of FREE online courses, that cover all you need to know (Latest iteration starts today). Simply head over and sign up for the DBA course, and if your feeling brave a couple of the others, but there is a lot of common / duplicated material, between all variants at the beginning.

MongoDB Text search fails for stop words

I'm trying to do a query in my collection, but its not returning anything.
Here's my query:
{'$match': {'$text': {'$search': 'a'}}},
{'$group': {'_id': {'texto': '$texto'},
'somanumero': {'$sum': '$numero'}}}
My collection:
{ "_id" : ObjectId("555cdc4fe13823315537042d"), "texto" : ObjectId("555cdc4fe13823315537042c"), "numero" : ObjectId("555cdc4fe13823315537042e") }
{ "_id" : ObjectId("555cdc5ee13823315537042f"), "numero" : 5, "texto" : "a", "lattexto" : "-15.79506", "lontexto" : "-47.88322" }
{ "_id" : ObjectId("555cdc6ae138233155370430"), "numero" : 10, "texto" : "a", "lattexto" : "-15.79506", "lontexto" : "-47.88322" }
{ "_id" : ObjectId("555cdc73e138233155370431"), "numero" : 3, "texto" : "b", "lattexto" : "-15.79506", "lontexto" : "-47.88322" }
And here's my text index:
{
"v" : 1,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"name" : "texto_text",
"ns" : "OSA.teste_texto",
"default_language" : "portuguese",
"weights" : {
"texto" : 1
},
"language_override" : "language",
"textIndexVersion" : 2
}
When i use $group or $match alone, it works.
Am I doing something wrong?
From the docs:
MongoDB supports text search for various languages. text indexes drop
language-specific stop words (e.g. in English, “the”, “an”, “a”,
“and”, etc.) and uses simple language-specific suffix stemming.
The problem with your data is that some of the records have the language-specific stop word, a, which is considered to be a stop word in portugese too. Some of the stop words include, and a is on top of the list.
a
ao
aos
aquela
aquelas
aquele
aqueles
aquilo
as
até
com
como
These words are never indexed, and hence whenever you query for stop words, you get no results.
At the same time, If you query for b, you would get results, since it is not a stop word and would be indexed.

mongodb fetch hundreds of data out of millions of data

In my database, I have millions of documents. Each of them has a time stamp. Some have the same time stamp. I want to get some points (a few hundreds or potentially more like thousands) to draw a graph. I don't want all the points. I want every n points I pick 1 point. I know there's aggregation framework and I tried that. The problem with that is since my data is huge. When I do aggregation work, The result exceeds document maximum size, 16MB, easily. There's also a function called skip in mongodb but it only skips first n documents. Are there good ways to achieve what I want? Or is there way to make aggregation result bigger? Thanks in advance!
I'm not sure how you can do this with either A/F or M/R - just skipping so that you have (f.e.) each 10th point is not something M/R allows you to do—unless you select each point based on a random value with a 10% change... which is probably not what you want. But that does work:
db.so.output.drop();
db.so.find().count();
map = function() {
// rand does 0-1, so < 0.1 means 10%
if (Math.random() < 0.1) {
emit(this._id, this);
}
}
reduce = function(key, values) {
return values;
}
db.so.mapReduce( map, reduce, { out: 'output' } );
db.output.find();
Which outputs something line:
{
"result" : "output",
"timeMillis" : 4,
"counts" : {
"input" : 23,
"emit" : 3,
"reduce" : 0,
"output" : 3
},
"ok" : 1,
}
> db.output.find();
{ "_id" : ObjectId("51ffc4bc16473d7b84172d85"), "value" : { "_id" : ObjectId("51ffc4bc16473d7b84172d85"), "date" : ISODate("2013-08-05T15:24:45Z") } }
{ "_id" : ObjectId("51ffc75316473d7b84172d8e"), "value" : { "_id" : ObjectId("51ffc75316473d7b84172d8e") } }
{ "_id" : ObjectId("51ffc75316473d7b84172d8f"), "value" : { "_id" : ObjectId("51ffc75316473d7b84172d8f") } }
or:
> db.so.mapReduce( map, reduce, { out: 'output' } );
{
"result" : "output",
"timeMillis" : 19,
"counts" : {
"input" : 23,
"emit" : 2,
"reduce" : 0,
"output" : 2
},
"ok" : 1,
}
> db.output.find();
{ "_id" : ObjectId("51ffc4bc16473d7b84172d83"), "value" : { "_id" : ObjectId("51ffc4bc16473d7b84172d83"), "date" : ISODate("2013-08-05T15:24:25Z") } }
{ "_id" : ObjectId("51ffc4bc16473d7b84172d86"), "value" : { "_id" : ObjectId("51ffc4bc16473d7b84172d86"), "date" : ISODate("2013-08-05T15:25:15Z") } }
Depending on a random factor.

update with $inc doesn't increment the field

Could you guys tell me what I'm doing wrong ?
I have following document
{
"_id" : ObjectId("5013df3e4a4314271d002476"),
"bgtiled" : NumberLong(1),
"post_count" : NumberLong(0),
"url" : "fb_test",
"url_lower" : "fb_test",
"user_id" : NumberLong(4044217),
"verified" : NumberLong(0)
}
I want to increment post_count. According to documentation I do this
db.users.update({user_id : 4044217}, {$inc : {post_count : 1}});
and it doesn't increment the field. I also tried this
db.users.update({user_id : 4044217}, {$inc : {"post_count" : 1}});
and this
db.users.update({user_id : 4044217}, {"$inc" : {"post_count" : 1}});
with no result :(
If you were doing this in the shell, one way of finding out what is happening would be to check what getLastError() call returned on this update.
Since the shell automatically calls getLastError() for you on every entry, you can do it one of two ways:
> db.users.update({user_id : 4044217}, {$inc : {post_count : 1}})
> db.getPrevError()
{ "err" : null, "updatedExisting" : true, "n" : 1, "nPrev" : 8, "ok" : 1 }
Or you can getch the error before the shell calls GLE for you via
> db.users.update({user_id : 4044217}, {$inc : {post_count : 1}}); db.getLastErrorObj();
{
"updatedExisting" : false,
"n" : 0,
"connectionId" : 2,
"err" : null,
"ok" : 1
}
note: in the second example, GLE call must be on the same line as the update.
In your case it looks like you would have gotten back information that one existing document was updated, and that might have prompted you to check the find clause of your update statement via
> db.users.find( {user_id : 4044217} )
which would have returned more than one document.
Turned out I had multiple documents matching the query and only one was updated.
Sorry.

Querying a Sub Object in MongoDB is not using the Index

I am recording site usage events in a sub object of a (visitor). here is a basic example of the data structure:
{ "_id" : ObjectId("4d4c695794b332a0740009bd"), "evs" : [
{
"ev" : "Visit Home Page",
"d" : 1,
"s" : 1
},
{
"ev" : "Buy Product",
"d" : "110.10",
"upc" : 1234,
"s" : 1
},
{
"ev" : "Sign up to newsletter",
"d" : "1",
"s" : 1
}
]}
I have an index on 'evs.s', but when I search on evs.s, the index is not used:
db.visitors.find({'evs.s':0}).explain()
{
"cursor" : "BtreeCursor evs.s_1",
"nscanned" : 33361,
"nscannedObjects" : 33361,
"n" : 33361,
"millis" : 311,
"nYields" : 105,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"evs.s" : [
[
0,
0
]
]
}
}
That query takes 311 milliseconds and scans through every object.
Here is the index: db.visitors.getIndexes()
{
"ns" : "tracking.visitors",
"unique" : false,
"key" : {
"evs.s" : 1
},
"name" : "evs.s_1",
"v" : 0
}
Your query actually is using an index, as indicated by the cursor type in the explain output ("BtreeCursor evs.s_1"). If you were not using a an index, it would be "BasicCursor".
From your input data, it looks like evs.s might not be a very efficient key to index on. If all of the values of evs.s are either 1 or 0, your index will always hit a large number of matches.
My guess is that your query did not do a full table scan, but that there are actually that many records with a value of evs.s = 0 in your index.
You might compare the output of
db.visits.find({evs.s: 0}).count();
db.visits.find({evs.s: 1}).count();
db.visits.find().count();
to verify this.
There are several things you can do to speed this up:
1) You can use a different index that has more distinct values. This will reduce the search space on each query.
2) You can add a limit statement to your query. This will stop scanning the index once limit documents have been found.
"cursor" : "BtreeCursor evs.s_1"
means that the index is used.