mongodb mapreduce doesn't return right in a sharded cluster - mongodb

very interesting, mapreduce works fine in a single instance, but not on a sharded collection. as below, you may see that i got a collection and write a simple map-reduce
function,
mongos> db.tweets.findOne()
{
"_id" : ObjectId("5359771dbfe1a02a8cf1c906"),
"geometry" : {
"type" : "Point",
"coordinates" : [
131.71778292855996,
0.21856835860911106
]
},
"type" : "Feature",
"properties" : {
"isflu" : 1,
"cell_id" : 60079,
"user_id" : 35,
"time" : ISODate("2014-04-24T15:42:05.048Z")
}
}
mongos> db.tweets.find({"properties.user_id":35}).count()
44247
mongos> map_flow
function () { var key=this.properties.user_id; var value={ "cell_id":1}; emit(key,value); }
mongos> reduce2
function (key,values){ var ros={flows:[]}; values.forEach(function(v){ros.flows.push(v.cell_id);});return ros;}
mongos> db.tweets.mapReduce(map_flow,reduce2, { out:"flows2", sort:{"properties.user_id":1,"properties.time":1} })
but the results are not what i want
mongos> db.flows2.find({"_id":35})
{ "_id" : 35, "value" : { "flows" : [ null, null, null ] } }
I got lots of null and interesting all have three ones.
mongodb mapreduce seems not right on sharded collection?

The number one rule of MapReduce is:
thou shall emit the value of the same type as reduce function returneth
You broke this rule, so your MapReduce only works for small collection where reduce is only called once for each key (that's the second rule of MapReduce - reduce function may be called zero, once or many times).
Your map function emits exactly this value {cell_id:1} for each document.
How does your reduce function use this value? Well, you return a value which is a document with an array, into which you push the cell_id value. This is strange already, because that value was 1, so I'm not sure why you wouldn't just emit 1 (if you wanted to count).
But look what happens when multiple shards push a bunch of 1's into this flows array (whether it's what you intended, that's what your code is doing) and now reduce is called on several already reduced values:
reduce(key, [ {flows:[1,1,1,1]},{flows:[1,1,1,1,1,1,1,1,1]}, etc ] )
Your reduce function now tries to take each member of the values array (which is a document with a single field flows) and you push v.cell_id to your flows array. There is no cell_id field here, so of course you end up with null. And three nulls could be because you have three shards?
I would recommend that you articulate to yourself what exactly you are trying to aggregate in this code, and then rewrite your map and your reduce to comply with the rules that mapReduce in MongoDB expects your code to follow.

Related

Matching elements in array documents sometimes gets very slow

I have a mongodb collection with about 100.000 documents.
Each document has an array with about ~ 100 elements. Is an array of strings like this:
features: [
"0_Toyota",
"29776_Grey",
"101037_Hybrid",
"240473_Iron Gray",
"46290_Aluminium,Magnesium",
"2787_14",
"9350_1920 x 1080",
"36303_Y",
"310870_N",
"57721_Y"
...
Making queries like this, are very fast. But sometimes gets very slow, including an specific extra condition inside $and. I have no idea why this happens. When gets slow, it takes more than 40 seconds. Always happens with the same extra condition. It is very possible that it happens with other conditions.
db.products.find({
$and:[
{
"features" : {
"$eq" : "36303_N"
}
},
{
"features" : {
"$eq" : "91135_IPS"
}
},
{
"features" : {
"$eq" : "9350_1366 x 768"
}
},
{
"features" : {
"$eq" : "178874_Y"
}
},
{
"features" : {
"$eq" : "43547_Y"
}
}
...
I'm running the same mongodb in my unix laptop and on a linux server instance.
Also trying indexing the field "features" with the same results.
use $all in your mongo query with your data helps you to query for an array
first create index on features
use this query may helps to you
db.products.find( { features: { $all: ["36303_N", "91135_IPS","others..."] } } )
by the way ,
if your query is very slow ,get the slow operation from your mongod log
show your mongodb version .
any writing when query (write will blocking read in some version)
I have realized that order inside $all matters. I change the order of elements by its number of documents that exists inside the collection, ascending. Making the query more selective.
Before, the query takes ~ 40 seconds to execute, now, with elements ordered, it takes ~ 22 seconds.
Still many seconds anyway.

Mongodb Sorting returns null instead of data

My Mongodb dataset is like this
{
"_id" : ObjectId("5a27cc4783800a0b284c7f62"),
"action" : "1",
"silent" : "0",
"createdate" : ISODate("2017-12-06T10:53:59.664Z"),
"__v" : 0
}
Now I have to find the data whose Action value is 1 and silent value is 0. one more thing is that all the data returns is descending Order.
My Mongodb Query is
db.collection.find({'action': 1, 'silent': 0}).sort({createdate: -1}).exec(function(err, post) {
console.log(post.length);
});
Earlier It works Fine for me But Now I have 121000 entry on this Collection. Now it returns null.
I know there is some confusion on .sort()
If i remove the sort Query then everything is fine. Example
db.collection.find({'action': 1, 'silent': 0}).exec(function(err, post) {
console.log(post.length);// Now it returns data but not on Descending order
});
MongoDB limits the amount of data it will attempt to sort without an index .
This is because Mongo has to sort the data in memory or on disk, both of which can be expensive operations, particularly for queries run frequently.
In most cases, this can be alleviated by creating indexes on the fields you sort on.
you can create index with :-
db.myColl.createIndex( { createdate: 1 })
thanks !

why is mongodb not indexing my collection

I have created a collection and added just a name field and tried to apply the following index.
db.names.createIndex({"name":1})
Even after applying the index I see the below result.
db.names.find()
{ "_id" : ObjectId("57d14139eceab001a19f7e82"), "name" : "kkkk" } {
"_id" : ObjectId("57d1413feceab001a19f7e83"), "name" : "aaaa" } {
"_id" : ObjectId("57d14144eceab001a19f7e84"), "name" : "zzzz" } {
"_id" : ObjectId("57d14148eceab001a19f7e85"), "name" : "dddd" } {
"_id" : ObjectId("57d1414ceceab001a19f7e86"), "name" : "rrrrr" }
What am I missing here.
Khans...
the way you built your index is correct however building an ascending index on names wont return the results in ascending order.
if you need results to be ordered by name you have to use
{db.names.find().sort({names:1})}
what happens when you build an index is that when you search for data the Mongo process perform the search behind the scenes in an ordered fashion for faster outcomes.
Please note: if you just want to see output in sorted order. you dont even need an index.
You won't be able to see if an index has been successfully created (unless there is a considerable speed performance) by running a find() command.
Instead, use db.names.getIndexes() to see if the index has been created (it may take some time if you're running the index in the background for it to appear in the index list)

Map Reduce does not emit large data sets

I am facing issues in map reduce, whenever expected result data set is large it returns nothing, it works for small data sets like for 40 thousand documents. Following is the code and problem understanding. See, I used this code
search = "breaking bad f"
var emit = function(a,b){
print(a);
}
map = function() {
if(this.torrent_name.indexOf(search) > -1){
emit(this._id, this.torrent_name);
}
}
reduce = function(key,values){
return values;
}
res = db.torrents.mapReduce(map,reduce,{out: { inline: 1 },query:{$text:{$search:search}},scope:{search:search},sort:{'seeders':-1}})
printjson(res);
Now the result to this job is:
{
"results" : [ ],
"timeMillis" : 503,
"counts" : {
"input" : 39859,
"emit" : 0,
"reduce" : 0,
"output" : 0
},
"ok" : 1
}
which makes sense because map reduce input is same as answer to below query
db.torrents.find({$text:{$search:"breaking bad f"}}).count()
output => 39859
Now the main issue come when I change the search string in map reduce job to "breaking bad s", the result shown is
{
"results" : [ ],
"timeMillis" : 329,
"counts" : {
"input" : 0,
"emit" : 0,
"reduce" : 0,
"output" : 0
},
"ok" : 1
}
which does not makes any because map reduce input is not equal to answer of below query
db.torrents.find({$text:{$search:"breaking bad s"}}).count()
output => 71484
From above results I come to conclusion that there is come memory issue but I don't know where and why. Please help.
Your process here is flawed in a number of ways.
Text search does not work like that
You are asking a $text search query to match on partial words such as
"breaking bad s"
"breaking bad f"
In each case, the "s" and the "f" here are ignored as they are not a whole word. So the only terms looked for are "breaking" and "bad". And I do mean "terms" here a opposed to a "phrase" as in "breaking bad", as the syntax you are using does not do that, but only looks for the terms instead.
They "might" be the phrase, but generally they will not be if the data being searched contains "breaking" or "bad" in any other combination.
I don't know where you think those counts are coming from, but it certainly has nothing to do with the "non-word" that is appended there.
The mapper is also wrong
Following on from above, since what you are actually matching here is "breaking" and "bad" as individual words, then it makes sense to only check that those "words" are present in the string. They will be of course, but the test is wrong and should be written like this:
map = function() {
if ( /breaking|bad/.test(this.torrent_name) ) {
emit(null,1);
}
};
Reducer is wrong as well
More to the point, besides the "emits" failing earlier, the reducer would never be called at all. With mapReduce the idea is that all "common" _id values are sent to the reducer for "reduction" to a single value and then returned.
The way you had this writen you are just trying to pass out values which is actually an array. If the reducer had fired, then this produces a "big" error, in that you can only return a "single" value. That is in fact why it is called a "reduce" stage, because you want to "reduce" the grouped data down to a single common point.
So again we rewrite this to something logical:
var reduce = function(key,values) {
return Array.sum(values);
};
Also noting here that what comes "out" of the emitted data from the mapper must be the same structure and type as what comes "out" of the reducer as well.
This is because in order to handle "large data", mapReduce does not process the same grouped key "all at once", but rather in small "chunks". So data that comes "out" of a reducer, can end up going back "in" as one of the values to further reduce.
So finally if you run this:
res = db.torrents.mapReduce(
map,
reduce,
{
"out": { "inline": 1 },
"query": { "$text":{ "$search" :search } }
}
);
You may actually just get a sane response that tells you it did something.
But as you develop this further, take note of what is said above and fully read the documentation, which also explains the points as descrived above.

Array intersection in MongoDB

Ok there are a couple of things going on here..I have two collections: test and test1. The documents in both collections have an array field (tags and tags1, respectively) that contains some tags. I need to find the intersection of these tags and also fetch the whole document from collection test1 if even a single tag matches.
> db.test.find();
{
"_id" : ObjectId("5166c19b32d001b79b32c72a"),
"tags" : [
"a",
"b",
"c"
]
}
> db.test1.find();
{
"_id" : ObjectId("5166c1c532d001b79b32c72b"),
"tags1" : [
"a",
"b",
"x",
"y"
]
}
> db.test.find().forEach(function(doc){db.test1.find({tags1:{$in:doc.tags}})});
Surprisingly this doesn't return anything. However when I try it with a single document, it works:
> var doc = db.test.findOne();
> db.test1.find({tags1:{$in:doc.tags}});
{ "_id" : ObjectId("5166c1c532d001b79b32c72b"), "tags1" : [ "a", "b", "x", "y" ] }
But this is part of what I need. I need intersection as well. So I tried this:
> db.test1.find({tags1:{$in:doc.tags}},{"tags1.$":1});
{ "_id" : ObjectId("5166c1c532d001b79b32c72b"), "tags1" : [ "a" ] }
But it returned just "a" whereas "a" and "b" both were in tags1. Does positional operator return just the first match? Also, using $in won't exactly give me an intersection..How can I get an intersection (should return "a" and "b") irrespective of which array is compared against the other.
Now say there's an operator that can do this..
> db.test1.find({tags1:{$intersection:doc.tags}},{"tags1.$":1});
{ "_id" : ObjectId("5166c1c532d001b79b32c72b"), "tags1" : [ "a", "b" ] }
My requirement is, I need the entire tags1 array PLUS this intersection, in the same query like this:
> db.test1.find({tags1:{$intersection:doc.tags}},{"tags1":1, "tags1.$":1});
{ "_id" : ObjectId("5166c1c532d001b79b32c72b"), "tags1": [ "a", "b", "x", "y" ],
"tags1" : [ "a", "b" ] }
But this is an invalid json. Is renaming key possible, or this is possible only through aggregation framework (and across different collections?)? I tried the above query with $in. But it behaved as if it totally ignored "tags:1" projection.
PS: I am going to have at least 10k docs in test1 and very few (<10) in test. And this query is in real-time, so I want to avoid mapreduce :)
Thanks for any help!
In newer versions you can use aggregation to accomplish this.
db.test.aggregate(
{
$match: {
tags1: {
$in: doc.tags
}
}
},
{
$project: {
tags1: 1,
intersection: {
$setIntersection: [doc.tags, "$tags1"]
}
}
}
);
As you can see, the match portion is exactly the same as your initial find() query. The project portion generates the result fields. In this case, it selects tags1 from the matching documents and also creates intersection from the input and the matching docs.
Mongo doesn't have any inherent ability to retrieve array intersections. If you really need to use ad-hoc querying get the intersection on the client side.
On the other hand, consider using Map-Reduce and storing it's output as a collection. You can augment the returned objects in the finalize section to add the intersecting tags. Cron MR to run every few seconds. You get the benefit of a permanent collection you can query from on the client side.
If you want to have this in realtime you should consider to move away from Serverside Javascript which is only run with one thread and should be quite slow (single threaded) (this is no longer true for v2.4, http://docs.mongodb.org/manual/core/server-side-javascript/).
The positional operator only returns the first matching/current value. Without knowing the internal implementation, from the point of performance it doesn't even makes sense to look for further matching criteria if the document was already evaluated as match. So I doubt that you can go for this.
I don't know if you need the cartesian product for your search, but I would consider joining your few test one document tags into one and then have some $in search for it on test1, returning all matching documents. On your local machine you could have multiple threads which generate the intersection for your document.
Depending on how frequent your test1 and test collection changes, you're performing this query you might precalculate this information. Which would allow to easily do a query on the field which contains the intersection information.
The document is invalid because you have two fields names tags1