Map Reduce does not emit large data sets - mongodb

I am facing issues in map reduce, whenever expected result data set is large it returns nothing, it works for small data sets like for 40 thousand documents. Following is the code and problem understanding. See, I used this code
search = "breaking bad f"
var emit = function(a,b){
print(a);
}
map = function() {
if(this.torrent_name.indexOf(search) > -1){
emit(this._id, this.torrent_name);
}
}
reduce = function(key,values){
return values;
}
res = db.torrents.mapReduce(map,reduce,{out: { inline: 1 },query:{$text:{$search:search}},scope:{search:search},sort:{'seeders':-1}})
printjson(res);
Now the result to this job is:
{
"results" : [ ],
"timeMillis" : 503,
"counts" : {
"input" : 39859,
"emit" : 0,
"reduce" : 0,
"output" : 0
},
"ok" : 1
}
which makes sense because map reduce input is same as answer to below query
db.torrents.find({$text:{$search:"breaking bad f"}}).count()
output => 39859
Now the main issue come when I change the search string in map reduce job to "breaking bad s", the result shown is
{
"results" : [ ],
"timeMillis" : 329,
"counts" : {
"input" : 0,
"emit" : 0,
"reduce" : 0,
"output" : 0
},
"ok" : 1
}
which does not makes any because map reduce input is not equal to answer of below query
db.torrents.find({$text:{$search:"breaking bad s"}}).count()
output => 71484
From above results I come to conclusion that there is come memory issue but I don't know where and why. Please help.

Your process here is flawed in a number of ways.
Text search does not work like that
You are asking a $text search query to match on partial words such as
"breaking bad s"
"breaking bad f"
In each case, the "s" and the "f" here are ignored as they are not a whole word. So the only terms looked for are "breaking" and "bad". And I do mean "terms" here a opposed to a "phrase" as in "breaking bad", as the syntax you are using does not do that, but only looks for the terms instead.
They "might" be the phrase, but generally they will not be if the data being searched contains "breaking" or "bad" in any other combination.
I don't know where you think those counts are coming from, but it certainly has nothing to do with the "non-word" that is appended there.
The mapper is also wrong
Following on from above, since what you are actually matching here is "breaking" and "bad" as individual words, then it makes sense to only check that those "words" are present in the string. They will be of course, but the test is wrong and should be written like this:
map = function() {
if ( /breaking|bad/.test(this.torrent_name) ) {
emit(null,1);
}
};
Reducer is wrong as well
More to the point, besides the "emits" failing earlier, the reducer would never be called at all. With mapReduce the idea is that all "common" _id values are sent to the reducer for "reduction" to a single value and then returned.
The way you had this writen you are just trying to pass out values which is actually an array. If the reducer had fired, then this produces a "big" error, in that you can only return a "single" value. That is in fact why it is called a "reduce" stage, because you want to "reduce" the grouped data down to a single common point.
So again we rewrite this to something logical:
var reduce = function(key,values) {
return Array.sum(values);
};
Also noting here that what comes "out" of the emitted data from the mapper must be the same structure and type as what comes "out" of the reducer as well.
This is because in order to handle "large data", mapReduce does not process the same grouped key "all at once", but rather in small "chunks". So data that comes "out" of a reducer, can end up going back "in" as one of the values to further reduce.
So finally if you run this:
res = db.torrents.mapReduce(
map,
reduce,
{
"out": { "inline": 1 },
"query": { "$text":{ "$search" :search } }
}
);
You may actually just get a sane response that tells you it did something.
But as you develop this further, take note of what is said above and fully read the documentation, which also explains the points as descrived above.

Related

search phrase or words in document with timestamped words

I've been trying to do this for some days, I guess it's time to ask for a little help.
I'm using elasticsearch 6.6 (I believe it could be upgraded if needed) and nest for c# net5.
The task is to create an index where the documents are the result of a speech-to-text recognition, where all the recognized words have a timestamp (so that that said timestamp can be used to find where the word is spoken in the original file). There are 1000+ texts from media files, and every file is 4 hours long (that means usually 5000~15000 words).
Main idea was to split every text in 3 sec long segments, creating a document with the words in that time segment, and index it so that it can be searched.
I thought that it would not work that well, so next idea was to create a document for every window of 10~12 words scanning the document and jumping by 2 words at time, so that the search could at least match a decent phrase, and have highlighting of the hits too.
Since it's yet far from perfect, I thought it would be nice to index every whole text as a document so to maintain its coherency, the problem is the timestamp associated with every word. To keep this relationship I tried to use nested objects in the document:
PUT index-tapes-nested
{
"mappings" : {
"_doc" : {
"properties" : {
"$type" : { "type" : "text" },
"ContentId" : { "type" : "long" },
"Inserted" : { "type" : "date" },
"TrackId" : { "type" : "long" },
"Words" : {
"type" : "nested",
"properties" : {
"StartMillisec" : { "type" : "integer" },
"Word": { "type" : "text" }
}
}
}
}
}
}
This kinda works, but I don't know exactly how to write the query to search in the index.
A very basic query could be for example:
GET index-tapes-nested/_search
{
"query":{
"nested":{
"path":"Words",
"score_mode":"avg",
"query":{
"match":{
"Words.Word": "a bunch of things"
}
},
"inner_hits": {}
}
}
}
but something like that, especially with the avg scoring, gives low quality results; there could be the right document in the hits, but it doesn't get the word order, so it's not certain and it's not clear.
So as far as I understand it the span_near should come handy in these situations, but I get no results:
GET index-tapes-nested/_search
{
"query": {
"nested":{
"path":"Words",
"score_mode": "avg",
"query": {
"span_near": {
"clauses": [
{ "span_term": { "Words.Word": "bunch" }},
{ "span_term": { "Words.Word": "of" }},
{ "span_term": { "Words.Word": "things" }}
],
"slop": 2,
"in_order": true
}
}
}
}
}
I don't know much about elasticsearch, maybe I should change approach and change the model, maybe rewriting the query is enough, I don't know, this is pretty time consuming, so any help is really appreciated (is this a fairly common task?). For the sake of brevity I'm cutting some stuff and some ideas, I'm available to give some data or other examples if needed.
I also had problems with the c# nest client to manage the nested index, but that is another story.
This could be interpreted in a few ways i guess, having something like an "alternative stream" for a field, or metadata for every word, and so on. What i needed was this: https://github.com/elastic/elasticsearch/issues/5736 but it's not yet done, so for now i think i'll go with the annotated_text plugin or the 10 words window.
I have no idea if in the case of indexing single words there can be a query that 'restores' the integrity of the original text (which means 1. grouping them by an id 2. ordering them) so that elasticsearch can give the desired results.
I'll keep searching in the docs if there's something interesting, or if i can hack something to get what i need (like require_field_match or intervals query).

Storing a query in Mongo

This is the case: A webshop in which I want to configure which items should be listed in the sjop based on a set of parameters.
I want this to be configurable, because that allows me to experiment with different parameters also change their values easily.
I have a Product collection that I want to query based on multiple parameters.
A couple of these are found here:
within product:
"delivery" : {
"maximum_delivery_days" : 30,
"average_delivery_days" : 10,
"source" : 1,
"filling_rate" : 85,
"stock" : 0
}
but also other parameters exist.
An example of such query to decide whether or not to include a product could be:
"$or" : [
{
"delivery.stock" : 1
},
{
"$or" : [
{
"$and" : [
{
"delivery.maximum_delivery_days" : {
"$lt" : 60
}
},
{
"delivery.filling_rate" : {
"$gt" : 90
}
}
]
},
{
"$and" : [
{
"delivery.maximum_delivery_days" : {
"$lt" : 40
}
},
{
"delivery.filling_rate" : {
"$gt" : 80
}
}
]
},
{
"$and" : [
{
"delivery.delivery_days" : {
"$lt" : 25
}
},
{
"delivery.filling_rate" : {
"$gt" : 70
}
}
]
}
]
}
]
Now to make this configurable, I need to be able to handle boolean logic, parameters and values.
So, I got the idea, since such query itself is JSON, to store it in Mongo and have my Java app retrieve it.
Next thing is using it in the filter (e.g. find, or whatever) and work on the corresponding selection of products.
The advantage of this approach is that I can actually analyse the data and the effectiveness of the query outside of my program.
I would store it by name in the database. E.g.
{
"name": "query1",
"query": { the thing printed above starting with "$or"... }
}
using:
db.queries.insert({
"name" : "query1",
"query": { the thing printed above starting with "$or"... }
})
Which results in:
2016-03-27T14:43:37.265+0200 E QUERY Error: field names cannot start with $ [$or]
at Error (<anonymous>)
at DBCollection._validateForStorage (src/mongo/shell/collection.js:161:19)
at DBCollection._validateForStorage (src/mongo/shell/collection.js:165:18)
at insert (src/mongo/shell/bulk_api.js:646:20)
at DBCollection.insert (src/mongo/shell/collection.js:243:18)
at (shell):1:12 at src/mongo/shell/collection.js:161
But I CAN STORE it using Robomongo, but not always. Obviously I am doing something wrong. But I have NO IDEA what it is.
If it fails, and I create a brand new collection and try again, it succeeds. Weird stuff that goes beyond what I can comprehend.
But when I try updating values in the "query", changes are not going through. Never. Not even sometimes.
I can however create a new object and discard the previous one. So, the workaround is there.
db.queries.update(
{"name": "query1"},
{"$set": {
... update goes here ...
}
}
)
doing this results in:
WriteResult({
"nMatched" : 0,
"nUpserted" : 0,
"nModified" : 0,
"writeError" : {
"code" : 52,
"errmsg" : "The dollar ($) prefixed field '$or' in 'action.$or' is not valid for storage."
}
})
seems pretty close to the other message above.
Needles to say, I am pretty clueless about what is going on here, so I hope some of the wizzards here are able to shed some light on the matter
I think the error message contains the important info you need to consider:
QUERY Error: field names cannot start with $
Since you are trying to store a query (or part of one) in a document, you'll end up with attribute names that contain mongo operator keywords (such as $or, $ne, $gt). The mongo documentation actually references this exact scenario - emphasis added
Field names cannot contain dots (i.e. .) or null characters, and they must not start with a dollar sign (i.e. $)...
I wouldn't trust 3rd party applications such as Robomongo in these instances. I suggest debugging/testing this issue directly in the mongo shell.
My suggestion would be to store an escaped version of the query in your document as to not interfere with reserved operator keywords. You can use the available JSON.stringify(my_obj); to encode your partial query into a string and then parse/decode it when you choose to retrieve it later on: JSON.parse(escaped_query_string_from_db)
Your approach of storing the query as a JSON object in MongoDB is not viable.
You could potentially store your query logic and fields in MongoDB, but you have to have an external app build the query with the proper MongoDB syntax.
MongoDB queries contain operators, and some of those have special characters in them.
There are rules for mongoDB filed names. These rules do not allow for special characters.
Look here: https://docs.mongodb.org/manual/reference/limits/#Restrictions-on-Field-Names
The probable reason you can sometimes successfully create the doc using Robomongo is because Robomongo is transforming your query into a string and properly escaping the special characters as it sends it to MongoDB.
This also explains why your attempt to update them never works. You tried to create a document, but instead created something that is a string object, so your update conditions are probably not retrieving any docs.
I see two problems with your approach.
In following query
db.queries.insert({
"name" : "query1",
"query": { the thing printed above starting with "$or"... }
})
a valid JSON expects key, value pair. here in "query" you are storing an object without a key. You have two options. either store query as text or create another key inside curly braces.
Second problem is, you are storing query values without wrapping in quotes. All string values must be wrapped in quotes.
so your final document should appear as
db.queries.insert({
"name" : "query1",
"query": 'the thing printed above starting with "$or"... '
})
Now try, it should work.
Obviously my attempt to store a query in mongo the way I did was foolish as became clear from the answers from both #bigdatakid and #lix. So what I finally did was this: I altered the naming of the fields to comply to the mongo requirements.
E.g. instead of $or I used _$or etc. and instead of using a . inside the name I used a #. Both of which I am replacing in my Java code.
This way I can still easily try and test the queries outside of my program. In my Java program I just change the names and use the query. Using just 2 lines of code. It simply works now. Thanks guys for the suggestions you made.
String documentAsString = query.toJson().replaceAll("_\\$", "\\$").replaceAll("#", ".");
Object q = JSON.parse(documentAsString);

How to query a relative element using MongoDB

I have a document like this:
{
"whoKnows" : {
"name" : "Jeff",
"phone" : "123-123-1234"
},
"anotherElement" : {
"name" : "Jeff",
"phone" : "321-321-3211"
}
}
How can any instance of "name" by queried? For example, using a wildcard may look something like,
db.collection.find( { "*.name" : "Jeff" } )
Or if regex was support in the element place, it might look like,
db.collection.find( { /.*\.name/ : "Jeff" } )
Is it possible to accomplish this using MongoDB?
Side note: I'm not looking for a solution like,
db.collection.find({
"$or": [
{ "whoKnows.name" : "Jeff" },
{ "anotherElement.name" : "Jeff" }
]
})
I need a truly relative path solution as I do not know what the parent element will be (unless there is a way to generate the name of every element - then I could dynamically generate the $or clause at runtime).
Everything about this is fairly horrible, you cannot possibly index on something like the "name" values and your "path" to each attribute is going to vary everywhere. So this is really bad for queries.
I notice you mention "nested" structures, and you still could accommodate this with a similar proposal and some additional tagging, but I want you to consider this "phone book" type example:
{
"phones": [
{
"type": "Home",
"name" : "Jeff",
"phone" : "123-123-1234"
},
{
"type": "Work",
"name" : "Jeff",
"phone" : "123-123-1234"
},
]
}
Since this is actually sub-documents within an array, fields like "name" always share the same path, so not only can you index these (which is going to be good for performance) but the query is very basic:
db.collection({ "phones.name": "Jeff" })
That does exactly what you need by finding "Jeff" in any "name" entry. If you need a hierachy, then add some fields in those sub-documents to indicate the parent/child relationship that you can use in post processing. Or even as a materialized path which could aid your queries.
It really is the better approach.
If you really must keep this kind of structure then at least do something like this with the JavaScript that will bail out on the first match at depth:
db.collection.find(
function () {
var found = false;
var finder = function( obj, field, value ) {
if ( obj.hasOwnProperty(field) && obj[field] == value )
found = true;
if (found) return true;
for( var n in obj ) {
if ( Object.prototype.toString.call(obj[n]) === "[object Object]" ) {
finder( obj[n], field, value );
if (found) return true;
}
}
};
finder( this, "name", "Jeff" );
return found;
}
)
The format there is shorthand notation for the $where operator, which is pretty bad news for performance, but your structure isn't offering much other choice. At any rate, the function should recurse into each nested document until the "field" with the "value" is found.
For anything of production scale, really look at changing the structure to something that can be indexed and accessed quickly. The first example should give you a starting point. Relying on arbitrary JavaScript for queries as your present structure constrains you to is bad news.
If these are similar instance, what stops you in putting these in an array? That would be easier to query.
In it's current form this looks as good as writing your own $where condition to parse all document structure and is not an efficient operation!
Although highly inefficient and I wouldn't suggest using this in a production environment, following is one of the simplest way (with its own various catches) you can query:
db.query.find({$where: function() { x = tojsononeline(this); return x.indexOf('"name" : "Jeff",') >= 0; } })
Please note that this will cause a tablescan and if you have a pre-condition you may want to specify that before the where clause in the query.

mongodb mapreduce doesn't return right in a sharded cluster

very interesting, mapreduce works fine in a single instance, but not on a sharded collection. as below, you may see that i got a collection and write a simple map-reduce
function,
mongos> db.tweets.findOne()
{
"_id" : ObjectId("5359771dbfe1a02a8cf1c906"),
"geometry" : {
"type" : "Point",
"coordinates" : [
131.71778292855996,
0.21856835860911106
]
},
"type" : "Feature",
"properties" : {
"isflu" : 1,
"cell_id" : 60079,
"user_id" : 35,
"time" : ISODate("2014-04-24T15:42:05.048Z")
}
}
mongos> db.tweets.find({"properties.user_id":35}).count()
44247
mongos> map_flow
function () { var key=this.properties.user_id; var value={ "cell_id":1}; emit(key,value); }
mongos> reduce2
function (key,values){ var ros={flows:[]}; values.forEach(function(v){ros.flows.push(v.cell_id);});return ros;}
mongos> db.tweets.mapReduce(map_flow,reduce2, { out:"flows2", sort:{"properties.user_id":1,"properties.time":1} })
but the results are not what i want
mongos> db.flows2.find({"_id":35})
{ "_id" : 35, "value" : { "flows" : [ null, null, null ] } }
I got lots of null and interesting all have three ones.
mongodb mapreduce seems not right on sharded collection?
The number one rule of MapReduce is:
thou shall emit the value of the same type as reduce function returneth
You broke this rule, so your MapReduce only works for small collection where reduce is only called once for each key (that's the second rule of MapReduce - reduce function may be called zero, once or many times).
Your map function emits exactly this value {cell_id:1} for each document.
How does your reduce function use this value? Well, you return a value which is a document with an array, into which you push the cell_id value. This is strange already, because that value was 1, so I'm not sure why you wouldn't just emit 1 (if you wanted to count).
But look what happens when multiple shards push a bunch of 1's into this flows array (whether it's what you intended, that's what your code is doing) and now reduce is called on several already reduced values:
reduce(key, [ {flows:[1,1,1,1]},{flows:[1,1,1,1,1,1,1,1,1]}, etc ] )
Your reduce function now tries to take each member of the values array (which is a document with a single field flows) and you push v.cell_id to your flows array. There is no cell_id field here, so of course you end up with null. And three nulls could be because you have three shards?
I would recommend that you articulate to yourself what exactly you are trying to aggregate in this code, and then rewrite your map and your reduce to comply with the rules that mapReduce in MongoDB expects your code to follow.

Most efficient way to generate a list of Unigrams from a text field in MongoDB

I need to generate a vector of unigrams, i.e. a vector of all the unique words which appear in a specific text field that I have stored as part of a broader JSON object in MongoDB.
I'm not really sure what's the easiest and most efficient way to generate this vector. I was thinking of writing a simple Java app which could handle the tokenization (using something like OpenNLP), however I think that a better approach may be to try to tackle this using Mongo's Map-Reduce feature... However I'm not really sure how I could go about this.
Another option would be to use Apache Lucene indexing, but it would mean I'd still need to export this data in one by one. Which is really the same issue I would have with the custom Java or Ruby approach...
Map reduce sounds good however the Mongo data is growing by the day as more document are inserted. This isn't really a one off task as there are new documents being added all the time. Updates are very rare. I really don't want to run a Map-Reduce over the millions of documents every time I want to update my Unigram vector as I fear this will be very inefficient use of resources...
What would be the most efficient way to generate the unigram vector and then keep it updated?
Thanks!
Since you have not provided a sample document (object) format take this as a sample collection called 'stories'.
{ "_id" : ObjectId("4eafd693627b738f69f8f1e3"), "body" : "There was a king", "author" : "tom" }
{ "_id" : ObjectId("4eafd69c627b738f69f8f1e4"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd72c627b738f69f8f1e5"), "body" : "There was a queen", "author" : "tom" }
{ "_id" : ObjectId("4eafd74e627b738f69f8f1e6"), "body" : "There was a jack", "author" : "tom" }
{ "_id" : ObjectId("4eafd785627b738f69f8f1e7"), "body" : "There was a humpty and dumpty . Humtpy was tall . Dumpty was short .", "author" : "jane" }
{ "_id" : ObjectId("4eafd7cc627b738f69f8f1e8"), "body" : "There was a cat called Mini . Mini was clever cat . ", "author" : "jane" }
For the given dataset, you can use the following javascript code to get to your solution. The collection "authors_unigrams" contains the result. All the code is supposed to be run using mongo console (http://www.mongodb.org/display/DOCS/mongo+-+The+Interactive+Shell).
First, we need to mark of all the new documents that have come afresh into the 'stories' collection. We do it using following command. It will add a new attribute called "mr_status" into each document and assign value "inprocess". Later, we will see that map-reduce operation will only take those documents in account which are having the value "inprocess" for the field "mr_status". This way, we can avoid reconsidering all the documents for map-reduce operation that have been already considered in any of the previous attempt, making the operation efficient as asked.
db.stories.update({mr_status:{$exists:false}},{$set:{mr_status:"inprocess"}},false,true);
Second, we define both map() and reduce() function.
var map = function() {
uniqueWords = function (words){
var arrWords = words.split(" ");
var arrNewWords = [];
var seenWords = {};
for(var i=0;i<arrWords.length;i++) {
if (!seenWords[arrWords[i]]) {
seenWords[arrWords[i]]=true;
arrNewWords.push(arrWords[i]);
}
}
return arrNewWords;
}
var unigrams = uniqueWords(this.body) ;
emit(this.author, {unigrams:unigrams});
};
var reduce = function(key,values){
Array.prototype.uniqueMerge = function( a ) {
for ( var nonDuplicates = [], i = 0, l = a.length; i<l; ++i ) {
if ( this.indexOf( a[i] ) === -1 ) {
nonDuplicates.push( a[i] );
}
}
return this.concat( nonDuplicates )
};
unigrams = [];
values.forEach(function(i){
unigrams = unigrams.uniqueMerge(i.unigrams);
});
return { unigrams:unigrams};
};
Third, we actually run the map-reduce function.
var result = db.stories.mapReduce( map,
reduce,
{query:{author:{$exists:true},mr_status:"inprocess"},
out: {reduce:"authors_unigrams"}
});
Fourth, we mark all the records that have been considered for map-reduce in last run as processed by setting "mr_status" as "processed".
db.stories.update({mr_status:"inprocess"},{$set:{mr_status:"processed"}},false,true);
Optionally, you can see the result collection "authors_unigrams" by firing following command.
db.authors_unigrams.find();