MongoDB : Slow text search when searching a very frequent term - mongodb

I have a collection of about 1 million documents (movies mainly), I created a text index on a field. All works fine for almost all searches : less than 20ms to have a result. The exception is when one search for a very frequent term, it can lasts up to 3000 ms !
For example,
if I search for 'pulp' in the collection (only 40 documents have it), it lasts 1ms
if I search for 'movie' (750 000 documents have it), it lasts 3000ms.
When profiling the request, the explain('executionStats') show that all 'movies' documents are scanned. I tried many indexing, sorting + limiting and hinting but all 750 000 documents are still scanned and the result is still slow to come...
Is there a strategy to be able to search very frequent term in a database faster ?

I ended to do my own stop words list by coding something like this :
import pymongo
from bson.code import Code
# NB max occurences of a word in a collection after what it is considerated as a stop word.
NB_MAX_COUNT = 20000
STOP_WORDS_FILE = 'stop_words.py'
db = connection to the database...
mapfn = Code("""function() {
var words = this.field_that_is_text_indexed;
if (words) {
// quick lowercase to normalize per your requirements
words = words.toLowerCase().split(/[ \/]/);
for (var i = words.length - 1; i >= 0; i--) {
// might want to remove punctuation, etc. here
if (words[i]) { // make sure there's something
emit(words[i], 1); // store a 1 for each word
}
}
}
};""")
reducefn = Code("""function( key, values ) {
var count = 0;
values.forEach(function(v) {
count +=v;
});
return count;
};""")
with open(STOP_WORDS_FILE,'w') as fh:
fh.write('# -*- coding: utf-8 -*-\n'
'stop_words = [\n')
result = db.mycollection.map_reduce(mapfn,reducefn,'words_count')
for doc in result.find({'value':{'$gt':NB_MAX_COUNT}}):
fh.write("'%s',\n" % doc['_id'])
fh.write(']\n')

Related

search in limited number of record MongoDB

I want to search in the first 1000 records of my document whose name is CityDB. I used the following code:
db.CityDB.find({'index.2':"London"}).limit(1000)
but it does not work, it return the first 1000 of finding, but I want to search just in the first 1000 records not all records. Could you please help me.
Thanks,
Amir
Note that there is no guarantee that your documents are returned in any particular order by a query as long as you don't sort explicitely. Documents in a new collection are usually returned in insertion order, but various things can cause that order to change unexpectedly, so don't rely on it. By the way: Auto-generated _id's start with a timestamp, so when you sort by _id, the objects are returned by creation-date.
Now about your actual question. When you first want to limit the documents and then perform a filter-operation on this limited set, you can use the aggregation pipeline. It allows you to use $limit-operator first and then use the $match-operator on the remaining documents.
db.CityDB.aggregate(
// { $sort: { _id: 1 } }, // <- uncomment when you want the first 1000 by creation-time
{ $limit: 1000 },
{ $match: { 'index.2':"London" } }
)
I can think of two ways to achieve this:
1) You have a global counter and every time you input data into your collection you add a field count = currentCounter and increase currentCounter by 1. When you need to select your first k elements, you find it this way
db.CityDB.find({
'index.2':"London",
count : {
'$gte' : currentCounter - k
}
})
This is not atomic and might give you sometimes more then k elements on a heavy loaded system (but it can support indexes).
Here is another approach which works nice in the shell:
2) Create your dummy data:
var k = 100;
for(var i = 1; i<k; i++){
db.a.insert({
_id : i,
z: Math.floor(1 + Math.random() * 10)
})
}
output = [];
And now find in the first k records where z == 3
k = 10;
db.a.find().sort({$natural : -1}).limit(k).forEach(function(el){
if (el.z == 3){
output.push(el)
}
})
as you see your output has correct elements:
output
I think it is pretty straight forward to modify my example for your needs.
P.S. also take a look in aggregation framework, there might be a way to achieve what you need with it.

MongoDB Map/Reduce very poor perfomance

I am executing MAP/Reduce in Mongo DB and it is extremely unexpectedly slow. I am using very wide document with 700 documents and reduce function performs simple sum and multiplication of each value in document by another value
Performance is very poor, 20 min for only 20K docs, and linearly growing with number of docs. Looking into mongotop 99% of the time process i busy writing into temp collection. Any idea on what it is doing and any ways to optimize, anything is missing?
I have used all proper practices, sorted by all emited keys(key1,key2,key3)
Created compound index db.ensueIndex({Key1:1,Key2:1,Key3:1})
map = function(){
Key = {
key1:this.Key1,
key2:this.Key2,
key3:this.Key3
};
values = this;
emit(key,values)
}
reduce = function(key,values)
{
//initialize all values to 0
var ret = {
val1:0,
val2:0,
.
.
val700:0,
}
values.forEach(function(v){
val1+=v.val1*v.quantity;
val2+=v.val2*v.quantity;
.
.
val700+=v.val700*v.quantity;
});
return ret;
}
db.runCommand({mapReduce:coll,map:map,reduce:reduce,out:coucoll,sort:{ke1,key2,key3})

Find out HDD usage of binary data in a field

In a collection with a field image that contains images (BinData). I want to find out how many % of the DB are used by the images. What is the most efficient way to calculate the total size of all images?
I want to avoid fetching all images from the DB server, so I came up with this code:
mapper = Code("""
function() {
var n = 0;
if (this.image) {
n = this.image.length();
}
emit('sizes', n);
}
""")
reducer = Code("""
function(key, sizes) {
var total = 0;
for (var i = 0; i < sizes.length; i++) {
total += sizes[i];
}
}
return total;
""")
result = db.files.map_reduce(mapper, reducer, "image_sizes")
During the execution memory usage of mongodb gets quite high, it looks as if the whole data is loaded into memory. How can this be optimized? Also, does it make sense to call this.image.length() in order to find out how many Bytes the images occupy on the harddrive?
You can not avoid loading all the data into memory. MongoDB treats a document as its atomic unit, and by querying all documents, it will pull in all documents into memory.
As an alternative, what might possibly help you is just to see how many bytes a collection takes up, but that of course only works if the only thing you have stored in your collection are images. On the shell, you can do this with:
db.files.stats()
Which has the field storageSize that shows you how much storage is needed for your images approximately. This is not nearly as accurate as going through all your images though.

optimizing hourly statistics retrieval with mongodb

I've collected about 10 mio documents spaning a few weeks in my mongodb database, and I want to be able to calculate some simple statistics and output them.
The statistics I'm trying to get is the average of the rating on each document within a timespan, in one hour intervals.
To give an idea of what I'm trying to do, follow this sudo code:
var dateTimeStart;
var dateTimeEnd;
var distinctHoursBetweenDateTimes = getHours(dateTimeStart, dateTimeEnd);
var totalResult=[];
foreach( distinctHour in distinctHoursBetweenDateTimes )
tmpResult = mapreduce_getAverageRating( distinctHour, distinctHour +1 )
totalResult[distinctHour] = tmpResult;
return totalResult;
My document structure is something like:
{_id, rating, topic, created_at}
Created_at is the date I'm gathering my statistics based on (time of insertion and time created are not always the same)
I've created an index on the created_at field.
The following is my mapreduce:
map = function (){
emit( this.Topic , { 'total' : this.Rating , num : 1 } );
};
reduce = function (key, values){
var n = {'total' : 0, num : 0};
for ( var i=0; i<values.length; i++ ){
n.total += values[i].total;
n.num += values[i].num;
}
return n;
};
finalize = function(key, res){
res.avg = res.total / res.num;
return res;
};
I'm pretty sure this can be done more effectively - possibly by letting mongo do more work, instead of running several map-reduce statements in a row.
At this point each map-reduce takes about 20-25 seconds so counting statistics for all the hours over a few days suddenly takes up a very long time.
My impression is that mongo should be suited for this kind of work - hence I must obviously be doing something wrong.
Thanks for your help!
And I assume the time is part of the documents you are MapReducing?
When you run the MapReduce over all documents, determine the hour in the map function and add it to the key you emit, you could do all this in a single MapReduce.

Random Sampling from Mongo

I have a mongo collection with documents. There is one field in every document which is 0 OR 1. I need to random sample 1000 records from the database and count the number of documents who have that field as 1. I need to do this sampling 1000 times. How do i do it ?
For people coming to the answer, you should now use the new $sample aggregation function, new in 3.2.
https://docs.mongodb.org/manual/reference/operator/aggregation/sample/
db.collection_of_things.aggregate(
[ { $sample: { size: 15 } } ]
)
Then add another step to count up the 0s and 1s using $group to get the count. Here is an example from the MongoDB docs.
For MongoDB 3.0 and before, I use an old trick from SQL days (which I think Wikipedia use for their random page feature). I store a random number between 0 and 1 in every object I need to randomize, let's call that field "r". You then add an index on "r".
db.coll.ensureIndex(r: 1);
Now to get random x objects, you use:
var startVal = Math.random();
db.coll.find({r: {$gt: startVal}}).sort({r: 1}).limit(x);
This gives you random objects in a single find query. Depending on your needs, this may be overkill, but if you are going to be doing lots of sampling over time, this is a very efficient way without putting load on your backend.
Here's an example in the mongo shell .. assuming a collection of collname, and a value of interest in thefield:
var total = db.collname.count();
var count = 0;
var numSamples = 1000;
for (i = 0; i < numSamples; i++) {
var random = Math.floor(Math.random()*total);
var doc = db.collname.find().skip(random).limit(1).next();
if (doc.thefield) {
count += (doc.thefield == 1);
}
}
I was gonna edit my comment on #Stennies answer with this but you could also use a seprate auto incrementing ID index here as an alternative if you were to skip over HUGE amounts of record (talking huge here).
I wrote another answer to another question a lot like this one where some one was trying to find nth record of the collection:
php mongodb find nth entry in collection
The second half of my answer basically describes one potential method by which you could approach this problem. You would still need to loop 1000 times to get the random row of course.
If you are using mongoengine, you can use a SequenceField to generate an incremental counter.
class User(db.DynamicDocument):
counter = db.SequenceField(collection_name="user.counters")
Then to fetch a random list of say 100, do the following
def get_random_users(number_requested):
users_to_fetch = random.sample(range(1, User.objects.count() + 1), min(number_requested, User.objects.count()))
return User.objects(counter__in=users_to_fetch)
where you would call
get_random_users(100)