MongoDB fulltext search for common words - mongodb

Hi i have a big problem with full text search, i have a collection with 10 million of documents that has lot of common words in the indexed field for example: what, as, like, how, hi, hello, etc.
When i do a serch with the word "hi" the search becomes super slow and takes about 30 minutes to search the results, and on the other hand when i do the same but with a uncommon word the search is super faster and takes less than 30 ms.
i don´t know what can be the problem.
My text index:
db.themes.createIndex({"theme":"text"})
and the query that i run:
db.themes.find({$text: {$search: "hi"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}}).limit(20)

well that's how it is. though mongodb provides full text capabilities but the performance is not on par with popular text search engines.
you probably can find in internet that most implementations have elastic search implemented along with mongodb for search optimizations. you can either use elastic search or Solr for your operations.

MongoDB's text search is disgracefully slow on large collections. I also don't like its way of automatically thinking that "James Bond" is OR, but that's another story... (for AND, need to search for "\"James\" \"Bond\"" which is inelegant at best).
One way to go around it, if your application allows it, is to Limit the Number of Entries Scanned by filtering on other fields. For that, it needs to be an equality, it can't be $gt or such. You might have to be creative to solve that... I've grouped my cities in "metropolitan areas" (this took a while...), and now I can search by {metro: "DC", {$text: {$search: "pizza"}}.

Related

MongoDB text search with sort by score is very slow

Looking for some assistance with MondoDB.
I have a MongoDB collection with 7.3 million documents, running on a server with 24 cores and 64GB RAM. I have created a text index for a single field.
When I perform text searches using the code below (in MongoShell):
db.getCollection('mycoll').find({$text: {$search: 'my search query'}}).limit(10);
I get results almost instantly. The only problem is that these results are basically useless because they aren't sorted by score (and it's beyond me why this isn't the default behavior, but that's a different topic).
I modify the command to include the sorting by score, as follows:
db.getCollection('mycollection').find({$text: {$search: 'my search query'}, {score: {$meta: 'textScore'}}}).sort({score: {$meta: 'textScore'}}).limit(10);
and the command works, but it now takes about 10 seconds.
For a single or occasional query, it's not a big deal, but when I need to perform 1,000 - 10,000 of such queries in a batch we are talking now several hours of wait time.
I have seen some recommendations about adding "project" and "aggregation" but I don't see any improvement. Maybe I am not doing it right, so I don't want to put additional code here.
Any MongoDB expert that can provide some help would be greatly appreciated!

Tips for a search system with tags and sorting (questions about database design, performances issues, ...)

I am developing a site around sharing texts in Java/Scala, with Play Framework 2 and MongoDB for storage.
I am currently developing the search page. There will be of course a classic textfield search, but also two types of filters :
Tag
Top rated / most viewed per year / month / week
For example, it will be possible to get the xxx best texts of the week among those with the tag "fantasy". If you do not see what I mean, think Pornhub. ;)
I see how to do the query, but I'm afraid of performance issues.
I'm a real noob about performance and query optimizations, and a MongoDB beginner, so I'm afraid of impact of big queries that would seek, sort and rank among tens of thousands of texts.
Naturally, I thought of a cache system but I do not know how to implement it because each query may be different. I also thought of temporary collections updated every day at midnight (for example) with a job, but again there are too many different combinations.
So what are the techniques and "tricks" that I could use to model it? Have you any idea? Is there a search framework designed for that ?
Or maybe I worry too much about it and that MongoDB handles very well that kind of sorting and ranking?
I hope to be clear. Thank you very much for your help!
Some notes:
MongoDB is coming out with full text search soon. (v2.4)
You can always send data to Elastic Search or Solr at the same time it's written to Mongo. Then you can search with Elastic or Solr.
You can definitely tag text documents in Mongo and then index them and search. For example:
{ "_id" : 123, "content" : "...", "tags" : [ "fun", "cool read" ] }
You index the "tags" field and then you can search for "tag : 'fun'" and Mongo will retrieve that document really fast.
You didn't describe how you're doing top rated, but you could definitely write that information to the text and query on that.

Fast Substring Search in Mongo

We have a Mongo database with about 400,000 entries, each of which have a relatively short (< 20 characters) title. We want to be able to do fast substring searches on these titles (fast enough to be able to use the results in things like autocomplete bars). We are also only searching for prefixes (does the title start with substring). What can we do?
If you only do prefix searches then indexing that field should be enough. Rooted regex queries use index and should be fast.
Sergio is correct, but to be more specific, an index on that and a left-rooted prefix without the i (case-insensitivity) flag will make efficient use of the index. This is noted in the docs in fact:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-RegularExpressions
Don't forget to use .explain() if you want to benchmark the queries too.

Full Text Search & Inverted Indexes in MongoDB

I’m playing around with MongoDB for the moment to see what nice features it has. I’ve created a small test suite representing a simple blog system with posts, authors and comments, very basic.
I’ve experimented with a search function which uses the MongoRegEx class (PHP Driver), where I’m just searching through all post content and post titles after the sentence ‘lorem ipsum’ with case sensitive on “/I”.
My code looks like this:
$regex = new MongoRegEx('/lorem ipsum/i');
$query = array('post' => $regex, 'post_title' => $regex);
But I’m confused and stunned about what happens. I check every query for running time (set microtime before and after the query and get the time with 15 decimals).
For my first test I’ve added 110.000 blog documents and 5000 authors, everything randomly generated. When I do my search, it finds 6824 posts with the sentence “lorem ipsum” and it takes 0.000057935714722 seconds to do the search. And this is after I’ve reset the MongoDB service (using Windows) and this is without any index other than the default on _id.
MongoDB uses a B-tree index, which most definitely isn’t very efficient for a full text search. If I create an index on my post content attribute, the same query as above runs in 0.000150918960571, which funny enough is slower than without any index (slower with a factor of 0.000092983245849). Now this can happen for several reasons because it uses a B-tree cursor.
But I’ve tried to search for an explanation as to how it can query it so fast. I guess that it probably keeps everything in my RAM (I’ve got 4GB and the database is about 500MB). This is why I try to restart the mongodb service to get a full result.
Can anyone with experience with MongoDB help me understand what is going on with this kind of full text search with or without index and definitely without an inverted index?
Sincerely
- Mestika
I think you simply didn't iterate over the results? With just a find(), the driver will not send a query to the server. You need to fetch at least one result for that. I don't believe MongoDB is this fast, and I believe your error to be in your benchmark.
As a second thing, for regular expression search that is not anchored at the beginning of the field's value with an ^, no index is used at all. You should play with explain() to see what is actually happening.

Lucene "contains" search on subset of index

I have an index with around 5 million documents that I am trying to do a "contains" search on. I know how to accomplish this and I have explained the performance cost to the customer, but that is what they want. As expected doing a "contains" search on the entire index is very slow, but sometimes I only want to search a very small subset of the index (say 100 documents or so). I've done this by adding a Filter to the search that should limit the results correctly. However I find that this search and the entire index search perform almost exactly the same. Is there something I'm missing here? It feels like this search is also searching the entire index.
Adding a filter to the search will not limit the scope of the index.
You need to be more clear about what you need from your search, but I don't believe what you want is possible.
Is the subset of documents always the same? If so, maybe you can get clever with multiple indices. (e.g. search the smaller index and if there aren't enough hits, then search the larger index).
You can try SingleCharTokenAnalyzer