We have a Mongo database with about 400,000 entries, each of which have a relatively short (< 20 characters) title. We want to be able to do fast substring searches on these titles (fast enough to be able to use the results in things like autocomplete bars). We are also only searching for prefixes (does the title start with substring). What can we do?
If you only do prefix searches then indexing that field should be enough. Rooted regex queries use index and should be fast.
Sergio is correct, but to be more specific, an index on that and a left-rooted prefix without the i (case-insensitivity) flag will make efficient use of the index. This is noted in the docs in fact:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-RegularExpressions
Don't forget to use .explain() if you want to benchmark the queries too.
Related
I want to have Full-Text Search without the stemming - it feels like too much of a naive approach.
What I want is:
select * from content order by levenstein_distance(document, :query_string)
I want it to be blazing fast.
I still want to have my queries / documents split by spaces.
Would the GIN index on document still be relevant?
Is this possible? Would it be blazing fast?
There is no index support for that, so it would be slow.
Perhaps pg_trgm can help, it provides similarity matching that can be supported by a GIN index.
I Want to query using part of id to get all the matched documents. So I tried “starts with” and "contains" which works find but is there any performance issue for large collection?
The best way to make this search optimum :
Add $text index on the fields you want to do search in. This is really important because internally it tokenize your string to that you could search for a part of it.
Use regex which is also quicker to do.
If you are using aggregate, read this mongodb official doc about aggregation optimization which might help you to implement this in efficient manner : https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/
Last but not the least, if you are not yet fully inclined towards mongodb and project is fresh, look out for elasticsearch service which is based on Lucene. Its extremely powerful doing these kinds of searches.
Hi i have a big problem with full text search, i have a collection with 10 million of documents that has lot of common words in the indexed field for example: what, as, like, how, hi, hello, etc.
When i do a serch with the word "hi" the search becomes super slow and takes about 30 minutes to search the results, and on the other hand when i do the same but with a uncommon word the search is super faster and takes less than 30 ms.
i don´t know what can be the problem.
My text index:
db.themes.createIndex({"theme":"text"})
and the query that i run:
db.themes.find({$text: {$search: "hi"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}}).limit(20)
well that's how it is. though mongodb provides full text capabilities but the performance is not on par with popular text search engines.
you probably can find in internet that most implementations have elastic search implemented along with mongodb for search optimizations. you can either use elastic search or Solr for your operations.
MongoDB's text search is disgracefully slow on large collections. I also don't like its way of automatically thinking that "James Bond" is OR, but that's another story... (for AND, need to search for "\"James\" \"Bond\"" which is inelegant at best).
One way to go around it, if your application allows it, is to Limit the Number of Entries Scanned by filtering on other fields. For that, it needs to be an equality, it can't be $gt or such. You might have to be creative to solve that... I've grouped my cities in "metropolitan areas" (this took a while...), and now I can search by {metro: "DC", {$text: {$search: "pizza"}}.
I’m playing around with MongoDB for the moment to see what nice features it has. I’ve created a small test suite representing a simple blog system with posts, authors and comments, very basic.
I’ve experimented with a search function which uses the MongoRegEx class (PHP Driver), where I’m just searching through all post content and post titles after the sentence ‘lorem ipsum’ with case sensitive on “/I”.
My code looks like this:
$regex = new MongoRegEx('/lorem ipsum/i');
$query = array('post' => $regex, 'post_title' => $regex);
But I’m confused and stunned about what happens. I check every query for running time (set microtime before and after the query and get the time with 15 decimals).
For my first test I’ve added 110.000 blog documents and 5000 authors, everything randomly generated. When I do my search, it finds 6824 posts with the sentence “lorem ipsum” and it takes 0.000057935714722 seconds to do the search. And this is after I’ve reset the MongoDB service (using Windows) and this is without any index other than the default on _id.
MongoDB uses a B-tree index, which most definitely isn’t very efficient for a full text search. If I create an index on my post content attribute, the same query as above runs in 0.000150918960571, which funny enough is slower than without any index (slower with a factor of 0.000092983245849). Now this can happen for several reasons because it uses a B-tree cursor.
But I’ve tried to search for an explanation as to how it can query it so fast. I guess that it probably keeps everything in my RAM (I’ve got 4GB and the database is about 500MB). This is why I try to restart the mongodb service to get a full result.
Can anyone with experience with MongoDB help me understand what is going on with this kind of full text search with or without index and definitely without an inverted index?
Sincerely
- Mestika
I think you simply didn't iterate over the results? With just a find(), the driver will not send a query to the server. You need to fetch at least one result for that. I don't believe MongoDB is this fast, and I believe your error to be in your benchmark.
As a second thing, for regular expression search that is not anchored at the beginning of the field's value with an ^, no index is used at all. You should play with explain() to see what is actually happening.
I have an index with around 5 million documents that I am trying to do a "contains" search on. I know how to accomplish this and I have explained the performance cost to the customer, but that is what they want. As expected doing a "contains" search on the entire index is very slow, but sometimes I only want to search a very small subset of the index (say 100 documents or so). I've done this by adding a Filter to the search that should limit the results correctly. However I find that this search and the entire index search perform almost exactly the same. Is there something I'm missing here? It feels like this search is also searching the entire index.
Adding a filter to the search will not limit the scope of the index.
You need to be more clear about what you need from your search, but I don't believe what you want is possible.
Is the subset of documents always the same? If so, maybe you can get clever with multiple indices. (e.g. search the smaller index and if there aren't enough hits, then search the larger index).
You can try SingleCharTokenAnalyzer