Sort results of search by regex in MongoDB - mongodb

So, there's a web-server that has a number of methods which are used for autocompleting input fields on the client. Methods take a string and scan a specific property of mongodb collection using regexp.
Pretty common stuff, right? But here's a problem - these methods need to sort results based on how close the searched string is to the start of the result string. Like if I searched for countries and typed "ru", "Russia" should come before "Peru".
I don't see how I can sort results like this without performing multiple searches. Now I can only think of something like this
const limit = 20;
const resultsStartOfLine = db.countries.find({name: /^ru/i})
.limit(limit)
.toArray();
const resultsRest = db.countries.find({
name: /ru/i,
_id: {$nin: _.map(resultsStartOfLine, '_id')}
})
.limit(limit - resultsStartOfLine.length)
.toArray();
I know, that Mongo can't do this kind of sort by default, but maybe there's better way to do it?

As I've learned search by regex is usually a bad practice because it doesn't utilize indexes and as a result is pretty slow.
So I created an index for full-text search and sort results by weights.

Related

mongodb index on regex fields not working

I'm new in mongoDB and I'm facing an issue about performance that need your help. I have a collection with 400k records, when not create index for any field on the collection it takes 20-30s for each query then I create indexs for fields that usually using for search query, but the problem is, when using $regex to search for a string field with index on it, mongoDB does not use index on that field, mongodb still scan for all records in that collection, I've searched on internet with this keyword: "index on regex fields mongodb" and I found some answers which say that "MongoDB use prefix of RegEx to lookup indexes" which means you have to use "^" prefix for the index to work like "db.users.find({name: /^key word/})", but that is not working for me, does "index on $regex field" need MongoDB Atlas to work? because i'm using comunity version of mongoDB. Thanks!
There's a lot to unpack here. We'll split the answer into two parts, the first to try and answer some of the direct questions about index usage and the second to explore solutions to satisfy the application requirements.
Index Usage with $regex
As is true with an index in any database that captures the full string value as the key, MongoDB can use the index for a $regex operation but its efficiency in doing so greatly depends on the regex being applied. That is what the Index Use documentation from the comments and the other answers you reference are describing.
In the comments you mention that an example query might be db.users.find({name: {$regex: '.*keyword.*', $options: 'i'}}). That means that the regex is a both unanchored and case-insensitive. The aforementioned doumentation states directly:
Case insensitive regular expression queries generally cannot use indexes effectively.
Why is this? because the substring that you are searching for can be found in any string value captured by the index. So the document with matching value {name: 'a keyword'} would be located at one end of the index, {name: 'keyWord' }, may be somewhere in the middle, and {name: 'Z keyword'} may be at the end. The only way to ensure correct results is for the database to scan the index for all string values. So while it is still using the index, it may not be efficient as most of the scanned values will not be match and will be discarded.
You may always use .explain() to better understand how the database is answering the query, such as if and how it is using an index.
Solutions
So what do we do about this?
Well as #rickhg12hs suggests in the comments, it depends on exactly what you are trying to achieve. You reiterate that that you are looking for 'full regex search capability', but that is really an approach/solution rather than a goal. If what you really need, for example, is just to match an exact string in a case insensitive manner, then something as simple as a case insensitive index would likely do the trick.
However if truly do wish to perform arbitrary substring searching, then you are really looking at search engine capabilities. In that situation your best bets would probably be to emulate their indexes directly in MongoDB (e.g. have the application manually tokenize the strings to be indexed), stand up something like Solr/Elasticsearch next to MongoDB, or use MongoDB's Atlas Search offering. The $text operator mentioned in the comment has limitations when it comes to substring searching (such as just part of a word), which may or may not be relevant for your needs.

Mongodb $in implementation and complexity

Here is our document schema
{
name: String
}
Here is our query
{
name: {$in: ["Jack", "Tom"]}
}
I believe even if there isn't an index on name, the query engine will turn the array in the $in into a hashset and then check for presence as it scans through each record with a COLSCAN which is O(n). It will never do a naive O(m*n) search, right?
I'm trying to find supporting documentation online but I've come up short. I've tried searching in the source code but I can't seem to find the exact section responsible for this either.
If the index exists I believe that it will use it directly instead and be faster. If I'm not wrong I think it will be O(m*log(n)) as it gets the result set in log(n) time from the b-tree for every element in the $in array and returns the union of them all. Though big Oh wise for large m it seems slower than the O(n) hashset approach, its faster in practice as the disk reads are much more expensive.
Is this line of thinking correct when there is an index?
And if there isn't an index does it do the COLSCAN with a naive search or will it use a hashset to fasten the process?
When setting up the query, the $in expression sorts the non-regex elements in the setEqualities function:
if (!std::is_sorted(_originalEqualityVector.begin(),
_originalEqualityVector.end(),
_eltCmp.makeLessThan())) {
std::sort(
_originalEqualityVector.begin(), _originalEqualityVector.end(), _eltCmp.makeLessThan());
}
It then tests the element from each document using the contains function, which uses a binary search:
bool InMatchExpression::contains(const BSONElement& e) const {
return std::binary_search(_equalitySet.begin(), _equalitySet.end(), e, _eltCmp.makeLessThan());
}

Pymongo - find multiple different documents

my question is very similar to how-to-get-multiple-document-using-array-of-mongodb-id, however, I would like to find multiple documents without using the _id.
That is, consider that I have documents such as
the
document = { _id: _id, key_1: val_1, key_2: val_2, key_3: val_3}
I need to be able to .find() by multiple parameters, as for example,
query_1 = {key_1: foo, key_2: bar}
query_2 = {key_1: foofoo, key_2: barbar}
Right now, I am running a query for query_1, followed by a query for query_2.
As it turns out, this method is extremely inefficient.
I tried to add concurrency as to make it faster, but the speedup was not even 2x.
Is it possible to query multiple documents at once?,
I am looking for a method that returns the union of the matches for query_1 AND query_2.
If this is not possible, do you have any suggestions that might speed a query of this type?
Thank you for your help.

MongoDB fulltext search + workaround for partial word match

Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?
Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.
A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.
You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".
Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.
For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.

mongodb- indexes on list fields for $all queries

I am making an application using pymongo wrapper for which my schema is like:
{
_id: <some_id>,
name: <some_name>,
my_tags: [<list_of_tags>]
}
Now I want to return those entries which falls under the user specified tags. For example,
I want to have entries where my_tags should be atleast ["college", "USA", "engineering"]. For that I read $all construct can be used. Now what I want to know is, would it be of any use making an index on my_tags. For my app, this type of queries are used extensively.
would it be of any use making an index on my_tags. For my app, this type of queries are used extensively.
Yes $all will use an index so it is still good to make one there however there are still optimisations that can be done for it: https://jira.mongodb.org/browse/SERVER-5331 and https://jira.mongodb.org/browse/SERVER-1000
Normally the docs will only warn you of when something can not use an index.
The syntax for the $all query is:
db.collection.find({'my_tags': {'$all': ['college', 'USA', 'engineering']}})
The documentation can be found at:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%24all