mongodb index on regex fields not working - mongodb

I'm new in mongoDB and I'm facing an issue about performance that need your help. I have a collection with 400k records, when not create index for any field on the collection it takes 20-30s for each query then I create indexs for fields that usually using for search query, but the problem is, when using $regex to search for a string field with index on it, mongoDB does not use index on that field, mongodb still scan for all records in that collection, I've searched on internet with this keyword: "index on regex fields mongodb" and I found some answers which say that "MongoDB use prefix of RegEx to lookup indexes" which means you have to use "^" prefix for the index to work like "db.users.find({name: /^key word/})", but that is not working for me, does "index on $regex field" need MongoDB Atlas to work? because i'm using comunity version of mongoDB. Thanks!

There's a lot to unpack here. We'll split the answer into two parts, the first to try and answer some of the direct questions about index usage and the second to explore solutions to satisfy the application requirements.
Index Usage with $regex
As is true with an index in any database that captures the full string value as the key, MongoDB can use the index for a $regex operation but its efficiency in doing so greatly depends on the regex being applied. That is what the Index Use documentation from the comments and the other answers you reference are describing.
In the comments you mention that an example query might be db.users.find({name: {$regex: '.*keyword.*', $options: 'i'}}). That means that the regex is a both unanchored and case-insensitive. The aforementioned doumentation states directly:
Case insensitive regular expression queries generally cannot use indexes effectively.
Why is this? because the substring that you are searching for can be found in any string value captured by the index. So the document with matching value {name: 'a keyword'} would be located at one end of the index, {name: 'keyWord' }, may be somewhere in the middle, and {name: 'Z keyword'} may be at the end. The only way to ensure correct results is for the database to scan the index for all string values. So while it is still using the index, it may not be efficient as most of the scanned values will not be match and will be discarded.
You may always use .explain() to better understand how the database is answering the query, such as if and how it is using an index.
Solutions
So what do we do about this?
Well as #rickhg12hs suggests in the comments, it depends on exactly what you are trying to achieve. You reiterate that that you are looking for 'full regex search capability', but that is really an approach/solution rather than a goal. If what you really need, for example, is just to match an exact string in a case insensitive manner, then something as simple as a case insensitive index would likely do the trick.
However if truly do wish to perform arbitrary substring searching, then you are really looking at search engine capabilities. In that situation your best bets would probably be to emulate their indexes directly in MongoDB (e.g. have the application manually tokenize the strings to be indexed), stand up something like Solr/Elasticsearch next to MongoDB, or use MongoDB's Atlas Search offering. The $text operator mentioned in the comment has limitations when it comes to substring searching (such as just part of a word), which may or may not be relevant for your needs.

Related

Case Insensitive query taking advantage of index

I have tried several ways of making a case-insensitive query on mongo. My ideal scenario is to use a case insensitive index, which is clearly documented here. The Problem is that I need the query to find from within the collection, values starting with a specific phrase, while not being case insensitive. The method that I have found that effectively achieves this is using a regex, but according to the documentation
Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes.
Furthermore, text indexes will NOT satisfy my needs , as according to the documentation
For case insensitive and diacritic insensitive text searches, the $text operator matches on the complete stemmed word. So if a document field contains the word blueberry, a search on the term blue will not match. However, blueberry or blueberries will match.
Is there a way of achieving the desired functionality while making use of indexes, or a way that I could optimize my database to achieve this? The field in question is a short string limited to 30 characters

Reasons for creating an Index for a string field in MongoDB

When I create an Index on a string-type field in MongoDB I get no significant speed boost from it. In fact, when I use the query:
db.movies.find({plot: /some text/}).explain("executionStats")
An Index is slowing down the query by 30-50% in my Database (~55k Docs).
I know, that I can use a "text" Index, which is working fine for me, but I was wondering, why you would create a "normal" Index on a string field.
Index on string fields will improve the performance of exact matches like,
db.movies.find({name: "some movie"})
Indexes will also be used for find queries with prefix expression,
db.movies.find({plot: /^begins with/})

MongoDB fulltext search + workaround for partial word match

Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?
Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.
A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.
You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".
Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.
For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.

Checking position of an entry in an index MongoDB

I have a query using pymongo that is outputting some values based on the following:
cursor = db.collect.find({"index_field":{"$regex":'\s'}}
for document in cursor:
print document["_id"]
Now this query has been running for a long time (over 500 million documents) as I expected. I was wondering though if there is a way to check where the query is in its execution by perhaps finding out where the last printed "_id" is in the indexed field. Like is the last printed _id halfway through the btree index? Is it near the end?
I want to know this just to see if I should cancel the query and reoptimize and/or let it finish, but I have no way of knowing where the _id exists in the query.
Also, if anyone has a way to optimize my whitespace query, that would be helpful to. Based on the doc, it seems if I would of used ignorecase it would of been faster, although it doesn't make sense for whitespace checking.
Thanks so much,
J
Query optimization
Your query cannot be optimized, because it's an inefficient$regex search that's looking for the space \s in the the document. What you can do, is to search $regex for a prefix of \s, e.g.
db.collect.find({"index_field": {"$regex": '^\\s'}})
Check out the notes in the link
Indexing problem
$regex can only use an index efficiently when the regular
expression has an anchor for the beginning (i.e. ^) of a string and is
a case-sensitive match. Additionally, while /^a/, /^a.*/, and
/^a.*$/ match equivalent strings, they have different performance
characteristics. All of these expressions use an index if an
appropriate index exists; however, /^a.*/, and /^a.*$/ are slower.
/^a/ can stop scanning after matching the prefix.
DB op's info
Use db.currentOp() to get info on all of your running ops.

MongoDB query array containing search text

I have the following query (MongoMapper/Rails):
Card.where(:card_tags => {:$all => search_tags}
Where card_tags is an array of string tags and search_tags is in array of the search strings. At the moment if someone searches 'snow', no results with tag 'snowboarding' will be returned.
How can I modify this query to search whether any strings in card_tags contains any of the strings in search_tags? Regular expressions come to mind but not sure of the syntax given these are arrays...
Thanks
You can use regular expressions but you will be doing full collection scans - this is going to be bad for performance.
You can use regex with an index only if you "starts with" type of searches, but I doubt you want to limit yourself to that.
For fulltext searching, you are better off using some external search service for this - like Lucene, ElasticSearch, or Solr.
Refer to this post too: like query in mongoDB