Tackling MongoDB single text index limit - mongodb

I have a collection of around 200m documents. I need to build a search for it that looks for substrings. Using a regex search is incredibly slow even with a regular index on the field being searched for.
The answer seems to be a text index but there is only one text index allowed per collection. I can make the text index search multiple fields but that will actually break intended functionality as it will make results inaccurate. I need to specify the exact fields the substring should appear in.
Are there any ways around this limitation? The documentation says their cloud databases allow multiple indexes but for this project I need to keep data on our own servers.

Yea even if you index your field, it will still go for a collection scan if you use regex search. And you can only have text index on a single field. Also this text index is based on words not sub-strings, so text index would not do anything.
These indices including text index is basically pre sorting the documents according to the indexed field in alphabetical order(or reverse). For text field, it is very similar, but a little better because it indexes each word of the selected field. But in your case since you are searching for substrings, text index would be equally useless.
To solve your problem, typically you would have to go for another dedicated database such as ElasticSearch.
Fortunately, MongoDB Atlas released Atlas search index recently and it should solve your problem. You can index multiple(or all) fields and it can also search sub-strings. Its basically a "search engine". Just like ElasticSearch, it is based on the popular open source search engine, Lucene. After you apply Atlas search index you can use aggregate with $search pipeline.
But in order to use this feature, you need to use MongoDB Atlas. As far as I know you can only create this search index in MongoDB Atlas. Once you have MongoDB Atlas setup, applying and using this search feature is straight forward. You can go to MongoDB Atlas, then to your collection and apply this search index with few clicks. You can fine tune it(check the docs) but you can start with the default settings.
Using it in your backend is very simple(from docs):
db.articles.aggregate(
[
{ $match: { $text: { $search: "cake" } } },
{ $group: { _id: null, views: { $sum: "$views" } } }
]
)

Related

What is the best index to create for such a collection in MongoDB?

I have a collection with documents like this:
{
"_id":{
"$oid":"60c5316cbc885e00c6e5abeb"
},
"name":"<name>",
"addedAt":{
"date":{
"$date":"2021-06-12T22:13:00.316Z"
},
"timestamp":1623535980.316648
},
"lastUsed":{
"date":{
"$date":"2021-06-22T14:17:23.339Z"
},
"timestamp":1624371443.339323
},
"connStr":"http://<user>:<pwd>#<host>:<port>",
"resetIpUri":"http://<host>:<port>/api/changeIP?apiToken=<token>",
"lastResetIP":1623535980.316648
}
And pretty simple queries:
db.collection.find({connStr: <connStr>})
db.collection.find({}).sort({"lastUsed.timestamp": 1})
But I'm not quite sure if I need to use a text index for the field conStr, or a regular one? I cannot understand how text indexes work, do I always need to use them when I have a task to find a document by its value, if so, should I use text indexes for integer or float fields?
Text Indexes and regular Indexes are entirely different and should be used under entirely different scenarios.
Go for Normal Indexing if:
You will be matching the exact string value of a key or you will be using Regex operations on the string value stored.
The Normal Index is what you are looking for.
Go for Text Indexing if:
You want the values of a key to be stemmed for searching purposes. For Example The stemming of the root word like are: likes, liked, likely, liking, etc. All these are stems of the root word like.
If you want a key value to be searchable, like textbook name or description, you can make use of the text indexing, and perform a text search on the key, which will perform a stem search of all the words.
Note: I am no MongoDB Text Index expert and just have a vague idea of what it is. Any corrections and edits are welcomed.

Mongodb searching on array / indexing

I'm using the airbnb sample set and it has a field that looks like:
"amenities": ["TV", "Cable TV", "Wifi"....
So I'm trying to do a case-INsensitive, wildcard search (on one or more values passed in).
Only thing I've found that works is:
{ amenities: { $in: [ /wi/ ] }}
Is that the best way?
So I ran it in Compass as the dataset was imported (5600 docs), and the Explain says it took ~20ms on my machine and warned there was no index. I then created an index on the amenities column and the same search jumped up to ~100ms. I just created the index through the Compass UI, so not sure why its taking 5x as long with an index? Or if there is a better way to do this?
The way to run that query is:
{ amenities: /wi/i }
//better but not always useful
{ amenities: /wi/i }, { amenities:1, _id:0 }
It already traverses the array, and to be case insensitive it must be on the options.
For multikey indexes the second query won't be a covered query. Otherwise, it would be blazing fast.
I've tested a similar search with and without index though. Exec. time is reduced 10X. (1500ms to 150ms, in a huge collection). Measure with Mongo Hacker.
As you report executionTimeMilliseconds is not that different. But still smaller.
The reason why you don't see a huge decrease in time is because the index stores each array entry separately. When it finds a match, it comes back to collection to fetch the whole array field, instead of using the indexes.
Probably indexes aren't very useful for arrays.
When querying with an unanchored regex, the query executor will have to scan every index key to see if there is a match.
You might find a collated index to be helpful.
Create an index with the appropriate collation, like:
(strength 1 and 2 are case-insensitive)
db.collection.createIndex({amenities:1},{collation:{locale:"en",strength:1}})
Then query using the same collation:
db.collection.find({amenities:"wifi"}).collation({locale:"en",strength:1})
The search will be case insensitive, and it can efficiently use the index.

Reasons for creating an Index for a string field in MongoDB

When I create an Index on a string-type field in MongoDB I get no significant speed boost from it. In fact, when I use the query:
db.movies.find({plot: /some text/}).explain("executionStats")
An Index is slowing down the query by 30-50% in my Database (~55k Docs).
I know, that I can use a "text" Index, which is working fine for me, but I was wondering, why you would create a "normal" Index on a string field.
Index on string fields will improve the performance of exact matches like,
db.movies.find({name: "some movie"})
Indexes will also be used for find queries with prefix expression,
db.movies.find({plot: /^begins with/})

MongoDB fulltext search + workaround for partial word match

Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?
Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.
A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.
You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".
Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.
For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.

MongoDB indexing large text field doesn't seem to make query faster?

I have 1.5 million records, each with a text field "body" that contains a lot of text. I'm performing a full-text search against these documents using a regular expression but haven't noticed any difference in query times between indexing the data and not indexing it.
I ensured there was an index on the "body" field via
db.documents.ensureIndex({ body: 1 });
MongoDB took a few moments to index the data, and when I ran
db.documents.getIndexes()
it showed that I had an index on the collection's "body" field. But queries still take the same amount of time before and after indexing.
If I run the query
db.documents.find({ body: /test/i });
I would expect it to run faster because the data is indexed. When I do
db.documents.find({ body: /test/i }).explain();
mongo tells me that it's using the BTreeCursor on the body field.
Am I doing something wrong here? Why would there not be any decrease in query time after the text data has been indexed?
Check the docs for indexes and regex queries:
http://www.mongodb.org/display/DOCS/Advanced+Queries
For simple prefix queries (also called rooted regexps) like /^prefix/,
the database will use an index when available and appropriate (much
like most SQL databases that use indexes for a LIKE 'prefix%'
expression). This only works if you don't have i (case-insensitivity)
in the flags.
Full text search is a dedicated area where MongoDb doesn't really fit.
If you're looking for something open source & fast, you should try Apache SOLR. We've been using it for 4 years now, a great value!
http://lucene.apache.org/solr/
You need to create a TEXT search index on the field.
db.documents.ensureIndex({ body: "text" });
once the TEXT search index is created, you can search as below :
db.documents.find({ "$text": {"$search" : /test/i} });