I have 1.5 million records, each with a text field "body" that contains a lot of text. I'm performing a full-text search against these documents using a regular expression but haven't noticed any difference in query times between indexing the data and not indexing it.
I ensured there was an index on the "body" field via
db.documents.ensureIndex({ body: 1 });
MongoDB took a few moments to index the data, and when I ran
db.documents.getIndexes()
it showed that I had an index on the collection's "body" field. But queries still take the same amount of time before and after indexing.
If I run the query
db.documents.find({ body: /test/i });
I would expect it to run faster because the data is indexed. When I do
db.documents.find({ body: /test/i }).explain();
mongo tells me that it's using the BTreeCursor on the body field.
Am I doing something wrong here? Why would there not be any decrease in query time after the text data has been indexed?
Check the docs for indexes and regex queries:
http://www.mongodb.org/display/DOCS/Advanced+Queries
For simple prefix queries (also called rooted regexps) like /^prefix/,
the database will use an index when available and appropriate (much
like most SQL databases that use indexes for a LIKE 'prefix%'
expression). This only works if you don't have i (case-insensitivity)
in the flags.
Full text search is a dedicated area where MongoDb doesn't really fit.
If you're looking for something open source & fast, you should try Apache SOLR. We've been using it for 4 years now, a great value!
http://lucene.apache.org/solr/
You need to create a TEXT search index on the field.
db.documents.ensureIndex({ body: "text" });
once the TEXT search index is created, you can search as below :
db.documents.find({ "$text": {"$search" : /test/i} });
Related
I have a collection of around 200m documents. I need to build a search for it that looks for substrings. Using a regex search is incredibly slow even with a regular index on the field being searched for.
The answer seems to be a text index but there is only one text index allowed per collection. I can make the text index search multiple fields but that will actually break intended functionality as it will make results inaccurate. I need to specify the exact fields the substring should appear in.
Are there any ways around this limitation? The documentation says their cloud databases allow multiple indexes but for this project I need to keep data on our own servers.
Yea even if you index your field, it will still go for a collection scan if you use regex search. And you can only have text index on a single field. Also this text index is based on words not sub-strings, so text index would not do anything.
These indices including text index is basically pre sorting the documents according to the indexed field in alphabetical order(or reverse). For text field, it is very similar, but a little better because it indexes each word of the selected field. But in your case since you are searching for substrings, text index would be equally useless.
To solve your problem, typically you would have to go for another dedicated database such as ElasticSearch.
Fortunately, MongoDB Atlas released Atlas search index recently and it should solve your problem. You can index multiple(or all) fields and it can also search sub-strings. Its basically a "search engine". Just like ElasticSearch, it is based on the popular open source search engine, Lucene. After you apply Atlas search index you can use aggregate with $search pipeline.
But in order to use this feature, you need to use MongoDB Atlas. As far as I know you can only create this search index in MongoDB Atlas. Once you have MongoDB Atlas setup, applying and using this search feature is straight forward. You can go to MongoDB Atlas, then to your collection and apply this search index with few clicks. You can fine tune it(check the docs) but you can start with the default settings.
Using it in your backend is very simple(from docs):
db.articles.aggregate(
[
{ $match: { $text: { $search: "cake" } } },
{ $group: { _id: null, views: { $sum: "$views" } } }
]
)
I'm using the airbnb sample set and it has a field that looks like:
"amenities": ["TV", "Cable TV", "Wifi"....
So I'm trying to do a case-INsensitive, wildcard search (on one or more values passed in).
Only thing I've found that works is:
{ amenities: { $in: [ /wi/ ] }}
Is that the best way?
So I ran it in Compass as the dataset was imported (5600 docs), and the Explain says it took ~20ms on my machine and warned there was no index. I then created an index on the amenities column and the same search jumped up to ~100ms. I just created the index through the Compass UI, so not sure why its taking 5x as long with an index? Or if there is a better way to do this?
The way to run that query is:
{ amenities: /wi/i }
//better but not always useful
{ amenities: /wi/i }, { amenities:1, _id:0 }
It already traverses the array, and to be case insensitive it must be on the options.
For multikey indexes the second query won't be a covered query. Otherwise, it would be blazing fast.
I've tested a similar search with and without index though. Exec. time is reduced 10X. (1500ms to 150ms, in a huge collection). Measure with Mongo Hacker.
As you report executionTimeMilliseconds is not that different. But still smaller.
The reason why you don't see a huge decrease in time is because the index stores each array entry separately. When it finds a match, it comes back to collection to fetch the whole array field, instead of using the indexes.
Probably indexes aren't very useful for arrays.
When querying with an unanchored regex, the query executor will have to scan every index key to see if there is a match.
You might find a collated index to be helpful.
Create an index with the appropriate collation, like:
(strength 1 and 2 are case-insensitive)
db.collection.createIndex({amenities:1},{collation:{locale:"en",strength:1}})
Then query using the same collation:
db.collection.find({amenities:"wifi"}).collation({locale:"en",strength:1})
The search will be case insensitive, and it can efficiently use the index.
When I create an Index on a string-type field in MongoDB I get no significant speed boost from it. In fact, when I use the query:
db.movies.find({plot: /some text/}).explain("executionStats")
An Index is slowing down the query by 30-50% in my Database (~55k Docs).
I know, that I can use a "text" Index, which is working fine for me, but I was wondering, why you would create a "normal" Index on a string field.
Index on string fields will improve the performance of exact matches like,
db.movies.find({name: "some movie"})
Indexes will also be used for find queries with prefix expression,
db.movies.find({plot: /^begins with/})
Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?
Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.
A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.
You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".
Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.
For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.
Some Background: I'm planning to use MongoDB as the publishing frontend db for a few of my websites. The actual data will be kept in a SQL Server db and there will be background jobs that will populate the MongoDB at predefined time intervals for readonly purposes to boost website performance.
The Situation: I have a table 'x' that i translated into a mongo collection, everything worked fine.
'x' has a column 'c' that was originally a NVARCHAR(MAX) in the source db and has multilingual text in it.
When I was searching by column 'c', mongo was doing fullscan on the collection.
So I tried doing an ensureIndex({c : 1 }) which worked but when I checked the mongodb logs it showed me that 90% of the data could not be indexed as [Key Too Large To Index] !!
And thus is has indexed 10% of the data and now only returns results from that 10% !!
What are my alternatives ??
Note: I was using this column to do full text searching in SQL Server, now im not sure if I should go ahead with Mongo or not :(
Try to run your mongod process with this parameter:
sudo mongod --setParameter failIndexKeyTooLong=false
And than try again.
if you need to search text inside a large string you can use one of those:
keyword splitting
regular expression
the former has the downside that you need some "logic" to combine the keyword to make a search, the latter heavily impacts on performance.
probably if you really need full text search the best option is to use an external indexer like solr or lucene.
Since you can do some elaboration, you could extract some key words and put them in a field:
_keywords : [ "mongodb" , "full search" , "nosql" ]
and make an index on that.
Don't use mongo for full text searching
its not designed for that. Yes obviously you will get an error key too large on indexing for long string values.
Better approach would be using full text search servers (solr/lucene or sphinx) if your main concern is search.
Recent (2.4 and above) MongoDB builds provide a couple other options:
As the OP's stated desire is for full text search, the right approach would be to use a text index which directly supports that use case.
For an exact match index on long string values you can use a hashed index.