MongoDB - Difference between index on text field and text index? - mongodb

For a MongoDB field that contains strings (for example, state or province names), what (if any) difference is there between creating an index on a string-type field :
db.ensureIndex( { field: 1 } )
and creating a text index on that field:
db.ensureIndex( { field: "text" }
Where, in both cases, field is of string type.
I'm looking for a way to do a case-insensitive search on a text field which would contain a single word (maybe more). Being new to Mongo, I'm having trouble distinguishing between using the above two index methods, and even something like a $regex search.

The two index options are very different.
When you create a regular index on a string field it indexes the
entire value in the string. Mostly useful for single word strings
(like a username for logins) where you can match exactly.
A text index on the other hand will tokenize and stem the content of
the field. So it will break the string into individual words or
tokens, and will further reduce them to their stems so that variants
of the same word will match ("talk" matching "talks", "talked" and
"talking" for example, as "talk" is a stem of all three). Mostly
useful for true text (sentences, paragraphs, etc).
Text Search
Text search supports the search of string content in documents of a
collection. MongoDB provides the $text operator to perform text search
in queries and in aggregation pipelines.
The text search process:
tokenizes and stems the search term(s) during both the index creation and the text command execution.
assigns a score to each document that contains the search term in the indexed fields. The score determines the relevance of a document to a given search query.
The $text operator can search for words and phrases. The query matches
on the complete stemmed words. For example, if a document field
contains the word blueberry, a search on the term blue will not match
the document. However, a search on either blueberry or blueberries
will match.
$regex searches can be used with regular indexes on string fields, to
provide some pattern matching and wildcard search. Not a terribly
effective user of indexes but it will use indexes where it can:
If an index exists for the field, then MongoDB matches the regular
expression against the values in the index, which can be faster than a
collection scan. Further optimization can occur if the regular
expression is a “prefix expression”, which means that all potential
matches start with the same string. This allows MongoDB to construct a
“range” from that prefix and only match against those values from the
index that fall within that range.
http://docs.mongodb.org/manual/core/index-text/
http://docs.mongodb.org/manual/reference/operator/query/regex/

text indexes allow you to search for words inside texts. You can do the same using a regex on a non text-indexed text field, but it would be much slower.
Prior to MongoDB 2.6, text search operations had to be made with their own command, which was a big drawback because you coulnd't combine it with other filters, nor treat the result as a common cursor. As of now, the text search is just another another operator for the typical find method and that's super nice.
So, Why is a text index, and its subsequent searchs faster than a regex on a non-indexed text field? It's because text indexes work as a dictionary, a clever one that's capable of discarding words on a per-language basis (defaults to english). When you run a text search query, you run it against the dictionary, saving yourself the time that would otherwise be spent iterating over the whole collection.
Keep in mind that the text index will grow along with your collection, and it can use a lot of space. I learnt this the hard way when using capped collections. There's no way to cap text indexes.
A regular index on a text field, such as
db.ensureIndex( { field: 1 } )
will be useful only if you search for the whole text. It's used for example to look for alphanumeric hashes. It doesn't make any sense to apply this kind of indexes when storing text paragraphs, phrases, etc.

Related

Difference between wildcard search and individual text search

Is there a difference between a wildcard search index like $** and text indexes that I create for each of the fields in the collection ?
I do see a small difference in response time when I individually create text indexes. Using individual indexes, returns a better response. I am not able to post an example now, but will try to.
A wildcard text search will index every field that contains string data for each document in the collection (https://docs.mongodb.com/manual/core/index-text/#wildcard-text-indexes).
Because you are essentially increasing the number of fields indexed with a wild card text index, it would take longer to run compared to targeting specific fields for a text index.
Since you can only have one text index per collection (https://docs.mongodb.com/manual/core/index-text/#create-text-index), its worth considering which fields you plan on querying against beforehand.

Can a $text search perform a partial match

I'm very confused by this behavior. It seems inconsistent and strange, especially since I've read that Mongo isn't supposed to support partial search terms in full text search. I'm using version 3.4.7 of Mongo DB Community Server. I'm doing these tests from the Mongo shell.
So, I have a Mongo DB collection with a text index assigned. I created the index like this:
db.submissions.createIndex({"$**":"text"})
There is a document in this collection that contains these two values:
"Craig"
"Dr. Bob".
My goal is to do a text search for a document that has multiple matching terms in it.
So, here are tests I've run, and their inconsistent output:
SINGLE TERM, COMPLETE
db.submissions.find({"$text":{"$search":"\"Craig\""}})
Result: Gets me the document with this value in it.
SINGLE TERM, PARTIAL
db.submissions.find({"$text":{"$search":"\"Crai\""}})
Result: Returns nothing, because this partial search term doesn't exactly match anything in the document.
MULTIPLE TERMS, COMPLETE
db.submissions.find({"$text":{"$search":"\"Craig\" \"Dr. Bob\""}})
Result: Returns the document with both of these terms in it.
MULTIPLE TERMS, ONE PARTIAL
db.submissions.find({"$text":{"$search":"\"Craig\" \"Dr. Bo\""}})
Result: Returns the document with both terms in it, despite the fact that one term is partial. There is nothing in the document that matches "Dr. Bo"
MULTIPLE TERMS, BOTH PARTIAL
db.submissions.find({"$text":{"$search":"\"Crai\" \"Dr. Bo\""}})
Result: Returns the document with both terms in it, despite the fact that both terms are partial and incomplete. There is nothing in the document that matches either "Crai" or "Dr. Bo".
Question
So, it all boils down to: why? Why is it, when I do a text search with a partial term with only a single value, nothing gets returned. When I do a text search with two partial terms, I get the matching result? It just seems so strange and inconsistent.
MongoDB $text searches do not support partial matching. MongoDB allows text search queries on string content with support for case insensitivity, delimiters, stop words and stemming. And the terms in your search string are, by default, OR'ed.
Taking your (very useful :) examples one by one:
SINGLE TERM, PARTIAL
// returns nothing because there is no world word with the value `Crai` in your
// text index and there is no whole word for which `Crai` is a recognised stem
db.submissions.find({"$text":{"$search":"\"Crai\""}})
MULTIPLE TERMS, COMPLETE
// returns the document because it contains all of these words
// note in the text index Dr. Bob is not a single entry since "." is a delimiter
db.submissions.find({"$text":{"$search":"\"Craig\" \"Dr. Bob\""}})
MULTIPLE TERMS, ONE PARTIAL
// returns the document because it contains the whole word "Craig" and it
// contains the whole word "Dr"
db.submissions.find({"$text":{"$search":"\"Craig\" \"Dr. Bo\""}})
MULTIPLE TERMS, BOTH PARTIAL
// returns the document because it contains the whole word "Dr"
db.submissions.find({"$text":{"$search":"\"Crai\" \"Dr. Bo\""}})
Bear in mind that the $search string is ...
A string of terms that MongoDB parses and uses to query the text index. MongoDB performs a logical OR search of the terms unless specified as a phrase.
So, if at least one term in your $search string matches then MongoDB matches that document.
To verify this behaviour, if you edit your document changing Dr. Bob to DrBob then the following queries will return no documents:
db.submissions.find({"$text":{"$search":"\"Craig\" \"Dr. Bo\""}})
db.submissions.find({"$text":{"$search":"\"Crai\" \"Dr. Bo\""}})
These now return no matches because Dr is no longer a whole word in your text index because it is not followed by the . delimiter.
You can do partial searching in mongoose database using mongoose external library called mongoose-fuzzy-search where the search text is broken in various anagrams.
for more information visit this link
User.fuzzySearch('jo').sort({ age: -1 }).exec(function (err, users) {
console.error(err);
console.log(users);
});

MongoDB multiple type of index on same field

Can I have multiple type of index on same field? Will it affect performance?
Example :
db.users.createIndex({"username":"text"})
db.users.createIndex({"username":1})
Yes, you can have different types of indexes on single field. You can create indexes of type e.g text, 2dsphere, hash
You can not create same index with sparse and unique options.
Every write operation is going to update a relevant index entry of all possible types in this case
The two index options are very different.
When you create a regular index on a string field it indexes the entire value in the string. Mostly useful for single word strings (like a username for logins) where you can match exactly.
A text index on the other hard will tokenize and stem the content of the field. So it will break the string into individual words or tokens, and will further reduce them to their stems so that variants of the same word will match ("talk" matching "talks", "talked" and "talking" for example, as "talk" is a stem of all three). Mostly useful for true text (sentences, paragraphs, etc).
Text Search
Text search supports the search of string content in documents of a collection. MongoDB provides the $text operator to perform text search in queries and in aggregation pipelines.
The text search process:
tokenizes and stems the search term(s) during both the index creation and the text command execution.
assigns a score to each document that contains the search term in the indexed fields. The score determines the relevance of a
document to a given search query.
The $text operator can search for words and phrases. The query matches on the complete stemmed words. For example, if a document field contains the word blueberry, a search on the term blue will not match the document. However, a search on either blueberry or blueberries will match.
$regex searches can be used with regular indexes on string fields, to provide some pattern matching and wildcard search. Not a terribly effective user of indexes but it will use indexes where it can:
If an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan. Further optimization can occur if the regular expression is a “prefix expression”, which means that all potential matches start with the same string. This allows MongoDB to construct a “range” from that prefix and only match against those values from the index that fall within that range.
http://docs.mongodb.org/manual/core/index-text/
http://docs.mongodb.org/manual/reference/operator/query/regex/

MongoDB fulltext search + workaround for partial word match

Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?
Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.
A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.
You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".
Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.
For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.

Data model built on Mongo: store IDs as one massive string or array of strings? Is Mongo faster at using regular expressions or looking inside arrays?

We could use help on structuring our Mongo database. We need to store country IDs then run queries to return documents containing matching countries. Assume the IDs are strings 6-10 chars long.
Two options:
1) Store the country IDs as one massive string separated by some delimiter
(e.g., /). Ex: "IDIDID1/IDIDID2/IDIDID3/IDIDID4/IDIDID5".
2) Store the IDs in an array.
Ex: ["IDIDID1", "IDIDID2", "IDIDID3", "IDIDID4", "IDIDID5"].
We want to optimize for queries like "Find all documents containing country ID IDIDID3."
For option 1, we plan to use a RegEx to query documents (e.g., /IDIDID3/).
For option 2, we will use the standard $in operator.
Which option yields better read performance?
Does using the string approach yield better performance because you can index strings (as opposed to the limitation of only one array indexable by Mongo)?
We're using MongoMapper.
From MongDB Manual
$regex can only use an index efficiently when the regular expression
has an anchor for the beginning (i.e. ^) of a string and is a case-sensitive match.
Additionally, while /^a/, /^a.*/, and /^a.*$/ match equivalent strings,
they have different performance characteristics.
All of these expressions use an index if an appropriate index exists;
however, /^a.*/, and /^a.*$/ are slower. /^a/ can stop scanning after matching the prefix.
So using an array and a multi key index makes more sense in terms of performance