MongoDB full text search - matching words and exact phrases - mongodb

I'm currently having some issues with the full text search functionality in MongoDB. Specifically when trying to match exact phrases.
I'm testing out the functionality in the mongo shell, but ultimately I'll be using Spring Data MongoDB with Java.
So I first tried running this command to search for the words "delay", "late" and the phrase "on time"
db.mycollection.find( { $text: { $search: "delay late \"on time\"" } }).explain(true);
And the resulting explain query told me:
"parsedTextQuery" : {
"terms" : [
"delay",
"late",
"time"
],
"negatedTerms" : [ ],
"phrases" : [
"on time"
],
"negatedPhrases" : [ ] },
The issues here being that I don't want to search for the word "time", but rather the phrase "on time". I do want to search for delay and late and ideally don't want to prevent the stemming.
I tried a few different permutations e.g.
db.mycollection.find( { $text: { $search: "delay late \"'on time'\"" } }).explain(true);
db.mycollection.find( { $text: { $search: "delay late \"on\" \"time\"" } }).explain(true);
But couldn't seem to get the right results. I can't see anything obvious in the documentation about this.
For my purposes should I use the full text search for individual words and the regex search functionality for phrases?
Currently working with MongoDB version 2.6.5. Thanks.

Did you try the text search to see if it didn't behave correctly? It works as expected for me on MongoDB 2.6.7:
> db.test.drop()
> db.test.insert({ "t" : "I'm on time, not late or delayed" })
> db.test.insert({ "t" : "I'm either late or delayed" })
> db.test.insert({ "t" : "Time flies like a banana" })
> db.test.ensureIndex({ "t" : "text" })
> db.test.find({ "$text" : { "$search" : "time late delay" } }, { "_id" : 0 })
{ "t" : "I'm on time, not late or delayed" }
{ "t" : "Time flies like a banana" }
{ "t" : "I'm either late or delayed" }
> db.test.find({ "$text" : { "$search" : "late delay" } }, { "_id" : 0 })
{ "t" : "I'm on time, not late or delayed" }
{ "t" : "I'm either late or delayed" }
> db.test.find({ "$text" : { "$search" : "late delay \"on time\"" } }, { "_id" : 0 })
{ "t" : "I'm on time, not late or delayed" }
Why is "time" in the terms array in the explain? Because if the phrase "on time" occurs in a document, the term time must also. MongoDB uses the text index to the extent it can to help locate the phrase and then will check the index results to see which actually matches the full phrase and not just the terms in the phrase.

Related

Calculate relevant result on full text search in mongodb

I am trying to get the more relevant results from mongo, let say that i have this collections
{ "text" : "mitsubishi lancer 2011"}
{ "text" : "mitsubishi lancer 2011"}
{ "text" : "mitsubishi lancer 2011 in good conditions"}
{ "text" : "lancer 2011"}
{ "text" : "mitsubishi lancer 2014"}
{ "text" : "lancer 2016"}
and make this query
db.post.find({$text: {$search: "mitsubishi lancer 2011"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})
i get this result
{ "text" : "mitsubishi lancer 2011", "score" : 2 }
{ "text" : "mitsubishi lancer 2011", "score" : 2 }
{ "text" : "mitsubishi lancer 2011 in good conditions", "score" : 1.7999999999999998 }
{ "text" : "lancer 2011", "score" : 1.5 }
{ "text" : "mitsubishi lancer 2014", "score" : 1.3333333333333333 }
{ "text" : "lancer 2016", "score" : 0.75 }
How do i know that the first two have all the text that i search?
who the score is calculated?
The scoring algorithm is internal to MongoDB and should probably be expected to change over time so the precise values shouldn't matter. You can attempt to understand what's going on by looking at the sources if you want (although I wouldn't recommend that).
The final score depends on the number of occurrences of your searched terms (or rather their word stems), the distances between the matches, the match quality (full match vs. partial), language settings and weights which you can configure. That's all pretty hefty stuff that cannot easily be documented. There is, however, a blog post that explains some aspects quite nicely: https://blog.codecentric.de/en/2013/01/text-search-mongodb-stemming/
Also, things get a bit clearer once you try out various queries using different combinations of search terms and indexed data.
Lastly, if you want to find out if there is a perfect match, the only way I can think of to make this work is something like this:
db.getCollection('test').aggregate(
{
// do the normal filtering query
$match: {
$text: {
$search: "mitsubishi lancer 2011"
}
}
}, {
// select what's relevant in the output and add an indicator "perfectmatch"
$project: {
"text": 1,
"score": {
$meta: "textScore"
},
"perfectmatch": {
$cond: [
{ $eq: [ "$text", "mitsubishi lancer 2011" ] }, // this would check for a perfect match using the exact full string, for individual token matching you would need to do tokenize your query and do a series of other checks here.
true,
false
]
}
}
}, {
// if you want to have the results sorted by "best match first"
$sort: {
"score": -1
}
})

$and operator on multiple $text search in mongo

Is it possible to have $and operator on multiple $text index search in mongo?
I have documents in tp collection of my db
> db.tp.find()
{ "_id" : ObjectId("...."), "name" : "tp", "dict" : { "item1" : "random", "item2" : "some" } }
{ "_id" : ObjectId("...."), "name" : "tp", "dict" : { "item3" : "rom", "item4" : "tttt" } }
Then I do
> db.tp.createIndex({ "$**": "text" })
> db.tp.find({ $and: [{$text : { $search: "random" } }, {$text : { $search: "redruth" } }]})
And it fails with
Error: error: {
"waitedMS" : NumberLong(0),
"ok" : 0,
"errmsg" : "Too many text expressions",
"code" : 2
}
but text index search works for single search so is it not possible to bind multiple text searches with $and operator? By the way I am using wildcard character $** for indexing because I want to search over entire document.
Base on mongoDB docs, AND operator can use directly in search term by combining quote and space. For example, we search for "ssl certificate" AND "authority key", so the query should like:
> db.tp.find({'$text': {'$search': '"ssl certificate" "authority key"'}})
A query can specify at most one $text expression. See:
https://docs.mongodb.com/manual/reference/operator/query/text/

Stop mongodb from ignoring special characters?

Model.find({ $text : {$search: "#text"} })
returns everything that includes "text", not only those documents with "#text". I've tried putting an \ before the #, to no avail. How do I stop mongodb from ignoring my special characters? Thanks.
Tomalak's description of how text indexing works is correct, but you can actually use a text index for an exact phrase match on a phrase with a special character:
> db.test.drop()
> db.test.insert({ "_id" : 0, "t" : "hey look at all this #text" })
> db.test.insert({ "_id" : 1, "t" : "text is the best" })
> db.test.ensureIndex({ "t" : "text" })
> db.test.count({ "$text" : { "$search" : "text" } })
2
> db.test.count({ "$text" : { "$search" : "#text" } })
2
> db.test.find({ "$text" : { "$search" : "\"#text\"" } })
{ "_id" : 0, "t" : "hey look at all this #text" }
Exact phrase matches are indicated by surrounding the phrase in double quotes, which need to be escaped in the shell like "\"#text\"".
Text indexes are larger than normal indexes, but if you are doing a lot of case-insensitive exact phrase matches then they can be a better option than a standard index because they will perform better. For example, on a field t with an index { "t" : 1 }, an exact match regex
> db.test.find({ "t" : /#text/ })
performs a full index scan. The analogous (but not equivalent) text query
> db.test.find({ "$text" : { "$search" : "\"#text\"" } })
will use the text index to locate documents containing the term "text", then scan all those documents to see if they contain the full phrase "#text".
Be careful because text indexes aren't case sensitive. Continuing the example above:
> db.test.insert({ "_id" : 2, "t" : "Never seen so much #TEXT" })
> db.test.find({ "t" : /#text/ })
{ "_id" : 0, "t" : "hey look at all this #text" }
> db.test.find({ "$text" : { "$search" : "\"#text\"" } })
{ "_id" : 0, "t" : "hey look at all this #text" }
{ "_id" : 2, "t" : "Never seen so much #TEXT" }

MongoDB - Logical OR when searching for words and phrases using full text search

I asked a related question previously, and as suggested by the poster there have created this new question as a follow up:
MongoDB full text search - matching words and exact phrases
I was having some problems with unexpected results when using the full text search functionality in MongoDB, specifically when searching for a mixture of words and phrases.
Using this helpful example provided by the poster in the previous question...
> db.test.drop()
> db.test.insert({ "t" : "I'm on time, not late or delayed" })
> db.test.insert({ "t" : "I'm either late or delayed" })
> db.test.insert({ "t" : "Time flies like a banana" })
> db.test.ensureIndex({ "t" : "text" })
> db.test.find({ "$text" : { "$search" : "time late delay" } }, { "_id" : 0 })
{ "t" : "I'm on time, not late or delayed" }
{ "t" : "Time flies like a banana" }
{ "t" : "I'm either late or delayed" }
> db.test.find({ "$text" : { "$search" : "late delay" } }, { "_id" : 0 })
{ "t" : "I'm on time, not late or delayed" }
{ "t" : "I'm either late or delayed" }
> db.test.find({ "$text" : { "$search" : "late delay \"on time\"" } }, { "_id" : 0 })
{ "t" : "I'm on time, not late or delayed" }
The first two queries behave as I would expect, the first searching for "time OR late OR delay" and the second for "late OR delay".
I now understand from reading this section of the documentation http://docs.mongodb.org/manual/reference/operator/query/text/#phrases that the third query, which includes a phrase will search for "late OR delay AND ("on time")".
My question is, is it possible to search for "late OR delay OR ("on time")" in one text query?
I combed the docs on text search, and I'm afraid I don't think this is possible as of MongoDB 2.6. MongoDB's text search support is simply not as complete as a bona fide full text search engine (e.g. Solr/things built with the Lucene text search library). Right now, there's no support for boolean operators in text queries, so you cannot change the meaning of "late delay \"on time\"" from "(late OR delay) AND (\"on time\")" to "late OR delay OR \"on time\"". There might be some workarounds involving storing an array of tokens instead of or in addition to the text, or synchronizing with a full text search engine like ElasticSearch, but I'd rather know a bit more about the use case for the query before recommending any solutions.

mongoDB prefix wildcard: fulltext-search ($text) find part with search-string

I have mongodb with a $text-Index and elements like this:
{
foo: "my super cool item"
}
{
foo: "your not so cool item"
}
If i do search with
mycoll.find({ $text: { $search: "super"} })
i get the first item (correct).
But i also want to search with "uper" to get the fist item - but if i try:
mycoll.find({ $text: { $search: "uper"} })
I dont get any results.
My Question:
If there is a way to use $text so its finds results with a part of the searching string? (e.g. like '%uper%' in mysql)
Attention: I dont ask for a regex only search - i ask for a regex-search within a $text-search!
It's not possible to do it with $text operator.
Text indexes are created with the terms included in the string value or in an array of strings and the search is based in those indices.
You can only group terms on a phrase but not take part of them.
Read $text operator reference and text indexes description.
The best solution is to use both a text index and a regex.
The index will provide excellent speed performances but won't match as many documents as a regex.
The regex will allow a fallback in case the index doesn't return enough results.
db.mycoll.createIndex({ foo: 'text' });
db.mycoll.createIndex({ foo: 1 });
db.mycoll.find({
$or: [
{ $text: { $search: 'uper' } },
{ foo: { $regex: 'uper' } }
]
});
For even better performances (but slightly different results), use ^ inside the regex:
db.mycoll.find({
$or: [
{ $text: { $search: 'uper' } },
{ foo: { $regex: '^uper' } }
]
});
What you are trying to do in your second example is prefix wildcard search in your collection mycoll on field foo. This is not something the textsearch feature is designed for and it is not possible to do it with $text operator. This behaviour does not include wildcard prefix search on any given token in the indexed field. However you can alternatively perform regex search as others suggested. Here is my walkthrough:
>db.mycoll.find()
{ "_id" : ObjectId("53add9364dfbffa0471c6e8e"), "foo" : "my super cool item" }
{ "_id" : ObjectId("53add9674dfbffa0471c6e8f"), "foo" : "your not so cool item" }
> db.mycoll.find({ $text: { $search: "super"} })
{ "_id" : ObjectId("53add9364dfbffa0471c6e8e"), "foo" : "my super cool item" }
> db.mycoll.count({ $text: { $search: "uper"} })
0
The $text operator supports search for a single word, search for one or more words or search for phrase. The kind of search you wish is not supported
The regex solution:
> db.mycoll.find({foo:/uper/})
{ "_id" : ObjectId("53add9364dfbffa0471c6e8e"), "foo" : "my super cool item" }
>
The answer to your final question: to do mysql style %super% in mongoDB you would most likely have to do:
db.mycoll.find( { foo : /.*super.*/ } );
It should work with /uper/.
See http://docs.mongodb.org/manual/reference/operator/query/regex/ for details.
Edit:
As per request in the comments:
The solution wasn't necessarily meant to actually give what the OP requested, but what he needed to solve the problem.
Since $regex searches don't work with text indices, a simple regex search over an indexed field should give the expected result, though not using the requested means.
Actually, it is pretty easy to do this:
db.collection.insert( {foo: "my super cool item"} )
db.collection.insert( {foo: "your not so cool item"})
db.collection.ensureIndex({ foo: 1 })
db.collection.find({'foo': /uper/})
gives us the expected result:
{ "_id" : ObjectId("557f3ba4c1664dadf9fcfe47"), "foo" : "my super cool item" }
An added explain shows us that the index was used efficiently:
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.collection",
"indexFilterSet" : false,
"parsedQuery" : {
"foo" : /uper/
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"filter" : {
"foo" : /uper/
},
"keyPattern" : {
"foo" : 1
},
"indexName" : "foo_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"foo" : [
"[\"\", {})",
"[/uper/, /uper/]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
// skipped
},
"ok" : 1
}
To make a long story short: No, you can not reuse a $text index, but you can do the query efficiently. Like written in Implement auto-complete feature using MongoDB search , one could probably be even more efficient by using a map/reduce approach, eliminating redundancy and unnecessary stop words from the indices, at the cost of being not real time any more.
As francadaval said, text index is searching by terms but if you combine regex and text-index you should be good.
mycoll.find({$or: [
{
$text: {
$search: "super"
}
},
{
'column-name': {
$regex: 'uper',
$options: 'i'
}
]})
Also, make sure that you have normal index applied to the column other than text index.
if you go with regex you can achieve search for "super cool" but not "super item", to achieve both request do an or request with $text and $regex for the search term.
make sure you index both text indexing and normal indexing to work.
You could have achieved is as-
db.mycoll.find( {foo: { $regex : /uper/i } })
Here 'i' is an option, denotes case-insensitive search