I am using Perl's KinoSearch module to index a bunch of text.
Some of the text has numeric fields associated with each word. For example, the word "Pizza" in the index may have a dollar field value like "5.50" (dollars).
How can I write a query in KinoSearch that will find all words that have a dollar value greater than 5?
I'm not even sure if a full-text search engine can do this kind of thing. It seems more like a SQL query.
After a bunch of searching (heh, heh), I found this in the docs: RangeQuery
I may be able to make that work. But it appears the new required classes are not part of the standard release, yet.
Related
I want to have some sort of limited indexed Full-text search. With FTS postgres will index all the words in the text, but I want it to track only a given set of words. For example, I have a database of tweets, and I want them to be indexed by special words that I give: awesome, terrible and etc.
If someone will be interested in such a specific thing, I made it by creating a custom dictionary (thanks Mark).
My findings I documented here: https://hackmd.io/#z889TbyuRlm0vFIqFl_AYQ/r1gKJQBZS
I'm using SnowballAnalyzer in Lucene.Net 3.0.3, and it works well for stem matches. I would like to also support exact text matches, so if a user searches for "jumping jacks", in quotes, it will only match documents which contain that exact phrase. But the index will contain only the word stems, "jump" and "jack". Is it possible to index and search the original text while still supporting stemming?
I solved this using PerFieldAnalyzerWrapper for indexing and searching. Add two fields, one using SnowballAnalyzer and one using StandardAnalyzer. For exact phrases, search the field you've indexed with StandardAnalyzer, and for the rest use the SnowballAnalyzer one.
For a MongoDB field that contains strings (for example, state or province names), what (if any) difference is there between creating an index on a string-type field :
db.ensureIndex( { field: 1 } )
and creating a text index on that field:
db.ensureIndex( { field: "text" }
Where, in both cases, field is of string type.
I'm looking for a way to do a case-insensitive search on a text field which would contain a single word (maybe more). Being new to Mongo, I'm having trouble distinguishing between using the above two index methods, and even something like a $regex search.
The two index options are very different.
When you create a regular index on a string field it indexes the
entire value in the string. Mostly useful for single word strings
(like a username for logins) where you can match exactly.
A text index on the other hand will tokenize and stem the content of
the field. So it will break the string into individual words or
tokens, and will further reduce them to their stems so that variants
of the same word will match ("talk" matching "talks", "talked" and
"talking" for example, as "talk" is a stem of all three). Mostly
useful for true text (sentences, paragraphs, etc).
Text Search
Text search supports the search of string content in documents of a
collection. MongoDB provides the $text operator to perform text search
in queries and in aggregation pipelines.
The text search process:
tokenizes and stems the search term(s) during both the index creation and the text command execution.
assigns a score to each document that contains the search term in the indexed fields. The score determines the relevance of a document to a given search query.
The $text operator can search for words and phrases. The query matches
on the complete stemmed words. For example, if a document field
contains the word blueberry, a search on the term blue will not match
the document. However, a search on either blueberry or blueberries
will match.
$regex searches can be used with regular indexes on string fields, to
provide some pattern matching and wildcard search. Not a terribly
effective user of indexes but it will use indexes where it can:
If an index exists for the field, then MongoDB matches the regular
expression against the values in the index, which can be faster than a
collection scan. Further optimization can occur if the regular
expression is a “prefix expression”, which means that all potential
matches start with the same string. This allows MongoDB to construct a
“range” from that prefix and only match against those values from the
index that fall within that range.
http://docs.mongodb.org/manual/core/index-text/
http://docs.mongodb.org/manual/reference/operator/query/regex/
text indexes allow you to search for words inside texts. You can do the same using a regex on a non text-indexed text field, but it would be much slower.
Prior to MongoDB 2.6, text search operations had to be made with their own command, which was a big drawback because you coulnd't combine it with other filters, nor treat the result as a common cursor. As of now, the text search is just another another operator for the typical find method and that's super nice.
So, Why is a text index, and its subsequent searchs faster than a regex on a non-indexed text field? It's because text indexes work as a dictionary, a clever one that's capable of discarding words on a per-language basis (defaults to english). When you run a text search query, you run it against the dictionary, saving yourself the time that would otherwise be spent iterating over the whole collection.
Keep in mind that the text index will grow along with your collection, and it can use a lot of space. I learnt this the hard way when using capped collections. There's no way to cap text indexes.
A regular index on a text field, such as
db.ensureIndex( { field: 1 } )
will be useful only if you search for the whole text. It's used for example to look for alphanumeric hashes. It doesn't make any sense to apply this kind of indexes when storing text paragraphs, phrases, etc.
I am trying to use full text search feature of MongoDB and observing some unexpected behavior. The problem is related to "stemming" aspect of the text indexing feature. The way full text search is described in many articles online, if you have a string "big hunting dogs" in a document's field that is part of the text index, you should be able to search on "hunt" or "hunting" as well as on "dog" or "dogs". MongoDB should normalize or stem the text when indexing and also when searching. So in my example, I would expect it to save words "dog" and "hunt" in the index and search for a stemmed version of this words. If I search for "hunting", MongoDB should search for "hunt".
Well, this is not how it works for me. I am running MongoDB 2.4.8 on Linux with full text search enabled. If my record has value "big hunting dogs", only searching for "big" will produce the result, while searches for "hunt" or "dog" produce nothing. It is as if the words that are not in their "normalized" form are not stored in the text the index (or stored in a way it cannot find them). Searches using $regex operator work fine, that is I am able to find the document by searching on a string like /hunting/ against the field in question.
I tried dropping and recreating the full text index - nothing changed. I can only find the documents containing the words on their "normal" form. Searching for words like "dogs" or "hunting" (or even "dog" or "hunt") produces no results.
Do I misunderstand or misuse the full text search operations or is there a bug in MongoDB?
After a fair amount of experimenting and scratching my head I discovered the reason for this behavior. It turned out that the documents in the collection in question had attribute 'language'. Apparently the presence and the value of that attribute made these documents non-searchable. (The value happened to be 'ENG'. It is possible that changing it to 'eng' would make this document searchable again. The field, however, served a completely different purpose). After I renamed the field to 'lang' I was able to find the document containing the word "dogs" by searching for "dog" or "dogs".
I wonder whether this is expected behavior of MongoDB - that the presence of language attribute in the document would affect the text search.
Michael,
The "language" field (if present) allows each document to override the
language in which the stemming of words would be done. I think, as
you specified to MongoDB a language which it didn't recognize ("ENG"),
it was unable to stem the words at all. As others pointed out, you can use the
language_override option to specify that MongoDB should be using some
other field for this purpose (say "lang") and not the default one ("language").
Below is a nice quote (about full text indexing and searching) which
is exactly related to your issue. It is taken from this book.
"MongoDB: The Definitive Guide, 2nd Edition"
Searching in Other Languages
When a document is inserted (or the index is first created), MongoDB looks at the
indexes fields and stems each word, reducing it to an essential unit. However, different
languages stem words in different ways, so you must specify what language the index
or document is. Thus, text-type indexes allow a "default_language" option to be
specified, which defaults to "english" but can be set to a number of other languages
(see the online documentation for an up-to-date list).
For example, to create a French-language index, we could say:
> db.users.ensureIndex({"profil" : "text", "interets" : "text"}, {"default_language" : "french"})
Then French would be used for stemming, unless otherwise specified. You can, on a
per-document basis, specify another stemming language by having a "language" field
that describes the document’s language:
> db.users.insert({"username" : "swedishChef", "profile" : "Bork de bork", language : "swedish"})
What the book does not mention (at least this page of it doesn't) is that
one can use the language_override option to specify that MongoDB
should be using some other field for this purpose (say "lang") and
not the default one ("language").
In http://docs.mongodb.org/manual/tutorial/specify-language-for-text-index/ take a look at the language_override option when setting up the index. It allows you to change the name of the field that should be used to define the language of the text search. That way you can leave the "language" property for your application's use, and call it something else (e.g. searchlang or something like that).
How to sort a Thai field using mongoDB as in the following SQL command?
SELECT * FROM employee ORDER BY CONVERT(name USING tis620)
Right now, it is not possible for MongoDB to sort by anything else than "Unicode Code Point". There is an issue in our issue tracker: https://jira.mongodb.org/browse/SERVER-1920 which tracks the inclusion of locale-based, and case-insensitive sorting into MongoDB.
Actually, there is a way! (Though it is a hack.)
I know this is an older thread, but I think it would be useful to answer anyways.
You definitely do not want to do the sorting in your app, because that means you have to get all documents in the collection into memory to sort them and return the window that you want. If your collection is huge, then this is extremely inefficient. The database should be doing the sorting and returning the window to you.
But, MongoDB doesn't support locale-sensitive sorting, you say. How do you solve the problem? The magic is the concept of "sort keys".
Basically, let's say you had the regular English/Latin alphabet from "a" to "z". What you would do is create a sort key mapping from "a" to "01" and from "b" to "02", etc., through to "z" to "26". That is, map every letter to a number in the sort order for that language and then encode that number as a string. Then, map the string you want to sort on to this type of sort key. For example, "abc" would become "010203". Then add a property to your document with the sort key for a property, and append the name of the property with the name of the the locale:
{
name: "abc",
name_en: "010203"
}
Now you can sort in the language "en" just by indexing on the property "name_en" and use plain old English-based MongoDB sorting for selectors and ranges instead of "name" property.
Now, let's say you have another crazy language "xx" where the order of the alphabet is "acb" instead of "abc". (Yes, there are languages that mess with the order of the Latin alphabet in that fashion!) The sort key would be like this:
{
name: "abc",
name_en: "010203",
name_xx: "010302"
}
Now, all you have to do is create indexes on name_en and name_xx and use the regular MongoDB sort in order to sort correctly on those locales. Basically, the extra properties are proxies for sorting in different locales.
So where do you get these mappings, you ask? After all, you're no globalization expert, right?
Well, if you're using Java, C, or C++, you can there are ready-made classes that do this mapping for you. In Java, use the standard Collator class, or use the icu4j Collator class. If you are using C/C++, use C/C++ version of the ICU Collator functions/class. For other languages, you are sort of out-of-luck unless you can find a library that does it already.
I know that both Java and ICU support the Thai locale and can do proper sorting in Thai. Just make sure all your strings are properly encoded in UTF-8.
Here are some links to help you find them:
The standard Java library Collator: http://docs.oracle.com/javase/7/docs/api/java/text/Collator.html#getCollationKey(java.lang.String)
The C++ Collator class: http://icu-project.org/apiref/icu4c/classicu_1_1Collator.html#ae0bc68d37c4a88d1cb731adaa5a85e95
You can also make different sort keys that allow you to sort case-insensitively per locale (yes, case mapping is locale sensitive!) and accent-insensitively, and Unicode variant insensitive, or any combination of the above. The only problem is that now you have many properties that parallel each sortable property, and you have to keep them all in sync when you update the base "name" property. It is a pain in the you-know-what, but still, it is better than doing sorting in your app or business logic layer.
Also be careful of cursors with ranges. In English, for example, we just ignore accents on characters. So, an "Ö" sorts the same way as "O" and it will appear in the range "M" to "Z". But, in Swedish, accented characters sort after "Z". So, if you do a range "M" - "Z", you will include a bunch of records starting with "Ö" that should be there in English, but not in Swedish.
This also has implications in sharding if you split on a text property of a document. Be careful about what ranges go into which shard. It would be better to shard on things that are not locale-sensitive, like hashes.