sphinx get non-stemmed results - sphinx

When I query sphinx, it first applies stemming to my input keywords and gives me back a set of results with words that matched. The problem is that the result keywords are also stemmed.
Is there a way to get back from sphinx the original search keyword and not the stemmed one. For example, if I do a batch query with the following words:
credit card
working
walked
and suppose sphinx found a match for credit card. The problem is that sphinx returns me "cred card" and I have to manually check (by comparing strings) with which one of the above keywords the document(s) matched. And this could be very inefficient in my circumstances.
Any suggestion?

Related

Autocomplete and text search memory issues in apostrophe-cms: need ideas

I’m having trouble to use the text search and the autocomplete because I have a piece with +87k documents, some of them being big (~3.4MB of text).
I already:
Removed every field from the text index, except title , searchBoost and seoDescription ; these are the only fields copied to highSearchText and the field lowSearchText is always set to an empty string.
Modified the standard text index, including the fields type, published and trash in the beginning of it. I'm also modified the queries to have equality conditions on these fields. The result returned by the command db.aposDocs.stats() shows:
type_1_published_1_trash_1_highSearchText_text_lowSearchText_text_title_text_searchBoost_text: 12201984 (~11 MB, fits nicely in memory)
Verified that this index is being used, both in ‘toDistinc’ query as well in the final ‘toArray’ query.
What I think is the biggest problem
The documents have many repeated words in the title, so if the user types a word present in 5k document titles, the server suffers.
Idea I'm testing
The MongoDB docs says that to improve performance the entire collection must fit in RAM (https://docs.mongodb.com/manual/core/index-text/#storage-requirements-and-performance-costs, last bullet).
So, I created a separate collection named “search” with just the fields highSearchText (string, indexed as text) and highSearchWords (array, also indexed), which result in total size of ~ 19 MB.
By doing the same operations of the standard apostrophe autocomplete in this collection, I achieved much faster, but similar results.
I had to write events to automatically update the search collection when the piece changes, but it seems to work until now.
Issues
I'm testing this search collection with the autocomplete. For the simple text search, I’m just limiting the sorted response to 50 results. Maybe I'll have to use the search collection as well, because the search could still breaks.
Is there some easier approach I'm missing? Please, any ideas are welcome.

SnowballAnalyzer - Exact match searches

I'm using SnowballAnalyzer in Lucene.Net 3.0.3, and it works well for stem matches. I would like to also support exact text matches, so if a user searches for "jumping jacks", in quotes, it will only match documents which contain that exact phrase. But the index will contain only the word stems, "jump" and "jack". Is it possible to index and search the original text while still supporting stemming?
I solved this using PerFieldAnalyzerWrapper for indexing and searching. Add two fields, one using SnowballAnalyzer and one using StandardAnalyzer. For exact phrases, search the field you've indexed with StandardAnalyzer, and for the rest use the SnowballAnalyzer one.

Stemming does not work properly for MongoDB text index

I am trying to use full text search feature of MongoDB and observing some unexpected behavior. The problem is related to "stemming" aspect of the text indexing feature. The way full text search is described in many articles online, if you have a string "big hunting dogs" in a document's field that is part of the text index, you should be able to search on "hunt" or "hunting" as well as on "dog" or "dogs". MongoDB should normalize or stem the text when indexing and also when searching. So in my example, I would expect it to save words "dog" and "hunt" in the index and search for a stemmed version of this words. If I search for "hunting", MongoDB should search for "hunt".
Well, this is not how it works for me. I am running MongoDB 2.4.8 on Linux with full text search enabled. If my record has value "big hunting dogs", only searching for "big" will produce the result, while searches for "hunt" or "dog" produce nothing. It is as if the words that are not in their "normalized" form are not stored in the text the index (or stored in a way it cannot find them). Searches using $regex operator work fine, that is I am able to find the document by searching on a string like /hunting/ against the field in question.
I tried dropping and recreating the full text index - nothing changed. I can only find the documents containing the words on their "normal" form. Searching for words like "dogs" or "hunting" (or even "dog" or "hunt") produces no results.
Do I misunderstand or misuse the full text search operations or is there a bug in MongoDB?
After a fair amount of experimenting and scratching my head I discovered the reason for this behavior. It turned out that the documents in the collection in question had attribute 'language'. Apparently the presence and the value of that attribute made these documents non-searchable. (The value happened to be 'ENG'. It is possible that changing it to 'eng' would make this document searchable again. The field, however, served a completely different purpose). After I renamed the field to 'lang' I was able to find the document containing the word "dogs" by searching for "dog" or "dogs".
I wonder whether this is expected behavior of MongoDB - that the presence of language attribute in the document would affect the text search.
Michael,
The "language" field (if present) allows each document to override the
language in which the stemming of words would be done. I think, as
you specified to MongoDB a language which it didn't recognize ("ENG"),
it was unable to stem the words at all. As others pointed out, you can use the
language_override option to specify that MongoDB should be using some
other field for this purpose (say "lang") and not the default one ("language").
Below is a nice quote (about full text indexing and searching) which
is exactly related to your issue. It is taken from this book.
"MongoDB: The Definitive Guide, 2nd Edition"
Searching in Other Languages
When a document is inserted (or the index is first created), MongoDB looks at the
indexes fields and stems each word, reducing it to an essential unit. However, different
languages stem words in different ways, so you must specify what language the index
or document is. Thus, text-type indexes allow a "default_language" option to be
specified, which defaults to "english" but can be set to a number of other languages
(see the online documentation for an up-to-date list).
For example, to create a French-language index, we could say:
> db.users.ensureIndex({"profil" : "text", "interets" : "text"}, {"default_language" : "french"})
Then French would be used for stemming, unless otherwise specified. You can, on a
per-document basis, specify another stemming language by having a "language" field
that describes the document’s language:
> db.users.insert({"username" : "swedishChef", "profile" : "Bork de bork", language : "swedish"})
What the book does not mention (at least this page of it doesn't) is that
one can use the language_override option to specify that MongoDB
should be using some other field for this purpose (say "lang") and
not the default one ("language").
In http://docs.mongodb.org/manual/tutorial/specify-language-for-text-index/ take a look at the language_override option when setting up the index. It allows you to change the name of the field that should be used to define the language of the text search. That way you can leave the "language" property for your application's use, and call it something else (e.g. searchlang or something like that).

Thinking sphinx fuzzy search?

I am implementing sphinx search in my rails application.
I want to search with fuzzy on. It should search for spelling mistakes e.g if is enter search query charact*a*ristics, it should search for charact*e*ristics.
How should I implement this
Sphinx doesn't naturally allow for spelling mistakes - it doesn't care if the words are spelled correctly or not, it just indexes them and matches them.
There's two options around this - either use thinking-sphinx-raspell to catch spelling errors by users when they search, and offer them the choice to search again with an improved query (much like Google does); or maybe use the soundex or metaphone morphologies so words are indexed in a way that accounts for how they sound. Search on this page for stemming, you'll find the relevant section. Also have a read of Sphinx's documentation on the matter as well.
I've no idea how reliable either option would be - personally, I'd opt for #1.
By default, Sphinx does not pay any attention to wildcard searching using an asterisk character. You can turn it on, though:
development:
enable_star: true
# ... repeat for other environments
See http://pat.github.io/thinking-sphinx/advanced_config.html Wildcard/Star Syntax section.
Yes, Sphinx generaly always uses the extended match modes.
There are the following matching modes available:
SPH_MATCH_ALL, matches all query words (default mode);
SPH_MATCH_ANY, matches any of the query words;
SPH_MATCH_PHRASE, matches query as a phrase, requiring perfect match;
SPH_MATCH_BOOLEAN, matches query as a boolean expression (see Section 5.2, “Boolean query syntax”);
SPH_MATCH_EXTENDED, matches query as an expression in Sphinx internal query language (see Section 5.3, “Extended query syntax”);
SPH_MATCH_EXTENDED2, an alias for SPH_MATCH_EXTENDED;
SPH_MATCH_FULLSCAN, matches query, forcibly using the "full scan" mode as below. NB, any query terms will be ignored, such that filters, filter-ranges and grouping will still be applied, but no text-matching.
SPH_MATCH_EXTENDED2 was used during 0.9.8 and 0.9.9 development cycle, when the internal matching engine was being rewritten (for the sake of additional functionality and better performance). By 0.9.9-release, the older version was removed, and SPH_MATCH_EXTENDED and SPH_MATCH_EXTENDED2 are now just aliases.
enable_star
Enables star-syntax (or wildcard syntax) when searching through prefix/infix indexes. >Optional, default is is 0 (do not use wildcard syntax), for compatibility with 0.9.7. >Known values are 0 and 1.
For example, assume that the index was built with infixes and that enable_star is 1. Searching should work as follows:
"abcdef" query will match only those documents that contain the exact "abcdef" word in them.
"abc*" query will match those documents that contain any words starting with "abc" (including the documents which contain the exact "abc" word only);
"*cde*" query will match those documents that contain any words which have "cde" characters in any part of the word (including the documents which contain the exact "cde" word only).
"*def" query will match those documents that contain any words ending with "def" (including the documents that contain the exact "def" word only).
Example:
enable_star = 1

Can Kinosearch do mathematical comparisons on numbers like "greater-than"?

I am using Perl's KinoSearch module to index a bunch of text.
Some of the text has numeric fields associated with each word. For example, the word "Pizza" in the index may have a dollar field value like "5.50" (dollars).
How can I write a query in KinoSearch that will find all words that have a dollar value greater than 5?
I'm not even sure if a full-text search engine can do this kind of thing. It seems more like a SQL query.
After a bunch of searching (heh, heh), I found this in the docs: RangeQuery
I may be able to make that work. But it appears the new required classes are not part of the standard release, yet.