Use exact search with OR operator inside Sphinx query - sphinx

There are different search options possible with sphinx extended syntax.
Exact search: "I love to eat" //will match exact phrase
OR search: (eat|sleep|dream)
Is it possible to mix them and build queries like that:
"I love to (eat|sleep|dream)"
I know it is possible to simplify it and split OR condition to different exact phrases, like:
"I love to eat" | "I love to sleep" | "I love to dream"
But I plan to use lot of OR groups with lot of options inside, and extending this query will end up with huge one.
So is it possible to use OR syntax inside exact match syntax in Sphinx?

No, its not possible to use 'OR' within the Phrase Operator (the "s around words that enforces adjacent words) - the proper name for what you call 'exact match'.
Alas there isn't another combined 'strict order' operator and 'near' (ie there isnt a 'just before' operator). So you forced to use both, so something like
("I love to" << (eat|sleep|dream)) NEAR/3 ("I love to" NEAR/1 (eat|sleep|dream))
Which is no simpler, and would argue is more complicated and convoluted! The NEAR/3 in the middle is needed to make sure you matching on the same sentance within the document (otherwise there are edge cases with false positives).
An off the wall idea, if you have a lot of these 'OR' lists, is rather than implement them in the query, use wordforms instead. The drawback is you need to know them in advance (ie compiled into the index) and 'opting out' is more complicated.

Related

Is there anyway to replicated regex term extraction with sphinx query?

Using a simple regex:
Status: (.*?),(.*?)\s
I can easily extract "Updated" and "In-Progress" from
Status: In-Progress,Updated
see https://regex101.com/r/mV7gF5/1
I am trying to do something similar with Sphinx since it is much faster. Is there any way to do this with SphinxQL? I don't even mind if it requires post-processing but I can't for the life of me figure out a sphinxQL since it seems far more literal.
Well sphinx could give you a list of documents containing the word 'Status' and even ones containing Status: .*,.* if was to add : and , to charset_table.
But it can't do any sort of term extraction, would need to post-process those documents (and probably execute the regular expression against them!). The closest would be to CALL SNIPPETS, which sort of does text matching, but it doesnt have a regex syntax.

Override a stemmed word on the fly in a query with Spinx?

If I turn on stemming/lemmatizer in sphinx can I push a term to it "as needed" that does not utilize stemming? I know I can use wordforms to always ignore that word from stemming e.g. Radiology > Radiology but that results in never stemming the word. I'm looking for a way to not add as a wordform exception but be able to in a query in essence say 'look exactly for "Radiology" and do not stem/lemmatize". I have tried "Radiology" instead of Radiology to no avail.
http://sphinxsearch.com/docs/current.html#conf-index-exact-words
:)
Then can do
=Radiology
(in extended match mode)

Sphinx with metaphone and wildcard search

we are an anatomy platform and use sphinx for our search. We want to make our search more fuzzier and started to use metaphone to correct spelling mistakes. It finds for example phalanges even though the search word is falanges.
That's good but we want more. We want that the user could type in falange or even falang and we still find phalanges. Any ideas how to accomplish this?
If you are interested you can checkout our sphinx config file here.
Thanks!
Well you can enable both metaphone and min_prefix_len on an index at once. It will sort of work.
falange*
might then just work. (to match phalanges)
The problem is the 'stripped' letters may change the 'sound' of the word (because change the pronunciation)
eg falange becomes FLNJ, but falang acully becomes FLNK - so they no longer 'substrings' of one another. (ie phalanges becomes FLNJS, which FLNK* wont match)
... to be honest I dont know a good solution. You could perhaps get better results, if was to apply stemming, BEFORE metaphone. (so the endings that change the pronouncation of the words are removed.
Alas Sphinx can't do this. If you enable stemming and metaphone together, only ONE of the processors will ever fire.
Two possible solutions, implement stemming outside of sphinx (or maybe with regexp_filter. Not sure if say a porter stemmer can be implemnented purely with regular expressions)
or modify sphinx, so that ALL morphology processors apply. (rather than just the first one that changes the word)

Sphinx search configuration for words ending with apostrophes

I am trying to improve my Sphinx configuration and I have a trouble with words ending with apostrophes.
For example, for Surfin' USA result, searching with "Surfin USA" returns match but "Surfing USA" doesn't return anything. How can I set Sphinx to return result for such situation?
Hmm, thats an interesting one. Not sure sphinx can automatically deal with this, because it has no way of knowing what the Apostrophe is meant to represent. I suppose there are cases where it could be multiple things.
The only way I can think would be to list them in exceptions, you can build a list of all words want to support
Surfin' > Surfing
Have to use exceptions to be able to use the apostrophe
You might want to add
Surfin > Surfing
too, so can search without the apostrophe too.

Perl module for text comparison

Can anyone suggest a Perl module which can compare two strings and return a degree to which they match? I searched CPAN extensively, and although there are similar modules like String::Approx and Data::Compare, they are not what I am looking for. Suppose I have two strings : I love you, and I boht you. I want functionality which will compare these two strings, taking into account numerous parameters, the matching of words in correct order (love as the first word in a string should not "match" love as the 4th word in the 2nd string, even though both strings have that word), words not matching but spelt almost similarly (like say love and loge), number of words, etc and return an index, say a number from 0 to 1 on a scale of 1, representing the degree of similarity between the two strings. Is there any such Perl module?
There are many such modules. Often, though, you'll have to make use of them in some special way to account for your own assumptions. Most of the string comparison tools like this just implement some algorithm for comparing one string to another. Most assume that if you have specific policy decisions to make, you'll code them yourself.
Personally, I am not sure I'd recommend Text::Levenshtein because of bugs and lack of ut8 support. I don't have a better recommendation either, though.
However, these searches will reveal lots of potential modules you could look into and determine what works best for your purpose (based on the names of common algorithms for doing this sort of thing):
https://metacpan.org/search?q=levenshtein
https://metacpan.org/search?q=wagner+fischer
https://metacpan.org/search?q=edit+distance
If you're interested in spoken similarities, you can also look into phonetic comparisons:
https://metacpan.org/search?q=phonetic
https://metacpan.org/search?q=soundex
https://metacpan.org/search?q=metaphone