Morphology interefering with Wordforms in Sphinx - sphinx

With stem_en on 'Children' = 'Childrens' and vice-versa w/o any wordforms.
If I map Children to Term2 in wordformsthan ONLY Children maps to Term2, not Childrens. Assuming I have told Sphinx to remove Children from the morphology by adding it to wordform.
Is there no way to tell Sphinx I want the Children/Childrens stem_all to be used and I want to map Children to a non-morphology related word (Term2)?

You need to use the ~ operation
http://sphinxsearch.com/docs/current.html#conf-wordforms
Starting with version 2.1.1-beta, ... Finally, if a line starts with a tilde ("~") the wordform will be applied after morphology, instead of before.
So that wordforms dont implement stemming exceptions, but work on stemmed versions.

Related

Override a stemmed word on the fly in a query with Spinx?

If I turn on stemming/lemmatizer in sphinx can I push a term to it "as needed" that does not utilize stemming? I know I can use wordforms to always ignore that word from stemming e.g. Radiology > Radiology but that results in never stemming the word. I'm looking for a way to not add as a wordform exception but be able to in a query in essence say 'look exactly for "Radiology" and do not stem/lemmatize". I have tried "Radiology" instead of Radiology to no avail.
http://sphinxsearch.com/docs/current.html#conf-index-exact-words
:)
Then can do
=Radiology
(in extended match mode)

Can you change priority between wordform and lemmatizer in Sphinx?

If I turn lemmatizer on then plurals all work e.g
Office=Offices
Dog=Dogs
However if I make a wordform unrelated to plural like
100 > Hundred
Then Hundred will not match Hundreds (I realize not a perfect example so don't take it literally).
So the question is is there any other type of wordform or process that will allow you to first apply stemming and then wordform? So in this case it would stem Hundred to Hundreds so that 100 would match both Hundred and Hundreds?
See http://sphinxsearch.com/docs/current.html#conf-wordforms
There is special syntax to use with morphology.
100 => Hundr
You need to apply the morphology to the right side manually.
Some code here:
http://sphinxsearch.com/forum/view.html?id=13907
that might help with creating this style of wordforms.

How to get wordforms in sphinx?

How i can get all morphology forms of the word?
For example, searching keyword is:
runner
Result should be:
run,running ... etc
You usually don't need one. Use a stemmer, which can do the reverse. ie it removes the "ending", so that matching works, rather than trying to figure out all the possible endings.
https://en.wikipedia.org/wiki/Stemming
ie use morphology, rather than worforms.
http://sphinxsearch.com/docs/current.html#conf-morphology

Sphinx with metaphone and wildcard search

we are an anatomy platform and use sphinx for our search. We want to make our search more fuzzier and started to use metaphone to correct spelling mistakes. It finds for example phalanges even though the search word is falanges.
That's good but we want more. We want that the user could type in falange or even falang and we still find phalanges. Any ideas how to accomplish this?
If you are interested you can checkout our sphinx config file here.
Thanks!
Well you can enable both metaphone and min_prefix_len on an index at once. It will sort of work.
falange*
might then just work. (to match phalanges)
The problem is the 'stripped' letters may change the 'sound' of the word (because change the pronunciation)
eg falange becomes FLNJ, but falang acully becomes FLNK - so they no longer 'substrings' of one another. (ie phalanges becomes FLNJS, which FLNK* wont match)
... to be honest I dont know a good solution. You could perhaps get better results, if was to apply stemming, BEFORE metaphone. (so the endings that change the pronouncation of the words are removed.
Alas Sphinx can't do this. If you enable stemming and metaphone together, only ONE of the processors will ever fire.
Two possible solutions, implement stemming outside of sphinx (or maybe with regexp_filter. Not sure if say a porter stemmer can be implemnented purely with regular expressions)
or modify sphinx, so that ALL morphology processors apply. (rather than just the first one that changes the word)

Handling American / UK spelling plus plurals in Sphinx

We need to have all these terms match each other and are running into difficulty
orthopaedic, orthopedic, orthopaedics, orthopedics
At the moment we are dealing with most other plurals using morphology stem_en
This is our current wordforms entry for this group (the pair is duplicated in reverse or
else it only works one way)
orthopaedic > orthopedic
orthopedic > orthopaedic
orthopedics > orthopaedics
orthopaedics > orthopedics
However "orthopedics" does not then match "orthopaedic" and we can't add another entry
"orthopaedic > orthopedics" because "orthopaedic" is already present and will throw an
error when indexing.
Any advice would be greatly appreciated
the pair is duplicated in reverse or else it only works one way
That's a bad idea! Putting it both way, will lead to issues (like you've found in fact!), you changing one to the other, so they wont match properly!
You need only one direction. Sphinx takes the left word, and actully stores the right in the index. So searching for the left and the right become interchangable. If you swap the words, then they nver get a chance to match.
The complication arrises because wordforms performs 'stemming exception' - ie a word in wordforms is NOT stemmed, so that means many word wont match. So you need to
perform stemming manually on the wordforms list, and
list all variations in your wordforms file, - to the same common word
Using your example above, would be something like
orthopaedic > orthopedic
orthopedic > orthopedic
orthopedics > orthopedic
orthopaedics > orthopedic
If the word did stem would have to do that, eg
bridge > bridg
bridges > bridg
bridging > bridg
etc
It vastly bloats your wordforms file, but it can be automated.