Custom 'stemming' in Sphinx with Workforms? - sphinx

I've found the stem_en and lemmatizer to be either to limiting or inclusive for my needs. Can I make custom stemming with word forms? Either full workds
e.g. Procology > Procotologist
but idealy stems
ology > ologist

No wordforms works on whole words, not 'ending'.
Would have to either use regexp_filter, or develop your own morphology processor (as C code)

Related

Simple `to_tsvector` configuration - postgres

How can I change the to_tsvector configuration to use a simple tokenization rule like:
lowercase
split by spaces only
Executing the following query:
SELECT to_tsvector('english', 'birthday=19770531 Name=John-Oliver Age=44 Code=AAA-345')
I get these lexemes:
'-345':9 '19770531':2 '44':6 'aaa':8 'age':5 'birthday':1 'code':7 'john':4 'name':3
The kind of searching I'm looking for is like:
(!birthday | birthday=19770531) & (code=AAA-345)
It means, get me all records that has a text "birthday=19770531" or doesn't have "birthday" at all, and a text equals to "code=AAA-345"). The way lexemes are being created it is not possible. I was expecting to have something like this:
'birthday=19770531':1 'age=44':2 'code=aaa-345':4 'name=john-oliver':3
You would have to code a custom parser. This can only be done in C.
But you might be able to use the existing testing parser test_parser, it seems to do what you want. If not, it would at least be a good starting point.
The problem may be that this is in src/test/modules/, and I don't think it ships with most installation packaging. So it might take some effort to get it to install. It would depend on your OS, version, and package manager.

Use wordforms or regexp in Sphinx to force a mutilword term to be a "word"

Is there a way to "force" Sphinx to index a term such as e.g. iphone 5 into a single-term? For various reasons I can't search for it as "iphone 5" or iphone near\1 5 I need to search for it as iphone 5. Naturally the way Sphinx works this means that it searches for both iphone and 5 anywhere in the document when I want it to search for the exact term iphone 5. Can I somehow index iphone 5 into a single-term to make this happen.
I still need to be able to apply wordforms/regexp and other mapping to the term e.g.
iphone 5>iphone5
This way if someone searches on iphone5 it will find iphone 5 and vice-versa. The issue is if I a search is done on iphone 5 while it will find iphone5 it will also find Selling 5 iphone 6Gs as well whereas if I search on "iphone 5" it will not find iphone5. So my goal is to make iphone 5 into a term that does not require "" to be treated as a phrase without being forced to search as an exact phrase which will break any additional wordform/regexp matching.
Do you control the configuration of the index?
If so you can configure the index to be created with the index_exact_words option.
From the documentation (http://sphinxsearch.com/docs/current.html#conf-index-exact-words) :
12.2.42. index_exact_words
Whether to index the original keywords along with the stemmed/remapped versions. Optional, default is 0 (do not index). Introduced in version 0.9.9-rc1.
When enabled, index_exact_words forces indexer to put the raw keywords in the index along with the stemmed versions. That, in turn, enables exact form operator in the query language to work. This impacts the index size and the indexing time. However, searching performance is not impacted at all.
Example:
index_exact_words = 1
`

What is he best way how to get the short version of an article?

What do you think? What's the best way how to get from administratior the short version of an article (in CMS)?
I have 3 possibilies in my mind:
1) Create two textareas (WYSIWYG) fro the short and for the full version of article
2) The short version will be (for example) the first 100 words of article
3) To have one sepparator (like <hr>) and the short version will be the text from the beggining up to the sepparator (full version will be up to the end)
A possibility would be to mix your first two ideas (I don't quite like the third one, where the content cannot live without the excerpt) :
Use an aditionnal textarea for the exceprt,
And if the user doesn't input anything in it, just use the first X words / sentences of the content.

How can I make Sphinx ignore some characters?

I'm making a PHP website with MySQL backend and Sphinx as a search engine. Say, I have an item with the designer "Ray-Ban" and I need to get it as a result when the user types "ray ban" or "rayban". Should there be an exclusion list somewhere?
The standart way to do so is a charset_table option. charset_table defines characters that only have to be tokenized,
ie with this charset_table
index YOUR_INDEX_NAME
{
charset_table = 0..9, A..Z->a..z, _, a..z
such text
My best fiend is Hoo-foo but not Pe_ter.!!! That's all.
is parsed as these tokens
my best friend is hoo foo but not pe_ter that s all
Your best bet is probably the exceptions file - although that means you'll need to know every case where you want two different words/phrases to be treated the same.
As of version 0.9.8 there is an exclusion list option available per index named ignore_chars.
eg.
index YOUR_INDEX {
charset_type = utf-8
ignore_chars = -
More information available on the Sphinx website: http://sphinxsearch.com/docs/manual-0.9.8.html#conf-ignore-chars
Side note: they show using U+AD to remove soft-hyphens in their example. For some reason this didn't work for me, but the example I gave above worked fine.

Double-metaphone errors

I'm using Lawrence Philips Double-Metaphone algorithm with great success, but I have found the odd "unexpected result" for some combinations.
Does anyone else have additions or changes to the algorithm for other parts of it they wouldn't mind sharing, or just the combinations that they've found that do not work as expected.
eg. I had issues between:
Peashill and Bushley. (both match with PXL)
Rockliffe and Rockcliffe (RKLF and RKKL)
All Soundex, Metaphone and variant schemes are occasionally going to give results that aren't identical to what you expect. This is unavoidable - they can be regarded as more or less simple hash algorithms with special information preserving properties, and will sometimes produce collisions when you'd rather they didn't, and will sometimes produce differences when you'd rather they didn't.
One possible way of improving things is using 'synonym rings'. This basically produces lists of words that should be regarded as synonyms, independent of the spelling. I encountered them in the context of name matching. For example, variants on Chaudri
included:
CHAUDARY
CHAUDERI
CHAUDERY
CHAUDHARY
CHAUDHERI
CHAUDHERY
CHAUDHRI
CHAUDHRY
CHAUDHURI
CHAUDHURY
CHAUDHY
CHAUDREY
CHAUDRI
CHAUDRY
CHAUDURI
CHAWDHARY
CHAWDHRY
CHAWDHURY
CHDRY
CHODARY
CHODHARI
CHODHOURY
CHODHRY
CHODREY
CHODRY
CHODURY
CHOUDARI
CHOUDARY
CHOUDERY
CHOUDHARI
CHOUDHARY
CHOUDHERY
CHOUDHOURY
CHOUDHRI
CHOUDHRY
CHOUDHURI
CHOUDHURY
CHOUDREY
CHOUDRI
CHOUDRY
CHOUDURY
CHOUWDHRY
CHOWDARI
CHOWDARY
CHOWDHARY
CHOWDHERY
CHOWDHRI
CHOWDHRY
CHOWDHURI
CHOWDHURRYY
CHOWDHURY
CHOWDORY
CHOWDRAY
CHOWDREY
CHOWDRI
CHOWDRURY
CHOWDRY
CHOWDURI
CHOWDURY
CHUDARY
CHUDHRY
CHUDORY
COWDHURY
regular metaphone is returning a difference between Peashill and Bushley
Peashill PXL
Bushley BXL