Sphinx search: multi-term wordforms not indexed correctly - sphinx

I'm having an issue with specific entries in my wordforms file that are not being
interpreted as expected.
Here are a couple of examples:
1/48 > forty-eighth
1/96 > ninety-sixth
As you can see, these entries contain both slashes and hyphens, which may be related to
my issue.
For some reason, Sphinx doesn't correctly equate each fraction to the spelled out
version. Search results for "1/48" are not the same as for "forty-eighth", as they should
be. In other words, the mapping between these equivalent forms is not working.
In my Sphinx config, I have the forward slash (/) set as a blend character, so I assume
that the fraction is being recognized properly.
In support of that belief, the following wordforms entry does work correctly:
1/4 > fourth
Does anyone have any idea why my multi-term synonyms would not be working as expected?
I have tried replacing the hyphen with a space, but this doesn't change the result at
all. Would it help to change the order of the terms (i.e., on which side of the ">" they
should be placed)?
Thank you very much for any help.

When using characters in Sphinx it is always good to keep in mind the following:
By default, the Sphinx tokenizer handles unknown characters as whitespace
https://sphinxsearch.com/blog/2014/11/26/sphinx-text-processing-pipeline/
That has given me weird results too when using wordforms.
I would suggest you add the hyphen to charset_tables so ninety-sixth becomes one word. ignore_chars is also an option but then you will be looking for ninetysixth instead.
Much depends on the rest of your dataset and use cases ofcourse.

Related

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

Parsing commas in Sphinx

I have a field which can have multiple commas which are actually critical to some regex pattern matching.
Commas however do not index and adding them to the charset breaks it (for a # of technical reasons on how sphinx searches/indexes).
I cannot change the char prior to indexing (e.g. COMMA) so I have some anchor for the pattern and can't properly pattern extract w/o.
My only thought is to use exceptions to map ,=>COMMA (this won't process large text fields so not a huge issue). I'm curious as to if this will work and what the pipeline is i.e. what it could possibly break that I'm not considering. AFAIK Exceptions happen first and do not obey charset so this might in fact work. I get I can test it to see if it does but again I am more concenred with what this might break given my rudimentary knowledge of the pipeline of Sphinx Indexing.
Commas however do not index and adding them to the charset breaks it (for a # of technical reasons on how sphinx searches/indexes).
Just use U+2C to add comma to your charset_table, e.g.
charset_table=a..z,A..Z,0..9,U+2C
You might also want to add it to blend_chars instead to consider a comma both a word separator and not.

Algolia tag not searchable when ending with special characters

I'm coming across a strange situation where I cannot search on string tags that end with a special character. So far I've tried ) and ].
For example, given a Fruit index with a record with a tag apple (red), if you query (using the JS library) with tagFilters: "apple (red)", no results will be returned even if there are records with this tag.
However, if you change the tag to apple (red (not ending with a special character), results will be returned.
Is this a known issue? Is there a way to get around this?
EDIT
I saw this FAQ on special characters. However, it seems as though even if I set () as separator characters to index that only effects the direct attriubtes that are searchable, not the tag. is this correct? can I change the separator characters to index on tags?
You should try using the array syntax for your tags:
tagFilters: ["apple (red)"]
The reason it is currently failing is because of the syntax of tagFilters. When you pass a string, it tries to parse it using a special syntax, documented here, where commas mean "AND" and parentheses delimit an "OR" group.
By the way, tagFilters is now deprecated for a much clearer syntax available with the filters parameter. For your specific example, you'd use it this way:
filters: '_tags:"apple (red)"'

SOLR Dropping Emoji Miscellaneous characters

It looks like SOLR is considering what should be valid Unicode characters as invalid, and dropping them.
I "proved" this by turning on query debug to see what the parser was doing with my query. Here's an example:
Query = 'ァ☀' (\u30a1\u2600)
Here's what SOLR did with it:
'debug':{
'rawquerystring':u'\u30a1\u2600',
'querystring':u'\u30a1\u2600',
'parsedquery':u'(+DisjunctionMaxQuery((text:\u30a1)))/no_coord',
'parsedquery_toString':u'+(text:\u30a1)',
As you can see, was OK with 'ァ', but it ATE the "Black Sun" character.
I haven't tried ALL of the Block, but I've confirmed it also doesn't like ⛿ (\u26ff) and ♖ (\u2656).
I'm using SOLR with Jetty, so the various TomCat issues WRT character encoding shouldn't apply.
This very likely has more to do with the Analyzer. I don't see anything specifying the treatment of those sorts of characters exactly, but they are probably being treated very much as punctuation by the StandardAnalyzer (or whatever Analyzer you may be using), and so will not be present in the final query. StandardAnalyzer implements the rules set forward in UAX-29, Unicode Text Segmentation, in order to separate input into tokens.

sphinx dash in author names causing problems when searching

I've read all the posts about dashes and tried pretty much everything mentioned in them, yet cannot figure out a strange problem I'm having.
For example, I have an author name like this:
Arturo Pérez-Reverte
A search for 'pérez-reverte' will not turn up anything, nor will 'pérez-reverte' so escaping the dash is not the issue.
But a search for 'spider-man' will return hits, proving that the dash seems to be working.
However, a search for 'perez reverte' also finds a hit because it searches each word separately and finds the 'reverte' in 'perez-reverte' (but doesn't seem to find the 'perez').
A search for either 'pérez' or 'perez' finds the same number of documents, suggesting that the accent is not an issue (I do have a charset_table which accounts for accented characters).
So I'm very confused as to what's happening here. It if it isn't the accent and it isn't the dash, what could it be?
I don't have any ignore_chars set, I'm using UTF-8 and have a charset_table to treat accented characters as regular characters.
The only difference between these two terms is that one of them is a title (spider-man) and the other an author, but they are both part of the same Sphinx index declaration, so I don't see that as an issue in any way.
Any help would be greatly appreciated.
After much fighting with it, I found out that even though my database is all UTF-8 with the proper collation I needed to add this in sphinx.conf for everything to work properly:
sql_query_pre = SET NAMES utf8
sql_query_pre = SET CHARACTER SET utf8
After doing that, and having the proper charset_table, everything seems to be working fine.
Hope this helps someone else.