Ligatures and umlauts in Sphinx search - sphinx

I have a list of names with ligatures and umlauts (using Sphinx). Try to search "Æther" give me a result. But i want to have an ability to search these names with replaced ligatures as "Aether" for example.
Can Sphinx do it automatically?

umlauts, can be dealt with directly by charset_table
http://sphinxsearch.com/docs/current.html#conf-charset-table
Alas there is no easy way to just tell sphinx to index everything, needs an explicit charset_table setup to your own requirements. This is perhaps
http://sphinxsearch.com/forum/view.html?id=9312
the best starting point
ligatures, is more trickly because its not a one-to-one mapping. I think regexp_filter, would be the best way to deal with these
http://sphinxsearch.com/docs/current.html#conf-regexp-filter

Related

How to automatically translate between traditional and simplified Chinese characters with Unicode?

Is there a way to translate programmatically from traditional to simplified Chinese characters? If so, how do you do it, does unicode offer a way? If not, why doesn't there exist a database with the mapping, is it not one-to-one? I know you can find a mirror image glyph from another glyph in Unicode, but can you find the simplified glyph from a traditional one?
It is indeed not one to one. My favorite example to explain this quickly is this:
Take the character for face, 面. So far so good, it's the same in Traditional and Simplified Chinese. However, 面 is also the simplified version of 麵, noodle (where the 面 part on the right is the phonetic part). So if you have 面, you have no way of knowing which is which.

Sphinx search: multi-term wordforms not indexed correctly

I'm having an issue with specific entries in my wordforms file that are not being
interpreted as expected.
Here are a couple of examples:
1/48 > forty-eighth
1/96 > ninety-sixth
As you can see, these entries contain both slashes and hyphens, which may be related to
my issue.
For some reason, Sphinx doesn't correctly equate each fraction to the spelled out
version. Search results for "1/48" are not the same as for "forty-eighth", as they should
be. In other words, the mapping between these equivalent forms is not working.
In my Sphinx config, I have the forward slash (/) set as a blend character, so I assume
that the fraction is being recognized properly.
In support of that belief, the following wordforms entry does work correctly:
1/4 > fourth
Does anyone have any idea why my multi-term synonyms would not be working as expected?
I have tried replacing the hyphen with a space, but this doesn't change the result at
all. Would it help to change the order of the terms (i.e., on which side of the ">" they
should be placed)?
Thank you very much for any help.
When using characters in Sphinx it is always good to keep in mind the following:
By default, the Sphinx tokenizer handles unknown characters as whitespace
https://sphinxsearch.com/blog/2014/11/26/sphinx-text-processing-pipeline/
That has given me weird results too when using wordforms.
I would suggest you add the hyphen to charset_tables so ninety-sixth becomes one word. ignore_chars is also an option but then you will be looking for ninetysixth instead.
Much depends on the rest of your dataset and use cases ofcourse.

Searching for hash tag via thinking sphinx

Is it possible to search hash tag via thinking_spinx? Can't find solution.
Need to find all titles with hash tag only: "title#text","#text", etc.
You'll want to make sure Sphinx is indexing the hash character - which is done via the charset_table setting. Thinking Sphinx finds this value in config/sphinx.yml (create it if you haven't already), which is set up via environments, much like config/database.yml.
development:
charset_table: "0..9, A..Z->a..z, _, a..z, \#, U+410..U+42F->U+430..U+44F, U+430..U+44F"
All other characters and character ranges listed are Sphinx's default set for UTF-8, which is what Thinking Sphinx uses by default.

sphinx dash in author names causing problems when searching

I've read all the posts about dashes and tried pretty much everything mentioned in them, yet cannot figure out a strange problem I'm having.
For example, I have an author name like this:
Arturo Pérez-Reverte
A search for 'pérez-reverte' will not turn up anything, nor will 'pérez-reverte' so escaping the dash is not the issue.
But a search for 'spider-man' will return hits, proving that the dash seems to be working.
However, a search for 'perez reverte' also finds a hit because it searches each word separately and finds the 'reverte' in 'perez-reverte' (but doesn't seem to find the 'perez').
A search for either 'pérez' or 'perez' finds the same number of documents, suggesting that the accent is not an issue (I do have a charset_table which accounts for accented characters).
So I'm very confused as to what's happening here. It if it isn't the accent and it isn't the dash, what could it be?
I don't have any ignore_chars set, I'm using UTF-8 and have a charset_table to treat accented characters as regular characters.
The only difference between these two terms is that one of them is a title (spider-man) and the other an author, but they are both part of the same Sphinx index declaration, so I don't see that as an issue in any way.
Any help would be greatly appreciated.
After much fighting with it, I found out that even though my database is all UTF-8 with the proper collation I needed to add this in sphinx.conf for everything to work properly:
sql_query_pre = SET NAMES utf8
sql_query_pre = SET CHARACTER SET utf8
After doing that, and having the proper charset_table, everything seems to be working fine.
Hope this helps someone else.

Which Unicode characters are allowed in IDN host labels?

I’m currently working on a “proper” URI validator, and currently it all comes down to hostname validation; the rest isn’t that tricky.
I’m stuck on IDN hostname labels (i.e., containing Unicode; possible punycode encoded strings have been decoded at this point).
My first idea was basically one regex for TLDs which don’t support IDNs and one for those which do. This could perhaps be based on Mozilla’s list of IDN-enabled TLDs. Respectively,
^[a-zA-Z0-9\-]+$ and ^[a-zA-Z0-9\-\p{L}]+$. However, this is not an ideal situation, since every IDN registrar can decide which characters to allow.
What I’m looking for is a proper, consistent, up to date data table of the Unicode characters allowed in various TLDs. It’s beginning to look like I have to find all the data myself at Russian and Chinese registry sites (which is quite difficult).
So before I go trying to gather all this data myself, I wondered whether such a list already exists. Or are there better approaches, best/common practices, etc.? (I want the validation to be as strict as possible.)
IANA maintains a list of all of the codepoints and their status at https://www.iana.org/assignments/idna-tables-6.3.0/idna-tables-6.3.0.xhtml#idna-tables-properties
All of the ones marked PVALID are safe to use. The ones marked CONTEXTO or CONTEXTJ have more rules to follow. Read RFC5892 (IDNA) and RFC6452 (changing the status of a couple of characters) for all of the gory details.
Can't you convert all Unicode domains to punycode and validate that? Since DNS doesn't support real UTF-8 chars anyways, this might be the best solution.