Using unaccent with two different rules

Using unaccent with two different rules - postgresql

German language uses diacritical characters ä, ö, ü. For international use, they are translated into ae, oe, ue (not a, o, u). This mean, that Müller is Mueller on his ID document. This is what we get, when we read the document with (for example) passport reader and this is what we save to the database table.
In next step we search for the records. We do it in two ways:
by entering search data with passport reader (no problem in here)
by entering search data manually
With manual insert there is little problem, because user may enter data international way: 'Mueller' or popular way 'Müller'.
This problem can be solved by using postgres extension Unaccent and modification of unaccent.rules file, so despite is user inserts 'Mueller' or 'Müller', we search in the database for Mueller.
So far so good...
BUT
in the same table we have also other origin's names - for example Turkish ones. Turks translates theirs umlauts (ä, ö, ü) directly into a, o, u, and this way they are saved on the documents, so Müller would by Muller on Turkish document. This causes a problem because (as described before) we search with German unnaccent.rules and we don't find people who we search for.
Long story, but finally question...
... does anybody have any idea how to handle it?
Is there any way, to have two unaccent.rules and use them with or???... for example
Select * from table
where last_name = unaccent('Müller' (use German rules))
or last_name = unaccent('Müller' (use Turkish rules))
(I know that what's above does not work, but maybe there is something similar we could use)
regards
M

The solution should be simple. Define your German unaccent dictionary (I'll call it entumlauten), then query like
SELECT ...,
last_name = unaccent('unaccent', 'Müller') AS might_be_turkish,
last_name = unaccent('entumlauten', 'Müller') AS might_be_german,
FROM tab
WHERE last_name IN (unaccent('unaccent', 'Müller'),
unaccent('entumlauten', 'Müller'))
IN (or (= ANY) will perform better than OR, because it can use an index scan. The additional columns in the SELECT list tell you which condition was matched.

Use soundex() function. This is suitable only for creating lists for human user to pick wanted name. You probably should clean all diacritics (use the Turkish way) before using this.
It also handles similar sounding letters, like C, S and Z or D and T. So Schmidt would match Smith or Jönssen matches Johnson.

Related

Enforce printable characters only in text fields

The text data type in PostgreSQL database (encoding utf-8) can contain any UTF-8 character. These include a number of control characters (https://en.wikipedia.org/wiki/Unicode_control_characters)
While I agree there are cases when the control characters are needed, there is little (to none) of use of these characters in normal attributes like persons name, document number etc. In fact allowing such characters to be stored in DB can lead to nasty problems as the characters are not visible and the value of the attribute is not what it seems to be to the end user.
As the problem seems to be very general, is there a way to prevent control chars in text fields? Maybe there is a special text type (like citext for case-incencitive text)? Or should this behaviour be realized as a domain? Are there any other options? All I could find people talk is finding these characters using regex.
I could not find any general recommendations to solving the problem so maybe I'm missing something obvious here.

The exact answer will depend on what you consider printable.
However, a domain is the way to go. If you want to go with what your database collation considers a printable character, use a domain like this:
CREATE DOMAIN printable_text AS text CHECK (VALUE !~ '[^[:print:]]');
SELECT 'a'::printable_text;
printable_text
════════════════
a
(1 row)
SELECT E'\u0007'::printable_text; -- bell character (ASCII 7)
ERROR: value for domain printable_text violates check constraint "printable_text_check"

Stemming words ending with y

I'm trying to use Postgres' full text search, but I'm struggling to get certain query phrases working properly when stemming is involved.
strawberries matches strawberry
fruity does not match fruit
From what I've read these stemming algorithms are internal to Postgres and can't necessarily be modified easily. Does anyone know if the -y suffix can be stemmed properly?

This is too long for a comment.
I assume you are at least familiar with the documentation on the subject. I think the simplest method would be to create a synonym dictionary with the pairs that are equivalent.
You need to be careful. There are lots of words in English where you cannot remove the "y":
lay <> la
Gaily <> Gail (woman's name)
Daily <> Dail (Irish parliament)
foxy <> fox
analog <> analogy
And this doesn't include the zillions of words where removing the "y" creates a non-word (a large class are -ly words; another, -way words).
You will need to manual create these yourselves.
I am not intimately familiar with Postgres's dictionaries. But you should be able to accomplish what you want.

Unicode normalization in Postgres

I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á instead of 0061 + 0301 (\x61\x301)
I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;
This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.
EDIT: As Craig has pointed out, and one of the things I tried:
SELECT convert_to(E'\u00E1', 'iso-8859-1');
returns \xe1, whereas
SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"

I think this is a Pg bug.
In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.
I'll raise it on pgsql-bugs ... done.
http://www.postgresql.org/message-id/53E179E1.3060404#2ndquadrant.com
You should be able to follow the thread there.
Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.
BTW, this can be simplified down to:
regress=> SELECT 'á' = 'á';
?column?
----------
f
(1 row)
which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).
If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:

PostgreSQL 13 has introduced string function normalize ( text [, form ] ) → text, which is available when the server encoding is UTF8.
> select 'päivää' = 'päivää' as without, normalize('päivää') = normalize('päivää') as with_norm ;
without | with_norm
---------+-----------
f | t
(1 row)
Note that I am expecting this to miss any indices, and therefore using this blindly in a hot production query is prone to be a recipe for disaster.
Great news for us who have naively stored NFD filenames from Mac users in our databases.

Simplified Chinese Unicode table

Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/

The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.

The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.

I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.

Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese

According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.

I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

Table query in iPhone app

I have a tableview (linked to a database) and a search bar. When I type something in the search bar, I do a quick search in the database and display the results as I type.
The query looks like this:
SELECT * FROM MyTable WHERE name LIKE '%NAME%'
Everything works fine as long as I use only ASCII characters. What I want is to type ASCII characters and to match their equivalent with diacritics. For instance, if I type "Alizee" I would expect it to match "Alizée".
Is there a way to do make the query locale-insensitive? I've red about the COLLATE option in SQL, but there seems to be of no use with SQLite.I've also red that iPhone SDK 3.0 has "Localized collation" but I was unable to find any documentation about what this means...
Thank you.

There are a few options for solving this:
Replacing all accented chars in the
query before executing it, e.g.
"Psychédélices" => "Psychedelices"
"À contre-courant" => "A contre-courant"
"Tempête" => "Tempete"
etc.
but this only works for the input so
you must not have accented chars in
the database itself. Simple solution but
far from perfect.
Using a 3rd party library, namely ICU (links below). Not sure if it's the best choice for iPhone though.
Writing one or more custom C functions that will do the comparison. More in the links below.
A few posts here on StackOverflow that discuss the various options:
How to sort text in sqlite3 with specified locale?
Case-insensitive UTF-8 string collation for SQLite (C/C++)
How to implement the accent/diacritic insensitive search in Sqlite?
Also a couple of external links:
SQLite and native UNICODE LIKE support in C/C++
sqlite case and accent insensitive searches

I'm not sure about SQL, but I think you can definitely use the NSDiacriticInsensitivePredicateOption to compare in-memory NSStrings.
An example would be an NSArray full of the strings you're searching over. You could just iterate over the array comparing strings using the NSDiacriticInsensitivePredicateOption as your comparison option and displaying the successful matches.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse