Detecting special chars in postgres

Detecting special chars in postgres - postgresql

I have usernames in my postgres 9 db such as
Ron R ty ♥☆♡★Green Eyes♥☆♡★
Sωℯℯт۞Angel 2 ᾧ➍ᾧ ty Լù☪ƖƒεƦ
the db is encoded in utf-8
is there a way to detect the presence of these special chars outside standard roman chars in SQL?
I tried using convert documented here http://www.postgresql.org/docs/9.1/static/functions-string.html but only got errors.

Try matching on a regexp character range based on unicode code-point.
WHERE uname ~ '[\x80-\xffff]';
Or, if you want to be more strict you can exclude anything non-alphanumeric.
WHERE uname ~ '[^[:alnum:]]
Other character-classes are available too. See the docs for details.

Related

Could i use special characters to split aggregation? Postgresql

Can i ,without worrying, use special characters to split up the words?
For example:
SELECT STRING_AGG(users.name, '💪')

This will work fine as long as you are using characters supported by your database encoding (check the parameter server_encoding).
I hope and expect that you are using the only sensible encoding here: UTF8. If yes, then you don't have to worry.

Tesseract (Swedish language) can't recognize special characters like #, § etc

I use Tesseract version 4 (swedish).
When interpreting images, it can never recognize some special characters like #, § etc.
For example, if I have xxx#gmail.com, it interprets xxxdgmail.com
How can I solve this problem?

Try combining with eng language -- or one that supports your special characters -- such as -l swe+eng or -l eng+swe

elixir Bson decoder failing on utf8 > 16#FF

I am reading mongodb and using Bson.decoder(data). Along the way, the data becomes a list of tuples that includes {"unitˊs", 1}. String.to_atom("unitˊs") fails apparently because the 5th char is "MODIFIER LETTER ACUTE ACCENT (U+02CA)" with
** (ArgumentError) argument error
:erlang.binary_to_atom("unitˊs", :utf8)
and http://erlang.org/doc/man/erlang.html#binary_to_atom-2 notes
binary_to_atom(Binary, utf8) will fail if the binary contains Unicode characters greater than 16#FF
Are there any suggested work arounds?

There aren't any workarounds until Erlang 18 which will support the full unicode ranges for atoms. So the best option is to not convert it to an atom right now.

Unicode normalization in Postgres

I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á instead of 0061 + 0301 (\x61\x301)
I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;
This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.
EDIT: As Craig has pointed out, and one of the things I tried:
SELECT convert_to(E'\u00E1', 'iso-8859-1');
returns \xe1, whereas
SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"

I think this is a Pg bug.
In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.
I'll raise it on pgsql-bugs ... done.
http://www.postgresql.org/message-id/53E179E1.3060404#2ndquadrant.com
You should be able to follow the thread there.
Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.
BTW, this can be simplified down to:
regress=> SELECT 'á' = 'á';
?column?
----------
f
(1 row)
which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).
If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:

PostgreSQL 13 has introduced string function normalize ( text [, form ] ) → text, which is available when the server encoding is UTF8.
> select 'päivää' = 'päivää' as without, normalize('päivää') = normalize('päivää') as with_norm ;
without | with_norm
---------+-----------
f | t
(1 row)
Note that I am expecting this to miss any indices, and therefore using this blindly in a hot production query is prone to be a recipe for disaster.
Great news for us who have naively stored NFD filenames from Mac users in our databases.

Converting accented characters in PostgreSQL?

Is there an existing function to replace accented characters with unadorned characters in PostgreSQL? Characters like å and ø should become a and o respectively.
The closest thing I could find is the translate function, given the example in the comments section found here.
Some commonly used accented characters
can be searched using the following
function:
translate(search_terms,
'\303\200\303\201\303\202\303\203\303\204\303\205\303\206\303\207\303\210\303\211\303\212\303\213\303\214\303\215\303\216\303\217\303\221\303\222\303\223\303\224\303\225\303\226\303\230\303\231\303\232\303\233\303\234\303\235\303\237\303\240\303\241\303\242\303\243\303\244\303\245\303\246\303\247\303\250\303\251\303\252\303\253\303\254\303\255\303\256\303\257\303\261\303\262\303\263\303\264\303\265\303\266\303\270\303\271\303\272\303\273\303\274\303\275\303\277','AAAAAAACEEEEIIIINOOOOOOUUUUYSaaaaaaaceeeeiiiinoooooouuuuyy')

Are you doing this just for indexing/sorting? If so, you could use this postgresql extension, which provides proper Unicode collation. The same group has a postgresql extension for doing normalization.