I'm using unaccent in Postgres but it cannot convert special character like:
ù : ù
but it's okay for ù: ù
2 characters same meaning but different code, the first one is character u + ̀
How I can solve this problem ?
Thank you so much.
Your problem is unicode normalization, what PostgreSQL does not do, unfortunately. And it's not so simple to implement on your own.
But, because you only want to remove diacritical marks, you only need to actually remove code-points (before or after calling the unaccent() function) which are unicode combining characters:
select regexp_replace(
'ùù',
'[\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]',
'',
'g'
)
should do the trick.
Related
i am trying to find rows in a postgresql table where a specific column contains special characters excluding the following
#^$.!\-#+'~_
any help appreciated
Hi I think I figured it out. I found a solution that worked for me using Posix Regular Expressions.
SELECT *
FROM TABLE_NAME
WHERE fieldName ~ '[^A-Za-z0-9#^\\$.!\-#+~_]'
The regular expression matches any character that is not between A-Z, a-z, 0-9 and is also not any of your whitelisted characters ^$.!-#+~_. Notice that in the regex I had to escape the backslash and the hyphen, because they have a special meaning in regex. Maybe start by evaluating my proposed regex online with a few examples, e.g. here: https://regex101.com
When I use the fuzzystrmatch levenshtein function with diacritic characters it returns a wrong / multibyte-ignorant result:
select levenshtein('ą', 'x');
levenshtein
-------------
2
(Note: the first character is an 'a' with a diacritic below, it is not rendered properly after I copied it here)
The fuzzystrmatch documentation (https://www.postgresql.org/docs/9.1/fuzzystrmatch.html) warns that:
At present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do not work well with multibyte encodings (such as UTF-8).
But as it does not name the levenshtein function, I was wondering if there is a multibyte aware version of levenshtein.
I know that I could use unaccent function as a workaround but I need to keep the diacritics.
Note: This solution was suggested by #Nick Barnes in his answer to a related question.
The 'a' with a diacritic is a character sequence, i.e. a combination of a and a combining character, the diacritic ̨ : E'a\u0328'
There is an equivalent precomposed character ą: E'\u0105'
A solution would be to normalise the Unicode strings, i.e. to convert the combining character sequence into the precomposed character before comparing them.
Unfortunately, Postgres doesn't seem to have a built-in Unicode normalisation function, but you can easily access one via the PL/Perl or PL/Python language extensions.
For example:
create extension plpythonu;
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ language plpythonu;
Now, as the character sequence E'a\u0328' is mapped onto the equivalent precomposed character E'\u0105' by using unicode_normalize, the levenshtein distance is correct:
select levenshtein(unicode_normalize(E'a\u0328'), 'x');
levenshtein
-------------
1
I have a table that has strings with non UTF-8 characters, like �. I need to change them in order they have back all accents, and other latin characters, like: cap� to capó. The field is a VARCHAR.
So far, I have tried:SELECT "Column Name", regexp_replace("Column Name", '[^\w]+','') FROM table
And:
CONVERT("Column Name", 'UTF8', 'LATIN1') but don't work at all.
For instance, the error I get is: "Regexp encountered an invalid UTF-8 character (...)"
I have seen other solutions, but I can't go on them because I cannot change the table because I am not administrator.
Is there any whay to achieve this?
If the database encoding is UTF8, then all your strings will contain only UTF8 characters. They just happen to be different characters than you want.
First, you have to find out what characters are in the strings. In the case you show, � is Unicode codepoint FFFD (in hexadecimal).
So you could use the replace function in PostgreSQL to replace it with ó (Unicode code point F3) like this:
SELECT replace(mycol, E'\uFFFD', E'\u00f3') FROM mytab;
This uses the Unicode character literal syntax of PostgreSQL; don't forget to prefix all strings with escapes in them with E for extended string literal syntax.
There are odds that the character is not really �, because that is the “REPLACEMENT CHARACTER” often used to represent characters that are not representable.
In that case, use psql and run a query like this to display the hexadecimal UTF-8 contents of your fields:
SELECT mycol::bytea FROM mytab WHERE id = 12345;
From the UTF-8 encoding of the character you can deduce what character it really is and use that in your call to replace.
If you have several characters, you will need several calls to replace to translate them all.
I have a μ character in my mySQL database. I would like to replace it, with the following code:
UPDATE produkte.produktliste SET Kurzbeschreibung
=REPLACE(REPLACE(Kurzbeschreibung, 'μ', ''), 'μ', '');
The code works, but I can't save the command to a script, because the μ char is not ANSI. I was having a similar problem in the past, and solved it with "char":
UPDATE produkte.produktliste SET Link
= REPLACE(REPLACE(Link, CHAR(160), ''), CHAR(160), '');
But, which number (like 160) does μ have? I have tried many values (181, 924, 956) but nothing works. Does anyone have an idea, why my statement does not work?
the numbers represent the unicode value of the char.
Thanks in advance.
Is there an existing function to replace accented characters with unadorned characters in PostgreSQL? Characters like å and ø should become a and o respectively.
The closest thing I could find is the translate function, given the example in the comments section found here.
Some commonly used accented characters
can be searched using the following
function:
translate(search_terms,
'\303\200\303\201\303\202\303\203\303\204\303\205\303\206\303\207\303\210\303\211\303\212\303\213\303\214\303\215\303\216\303\217\303\221\303\222\303\223\303\224\303\225\303\226\303\230\303\231\303\232\303\233\303\234\303\235\303\237\303\240\303\241\303\242\303\243\303\244\303\245\303\246\303\247\303\250\303\251\303\252\303\253\303\254\303\255\303\256\303\257\303\261\303\262\303\263\303\264\303\265\303\266\303\270\303\271\303\272\303\273\303\274\303\275\303\277','AAAAAAACEEEEIIIINOOOOOOUUUUYSaaaaaaaceeeeiiiinoooooouuuuyy')
Are you doing this just for indexing/sorting? If so, you could use this postgresql extension, which provides proper Unicode collation. The same group has a postgresql extension for doing normalization.