Are there any well-known PL functions/libraries for extending a PostgreSQL (9.4.1) database with URL encoding (also known as percent encoding) capabilities?
Here's an example of the intended functionality:
Input string: International donor day: give blood for a good cause!
Output string: International%20donor%20day%3A%20give%20blood%20for%20a%20good%20cause%21
I guess an alternative would be to roll out my own implementation, since AFAIK there is currently no built-in way of doing this.
This is trivial to do in an external PL,e.g.
CREATE LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION urlescape(original text) RETURNS text LANGUAGE plpythonu AS $$
import urllib
return urllib.quote(original);
$$
IMMUTABLE STRICT;
Related
CREATE OR REPLACE FUNCTION normalize(input text, separator text DEFAULT '')
RETURNS text AS $$
BEGIN
RETURN translate(lower(public.f_unaccent(input)), ' '',:-`´‘’_' , separator);
END
$$ LANGUAGE 'plpgsql' IMMUTABLE;
When i execute i get the following error. I tried dos2unix but didn't help
ERROR: syntax error at or near "("
LINE 1: CREATE OR REPLACE FUNCTION normalize(input text, separator t...
Like #Adrian commented, normalize is a reserved word in standard SQL. But it used to be allowed anyway until Postgres 13, where a system function of the same name was added. The release notes:
Add SQL functions NORMALIZE() to normalize Unicode strings, and IS NORMALIZED to check for normalization (Peter Eisentraut)
"normalize" changed its status to:
non-reserved (cannot be function or type)
While being at it, I suggest:
CREATE OR REPLACE FUNCTION f_normalize (input text, separator text DEFAULT '')
RETURNS text
LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT
BEGIN ATOMIC
SELECT lower(public.f_unaccent(translate(input, $$',:-`´‘’_$$, separator)));
END;
Most importantly, make it PARALLEL SAFE (because it is) or you may regret it. See:
When to mark functions as PARALLEL RESTRICTED vs PARALLEL SAFE?
And STRICT, since all used functions are strict themselves - assuming that for your f_unaccent().
BEGIN ATOMIC requires Postgres 14 or later. (Else make it a conventional SQL function.) See:
What does BEGIN ATOMIC mean in a PostgreSQL SQL function / procedure?
Also, since translate() is the cheapest operation, I would apply that first for a tiny overall gain.
Finally, if your f_unaccent() function something like this, you might just add the additional operations into a single function instead of creating another wrapper.
When I use the fuzzystrmatch levenshtein function with diacritic characters it returns a wrong / multibyte-ignorant result:
select levenshtein('ą', 'x');
levenshtein
-------------
2
(Note: the first character is an 'a' with a diacritic below, it is not rendered properly after I copied it here)
The fuzzystrmatch documentation (https://www.postgresql.org/docs/9.1/fuzzystrmatch.html) warns that:
At present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do not work well with multibyte encodings (such as UTF-8).
But as it does not name the levenshtein function, I was wondering if there is a multibyte aware version of levenshtein.
I know that I could use unaccent function as a workaround but I need to keep the diacritics.
Note: This solution was suggested by #Nick Barnes in his answer to a related question.
The 'a' with a diacritic is a character sequence, i.e. a combination of a and a combining character, the diacritic ̨ : E'a\u0328'
There is an equivalent precomposed character ą: E'\u0105'
A solution would be to normalise the Unicode strings, i.e. to convert the combining character sequence into the precomposed character before comparing them.
Unfortunately, Postgres doesn't seem to have a built-in Unicode normalisation function, but you can easily access one via the PL/Perl or PL/Python language extensions.
For example:
create extension plpythonu;
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ language plpythonu;
Now, as the character sequence E'a\u0328' is mapped onto the equivalent precomposed character E'\u0105' by using unicode_normalize, the levenshtein distance is correct:
select levenshtein(unicode_normalize(E'a\u0328'), 'x');
levenshtein
-------------
1
I try to convert Danish national characters to unicode. Is there function in plsql or parameter to plsql function which can help me ? I try this select convert ('Æ, æ:,Ø, ø:,Å, å:','AL32UTF8') from dual; but it doesnt help. As a workaround I used in my code something like that
w_temp := replace('Æ, æ:,Ø, ø:,Å, å:','å','\u00E5');
w_temp := replace(w_temp,'Å','\u00C5');
w_temp := replace(w_temp,'æ','\u00E6');
w_temp := replace(w_temp,'Æ','\u00C6');
w_temp := replace(w_temp,'ø','\u00F8');
w_temp := replace(w_temp,'Ø','\u00D8');
but this method is like a monkey job. My code is not prepared for any other national characters - have you any suggestion?
The CONVERT() function can be used as follows CONVERT('fioajfiohawiofh',<ORIGIN_CHARSET>,<DESTINATION_CHARSET>).
I don't know your charset, but you can try finding useful one using this SELECT:
SELECT
CONVERT('Æ, æ:,Ø, ø:,Å, å:',cs.value,'UTF8') AS conv
,cs.value
,cs.isdeprecated
FROM
V$NLS_VALID_VALUES cs
WHERE
cs.parameter = 'CHARACTERSET'
;
I'm not sure what the big picture is, but assuming that you currently have your data in a database with a single character set that supports your diacritics, I would rather use a completely different approach:
export the needed data your database and the existing character set
recreate the database with a unicode character set
(very likely) change te definition and install all database objects with CHAR instead of BYTE semantics
import all the data into the new database
Clearly there is a lot of details to sort out, but having Oracle to properly convert the character sets during import, seems to be only reasonanble way to go.
My Postgres database encodes everything as UTF-8, but in a query when selecting a column I want to know if it can be encoded as Latin. I've no need to actually encode it as Latin but I need to know if can be encoded as Latin.
By Latin in mean what other people generally mean by Latin, i.e characters are recognisable to Western European speakers
i.e
SELECT val
FROM
TABLE1
WHERE IS_LATIN(Val);
Solution
I used the answer posted below, firstly I tried the python function but it failed because I dont have that language installed. Then I tried the pl/sql function and that failed because missing RETURN statement, but I fixed as follows and now works okay
CREATE OR REPLACE FUNCTION is_latin(input text)
RETURNS boolean
LANGUAGE plpgsql
IMMUTABLE
AS $$
BEGIN
PERFORM convert_to(input, 'iso-8859-15');
RETURN true;
EXCEPTION
WHEN untranslatable_character THEN
RETURN false;
END;
$$;
Well, you need to be more specific about "latin".
Assuming you mean ISO-8859-15, typical for Western europe
regress=> SELECT convert_to('a€bcáéíöâ', 'iso-8859-15');
convert_to
----------------------
\x61a46263e1e9edf6e2
(1 row)
Beware, people often use iso-8859-1, but it doesn't support €.
However, you'll run into issues with currency symbols and other things that might typically appear in modern western european text. For example, ₽ isn't part of ISO-8859-15. Nor is ฿, ₡, ₹, and other major currencies. (Oddly, ¥ is in ISO-8859-15).
If you want to test without an error you'll need to either use PL/Python or similar, or use PL/PgSQL and trap the exception.
CREATE OR REPLACE FUNCTION is_latin(input text)
RETURNS boolean
LANGUAGE plpgsql
IMMUTABLE
AS $$
BEGIN
PERFORM convert_to(input, 'iso-8859-15');
EXCEPTION
WHEN untranslatable_character THEN
RETURN false;
END;
$$;
regress=> SELECT is_latin('฿');
is_latin
----------
f
(1 row)
That creates a savepoint on every call, though, which can get expensive. So perhaps PL/Python is better. This one makes an assumption about the server_encoding (assuming it is utf-8) which isn't wise, so it should really check that properly. Anyway:
CREATE OR REPLACE FUNCTION is_latin(input text)
RETURNS boolean
LANGUAGE plpythonu
IMMUTABLE
AS $$
try:
input.decode("utf-8").encode("iso-8859-1")
return True
except UnicodeEncodeError:
return False
$$;
Another option is to create a regular expression with a charset that matches all the chars you want to permit, but I suspect that'll be slow and ugly. Incomplete example:
SELECT 'ab฿cdé' ~ '^[a-zA-Z0-9.áéíóúÁÉÍÓÚàè]*$'
... where you'd probably use the iso-8859-15 encoding table to produce the character list.
I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á instead of 0061 + 0301 (\x61\x301)
I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;
This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.
EDIT: As Craig has pointed out, and one of the things I tried:
SELECT convert_to(E'\u00E1', 'iso-8859-1');
returns \xe1, whereas
SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"
I think this is a Pg bug.
In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.
I'll raise it on pgsql-bugs ... done.
http://www.postgresql.org/message-id/53E179E1.3060404#2ndquadrant.com
You should be able to follow the thread there.
Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.
BTW, this can be simplified down to:
regress=> SELECT 'á' = 'á';
?column?
----------
f
(1 row)
which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).
If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:
PostgreSQL 13 has introduced string function normalize ( text [, form ] ) → text, which is available when the server encoding is UTF8.
> select 'päivää' = 'päivää' as without, normalize('päivää') = normalize('päivää') as with_norm ;
without | with_norm
---------+-----------
f | t
(1 row)
Note that I am expecting this to miss any indices, and therefore using this blindly in a hot production query is prone to be a recipe for disaster.
Great news for us who have naively stored NFD filenames from Mac users in our databases.