CREATE OR REPLACE FUNCTION normalize(input text, separator text DEFAULT '')
RETURNS text AS $$
BEGIN
RETURN translate(lower(public.f_unaccent(input)), ' '',:-`´‘’_' , separator);
END
$$ LANGUAGE 'plpgsql' IMMUTABLE;
When i execute i get the following error. I tried dos2unix but didn't help
ERROR: syntax error at or near "("
LINE 1: CREATE OR REPLACE FUNCTION normalize(input text, separator t...
Like #Adrian commented, normalize is a reserved word in standard SQL. But it used to be allowed anyway until Postgres 13, where a system function of the same name was added. The release notes:
Add SQL functions NORMALIZE() to normalize Unicode strings, and IS NORMALIZED to check for normalization (Peter Eisentraut)
"normalize" changed its status to:
non-reserved (cannot be function or type)
While being at it, I suggest:
CREATE OR REPLACE FUNCTION f_normalize (input text, separator text DEFAULT '')
RETURNS text
LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT
BEGIN ATOMIC
SELECT lower(public.f_unaccent(translate(input, $$',:-`´‘’_$$, separator)));
END;
Most importantly, make it PARALLEL SAFE (because it is) or you may regret it. See:
When to mark functions as PARALLEL RESTRICTED vs PARALLEL SAFE?
And STRICT, since all used functions are strict themselves - assuming that for your f_unaccent().
BEGIN ATOMIC requires Postgres 14 or later. (Else make it a conventional SQL function.) See:
What does BEGIN ATOMIC mean in a PostgreSQL SQL function / procedure?
Also, since translate() is the cheapest operation, I would apply that first for a tiny overall gain.
Finally, if your f_unaccent() function something like this, you might just add the additional operations into a single function instead of creating another wrapper.
Related
Is there any alternative for the FORMATMESSAGE() function in SQL?
Here is my scenario. I have a variable of VARCHAR(2000) which takes a string and which is getting formatted by FORMATMESSAGE(). Now the length of the variable is changed to VARCHAR(8000), here I cannot use the FORMATMESSAGE() because this function accepts 2047 characters. If the message contains 2,048 or more characters, only the first 2,044 are displayed.
I'm planning to create a function with similar logic, but curious to know if there is any alternative or any other function that does the similar functionality logic.
Note: I cannot split the variable into many and use it.
When I use the fuzzystrmatch levenshtein function with diacritic characters it returns a wrong / multibyte-ignorant result:
select levenshtein('ą', 'x');
levenshtein
-------------
2
(Note: the first character is an 'a' with a diacritic below, it is not rendered properly after I copied it here)
The fuzzystrmatch documentation (https://www.postgresql.org/docs/9.1/fuzzystrmatch.html) warns that:
At present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do not work well with multibyte encodings (such as UTF-8).
But as it does not name the levenshtein function, I was wondering if there is a multibyte aware version of levenshtein.
I know that I could use unaccent function as a workaround but I need to keep the diacritics.
Note: This solution was suggested by #Nick Barnes in his answer to a related question.
The 'a' with a diacritic is a character sequence, i.e. a combination of a and a combining character, the diacritic ̨ : E'a\u0328'
There is an equivalent precomposed character ą: E'\u0105'
A solution would be to normalise the Unicode strings, i.e. to convert the combining character sequence into the precomposed character before comparing them.
Unfortunately, Postgres doesn't seem to have a built-in Unicode normalisation function, but you can easily access one via the PL/Perl or PL/Python language extensions.
For example:
create extension plpythonu;
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ language plpythonu;
Now, as the character sequence E'a\u0328' is mapped onto the equivalent precomposed character E'\u0105' by using unicode_normalize, the levenshtein distance is correct:
select levenshtein(unicode_normalize(E'a\u0328'), 'x');
levenshtein
-------------
1
Are there any well-known PL functions/libraries for extending a PostgreSQL (9.4.1) database with URL encoding (also known as percent encoding) capabilities?
Here's an example of the intended functionality:
Input string: International donor day: give blood for a good cause!
Output string: International%20donor%20day%3A%20give%20blood%20for%20a%20good%20cause%21
I guess an alternative would be to roll out my own implementation, since AFAIK there is currently no built-in way of doing this.
This is trivial to do in an external PL,e.g.
CREATE LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION urlescape(original text) RETURNS text LANGUAGE plpythonu AS $$
import urllib
return urllib.quote(original);
$$
IMMUTABLE STRICT;
My Postgres database encodes everything as UTF-8, but in a query when selecting a column I want to know if it can be encoded as Latin. I've no need to actually encode it as Latin but I need to know if can be encoded as Latin.
By Latin in mean what other people generally mean by Latin, i.e characters are recognisable to Western European speakers
i.e
SELECT val
FROM
TABLE1
WHERE IS_LATIN(Val);
Solution
I used the answer posted below, firstly I tried the python function but it failed because I dont have that language installed. Then I tried the pl/sql function and that failed because missing RETURN statement, but I fixed as follows and now works okay
CREATE OR REPLACE FUNCTION is_latin(input text)
RETURNS boolean
LANGUAGE plpgsql
IMMUTABLE
AS $$
BEGIN
PERFORM convert_to(input, 'iso-8859-15');
RETURN true;
EXCEPTION
WHEN untranslatable_character THEN
RETURN false;
END;
$$;
Well, you need to be more specific about "latin".
Assuming you mean ISO-8859-15, typical for Western europe
regress=> SELECT convert_to('a€bcáéíöâ', 'iso-8859-15');
convert_to
----------------------
\x61a46263e1e9edf6e2
(1 row)
Beware, people often use iso-8859-1, but it doesn't support €.
However, you'll run into issues with currency symbols and other things that might typically appear in modern western european text. For example, ₽ isn't part of ISO-8859-15. Nor is ฿, ₡, ₹, and other major currencies. (Oddly, ¥ is in ISO-8859-15).
If you want to test without an error you'll need to either use PL/Python or similar, or use PL/PgSQL and trap the exception.
CREATE OR REPLACE FUNCTION is_latin(input text)
RETURNS boolean
LANGUAGE plpgsql
IMMUTABLE
AS $$
BEGIN
PERFORM convert_to(input, 'iso-8859-15');
EXCEPTION
WHEN untranslatable_character THEN
RETURN false;
END;
$$;
regress=> SELECT is_latin('฿');
is_latin
----------
f
(1 row)
That creates a savepoint on every call, though, which can get expensive. So perhaps PL/Python is better. This one makes an assumption about the server_encoding (assuming it is utf-8) which isn't wise, so it should really check that properly. Anyway:
CREATE OR REPLACE FUNCTION is_latin(input text)
RETURNS boolean
LANGUAGE plpythonu
IMMUTABLE
AS $$
try:
input.decode("utf-8").encode("iso-8859-1")
return True
except UnicodeEncodeError:
return False
$$;
Another option is to create a regular expression with a charset that matches all the chars you want to permit, but I suspect that'll be slow and ugly. Incomplete example:
SELECT 'ab฿cdé' ~ '^[a-zA-Z0-9.áéíóúÁÉÍÓÚàè]*$'
... where you'd probably use the iso-8859-15 encoding table to produce the character list.
I want to find rows where a text column begins with a user given string, e.g. SELECT * FROM users WHERE name LIKE 'rob%' but "rob" is unvalidated user input. If the user writes a string containing a special pattern character like "rob_", it will match both "robert42" and "rob_the_man". I need to be sure that the string is matched literally, how would I do that? Do I need to handle the escaping on an application level or is it a more beautiful way?
I'm using PostgreSQL 9.1 and go-pgsql for Go.
The _ and % characters have to be quoted to be matched literally in a LIKE statement, there's no way around it. The choice is about doing it client-side, or server-side (typically by using the SQL replace(), see below). Also to get it 100% right in the general case, there are a few things to consider.
By default, the quote character to use before _ or % is the backslash (\), but it can be changed with an ESCAPE clause immediately following the LIKE clause.
In any case, the quote character has to be repeated twice in the pattern to be matched literally as one character.
Example: ... WHERE field like 'john^%node1^^node2.uucp#%' ESCAPE '^' would match john%node1^node2.uccp# followed by anything.
There's a problem with the default choice of backslash: it's already used for other purposes when standard_conforming_strings is OFF (PG 9.1 has it ON by default, but previous versions being still in wide use, this is a point to consider).
Also if the quoting for LIKE wildcard is done client-side in a user input injection scenario, it comes in addition to to the normal string-quoting already necessary on user input.
A glance at a go-pgsql example tells that it uses $N-style placeholders for variables... So here's an attempt to write it in a somehow generic way: it works with standard_conforming_strings both ON or OFF, uses server-side replacement of [%_], an alternative quote character, quoting of the quote character, and avoids sql injection:
db.Query("SELECT * from USERS where name like replace(replace(replace($1,'^','^^'),'%','^%'),'_','^_') ||'%' ESCAPE '^'",
variable_user_input);
To escape the underscore and the percent to be used in a pattern in like expressions use the escape character:
SELECT * FROM users WHERE name LIKE replace(replace(user_input, '_', '\\_'), '%', '\\%');
As far as I can tell the only special characters with the LIKE operator is percent and underscore, and these can easily be escaped manually using backslash. It's not very beautiful but it works.
SELECT * FROM users WHERE name LIKE
regexp_replace('rob', '(%|_)', '\\\1', 'g') || '%';
I find it strange that there is no such functions shipped with PostgreSQL. Who wants their users to write their own patterns?
The best answer is that you shouldn't be interpolating user input into your sql at all. Even escaping the sql is still dangerous.
The following which uses go's db/sql library illustrates a much safer way. Substitute the Prepare and Exec calls with whatever your go postgresql library's equivalents are.
// The question mark tells the database server that we will provide
// the LIKE parameter later in the Exec call
sql := "SELECT * FROM users where name LIKE ?"
// no need to escape since this won't be interpolated into the sql string.
value := "%" + user_input
// prepare the completely safe sql string.
stmt, err := db.Prepare(sql)
// Now execute that sql with the values for every occurence of the question mark.
result, err := stmt.Exec(value)
The benefits of this are that user input can safely be used without fear of it injecting sql into the statements you run. You also get the benefit of reusing the prepared sql for multiple queries which can be more efficient in certain cases.