Pl sql convert national characters to unicode - unicode

I try to convert Danish national characters to unicode. Is there function in plsql or parameter to plsql function which can help me ? I try this select convert ('Æ, æ:,Ø, ø:,Å, å:','AL32UTF8') from dual; but it doesnt help. As a workaround I used in my code something like that
w_temp := replace('Æ, æ:,Ø, ø:,Å, å:','å','\u00E5');
w_temp := replace(w_temp,'Å','\u00C5');
w_temp := replace(w_temp,'æ','\u00E6');
w_temp := replace(w_temp,'Æ','\u00C6');
w_temp := replace(w_temp,'ø','\u00F8');
w_temp := replace(w_temp,'Ø','\u00D8');
but this method is like a monkey job. My code is not prepared for any other national characters - have you any suggestion?

The CONVERT() function can be used as follows CONVERT('fioajfiohawiofh',<ORIGIN_CHARSET>,<DESTINATION_CHARSET>).
I don't know your charset, but you can try finding useful one using this SELECT:
SELECT
CONVERT('Æ, æ:,Ø, ø:,Å, å:',cs.value,'UTF8') AS conv
,cs.value
,cs.isdeprecated
FROM
V$NLS_VALID_VALUES cs
WHERE
cs.parameter = 'CHARACTERSET'
;

I'm not sure what the big picture is, but assuming that you currently have your data in a database with a single character set that supports your diacritics, I would rather use a completely different approach:
export the needed data your database and the existing character set
recreate the database with a unicode character set
(very likely) change te definition and install all database objects with CHAR instead of BYTE semantics
import all the data into the new database
Clearly there is a lot of details to sort out, but having Oracle to properly convert the character sets during import, seems to be only reasonanble way to go.

Related

How can I use tsvector on a string with numbers?

I would like to use a postgres tsquery on a column that has strings that all contain numbers, like this:
FRUIT-239476234
If I try to make a tsquery out of this:
select to_tsquery('FRUIT-239476234');
What I get is:
'fruit' & '-239476234'
I want to be able to search by just the numeric portion of this value like so:
239476234
It seems that it is unable to match this because it is interpreting my hyphen as a "negative sign" and doesn't think 239476234 matches -239476234. How can I tell postgres to treat all of my characters as text and not try to be smart about numbers and hyphens?
An answer from the future. Once version 13 of PostgreSQL is released, you will be able to do use the dict_int module to do this.
create extension dict_int ;
ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 100, ABSVAL=true);
ALTER TEXT SEARCH CONFIGURATION english ALTER MAPPING FOR int WITH intdict;
select to_tsquery('FRUIT-239476234');
to_tsquery
-----------------------
'fruit' & '239476234'
But you would probably be better off creating your own TEXT SEARCH DICTIONARY as well as copying the 'english' CONFIGURATION and modifying the copy, rather than modifying the default ones in place. Otherwise you have the risk that upgrading will silently lose your changes.
If you don't want to wait for v13, you could back-patch this change and compile into your own version of the extension for a prior server.
This is done by the text search parser, which is not configurable (short of writing your own parser in C, which is supported).
The simplest solution is to pre-process all search strings by replacing - with a space.

How to identify if a value in column can be encoded to Latin in Postgres

My Postgres database encodes everything as UTF-8, but in a query when selecting a column I want to know if it can be encoded as Latin. I've no need to actually encode it as Latin but I need to know if can be encoded as Latin.
By Latin in mean what other people generally mean by Latin, i.e characters are recognisable to Western European speakers
i.e
SELECT val
FROM
TABLE1
WHERE IS_LATIN(Val);
Solution
I used the answer posted below, firstly I tried the python function but it failed because I dont have that language installed. Then I tried the pl/sql function and that failed because missing RETURN statement, but I fixed as follows and now works okay
CREATE OR REPLACE FUNCTION is_latin(input text)
RETURNS boolean
LANGUAGE plpgsql
IMMUTABLE
AS $$
BEGIN
PERFORM convert_to(input, 'iso-8859-15');
RETURN true;
EXCEPTION
WHEN untranslatable_character THEN
RETURN false;
END;
$$;
Well, you need to be more specific about "latin".
Assuming you mean ISO-8859-15, typical for Western europe
regress=> SELECT convert_to('a€bcáéíöâ', 'iso-8859-15');
convert_to
----------------------
\x61a46263e1e9edf6e2
(1 row)
Beware, people often use iso-8859-1, but it doesn't support €.
However, you'll run into issues with currency symbols and other things that might typically appear in modern western european text. For example, ₽ isn't part of ISO-8859-15. Nor is ฿, ₡, ₹, and other major currencies. (Oddly, ¥ is in ISO-8859-15).
If you want to test without an error you'll need to either use PL/Python or similar, or use PL/PgSQL and trap the exception.
CREATE OR REPLACE FUNCTION is_latin(input text)
RETURNS boolean
LANGUAGE plpgsql
IMMUTABLE
AS $$
BEGIN
PERFORM convert_to(input, 'iso-8859-15');
EXCEPTION
WHEN untranslatable_character THEN
RETURN false;
END;
$$;
regress=> SELECT is_latin('฿');
is_latin
----------
f
(1 row)
That creates a savepoint on every call, though, which can get expensive. So perhaps PL/Python is better. This one makes an assumption about the server_encoding (assuming it is utf-8) which isn't wise, so it should really check that properly. Anyway:
CREATE OR REPLACE FUNCTION is_latin(input text)
RETURNS boolean
LANGUAGE plpythonu
IMMUTABLE
AS $$
try:
input.decode("utf-8").encode("iso-8859-1")
return True
except UnicodeEncodeError:
return False
$$;
Another option is to create a regular expression with a charset that matches all the chars you want to permit, but I suspect that'll be slow and ugly. Incomplete example:
SELECT 'ab฿cdé' ~ '^[a-zA-Z0-9.áéíóúÁÉÍÓÚàè]*$'
... where you'd probably use the iso-8859-15 encoding table to produce the character list.

Unicode normalization in Postgres

I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á instead of 0061 + 0301 (\x61\x301)
I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;
This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.
EDIT: As Craig has pointed out, and one of the things I tried:
SELECT convert_to(E'\u00E1', 'iso-8859-1');
returns \xe1, whereas
SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"
I think this is a Pg bug.
In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.
I'll raise it on pgsql-bugs ... done.
http://www.postgresql.org/message-id/53E179E1.3060404#2ndquadrant.com
You should be able to follow the thread there.
Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.
BTW, this can be simplified down to:
regress=> SELECT 'á' = 'á';
?column?
----------
f
(1 row)
which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).
If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:
PostgreSQL 13 has introduced string function normalize ( text [, form ] ) → text, which is available when the server encoding is UTF8.
> select 'päivää' = 'päivää' as without, normalize('päivää') = normalize('päivää') as with_norm ;
without | with_norm
---------+-----------
f | t
(1 row)
Note that I am expecting this to miss any indices, and therefore using this blindly in a hot production query is prone to be a recipe for disaster.
Great news for us who have naively stored NFD filenames from Mac users in our databases.

Replace characters with multi-character strings

I am trying to replace German and Dutch umlauts such as ä, ü, or ß. They should be written like ae instead of ä. So I can't simply translate one char with another.
Is there a more elegant way to do that? Actually it looks like that (not completed yet):
SELECT addr, REPLACE (REPLACE(addr, 'ü','ue'),'ß','ss') FROM search;
On my way trying different commands I got another problem:
When I searched for Ü I got this:
ERROR: invalid byte sequence for encoding "UTF8": 0xdc27
Tried it with U&'\0220', it didn't replace anything. Only by using ü (for lowercase ü) it was replaced correctly. Has to do something with unicode, but how to solve this issue?
Kind regards from Germany. :)
Your server encoding seems to be UTF8.
I suspect your client_encoding does not match, which might give you a wrong impression of what you are dealing with. Check with:
SHOW client_encoding; -- in your actual session
And read this related answers:
Can not insert German characters in Postgres
Replace unicode characters in PostgreSQL
The rest of the tool chain has to be in sync, too. When using puTTY, for instance, one has to make sure, the terminal agrees with the rest: Change settings... Window -> Translation -> Remote character set = UTF-8.
As for your first question, you already have the best solution. A couple of umlauts are best replaced with a string of replace() statements.
As you seem to know already as well, single character replacements are more efficient with (a single) translate() statement.
Related:
Replace unicode characters in PostgreSQL
Regex remove all occurrences of multiple characters in a string
Beside other reasons I decided to write the replacement in python. Like Erwin wrote before, it seems there is no better solution as combining replace- commands.
In general pretty simple, even no encoding had to benn used. My "final" solution now looks like this:
ger_UE="Ü"
ger_AE="Ä"
ger_OE="Ö"
ger_SS="ß"
dk_AA="Å"
dk_OE="Ø"
dk_AE="Æ"
cur.execute("""Select addr, REPLACE (REPLACE (REPLACE( REPLACE (REPLACE (REPLACE (REPLACE(addr, '%s','UE'),'%s','OE'),'%s','AE'),'%s','SS'),'%s','AA'),'%s','OE'),'%s','AE')
from search WHERE x = '1';"""%(ger_UE,ger_OE,ger_AE,ger_SS,dk_AA,dk_OE,dk_AE))
I am now looking forward to the speed when it hits the large table. If anyone would like to make some annotations, they are very welcome.

How to escape string while matching pattern in PostgreSQL

I want to find rows where a text column begins with a user given string, e.g. SELECT * FROM users WHERE name LIKE 'rob%' but "rob" is unvalidated user input. If the user writes a string containing a special pattern character like "rob_", it will match both "robert42" and "rob_the_man". I need to be sure that the string is matched literally, how would I do that? Do I need to handle the escaping on an application level or is it a more beautiful way?
I'm using PostgreSQL 9.1 and go-pgsql for Go.
The _ and % characters have to be quoted to be matched literally in a LIKE statement, there's no way around it. The choice is about doing it client-side, or server-side (typically by using the SQL replace(), see below). Also to get it 100% right in the general case, there are a few things to consider.
By default, the quote character to use before _ or % is the backslash (\), but it can be changed with an ESCAPE clause immediately following the LIKE clause.
In any case, the quote character has to be repeated twice in the pattern to be matched literally as one character.
Example: ... WHERE field like 'john^%node1^^node2.uucp#%' ESCAPE '^' would match john%node1^node2.uccp# followed by anything.
There's a problem with the default choice of backslash: it's already used for other purposes when standard_conforming_strings is OFF (PG 9.1 has it ON by default, but previous versions being still in wide use, this is a point to consider).
Also if the quoting for LIKE wildcard is done client-side in a user input injection scenario, it comes in addition to to the normal string-quoting already necessary on user input.
A glance at a go-pgsql example tells that it uses $N-style placeholders for variables... So here's an attempt to write it in a somehow generic way: it works with standard_conforming_strings both ON or OFF, uses server-side replacement of [%_], an alternative quote character, quoting of the quote character, and avoids sql injection:
db.Query("SELECT * from USERS where name like replace(replace(replace($1,'^','^^'),'%','^%'),'_','^_') ||'%' ESCAPE '^'",
variable_user_input);
To escape the underscore and the percent to be used in a pattern in like expressions use the escape character:
SELECT * FROM users WHERE name LIKE replace(replace(user_input, '_', '\\_'), '%', '\\%');
As far as I can tell the only special characters with the LIKE operator is percent and underscore, and these can easily be escaped manually using backslash. It's not very beautiful but it works.
SELECT * FROM users WHERE name LIKE
regexp_replace('rob', '(%|_)', '\\\1', 'g') || '%';
I find it strange that there is no such functions shipped with PostgreSQL. Who wants their users to write their own patterns?
The best answer is that you shouldn't be interpolating user input into your sql at all. Even escaping the sql is still dangerous.
The following which uses go's db/sql library illustrates a much safer way. Substitute the Prepare and Exec calls with whatever your go postgresql library's equivalents are.
// The question mark tells the database server that we will provide
// the LIKE parameter later in the Exec call
sql := "SELECT * FROM users where name LIKE ?"
// no need to escape since this won't be interpolated into the sql string.
value := "%" + user_input
// prepare the completely safe sql string.
stmt, err := db.Prepare(sql)
// Now execute that sql with the values for every occurence of the question mark.
result, err := stmt.Exec(value)
The benefits of this are that user input can safely be used without fear of it injecting sql into the statements you run. You also get the benefit of reusing the prepared sql for multiple queries which can be more efficient in certain cases.