PostgreSQL lower() and upper() not working with some Turkish characters

PostgreSQL lower() and upper() not working with some Turkish characters - postgresql

In PostgreSQL 11.4, I realized that lower() and upper() methods are not working with some of Turkish characters i.e. 'İ' and 'ı'.
select lower('İ'); -- returns 'İ' instead of 'i'
select upper('ı'); -- returns 'ı' instead of 'I'
Although the methods works properly with the other Turkish characters i.e. "ç, ö, ş, ğ, ü", there seems to be not a proper workaround except from replace or some of the procedures. So, I am wondering whether or not this is a bug and it will be solved in the next version of PostgreSQL. If not yet, is there a smarter workaround regarding to this problem at least for now?
Note: I use Windows 10, but I am not sure if the problem is directly related to Windows.

Related

PQgetvalue() strips spaces from result of string_agg()

I have a GNU C++ project that uses the PostgreSQL API and for some reason, it strips spaces from the result of a certain query. Other environments (psql and pgAdmin) don't. The query is:
SELECT string_agg(my_varchar, ', ') FROM my table;
Notice the space after the comma in the delimiter. Instead of 1046976, 1046977 being returned by PQgetvalue(), I get 1046976,1046977. Just for kicks, I tried changing the delimiter to silly things like string_agg(my_varchar, ',:) ' and string_agg(my_varchar, ', :)'. It doesn't strip the space if the space is in the middle of the delimiter.
Again, I don't have this problem if I do the same queries in db browsers like psql and pgAdmin; they don't strip the space in any of those queries.
Yes, I considered the possibility that because the columns from which they extract are varchars, but the data are 7-bit integers, the engine might be confused. I changed the query to something that is truly a varchar, but the spaces were still stripped.
Looking at https://www.postgresql.org/docs/9.4/static/functions-aggregate.html, I see that string_agg() expects its arguments to be texts or byteas. Well, I never got an error, but to be sure, I tried string_agg(my_varchar::text, ', '::text). It didn't make a difference.
I don't know a great deal about this API, but it doesn't appear to connect to the db with any options, so I don't think there's much to say about the configuration.
I'm running this in GNU C++ v4.9.2 on Debian 8.10. The PostgreSQL engine and API are 9.4.

Unicode normalization in Postgres

I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á instead of 0061 + 0301 (\x61\x301)
I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;
This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.
EDIT: As Craig has pointed out, and one of the things I tried:
SELECT convert_to(E'\u00E1', 'iso-8859-1');
returns \xe1, whereas
SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"

I think this is a Pg bug.
In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.
I'll raise it on pgsql-bugs ... done.
http://www.postgresql.org/message-id/53E179E1.3060404#2ndquadrant.com
You should be able to follow the thread there.
Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.
BTW, this can be simplified down to:
regress=> SELECT 'á' = 'á';
?column?
----------
f
(1 row)
which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).
If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:

PostgreSQL 13 has introduced string function normalize ( text [, form ] ) → text, which is available when the server encoding is UTF8.
> select 'päivää' = 'päivää' as without, normalize('päivää') = normalize('päivää') as with_norm ;
without | with_norm
---------+-----------
f | t
(1 row)
Note that I am expecting this to miss any indices, and therefore using this blindly in a hot production query is prone to be a recipe for disaster.
Great news for us who have naively stored NFD filenames from Mac users in our databases.

Replace characters with multi-character strings

I am trying to replace German and Dutch umlauts such as ä, ü, or ß. They should be written like ae instead of ä. So I can't simply translate one char with another.
Is there a more elegant way to do that? Actually it looks like that (not completed yet):
SELECT addr, REPLACE (REPLACE(addr, 'Ã¼','ue'),'ß','ss') FROM search;
On my way trying different commands I got another problem:
When I searched for Ü I got this:
ERROR: invalid byte sequence for encoding "UTF8": 0xdc27
Tried it with U&'\0220', it didn't replace anything. Only by using Ã¼ (for lowercase ü) it was replaced correctly. Has to do something with unicode, but how to solve this issue?
Kind regards from Germany. :)

Your server encoding seems to be UTF8.
I suspect your client_encoding does not match, which might give you a wrong impression of what you are dealing with. Check with:
SHOW client_encoding; -- in your actual session
And read this related answers:
Can not insert German characters in Postgres
Replace unicode characters in PostgreSQL
The rest of the tool chain has to be in sync, too. When using puTTY, for instance, one has to make sure, the terminal agrees with the rest: Change settings... Window -> Translation -> Remote character set = UTF-8.
As for your first question, you already have the best solution. A couple of umlauts are best replaced with a string of replace() statements.
As you seem to know already as well, single character replacements are more efficient with (a single) translate() statement.
Related:
Replace unicode characters in PostgreSQL
Regex remove all occurrences of multiple characters in a string

Beside other reasons I decided to write the replacement in python. Like Erwin wrote before, it seems there is no better solution as combining replace- commands.
In general pretty simple, even no encoding had to benn used. My "final" solution now looks like this:
ger_UE="Ü"
ger_AE="Ä"
ger_OE="Ö"
ger_SS="ß"
dk_AA="Å"
dk_OE="Ø"
dk_AE="Æ"
cur.execute("""Select addr, REPLACE (REPLACE (REPLACE( REPLACE (REPLACE (REPLACE (REPLACE(addr, '%s','UE'),'%s','OE'),'%s','AE'),'%s','SS'),'%s','AA'),'%s','OE'),'%s','AE')
from search WHERE x = '1';"""%(ger_UE,ger_OE,ger_AE,ger_SS,dk_AA,dk_OE,dk_AE))
I am now looking forward to the speed when it hits the large table. If anyone would like to make some annotations, they are very welcome.

copy data from a text.rpt file to paste it postgresql in pgadmin

When I run:
COPY con (date,kgs)
FROM 'H:Sir\\data\\reporting\\hi.rpt'
WITH DELIMITER ','
CSV HEADER
date AS 'Datum/Uhrzeit'
kgs AS 'Summe'
I get the error:
WARNING: nonstandard use of \\ in a string literal
LINE 2: FROM 'H:Sudhir\\Conair data\\TBreporting\\hi.txt'
^
HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
I've been having this problem for quite a while. Help?

It's not an error, it's just a warning. It has nothing to do with the file content, it's related to a PostgreSQL setting and the COPY command syntax you're using.
You're using PostgreSQL after 8.1 with standard_conforming_strings turned off - either before 9.1 (which defaulted to off) or a newer version with it turned off manually.
This causes backslashes in strings, like bob\ted, get interpreted as escapes, so that string would be bob<tab>ted with a literal tab, as \t is the escape for a tab.
Interpreting strings like this is contrary to the SQL standard, which doesn't have C-style backslash escapes. Years ago the PostgreSQL team decided to switch to the SQL standard way of doing things. For backward compatibility reasons it was done in two stages:
Add the standard_conforming_strings option to use the SQL-standard interpretation of strings, but have it default to off. Issue warnings when using the non-standard PostgreSQL string interpretation. Add a new E'string' style to allow applications to explicitly request escape processing in strings.
A few releases later, turn standard_conforming_strings on by default, once people had updated and fixed the warnings their applications produced. Supposedly.
The escape for \ is \\. So "doubling" the backslashes like you (or the tool you're using) done is correct. PostgreSQL is showing a warning because it doesn't know if when you wrote H:Sir\\data\\reporting\\hi.rpt you meant literally H:Sir\\data\\reporting\\hi.rpt (like the SQL spec says) or H:Sir\data\reporting\hi.rpt (like PostgreSQL used to do, against the standard).
Thus there's nothing wrong with your query. If you want to get rid of the warning, either turn standard_conforming_strings on , or add an explicit E'' to your string.

PostgreSQL ignores dashes when ordering

I have a PostgreSQL 8.4 database that is created with the da_DK.utf8 locale.
dbname=> show lc_collate;
lc_collate
------------
da_DK.utf8
(1 row)
When I select something from a table where I order on a character varying column I get a strange behaviour IMO. When ordering the result PostgreSQL ignores dashes that prefixes the value, e.g.:
select name from mytable order by name asc;
May return something like
name
----------------
Ad...
Ae...
Ag...
- Ak....
At....
The dash prefix seems to be ignored.
I can fix this issue by converting the column to latin1 when ordering:
select name from mytable order by convert_to(name, 'latin1') asc;
The I get the expected result as:
name
----------------
- Ak....
Ad...
Ae...
Ag...
At....
Why does the dash prefix get ignored by default? Can that behavior be changed?

This is because da_DK.utf8 locale defines it this way. Linux locale aware utilities, for example sort will also work like this.
Your convert_to(name, 'latin1') will break if it finds a character which is not in Latin 1 character set, for example €, so it isn't a good workaround.
You can use order by convert_to(name, 'SQL_ASCII'), which will ignore locale defined sort and simply use byte values.
Ugly hack edit:
order by
(
ascii(name) between ascii('a') and ascii('z')
or ascii(name) between ascii('A') and ascii('Z')
or ascii(name)>127
),
name;
This will sort first anything which starts with ASCII non-letter. This is very ugly, because sorting further in string would behave strange, but it can be good enough for you.

A workaround that will work in my specific case is to replace dashes with exclamation points. I happen to know that I will never get exclamation points and it will be sorted before any letters or digits.
select name from mytable order by translate(name, '-', '!') asc
It will certainly affect performance so I may look into creating a special column for sorting but I really don't like that either...

I don't know how seems ordering rules for Dutch, but for Polish special characters like space, dashes etc are not "counted" in sorting in most dictionaries. Some good sort routines do the same and ignores such special characters. Probably in Dutch there is similar rule, and this rule is implemented by Ubuntu locale aware sort function.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse