multi byte character issue in Redshift - postgresql

I am unable to convert multibyte characters in Redshift.
create table temp2 (city varchar);
insert into temp2 values('г. красноярск'); // lower value
insert into temp2 values('Г. Красноярск'); //upper value
select * from temp2 where city ilike 'Г. Красноярск'
city
-------------
Г. Красноярск
I tried like below, UTF-8 characters are converting into lower.
select lower('Г. Красноярск')
lower
-------------
г. красноярск
In vertica it is working fine with using lowerb() function.

Internally the LIKE and ILIKE operators use PostgreSQL's regular expression support.
Support for proper handling of utf-8 multibyte chars in regular expressions was added in PostgreSQL 9.2. Redshift is based on PostgreSQL 8.2 (?) and it looks like they haven't backported that support into their forked version.
See Postgresql regex to match uppercase, Unicode-aware
You can work around this, with limitations, by using LIKE lower('Г. Красноярск') instead. An expression index may be useful.

Related

How would I diagnose what error seems to lead to non-functional underscore wildcard queries in Postgresql 15?

I am working through a quick refresher ('SQL Handbook' by Flavio Copes), and any LIKE or ILIKE query I use with the underscore wildcard returns no results.
The table is created as such:
CREATE TABLE people (
names CHAR(20)
);
INSERT INTO people VALUES ('Joe'), ('John'), ('Johanna'), ('Zoe');
Given this table, I use the following query:
SELECT * FROM people WHERE names LIKE '_oe';
I expect it to return
names
1
Joe
2
Zoe
Instead, it returns
names
The install is PostgreSQL 15 (x64), pgAdmin 4, and PostGIS v3.3.1
Using char(20) means all strings are exactly 20 chars long, being padded with spaces out to that length. The spaces make it not match the pattern, as there is nothing in the pattern to accommodate spaces at the end.
If you make the pattern be '_oe%' it would work. Or better yet, don't use char(20).

Alphanumeric sorting without any pattern on the strings [duplicate]

I've got a Postgres ORDER BY issue with the following table:
em_code name
EM001 AAA
EM999 BBB
EM1000 CCC
To insert a new record to the table,
I select the last record with SELECT * FROM employees ORDER BY em_code DESC
Strip alphabets from em_code usiging reg exp and store in ec_alpha
Cast the remating part to integer ec_num
Increment by one ec_num++
Pad with sufficient zeors and prefix ec_alpha again
When em_code reaches EM1000, the above algorithm fails.
First step will return EM999 instead EM1000 and it will again generate EM1000 as new em_code, breaking the unique key constraint.
Any idea how to select EM1000?
Since Postgres 9.6, it is possible to specify a collation which will sort columns with numbers naturally.
https://www.postgresql.org/docs/10/collation.html
-- First create a collation with numeric sorting
CREATE COLLATION numeric (provider = icu, locale = 'en#colNumeric=yes');
-- Alter table to use the collation
ALTER TABLE "employees" ALTER COLUMN "em_code" type TEXT COLLATE numeric;
Now just query as you would otherwise.
SELECT * FROM employees ORDER BY em_code
On my data, I get results in this order (note that it also sorts foreign numerals):
Value
0
0001
001
1
06
6
13
۱۳
14
One approach you can take is to create a naturalsort function for this. Here's an example, written by Postgres legend RhodiumToad.
create or replace function naturalsort(text)
returns bytea language sql immutable strict as $f$
select string_agg(convert_to(coalesce(r[2], length(length(r[1])::text) || length(r[1])::text || r[1]), 'SQL_ASCII'),'\x00')
from regexp_matches($1, '0*([0-9]+)|([^0-9]+)', 'g') r;
$f$;
Source: http://www.rhodiumtoad.org.uk/junk/naturalsort.sql
To use it simply call the function in your order by:
SELECT * FROM employees ORDER BY naturalsort(em_code) DESC
The reason is that the string sorts alphabetically (instead of numerically like you would want it) and 1 sorts before 9.
You could solve it like this:
SELECT * FROM employees
ORDER BY substring(em_code, 3)::int DESC;
It would be more efficient to drop the redundant 'EM' from your em_code - if you can - and save an integer number to begin with.
Answer to question in comment
To strip any and all non-digits from a string:
SELECT regexp_replace(em_code, E'\\D','','g')
FROM employees;
\D is the regular expression class-shorthand for "non-digits".
'g' as 4th parameter is the "globally" switch to apply the replacement to every occurrence in the string, not just the first.
After replacing every non-digit with the empty string, only digits remain.
This always comes up in questions and in my own development and I finally tired of tricky ways of doing this. I finally broke down and implemented it as a PostgreSQL extension:
https://github.com/Bjond/pg_natural_sort_order
It's free to use, MIT license.
Basically it just normalizes the numerics (zero pre-pending numerics) within strings such that you can create an index column for full-speed sorting au naturel. The readme explains.
The advantage is you can have a trigger do the work and not your application code. It will be calculated at machine-speed on the PostgreSQL server and migrations adding columns become simple and fast.
you can use just this line
"ORDER BY length(substring(em_code FROM '[0-9]+')), em_code"
I wrote about this in detail in this related question:
Humanized or natural number sorting of mixed word-and-number strings
(I'm posting this answer as a useful cross-reference only, so it's community wiki).
I came up with something slightly different.
The basic idea is to create an array of tuples (integer, string) and then order by these. The magic number 2147483647 is int32_max, used so that strings are sorted after numbers.
ORDER BY ARRAY(
SELECT ROW(
CAST(COALESCE(NULLIF(match[1], ''), '2147483647') AS INTEGER),
match[2]
)
FROM REGEXP_MATCHES(col_to_sort_by, '(\d*)|(\D*)', 'g')
AS match
)
I thought about another way of doing this that uses less db storage than padding and saves time than calculating on the fly.
https://stackoverflow.com/a/47522040/935122
I've also put it on GitHub
https://github.com/ccsalway/dbNaturalSort
The following solution is a combination of various ideas presented in another question, as well as some ideas from the classic solution:
create function natsort(s text) returns text immutable language sql as $$
select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;
The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.
Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.

how to convert oracle to_single_byte() in postgreSQL

Oracle
select regexp_replace(TO_SINGLE_BYTE('A'),'[^ a-zA-Z]','!!',1,0,'im') from dual;
--> Result: A
Postgres
select regexp_replace('A','[^ a-zA-Z]',' ','ig');
--> Result : !!
Question: how to get a same result between oracle and Postgres? I'd like get an result as like Oracle.
I tried to search the way to convert this function into Postgres. But the answer is there is no function in Postgres.
That letter is Unicode code point U+FF21 (Fullwidth Latin Capital Letter A) and does not match your pattern.
Use the unaccent function to convert the character to a regular A:
CREATE EXTENSION unaccent;
SELECT ascii(unaccent('A'));
ascii
-------
65
(1 row)

Data Migration from DB2 to PostgreSQL using AWS DMS - Varchar fields are showing trailing spaces

We are migrating DB2 data to PostgreSQL 11.x using AWS DMS, we have varchar fields in db2 with trailing spaces and without any TRIM these fields working fine when we are using these fields in a WHERE clause. I think DB2 internally trimming them as these are varchar fields. But after moving to PostgreSQL these fields are not working without TRIM and also some times these giving unexpected results even if you use TRIM. below is the detailed problem.
Source: DB2 - RECIP_NUM -- VARCHAR(10) -- 'ST001 '
select RECIP_NUMBER, SERV_TYPE, LENGTH(SERV_TYPE) AS before_trim_COL_LENGTH, LENGTH(trim(SERV_TYPE)) AS after_trim_COL_LENGTH
from serv_type rst
WHERE SERV_TYPE = 'ST001' -- THIS WORKS FINE WITHOUT TRIM
Output:Output of DB2
Target: PGSQL -- RECIP_NUM -- VARCHAR(10) -- 'ST001 '
select RECIP_NUMBER, SERV_TYPE, LENGTH(SERV_TYPE) AS COL_LENGTH
from serv_type rst
WHERE trim(SERV_TYPE) = 'ST001' -- THIS IS NOT GIVING ANY OUTPUT WITHOUT TRIM
Output: Output of PostgreSQL
Is there any way we can tell PostgreSQL to ignore the trailing spaces of a VARCHAR Column?
Postgres doesn't follow the SQL standard, which requires the shorter string be padded, when comparing VARCHAR or TEXT strings; it only pads the CHAR strings. Therefore, you can use ...WHERE SERV_TYPE::char = 'ST001'::char to simulate the Db2 behaviour. Note though that this will preclude the use of index on SERV_TYPE, same as when using trim(SERV_TYPE).

How to set thousand separator for PostgreSQL?

I want to format long numbers using thousand separator. It can be done using to_char function just like:
SELECT TO_CHAR(76543210.98, '999G999G990D00')
But when my PostgreSQL server with UTF-8 encoding is on Polish version of Windows such SELECT ends with:
ERROR: invalid byte sequence for encoding "UTF8": 0xa0
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
In to_char pattern G is described as: group separator (uses locale).
This SELECT works without error when server is running on Linux with Polish locale.
As a workaround I use space instead of G in format string, but I think there should be way to set thousand separator just like in Oracle:
ALTER SESSION SET NLS_NUMERIC_CHARACTERS=', ';
Is such setting available for PostgreSQL?
If you use psql, you can execute this:
\pset numericlocale
Example:
test=# create temporary table a (a numeric(20,10));
CREATE TABLE
test=# insert into a select random() * 1000000 from generate_series(1,3);
INSERT 0 3
test=# select * from a;
a
-------------------
287421.6944910590
140297.9311533270
887215.3805568810
(3 rows)
test=# \pset numericlocale
Showing locale-adjusted numeric output.
test=# select * from a;
a
--------------------
287.421,6944910590
140.297,9311533270
887.215,3805568810
(3 rows)
I'm pretty sure the error message is literally true: 0xa0 isn't a valid UTF-8 character.
My home server is running PostgreSQL on Windows XP, SP3. I can do this in psql.
sandbox=# show client_encoding;
client_encoding
-----------------
UTF8
(1 row)
sandbox=# show lc_numeric;
lc_numeric
---------------
polish_poland
(1 row)
sandbox=# SELECT TO_CHAR(76543210.98, '999G999G990D00');
to_char
-----------------
76 543 210,98
(1 row)
I don't get an error message, but I get garbage for the separator. Could this be a code page issue?
As a workaround I use space instead of
G in format string
Let's think about this. If you use a space, then on a web page the value might split at the end of a line or at the boundary of a table cell. I'd think a nonbreaking space might be a better choice.
And, in Unicode, a nonbreaking space is 0xa0. In Unicode, not in UTF8. (That is, 0xa0 can't be the first byte of a UTF8 character. See UTF-8 Bit Distribution.)
Another possibility is that your client is expecting one byte order, and the server is giving it a different byte order. Since the numbers are single-byte characters, the byte order wouldn't matter until, well, it mattered. If the client is expecting a big endian MB character, and it got a little endian MB character beginning with 0xa0, I'd expect it to die with the error message you saw. I'm not sure I have a way to test this before I go to work today.