I want to format long numbers using thousand separator. It can be done using to_char function just like:
SELECT TO_CHAR(76543210.98, '999G999G990D00')
But when my PostgreSQL server with UTF-8 encoding is on Polish version of Windows such SELECT ends with:
ERROR: invalid byte sequence for encoding "UTF8": 0xa0
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
In to_char pattern G is described as: group separator (uses locale).
This SELECT works without error when server is running on Linux with Polish locale.
As a workaround I use space instead of G in format string, but I think there should be way to set thousand separator just like in Oracle:
ALTER SESSION SET NLS_NUMERIC_CHARACTERS=', ';
Is such setting available for PostgreSQL?
If you use psql, you can execute this:
\pset numericlocale
Example:
test=# create temporary table a (a numeric(20,10));
CREATE TABLE
test=# insert into a select random() * 1000000 from generate_series(1,3);
INSERT 0 3
test=# select * from a;
a
-------------------
287421.6944910590
140297.9311533270
887215.3805568810
(3 rows)
test=# \pset numericlocale
Showing locale-adjusted numeric output.
test=# select * from a;
a
--------------------
287.421,6944910590
140.297,9311533270
887.215,3805568810
(3 rows)
I'm pretty sure the error message is literally true: 0xa0 isn't a valid UTF-8 character.
My home server is running PostgreSQL on Windows XP, SP3. I can do this in psql.
sandbox=# show client_encoding;
client_encoding
-----------------
UTF8
(1 row)
sandbox=# show lc_numeric;
lc_numeric
---------------
polish_poland
(1 row)
sandbox=# SELECT TO_CHAR(76543210.98, '999G999G990D00');
to_char
-----------------
76 543 210,98
(1 row)
I don't get an error message, but I get garbage for the separator. Could this be a code page issue?
As a workaround I use space instead of
G in format string
Let's think about this. If you use a space, then on a web page the value might split at the end of a line or at the boundary of a table cell. I'd think a nonbreaking space might be a better choice.
And, in Unicode, a nonbreaking space is 0xa0. In Unicode, not in UTF8. (That is, 0xa0 can't be the first byte of a UTF8 character. See UTF-8 Bit Distribution.)
Another possibility is that your client is expecting one byte order, and the server is giving it a different byte order. Since the numbers are single-byte characters, the byte order wouldn't matter until, well, it mattered. If the client is expecting a big endian MB character, and it got a little endian MB character beginning with 0xa0, I'd expect it to die with the error message you saw. I'm not sure I have a way to test this before I go to work today.
Related
I've been having some problems trying to save a string word with limited varchar(9).
create database big_text
LOCALE 'en_US.utf8'
ENCODING UTF8
create table big_text(
description VARCHAR(9) not null
)
# OK
insert into big_text (description) values ('sintético')
# I Got error here
insert into big_text (description) values ('sintético')
I already know that the problem is because one word is using 'é' -> Latin small letter E with Acute (this case only have 1 codepoint) and another word is using 'é' -> Latin Small Letter E + Combining Acute Accent Modifier. (this case I have 2 codepoint).
How can I store the same word using both representation in a limited varchar(9)? There is some configuration that the database is able to understand both ways? I thought that database being UTF8 is enough but not.
I appreciate any explanation that could help me understand where am I wrong? Thank you!
edit: Actually I would like to know if there is any way for postgres automatically normalize for me.
A possible workaround using CHECK to do the character length constraint.
show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
create table big_text(
description VARCHAR not null CHECK (length(normalize(description)) <= 9)
)
-- Note shortened string. Explanation below.
select 'sintético'::varchar(9);
varchar
----------
sintétic
insert into big_text values ('sintético');
INSERT 0 1
select description, length(description) from big_text;
description | length
-------------+--------
sintético | 10
insert into big_text values ('sintético test');
ERROR: new row for relation "big_text" violates check constraint "big_text_description_check"
DETAIL: Failing row contains (sintético test).
From here Character type the explanation for the string truncation vs the error you got when inserting:
An attempt to store a longer string into a column of these types will result in an error, unless the excess characters are all spaces, in which case the string will be truncated to the maximum length.(This somewhat bizarre exception is required by the SQL standard.)
If one explicitly casts a value to character varying(n) or character(n), then an over-length value will be truncated to n characters without raising an error. (This too is required by the SQL standard.)
The dump function in Oracle displays the internal representation of data:
DUMP returns a VARCHAR2 value containing the data type code, length in bytes, and internal representation of expr
Fore example:
SELECT DUMP(cast(1 as number ))
2 FROM DUAL;
DUMP(CAST(1ASNUMBER))
--------------------------------------------------------------------------------
Typ=2 Len=2: 193,2
SQL> SELECT DUMP(cast(1.000001 as number ))
2 FROM DUAL;
DUMP(CAST(1.000001ASNUMBER))
--------------------------------------------------------------------------------
Typ=2 Len=5: 193,2,1,1,2
It shows that the first 1 uses 2 byte for storing and the second example uses 5 bytes for storing.
I suppose the similar function in PostgreSQL is pg_typeof but it returns only the type name without information about byte usage:
SELECT pg_typeof(33);
pg_typeof
integer (1 row)
Does anybody know if there is an equivalent function in PostgreSQL?
I don't speak PostgreSQL.
However, Oracle functionality page says that there's Orafce which
implements in Postgres some of the functions from the Oracle database that are missing (or behaving differently)
It, furthermore, mentions the dump function
dump (anyexpr [, int]): Returns a text value that includes the datatype code, the length in bytes, and the internal representation of the expression
One of examples looks like this:
postgres=# select pg_catalog.dump('Pavel Stehule',10);
dump
-------------------------------------------------------------------------
Typ=25 Len=17: 68,0,0,0,80,97,118,101,108,32,83,116,101,104,117,108,101
(1 row)
To me, it looks like Oracle's dump:
SQL> select dump('Pavel Stehule') result from dual;
RESULT
--------------------------------------------------------------
Typ=96 Len=13: 80,97,118,101,108,32,83,116,101,104,117,108,101
SQL>
I presume you'll have to visit GitHub and install the package to see whether you can use it or not.
It is not a complete equivalent, but if you want to figure out the byte values used to encode a string in PostgreSQL, you can simply cast the value to bytea, which will give you the bytes in hexadecimal:
SELECT CAST ('schön' AS bytea);
This will work for strings, but not for numbers.
I have a database with one column of the type nvarchar. If I write
INSERT INTO table VALUES ("玄真")
It shows ¿¿ in the table. What should I do?
I'm using SQL Developer.
Use single quotes, rather than double quotes, to create a text literal and for a NVARCHAR2/NCHAR text literal you need to prefix it with N
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE table_name ( value NVARCHAR2(20) );
INSERT INTO table_name VALUES (N'玄真');
Query 1:
SELECT * FROM table_name
Results:
| VALUE |
|-------|
| 玄真 |
First, using NVARCHAR might not even be necessary.
The 'N' character data types are for storing data that doesn't 'fit' in the database's defined character set. There's an auxiliary character set defined as the NCHAR Character set. It's kind of a band aid - once you create a database it can be difficult to change its character set. Moral of this story - take great care in defining the Character Set when creating your database and do not just accept the defaults.
Here's a scenario (LiveSQL) where we're storing a Chinese string in both NVARCHAR and VARCHAR2.
CREATE TABLE SO_CHINESE ( value1 NVARCHAR2(20), value2 varchar2(20 char));
INSERT INTO SO_CHINESE VALUES (N'玄真', '我很高興谷歌翻譯。' )
select * from SO_CHINESE;
Note that both the character sets are in the Unicode family. Note also I told my VARCHAR2 string to hold 20 characters. That's because some characters may require up to 4 bytes to be stored. Using a definition of (20) would give you only room to store 5 of those characters.
Let's look at the same scenario using SQL Developer and my local database.
And to confirm the character sets:
SQL> clear screen
SQL> set echo on
SQL> set sqlformat ansiconsole
SQL> select *
2 from database_properties
3 where PROPERTY_NAME in
4 ('NLS_CHARACTERSET',
5 'NLS_NCHAR_CHARACTERSET');
PROPERTY_NAME PROPERTY_VALUE DESCRIPTION
NLS_NCHAR_CHARACTERSET AL16UTF16 NCHAR Character set
NLS_CHARACTERSET AL32UTF8 Character set
First of all, you should to establish the Chinese character encoding on your Database, for example
UTF-8, Chinese_Hong_Kong_Stroke_90_BIN, Chinese_PRC_90_BIN, Chinese_Simplified_Pinyin_100_BIN ...
I show you an example with SQL Server 2008 (Management Studio) that incorporates all of this Collations, however, you can find the same characters encodings in other Databases (MySQL, SQLite, MongoDB, MariaDB...).
Create Database with Chinese_PRC_90_BIN, but you can choose other Coallition:
Select a Page (Left Header) Options > Collation > Choose the Collation
Create a Table with the same Collation:
Execute the Insert Statement
INSERT INTO ChineseTable VALUES ('玄真');
I am unable to convert multibyte characters in Redshift.
create table temp2 (city varchar);
insert into temp2 values('г. красноярск'); // lower value
insert into temp2 values('Г. Красноярск'); //upper value
select * from temp2 where city ilike 'Г. Красноярск'
city
-------------
Г. Красноярск
I tried like below, UTF-8 characters are converting into lower.
select lower('Г. Красноярск')
lower
-------------
г. красноярск
In vertica it is working fine with using lowerb() function.
Internally the LIKE and ILIKE operators use PostgreSQL's regular expression support.
Support for proper handling of utf-8 multibyte chars in regular expressions was added in PostgreSQL 9.2. Redshift is based on PostgreSQL 8.2 (?) and it looks like they haven't backported that support into their forked version.
See Postgresql regex to match uppercase, Unicode-aware
You can work around this, with limitations, by using LIKE lower('Г. Красноярск') instead. An expression index may be useful.
I got the answer to check for one certain BOM in a PostgreSQL text column. What I really like to do would be to have something more general, i.e. something like
select decode(replace(textColumn, '\\', '\\\\'), 'escape') from tableXY;
The result of a UTF8 BOM is:
\357\273\277
Which is octal bytea and can be converted by switching the output of bytea in pgadmin:
update pg_settings set setting = 'hex' WHERE name = 'bytea_output';
select '\357\273\277'::bytea
The result is:
\xefbbbf
What I would like to have is this result as one query, e.g.
update pg_settings set setting = 'hex' WHERE name = 'bytea_output';
select decode(replace(textColumn, '\\', '\\\\'), 'escape') from tableXY;
But that doesn't work. The result is empty, probably because the decode cannot handle hex output.
If the final purpose is to get the hexadecimal representation of all the bytes that constitute the strings in textColumn, this can be done with:
SELECT encode(convert_to(textColumn, 'UTF-8'), 'hex') from tableXY;
It does not depend on bytea_output. BTW, this setting plays a role only at the final stage of a query, when a result column is of type bytea and has to be returned in text format to the client (which is the most common case, and what pgAdmin does). It's a matter of representation, the actual values represented (the series of bytes) are identical.
In the query above, the result is of type text, so this is irrelevant anyway.
I think that your query with decode(..., 'escape') can't work because the argument is supposed to be encoded in escape format and it's not, per comments it's normal xml strings.
With the great help of Daniel-Vérité I use this general query now to check for all kind of BOM or unicode char problems:
select encode(textColumn::bytea, 'hex'), * from tableXY;
I had problem with pgAdmin and too long columns, as they had no result. I used that query for pgAdmin:
select encode(substr(textColumn,1,100)::bytea, 'hex'), * from tableXY;
Thanks Daniel!