How to store word "é" in postgres using limited varchar - postgresql

I've been having some problems trying to save a string word with limited varchar(9).
create database big_text
LOCALE 'en_US.utf8'
ENCODING UTF8
create table big_text(
description VARCHAR(9) not null
)
# OK
insert into big_text (description) values ('sintético')
# I Got error here
insert into big_text (description) values ('sintético')
I already know that the problem is because one word is using 'é' -> Latin small letter E with Acute (this case only have 1 codepoint) and another word is using 'é' -> Latin Small Letter E + Combining Acute Accent Modifier. (this case I have 2 codepoint).
How can I store the same word using both representation in a limited varchar(9)? There is some configuration that the database is able to understand both ways? I thought that database being UTF8 is enough but not.
I appreciate any explanation that could help me understand where am I wrong? Thank you!
edit: Actually I would like to know if there is any way for postgres automatically normalize for me.

A possible workaround using CHECK to do the character length constraint.
show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
create table big_text(
description VARCHAR not null CHECK (length(normalize(description)) <= 9)
)
-- Note shortened string. Explanation below.
select 'sintético'::varchar(9);
varchar
----------
sintétic
insert into big_text values ('sintético');
INSERT 0 1
select description, length(description) from big_text;
description | length
-------------+--------
sintético | 10
insert into big_text values ('sintético test');
ERROR: new row for relation "big_text" violates check constraint "big_text_description_check"
DETAIL: Failing row contains (sintético test).
From here Character type the explanation for the string truncation vs the error you got when inserting:
An attempt to store a longer string into a column of these types will result in an error, unless the excess characters are all spaces, in which case the string will be truncated to the maximum length.(This somewhat bizarre exception is required by the SQL standard.)
If one explicitly casts a value to character varying(n) or character(n), then an over-length value will be truncated to n characters without raising an error. (This too is required by the SQL standard.)

Related

Postgresql ORDER BY not working as expected

Let's try this simple example to represent the problem I'm facing.
Assume this table:
CREATE TABLE testing1
(
id serial NOT NULL,
word text,
CONSTRAINT testing1_pkey PRIMARY KEY (id)
);
and that data:
insert into testing1 (word) values ('Heliod, God');
insert into testing1 (word) values ('Heliod''s Inter');
insert into testing1 (word) values ('Heliod''s Pilg');
insert into testing1 (word) values ('Heliod, Sun');
Then I want to run this query to get the results ordered by the word column:
SELECT
id, word
FROM testing1
WHERE UPPER(word::text) LIKE UPPER('heliod%')
ORDER BY word asc;
But look at the output, it's not ordered. I would expect the rows to be in that order, using their ids: 2, 3, 1, 4 (or, if I use the word's values: Heliod's Inter, Heliod's Pilg, Heliod, God, Heliod, Sun). This is what I get:
I thought that maybe something could confuse postgresql because of the WHERE criteria I used, but the below happens if I just order by on the rows:
Am I missing something here? I couldn't find anything in the docs about ordering values that contain quotes (I suspect that the quotes cause that behaviour because of their special meaning in postgresql, but I may be wrong).
I am using UTF-8 encoding for my database (not sure if it matters though) and this issue is happening on Postgresql version 12.7.
The output of
show lc_ctype;
is
"en_GB.UTF-8"
and the output of
show lc_collate;
is
"en_GB.UTF-8"
That is the correct way to order the rows in en_US.UTF-8. It does 'weird' (to someone used to ASCII) things with punctuation and whitespace, skipping on a first pass and considering it only for otherwise tied values.
If you don't want those rules, maybe use the C collation instead.
Indeed, I've tried #jjanes's suggestion to use the C collation and the output is the one I would expect:
SELECT
id, word
FROM testing1
ORDER BY word collate "C" ;
How weird, I have been using postgresql for some years now and I never noticed that behaviour.
Relevant section from the docs:
23.2.2.1. Standard Collations
On all platforms, the collations named default, C, and POSIX are available. > Additional collations may be available depending on operating system
support. The default collation selects the LC_COLLATE and LC_CTYPE values
specified at database creation time. The C and POSIX collations both
specify “traditional C” behavior, in which only the ASCII letters “A”
through “Z” are treated as letters, and sorting is done strictly by
character code byte values.

Postgres is adding a space at the beginning and end of all fields

SLES 12 SP3
Postgres 10.8
I have duplicated a table to migrate data from a DB2 instance. The fields are all of type CHAR, VARCHAR, or TIMESTAMP. I originally tried to use \COPY to pull the data in from a pipe delimited file. But, it put a space at the beginning and end of all of the fields, even if this caused the field to be longer than it is defined. I found a claim online that this was a known issue with \COPY. At that point, I dropped the table, used sed and some other tools to convert the pipe delimited data into an SQL INSERT statement. I again had a leading and trailing space in every field.
There are a lot of columns but as an example of what I have follows:
FLD1 CHAR(6) PRIMARY KEY
FLD2 VARCHAR(8)
FLD3 TIMESTAMP
I am using the short form of INSERT.
INSERT INTO my_table VALUES
('123456', '12345678', '2021-01-01 12:34:56');
But when I do a SELECT, I get (note the leading and trailing spaces):
123456 | 12345678 | 2021-01-01 12:34:56 |
I would point out that the first two fields are now longer than they are defined by 2 characters.
Does anyone how I might fix this?
The -A argument to psql gives me the desired result.

SQL Command to insert Chinese Letters

I have a database with one column of the type nvarchar. If I write
INSERT INTO table VALUES ("玄真")
It shows ¿¿ in the table. What should I do?
I'm using SQL Developer.
Use single quotes, rather than double quotes, to create a text literal and for a NVARCHAR2/NCHAR text literal you need to prefix it with N
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE table_name ( value NVARCHAR2(20) );
INSERT INTO table_name VALUES (N'玄真');
Query 1:
SELECT * FROM table_name
Results:
| VALUE |
|-------|
| 玄真 |
First, using NVARCHAR might not even be necessary.
The 'N' character data types are for storing data that doesn't 'fit' in the database's defined character set. There's an auxiliary character set defined as the NCHAR Character set. It's kind of a band aid - once you create a database it can be difficult to change its character set. Moral of this story - take great care in defining the Character Set when creating your database and do not just accept the defaults.
Here's a scenario (LiveSQL) where we're storing a Chinese string in both NVARCHAR and VARCHAR2.
CREATE TABLE SO_CHINESE ( value1 NVARCHAR2(20), value2 varchar2(20 char));
INSERT INTO SO_CHINESE VALUES (N'玄真', '我很高興谷歌翻譯。' )
select * from SO_CHINESE;
Note that both the character sets are in the Unicode family. Note also I told my VARCHAR2 string to hold 20 characters. That's because some characters may require up to 4 bytes to be stored. Using a definition of (20) would give you only room to store 5 of those characters.
Let's look at the same scenario using SQL Developer and my local database.
And to confirm the character sets:
SQL> clear screen
SQL> set echo on
SQL> set sqlformat ansiconsole
SQL> select *
2 from database_properties
3 where PROPERTY_NAME in
4 ('NLS_CHARACTERSET',
5 'NLS_NCHAR_CHARACTERSET');
PROPERTY_NAME PROPERTY_VALUE DESCRIPTION
NLS_NCHAR_CHARACTERSET AL16UTF16 NCHAR Character set
NLS_CHARACTERSET AL32UTF8 Character set
First of all, you should to establish the Chinese character encoding on your Database, for example
UTF-8, Chinese_Hong_Kong_Stroke_90_BIN, Chinese_PRC_90_BIN, Chinese_Simplified_Pinyin_100_BIN ...
I show you an example with SQL Server 2008 (Management Studio) that incorporates all of this Collations, however, you can find the same characters encodings in other Databases (MySQL, SQLite, MongoDB, MariaDB...).
Create Database with Chinese_PRC_90_BIN, but you can choose other Coallition:
Select a Page (Left Header) Options > Collation > Choose the Collation
Create a Table with the same Collation:
Execute the Insert Statement
INSERT INTO ChineseTable VALUES ('玄真');

How does Redshift treat guillemets?

I am trying to run a CSV import using the COPY command for some data that includes a guillemet (»). Redshift complains that the column value is too long for the varchar column I have defined. The error in the "Loads" tab in the Redshift GUI displays this character as two dots: .. - had it been treated as one, it would have fit in the varchar column. It's not clear whether there is some sort of conversion error occurring or if there is a display issue.
When trying to do plain INSERTs I run into strange behavior as well:
dev=# create table test (name varchar(3));
CREATE TABLE
dev=# insert into test values ('bla');
INSERT 0 1
3 characters treated as 4?
dev=# insert into test values ('bl»');
ERROR: value too long for type character varying(3)
dev=# insert into test values ('b»');
INSERT 0 1
Why does char_length return 2?
dev=# select char_length(name), name from test;
char_length | name
-------------+------
2 | b»
I've checked the client encoding and database encodings and those all seem to be UTF8/UNICODE.
You need to increase the length of your varchar field. Multibyte characters use more than one character and length in the definition of varchar field are byte based. So, your special char might be taking more than a byte. If it still doesn't work refer to the doc page for Redshift below,
http://docs.aws.amazon.com/redshift/latest/dg/multi-byte-character-load-errors.html

AnsiString being truncated with plenty of space

I'm inserting a row with a JOBCODE field defined as varchar(50). When the string for that field is greater than 20 characters I get an error from SQL Server warning that the string would be truncated.
I suspect this may have to do with Unicode wide characters, but I thought then 25 characters would pass.
Has anyone seen something like this before? What am I missing?
I think there is something else at fault here.
VARCHAR(50) should be 50 characters, irrespective of the encoding
as an example
CREATE TABLE AnsiString
(
JobCode VARCHAR(20), -- ANSI with codepage
JobCodeUnicode NVARCHAR(20) -- Unicode
)
Inserting 20 unicode characters into both columns
INSERT INTO AnsiString(JobCode, JobCodeUnicode) VALUES ('葉2葉4葉6葉8葉0葉2葉4葉6葉8叶0',
N'葉2葉4葉6葉8葉0葉2葉4葉6葉8叶0')
select * from ansistring
Returns
?2?4?6?8?0?2?4?6?8?0 葉2葉4葉6葉8葉0葉2葉4葉6葉8叶0
As expected, ? is inserted for characters which weren't mapped into ANSI, but either way, we can still insert 20 characters.
Do you possibly have a trigger on the table? Could it be another column entirely? Could your data access layer somehow be expanding your unicode string to something else (e.g. byte[])?