How to convert character set to unicode in db2 query - unicode

Server: IBM i Series AS/400 running DB2
Client: Linux using unixodbc
I have a table in a DB2 database with a column of data using CCSID 836 (Simplified Chinese EBCDIC). I want to get results in UTF-16 so they work on other systems, but I'm having a hard time finding the right way to convert.
When I try:
SELECT CAST(MYCOLNAME AS VARCHAR(100) CCSID 13491) FROM MY.TABLE
I get the error:
SQL State: 22522
Vendor Code: -189
Message: [SQL0189] Coded Character Set Identifier 13491 not valid. Cause . . . . . : Coded Character Set Identifier (CCSID) 13491 is not valid for one of the following reasons: -- The CCSID is not EBCDIC. -- The CCSID is not supported by the system. -- The CCSID is not vaid for the data type. -- If the CCSID is specified for graphic data, then the CCSID must be a DBCS CCSID. -- If the CCSID is specified for UCS-2 or UTF-16 data, then the CCSID must be a UCS-2 or UTF-16 CCSID. -- If the CCSID is specified for XML data, then the CCSID must be SBCS or Unicode. It must not be DBCS or 65545.
How can I convert the data from CCSID 836 into UTF-16? I've been equally unsuccessful with UNICODE_STR().

I can't explain why, but here's what works:
SELECT CAST(MYCOLNAME AS VARCHAR(100) CCSID 935) FROM MY.TABLE
The native CCSID for the column in question is 836, which seems very similar to 935, so I don't understand the difference. But 935 works for me.

Related

How to store word "é" in postgres using limited varchar

I've been having some problems trying to save a string word with limited varchar(9).
create database big_text
LOCALE 'en_US.utf8'
ENCODING UTF8
create table big_text(
description VARCHAR(9) not null
)
# OK
insert into big_text (description) values ('sintético')
# I Got error here
insert into big_text (description) values ('sintético')
I already know that the problem is because one word is using 'é' -> Latin small letter E with Acute (this case only have 1 codepoint) and another word is using 'é' -> Latin Small Letter E + Combining Acute Accent Modifier. (this case I have 2 codepoint).
How can I store the same word using both representation in a limited varchar(9)? There is some configuration that the database is able to understand both ways? I thought that database being UTF8 is enough but not.
I appreciate any explanation that could help me understand where am I wrong? Thank you!
edit: Actually I would like to know if there is any way for postgres automatically normalize for me.
A possible workaround using CHECK to do the character length constraint.
show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
create table big_text(
description VARCHAR not null CHECK (length(normalize(description)) <= 9)
)
-- Note shortened string. Explanation below.
select 'sintético'::varchar(9);
varchar
----------
sintétic
insert into big_text values ('sintético');
INSERT 0 1
select description, length(description) from big_text;
description | length
-------------+--------
sintético | 10
insert into big_text values ('sintético test');
ERROR: new row for relation "big_text" violates check constraint "big_text_description_check"
DETAIL: Failing row contains (sintético test).
From here Character type the explanation for the string truncation vs the error you got when inserting:
An attempt to store a longer string into a column of these types will result in an error, unless the excess characters are all spaces, in which case the string will be truncated to the maximum length.(This somewhat bizarre exception is required by the SQL standard.)
If one explicitly casts a value to character varying(n) or character(n), then an over-length value will be truncated to n characters without raising an error. (This too is required by the SQL standard.)

Issues related to CCSID on DB2 while trying to export a table

I have been trying to export a table from db2 from one external source to another, so far, I noticed that there is an error while I tried to import it on the source table.
I was using this on a shell script:
myDate=$(date +"%Y%m%d%I%M")
myPath=/tmp/test
myDataSource="external"
myTableName=tableSchema.sourceTable
db2 +w -vm "export to $myPath/$myDataSource.$myTableName.$myDate.ixf of ixf
messages $myPath/$myDataSource.$myTableName.$myDate.ixf.xmsg
SELECT * from $myTableName"
After further investigation, it seemed like there was a weird character "▒" being inserted. I checked the source and there wasn't any weird character. SO the after checking on detail I issued this command:
select *
from SYSIBM.SYSCOLUMNS
where
tbcreator = 'SOURCESCHEMA'
and tbname = 'SOURCETABLE'
for fetch only with ur;
Which showed that the CCSID on the source columns is 37. I did the same on the target schema and the CCSID for the columns is 1208.
When I tried to export the table again forcing it to convert to CCSID 1208 adding modified by codepage=1208:
db2 +w -vm "export to $myPath/$myDataSource.$myTableName.$myDate.ixf of ixf
modified by codepage=1208
messages $myPath/$myDataSource.$myTableName.$myDate.ixf.xmsg
SELECT * from $myTableName"
This causes the script to work, but I get this warning:
SQL3132W The character data in column "COLUMN" will be truncated to size "4".
Said column is the same size on the source and the target but it seems that due to the CCSID I will need to change the size on the target (I can't change anything on the source and changing the CCSID on the target will break things on the target) So, my questions are:
How do I calculate the size needed for each varchar/char column depending on the encoding, for example if a varchar(4) on CCSID 37 will need a varchar(5) to hold a value with CCSID 1208?
Will it be possible to do something like:
SELECT
CAST(COLUMN as VARCHAR(12) CCSID 1208) --and for all columns
from tableSchema.sourceTable;
So I don't lose any part of those strings.
What about the numbers?
Thanks!
At the end, I ended up using this
SELECT
CAST(COLUMN AS VARCHAR(12) CCSID 1208) as COLUMN,
CAST(COLUMN2 AS VARCHAR(12) CCSID 1208) as COLUMN2,
(...),
CAST(COLUMNM AS VARCHAR(12) CCSID 1208) as COLUMNM,
FROM tableSchema.sourceTable;
It converted across codepages and no truncation was raised.

IBM DB2 9.7 VARCHAR([N]), does [N] stand for characters or bytes in UTF-8

We use UTF-8 encoding in our IBM DB2 9.7 LUW database. Even though I did a lot of searching I could not find a definite answer to this question. If I define a table column to be VARCHAR(100), does it mean 100 characters or 100 bytes?
As per the online IBM docs, it's in bytes:
VARCHAR(integer), or CHARACTER VARYING(integer), or CHAR VARYING(integer)
For a varying-length character string of maximum length integer bytes, which may range from 1 to 32,672.
There's further information on this page where you can see
SELECT CHARACTER_LENGTH (NAME, OCTETS) FROM T1 WHERE NAME = 'Jürgen'
gives you 7 because ü is encoded as x'c3bc'.

Oracle 10g column width (byte size) for UTF-8 character set / encoding

I'm still a bit of a stranger when it comes to character set and encoding, so I apologize in advance for any of my misconception.
I'm using Oracle 10g as a DBMS for my web application. My database is configured to UTF-8.
Database information:
SQL> SELECT * FROM NLS_DATABASE_PARAMETERS
PARAMETER VALUE
------------------------------ --------------------------------
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CURRENCY $
NLS_ISO_CURRENCY AMERICA
NLS_NUMERIC_CHARACTERS .,
NLS_CHARACTERSET UTF8
NLS_CALENDAR GREGORIAN
NLS_DATE_FORMAT DD-MON-RR
NLS_DATE_LANGUAGE AMERICAN
NLS_SORT BINARY
NLS_TIME_FORMAT HH.MI.SSXFF AM
NLS_TIMESTAMP_FORMAT DD-MON-RR HH.MI.SSXFF AM
NLS_TIME_TZ_FORMAT HH.MI.SSXFF AM TZR
NLS_TIMESTAMP_TZ_FORMAT DD-MON-RR HH.MI.SSXFF AM TZR
NLS_DUAL_CURRENCY $
NLS_COMP BINARY
NLS_LENGTH_SEMANTICS BYTE
NLS_NCHAR_CONV_EXCP FALSE
NLS_NCHAR_CHARACTERSET UTF8
NLS_RDBMS_VERSION 10.2.0.3.0
I have a database table that contains a column with a width limit of 1500.
Table details:
COLUMN_NAME DATATYPE
------------------------------ --------------------------------
TEST_COLUMN VARCHAR2(1500 BYTE)
Initially, being the noobie I am, I thought that the 1500 limit, which was set to the column, was a unit of number of CHARACTER, but found out later that it actually is the unit of number of BYTE.
What I am aiming at is to limit the number of character's to 1500, thus setting VARCHAR2(1500) would only apply to single-byte encoding.
So because I am using UTF-8, which uses multi-byte encoding, I was wondering what would be the correct value to set on my column width which would limit it to 1500 multi-byte character?
When you create the column you need to specify character length semantics, like this:
test_column varchar2(1500 char)
You can set a default through NLS_LENGTH_SEMANTICS. It's possible that your scripts may have worked correctly in a different environment because of the server or session parameters. But it is probably better to explicitly set each column.

AnsiString being truncated with plenty of space

I'm inserting a row with a JOBCODE field defined as varchar(50). When the string for that field is greater than 20 characters I get an error from SQL Server warning that the string would be truncated.
I suspect this may have to do with Unicode wide characters, but I thought then 25 characters would pass.
Has anyone seen something like this before? What am I missing?
I think there is something else at fault here.
VARCHAR(50) should be 50 characters, irrespective of the encoding
as an example
CREATE TABLE AnsiString
(
JobCode VARCHAR(20), -- ANSI with codepage
JobCodeUnicode NVARCHAR(20) -- Unicode
)
Inserting 20 unicode characters into both columns
INSERT INTO AnsiString(JobCode, JobCodeUnicode) VALUES ('葉2葉4葉6葉8葉0葉2葉4葉6葉8叶0',
N'葉2葉4葉6葉8葉0葉2葉4葉6葉8叶0')
select * from ansistring
Returns
?2?4?6?8?0?2?4?6?8?0 葉2葉4葉6葉8葉0葉2葉4葉6葉8叶0
As expected, ? is inserted for characters which weren't mapped into ANSI, but either way, we can still insert 20 characters.
Do you possibly have a trigger on the table? Could it be another column entirely? Could your data access layer somehow be expanding your unicode string to something else (e.g. byte[])?