What will be the change in Application behavior after altering table with CODEUNITS32 for supporting unicode behavior? - db2

we are in a phase of migration of some tables from AS400 DB to DB2 LUW(V11.1).
While migrating we found some special character(€) in the source database(AS400)- (Column with CHAR) and that lead to error if we are unable to alter table column with CODEUNITS32, DB2 LUW Database configuration Byte Encoding Set at UTF-8.
We want to understand, what would be the behavior of the application after changing the char column to CODEUNITS32, Do I need to update any Configuration at the application level (C & Java Application) to handle both Character Encoding Set?
After changing to CODEUNITS32
- My C application able to compile and able to handle change in Character byte from 8 bit per character(UTF-8) to 4 Byte Per Character(CODEUNITS32)?
- My Java application is able to handle change in Character byte from 8 bit per character(UTF-8) to 4 Byte Per Character(CODEUNITS32)?
We did some pilot testing by inserting Special character manually to the table after setting column definition to CODEUNITS32 from CHAR and testing was successful.

Using a string units specification of CODEUNITS32 for a column does not change the encoding of a column, the data is still stored in UTF-8 for CHAR/VARCHAR columns.
It alters the physical length (CHAR) or max length (VARCHAR) of the column by a factor of 4.
It also enables "character semantics" in some functions such as SUBSTR(), such that they work on characters, not bytes when processing CODEUNITS32 columns. (SUBSTRING() will always use character semantics (unless processing a FOR BIT DATA column))
So a CHAR(4) is CHAR(4 OCTETS) is 4 bytes long, and can hold at most 4 characters if they are all single byte in UTF-8. For € which is 3 bytes long, it could only hold say €4 but not €42
ACHAR(4 CODEUNTIS32) is 16 bytes long, and is allowed to hold up to 4 characters. It could hold €€€€ but not €2345
It is worth considering avoiding CHAR(x CODEUNITS32) and prefering VARCHAR(x CODEUNITS32). UTF-8 does not really play well with fixed width data types. The more common UTF-8 characters are 1 or 2 bytes long, so typically a CHAR(x CODEUNITS32) column will hold be more than 50% space padding.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008470.html
CODEUNITS32
Indicates that the units for the length attribute are Unicode UTF-32 code units which approximate counting in characters.
This unit of length does not affect the underlying code page of the data type.
The actual length of a data value is determined by counting the UTF-32
code units as if the data was converted to UTF-32.
A string unit of CODEUNITS32 can be used only in a Unicode database.
CODEUNITS32 can be
explicitly specified or determined based on an environment setting.
Also, out of interest, GRAPHIC/VARGRAPHIC and columns are stored in UTF-16, and default to CODEUNITS16, but can also use CODEUNITS32.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008471.html

Related

valid characters in Postgres Ltree label in utf8 charset

the doc of Postgres Ltree said that
A label is a sequence of alphanumeric characters and underscores (for example, in C locale the characters A-Za-z0-9_ are allowed). Labels must be less than 256 bytes long.
However, it does not said that if we set the locale to 'en_US.UTF-8', what is the valid character can be used in Postgres Ltree. So, can dash(hyphen) be used in the label of Ltree?
Sorry for not updating the answer.
Yes, i have finally figured out that it is one of our DBA that changed the source of ltree and recompile it with supporting the dash(-) char. We have a single table with more than 6B records.

Comparison of PostgreSQL text types

I'm migrating from MySQL to PostgreSQL because Oracle. There is a great MySQL text type reference, here is the relevant information for MySQL...
CHAR( ) A fixed section from 0 to 255 characters long.
VARCHAR( ) A variable section from 0 to 255 characters long.
TINYTEXT A string with a maximum length of 255 characters.
TEXT A string with a maximum length of 65535 characters.
BLOB A string with a maximum length of 65535 characters.
MEDIUMTEXT A string with a maximum length of 16777215 characters.
MEDIUMBLOB A string with a maximum length of 16777215 characters.
LONGTEXT A string with a maximum length of 4294967295 characters.
LONGBLOB A string with a maximum length of 4294967295 characters.
PostgreSQL seems a bit different, there is a text type looking through phppgAdmin, not sure what else there is and I'm not finding any good comparison tables.
What are all the available text types in PostgreSQL?
PostgreSQL has more advanced types but doesn't need the distinction between text sizes.
There are 3 string types in PostgreSQL and a binary type:
text
Just a text object with a non-specified size. You can put anything in here and it will be stored. Size doesn't matter.
varchar(n) / character varying(n)
Basically a text which has a size check, there is virtually no (except for checking the size while inserting) performance difference here.
char(n) / character(n)
Just a text where all the extra characters will be padded with space characters so you always get n characters back.
bytea
The blob type you've mentioned is a totally different type alltogether. You could replace it with the bytea type: http://www.postgresql.org/docs/9.3/static/datatype-binary.html
Source: http://www.postgresql.org/docs/9.3/static/datatype-character.html

is varchar(n) saving memory space in postgresql? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
If I define a varchar(25) and my string is less than 25 characters (e.g. 12 chars), SQL shows the vector of char as a length of 12 and no trailing spaces are added (unlike character does).
My question is the following : behind the fact that SQL displays just the string as it was inserted in the field (or truncated if longer than the maximum length), how Postgresql is storing such a data type?
Is it padding with extra bytes as :
twelvecharxx............. (length : 25)
or just storing 12 bytes?
I assume this could be more complex internally. I just need to know if the maximum-length optional argument is a safety to disallow large string storage or just a performance question (regarding if all the subsequence stored strings are expected to be less or equal to 25 characters).
SQL defines two primary character types: character varying(n) and
character(n), where n is a positive integer. Both of these types can
store strings up to n characters in length. An attempt to store a
longer string into a column of these types will result in an error,
unless the excess characters are all spaces, in which case the string
will be truncated to the maximum length. (This somewhat bizarre
exception is required by the SQL standard.) If the string to be stored
is shorter than the declared length, values of type character will be
space-padded; values of type character varying will simply store the
shorter string.
and
The storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string, which includes the space padding in the case
of character. Longer strings have 4 bytes overhead instead of 1.
finally
Tip: There are no performance differences between these three types,
apart from increased storage size when using the blank-padded type,
and a few extra cycles to check the length when storing into a
length-constrained column. While character(n) has performance
advantages in some other database systems, it has no such advantages
in PostgreSQL.
from here
No padding is stored for varchar, according to the documentation. For character it is, however. There's also some overhead in storing the length of the string.

SQL Server 2008 R2 : detect encoding in nvarchar field

I have a 1,000,000 rows plus string table, that has some garbage inside due to encoding errors.
The garbage is minimal, but needs to be found.
The column in question is a NVARCHAR column that normally contains text in one of 11 languages.
All of the text should be unicode (utf-8 when we process it application side).
The corrupt columns contain ? characters and or a very limited unusual glyph set, by eye they can be very easily seen not to be valid language. It is likely that these columns have been encoded backwards and forwards into total garbage.
So in the name of speed, is there anything I can do on SQL Server to detect bad encoding / string garbage?
Thanks.
EDIT to add garbage example:
This was Russian или Ð˜Ð¼Ñ Ð£Ñ‡Ð°Ñтника

Does not using NULL in PostgreSQL still use a NULL bitmap in the header?

Apparently PostgreSQL stores a couple of values in the header of each database row.
If I don't use NULL values in that table - is the null bitmap still there?
Does defining the columns with NOT NULL make any difference?
It's actually more complex than that.
The null bitmap needs one bit per column in the row, rounded up to full bytes. It is only there if the actual row includes at least one NULL value and is fully allocated in that case. NOT NULL constraints do not directly affect that. (Of course, if all fields of your table are NOT NULL, there can never be a null bitmap.)
The "heap tuple header" (per row) is 23 bytes long. Actual data starts at a multiple of MAXALIGN (Maximum data alignment) after that, which is typically 8 bytes on 64-bit OS (4 bytes on 32-bit OS). Run the following command from your PostgreSQL binary dir as root to get a definitive answer:
./pg_controldata /path/to/my/dbcluster
On a typical Debian-based installation of Postgres 12 that would be:
sudo /usr/lib/postgresql/12/bin/pg_controldata /var/lib/postgresql/12/main
Either way, there is one free byte between the header and the aligned start of the data, which the null bitmap can utilize. As long as your table has 8 columns or less, NULL storage is effectively absolutely free (as far as disk space is concerned).
After that, another MAXALIGN (typically 8 bytes) is allocated for the null bitmap to cover another (typically) 64 fields. Etc.
This is valid for at least versions 8.4 - 12 and most likely won't change.
The null bitmap is only present if the HEAP_HASNULL bit is set in t_infomask. If it is present it begins just after the fixed header and occupies enough bytes to have one bit per data column (that is, t_natts bits altogether). In this list of bits, a 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not present, all columns are assumed not-null.
http://www.postgresql.org/docs/9.0/static/storage-page-layout.html#HEAPTUPLEHEADERDATA-TABLE
so for every 8 columns you use one byte of extra storage. Then for every about million rows that would take up one megabyte of storage. Does not really seem that important. I would define the tables how they needed to be defined and not worry about null headers.