DB2 VARCHAR unicode data storage - unicode

We are currently using VARCHAR for storing text data in DB2 however we are hitting the problem that length of VARCHAR specified is not the same as length of text because in DB2 VARCHAR length specified is UTF-8 data length which can vary depending on stored text data. For example some texts contain characters from different languages and because of it some texts with 500 characters can't be saved in VARCHAR(500) and etc.
Now we are planning to migrate to VARGRAPHIC. I need to know what are limitations of using VARGRAPHIC for storing unicode text data in DB2.
Are there any problems with using VARGRAPHIC?

DB2 doesn't check that the data is in fact double-byte String, but it assumes it must be. Usually the drivers will do proper conversions for you but you might one day bump into some bug. It is unlikely though.
If you use federated databases Vargraphic support in queries might fail completely. In overall the amount of bug reports for vargraphic data types is somewhat high. Support for it isn't probably as well tested and tried as for other data types.
Vargraphic will with unicode database (ie. UTF-8 is requirement) use big-endian UCS-2, meaning your space requirements for those columns double. Vargraphic is DB2 properietary data type. If you migrate off DB2 some day you will have to do an extra conversion.

Related

PostgreSQL bytea network traffic double expected value

I'm investigating a bandwidth problem and stumbled on an issue with retrieving a bytea value. I tested this with PostgreSQL 10 and 14, the respective psql clients and the psycopg2 client library.
The issue is that if the size of a bytea value is eg. 10 MB (I can confirm by doing select length(value) from table where id=1), and I do select value from table where id=1, then the amount of data transferred over the socket is about 20MB. Note that the value in the database is pre-compressed (so high entropy), and the table is set to not compress the bytea value to avoid double work.
I can't find any obvious encoding issue since it's all just bytes. I can understand that the psql CLI command may negotiate some encoding so it can print the result, but psycopg2 definitely doesn't do that, and I experience the same behaviour.
I tested the same scenario with a text field, and that nearly worked as expected. I started with copy paste of lorem ipsum and it transferred the correct amount of data, but when I changed the text to be random extended ASCII values (higher entropy again), it transferred more data than it should have. I have compression disabled for all my columns so I don't understand why that would happen.
Any ideas as to why this would happen?
That is normal. By default, values are transferred as strings, so a bytea would be rendered in hexadecimal numbers, which doubles its size.
As a workaround, you could transfer such data in binary mode. The frontend-backend protocol and the C library offer support for that, but it will depend on your client API whether you can make use of that or not.

Which is the data type to use for storing a firebase uid in postgresql

I want to store the uids generated by firebase auth in a postgres database. As it is not a valid uuid I am not sure which datatype to choose. Mainly I am not sure if I should use a char or a varchar.
I would say use varchar to allow for the uid changing over time. From the Postgres end there is really no difference, see here:
Tip
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
Firebase Authentication UIDs are just strings. The strings don't contain any data - they are just random. A varchar seems appropriate.

cant find Varchar Chart of acceptable characters

Does anyone know of a simple chart or list that would show all acceptable varchar characters? I cannot seem to find this in my googling.
What codepage? Collation? Varchar stores characters assuming a specific codepage. Only the lower 127 characters (the ASCII subset) is standard. Higher characters vary by codepage.
The default codepage used matches the collation of the column, whose defaults are inherited from the table,database,server. All of the defaults can be overriden.
In sort, there IS no "simple chart". You'll have to check the character chart for the specific codepage, eg. using the "Character Map" utility in Windows.
It's far, far better to use Unicode and nvarchar when storing to the database. If you store text data from the wrong codepage you can easily end up with mangled and unrecoverable data. The only way to ensure the correct codepage is used, is to enforce it all the way from the client (ie the desktop app) to the application server, down to the database.
Even if your client/application server uses Unicode, a difference in the locale between the server and the database can result in faulty codepage conversions and mangled data.
On the other hand, when you use Unicode no conversions are needed or made.

Specifying ASCII columns in a UTF-8 PostgreSQL database

I have a PostgreSQL database with UTF8 encoding and LC_* en_US.UTF8. The database stores text columns in many different languages.
On some columns however, I am 100% sure there will never be any special characters, i.e. ISO country & currency codes.
I've tried doing something like:
"countryCode" char(3) CHARACTER SET "C" NOT NULL
and
"countryCode" char(3) CHARACTER SET "SQL_ASCII" NOT NULL
but this comes back with the error
ERROR: type "pg_catalog.bpchar_C" does not exist
ERROR: type "pg_catalog.bpchar_SQL_ASCII" does not exist
What am I doing wrong?
More importantly, should I even bother with this? I'm coming from a MySQL background where doing this was a performance and space enhancement, is this also the case with PostgreSQL?
TIA
Honestly, I do not see the purpose of such settings, as:
as #JoachimSauer mentions, ASCII subset in the UTF-8 encoding will occupy exactly the same number of bytes, as that was the main point of inventing UTF-8: keep ASCII unchanged. Therefore I see no size benefits;
all software that is capable of processing strings in different encoding will use a common internal encoding, which is UTF-8 by default for PostgreSQL nowadays. When some textual data comes in to the processing stage, database will convert it into the internal encoding if encodings do not match. Therefore, if you specify some columns as being non-UTF8, this will lead to the extra processing of the data, thus you will loose some cycles (don't think it will be notable performance hit though).
Given there's no space benefits and there's a potential performance hit, I think it is better to leave things as they are, i.e. keep all columns in the database's default encoding.
I think for the same arguments PostgreSQL do not allow to specify encodings for individual objects within the database. Character Set and Locale are set on the per-database level.

Does a varchar field's declared size have any impact in PostgreSQL?

Is VARCHAR(100) any better than VARCHAR(500) from a performance point of view? What about disk usage?
Talking about PostgreSQL today, not some database some time in history.
They are identical.
From the PostgreSQL documentation:
http://www.postgresql.org/docs/8.3/static/datatype-character.html
Tip: There are no performance
differences between these three types,
apart from increased storage size when
using the blank-padded type, and a few
extra cycles to check the length when
storing into a length-constrained
column. While character(n) has
performance advantages in some other
database systems, it has no such
advantages in PostgreSQL. In most
situations text or character varying
should be used instead.
Here they are talking about the differences between char(n), varchar(n) and text (= varchar(1G)). The official story is that there is no difference between varchar(100) and text (very large varchar).
There is no difference between varchar(m) and varchar(n)..
http://archives.postgresql.org/pgsql-admin/2008-07/msg00073.php
There is a difference between varchar(n) and text though, varchar(n) has a built in constraint which must be checked and is actually a little slower.
http://archives.postgresql.org/pgsql-general/2009-04/msg00945.php
TEXT /is/ the same as VARCHAR without an explicit length, the text
"The storage requirement for a short
string (up to 126 bytes) is 1 byte
plus the actual string, which includes
the space padding in the case of
character. Longer strings have 4 bytes
overhead instead of 1. Long strings
are compressed by the system
automatically, so the physical
requirement on disk might be less.
Very long values are also stored in
background tables so that they do not
interfere with rapid access to shorter
column values. In any case, the
longest possible character string that
can be stored is about 1 GB."
refers to both VARCHAR and TEXT (since VARCHAR(n) is just a limited version of TEXT). Limiting your VARCHARS artificially has no real storage or performance benefits (the overhead is based on the actual length of the string, not the length of the underlying varchar), except possibly for comparisons against wildcards and regexes (but at the level where that starts to matter, you should probably be looking at something like PostgreSQL's full-text indexing support).