I have a PostgreSQL database with UTF8 encoding and LC_* en_US.UTF8. The database stores text columns in many different languages.
On some columns however, I am 100% sure there will never be any special characters, i.e. ISO country & currency codes.
I've tried doing something like:
"countryCode" char(3) CHARACTER SET "C" NOT NULL
and
"countryCode" char(3) CHARACTER SET "SQL_ASCII" NOT NULL
but this comes back with the error
ERROR: type "pg_catalog.bpchar_C" does not exist
ERROR: type "pg_catalog.bpchar_SQL_ASCII" does not exist
What am I doing wrong?
More importantly, should I even bother with this? I'm coming from a MySQL background where doing this was a performance and space enhancement, is this also the case with PostgreSQL?
TIA
Honestly, I do not see the purpose of such settings, as:
as #JoachimSauer mentions, ASCII subset in the UTF-8 encoding will occupy exactly the same number of bytes, as that was the main point of inventing UTF-8: keep ASCII unchanged. Therefore I see no size benefits;
all software that is capable of processing strings in different encoding will use a common internal encoding, which is UTF-8 by default for PostgreSQL nowadays. When some textual data comes in to the processing stage, database will convert it into the internal encoding if encodings do not match. Therefore, if you specify some columns as being non-UTF8, this will lead to the extra processing of the data, thus you will loose some cycles (don't think it will be notable performance hit though).
Given there's no space benefits and there's a potential performance hit, I think it is better to leave things as they are, i.e. keep all columns in the database's default encoding.
I think for the same arguments PostgreSQL do not allow to specify encodings for individual objects within the database. Character Set and Locale are set on the per-database level.
Related
I have gone through the official postgres documentation to know about the LC_COLLATE and LC_TYPE. But, still I don't understand it correctly.
Can anyone help me in understanding these concepts and impact of these, especially when we are trying to load data which is at oracle of encoding WE8ISO8859P15 and at postgres encoding is as utf-8 and collation/ctype is en_US.UTF-8.
Thanks in advance
This is part of the “locale”, the national language support, which is different from the encoding (but the locale has to belong to the encoding).
LC_CTYPE determines which characters are letters, numbers, space characters, punctuation etc. Different languages have different ideas about that.
LC_COLLATE determines how strings are compared and sorted.
The first has little impact on the behavior of PostgreSQL, but the second is very relevant: it determines how b-tree indexes on string columns are ordered (which is why it cannot be changed after a database has been created) and how ORDER BY sorts strings by default (which is directly user-visible).
How exactly is one meant to seamlessly support all languages stored within postgres's utf8 character set? We seem to be required to specify a single language-specific collation along with the character set, such as en_US.utf8. If I'm not mistaken, we don't have the ability to store both English (en_US) and Chinese (zh_CN) in the same utf8 column, while maintaining any kind of meaningful collation behavior. If I define a column as en_US.utf8, how is it supposed to handle values containing Chinese (zh_CN) characters / byte sequences? The reality is that a single column value can contain multiple languages (ex: "Hello and 晚安"), and simply cannot be collated according to a single language.
Yes, I can physically store any character sequences; but what is the defined behavior for ordering on a en_US.utf8 column that contains English, German, Chinese, Japanese and Korean strings?
I understand that mysql's utf8mb4_unicode_ci collation isn't perfect, and that it is not following any set standard for how to collate the entire unicode set. I can already hear the anti-mysql crowd sighing about how mysql's language-agnostic collations are arbitrary, semantically meaningless, or even purely invalid. But the fact is, it works well enough, and fulfills the expectation that utf8 = multi-language unicode support.
Is postgres just being extremely stubborn with the fact that it's semantically incorrect to collate across the unicode spectrum? I know the developers are very strict when it comes to "doing things according to spec", but this inability to juggle multiple languages is frustrating to say the least. Am I missing something that solves the multi-language problem, or is the official stance that a single utf8 column can handle any language, but only one language at a time?
You are right that there will never be a perfect way to collate strings across languages.
PostgreSQL has decided not to create its own collations but to use those provided by the operating system. The idea behind this is to avoid re-inventing the wheel and to reduce maintenance effort.
So the traditional PostgreSQL answer to your question would be: if you want a string collation that works reasonably well for strings in different languages, complain to your operating system vendor or pick an operating system that provides such a collation.
However, this approach has drawbacks that the PostgreSQL community is aware of:
Few – if any – people decide on an operating system based on the collation support it provides.
PostgreSQL's sorting behaviour depends on the underlying operating system, which leads to frequent questions by confused users on the mailing lists.
With some operating systems collation behaviour can change during an operating system upgrade, leading to corrupt database indexes (see for example this thread).
It may well be that PostgreSQL changes its approach; there have been repeated efforts to use ICU libraries instead of operating system collations (see for example this recent thread), which would mitigate some of these problems.
Does anyone know of a simple chart or list that would show all acceptable varchar characters? I cannot seem to find this in my googling.
What codepage? Collation? Varchar stores characters assuming a specific codepage. Only the lower 127 characters (the ASCII subset) is standard. Higher characters vary by codepage.
The default codepage used matches the collation of the column, whose defaults are inherited from the table,database,server. All of the defaults can be overriden.
In sort, there IS no "simple chart". You'll have to check the character chart for the specific codepage, eg. using the "Character Map" utility in Windows.
It's far, far better to use Unicode and nvarchar when storing to the database. If you store text data from the wrong codepage you can easily end up with mangled and unrecoverable data. The only way to ensure the correct codepage is used, is to enforce it all the way from the client (ie the desktop app) to the application server, down to the database.
Even if your client/application server uses Unicode, a difference in the locale between the server and the database can result in faulty codepage conversions and mangled data.
On the other hand, when you use Unicode no conversions are needed or made.
using jdbc (jt400) to insert data into an as400 table.
db table code page is 424. Host Code Page 424
the ebcdic 424 code page does not support many of the characters that may come from the client.
for example the sign → (Ascii 26 Hex 1A)
the result is an incorrect translation.
is there any built-in way in the toolbox to remove any of the unsupported characters?
You could try to create a logical file over your ccsid424 physical file with a different codepage. It is possible on the as/400 to create logical files with different codepages for individual columns, by adding the keyword CCSID(<num>). You can even set it to an unicode charset, e.g. CCSID(1200) for UTF-16. Of course your physical file will still only be able to store chars that are in the 424 codepage, and those will be replaced by some invalid character char, but the translation might be better that way.
There is no way to store chars, that are not in codepage 424 in a column with that codepage directly (the only way I can think of is encoding them somehow with multiple chars, but that is most likely not what you want to do, since it will bring more problems than it "solves").
If you have control over that system, and it is possible to do some bigger changes, you could do it the other way around: create a new unicode version of that physical file with a different name (I'd propose CCSID(1200), that's as close as you get to UTF-16 on as/400 afaik, and UTF-8 is not supported by all parts of the system in my experience. IBM does recommend 1200 for unicode). Than transfer all data from your old file to the new one, delete the old one (before that, backup it!), and than create a logical file over the new physical, with the name of the old physical file. In that logical file change all ccsid-bearing columns from 1200 to 424. That way, existing programs can still work on the data. Of course there will be invalid chars in the logical file now, once you insert data that is not in a subset of ccsid 424; so you will most likely have to take a look at all programs that use the new logical file.
We are currently using VARCHAR for storing text data in DB2 however we are hitting the problem that length of VARCHAR specified is not the same as length of text because in DB2 VARCHAR length specified is UTF-8 data length which can vary depending on stored text data. For example some texts contain characters from different languages and because of it some texts with 500 characters can't be saved in VARCHAR(500) and etc.
Now we are planning to migrate to VARGRAPHIC. I need to know what are limitations of using VARGRAPHIC for storing unicode text data in DB2.
Are there any problems with using VARGRAPHIC?
DB2 doesn't check that the data is in fact double-byte String, but it assumes it must be. Usually the drivers will do proper conversions for you but you might one day bump into some bug. It is unlikely though.
If you use federated databases Vargraphic support in queries might fail completely. In overall the amount of bug reports for vargraphic data types is somewhat high. Support for it isn't probably as well tested and tried as for other data types.
Vargraphic will with unicode database (ie. UTF-8 is requirement) use big-endian UCS-2, meaning your space requirements for those columns double. Vargraphic is DB2 properietary data type. If you migrate off DB2 some day you will have to do an extra conversion.