Is the set of distinct graphemes infinite? - unicode

Is there any limit to the number of distinct graphemes that can be represented with a Unicode encoding such as UTF-8? Does, for example, the Unicode standard restrict the number of consecutive combining characters?

The set of possible combinations of a character and combining marks after it is infinite (though only countably infinite ☺). The Unicode Standard says explicitly in clause 2.1 (in chapter 2): “All combining characters can be applied to any base character and can, in principle, be used
with any script.” A combination of a letter and a diacritic can be used as a base character for another diacritic, and so on.
At a higher protocol level, as in a data format specification, you can of course impose limit e.g. on the number of consecutive combining marks. The Unicode Standard, however, does not set such restrictions.

Related

What will be the change in Application behavior after altering table with CODEUNITS32 for supporting unicode behavior?

we are in a phase of migration of some tables from AS400 DB to DB2 LUW(V11.1).
While migrating we found some special character(€) in the source database(AS400)- (Column with CHAR) and that lead to error if we are unable to alter table column with CODEUNITS32, DB2 LUW Database configuration Byte Encoding Set at UTF-8.
We want to understand, what would be the behavior of the application after changing the char column to CODEUNITS32, Do I need to update any Configuration at the application level (C & Java Application) to handle both Character Encoding Set?
After changing to CODEUNITS32
- My C application able to compile and able to handle change in Character byte from 8 bit per character(UTF-8) to 4 Byte Per Character(CODEUNITS32)?
- My Java application is able to handle change in Character byte from 8 bit per character(UTF-8) to 4 Byte Per Character(CODEUNITS32)?
We did some pilot testing by inserting Special character manually to the table after setting column definition to CODEUNITS32 from CHAR and testing was successful.
Using a string units specification of CODEUNITS32 for a column does not change the encoding of a column, the data is still stored in UTF-8 for CHAR/VARCHAR columns.
It alters the physical length (CHAR) or max length (VARCHAR) of the column by a factor of 4.
It also enables "character semantics" in some functions such as SUBSTR(), such that they work on characters, not bytes when processing CODEUNITS32 columns. (SUBSTRING() will always use character semantics (unless processing a FOR BIT DATA column))
So a CHAR(4) is CHAR(4 OCTETS) is 4 bytes long, and can hold at most 4 characters if they are all single byte in UTF-8. For € which is 3 bytes long, it could only hold say €4 but not €42
ACHAR(4 CODEUNTIS32) is 16 bytes long, and is allowed to hold up to 4 characters. It could hold €€€€ but not €2345
It is worth considering avoiding CHAR(x CODEUNITS32) and prefering VARCHAR(x CODEUNITS32). UTF-8 does not really play well with fixed width data types. The more common UTF-8 characters are 1 or 2 bytes long, so typically a CHAR(x CODEUNITS32) column will hold be more than 50% space padding.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008470.html
CODEUNITS32
Indicates that the units for the length attribute are Unicode UTF-32 code units which approximate counting in characters.
This unit of length does not affect the underlying code page of the data type.
The actual length of a data value is determined by counting the UTF-32
code units as if the data was converted to UTF-32.
A string unit of CODEUNITS32 can be used only in a Unicode database.
CODEUNITS32 can be
explicitly specified or determined based on an environment setting.
Also, out of interest, GRAPHIC/VARGRAPHIC and columns are stored in UTF-16, and default to CODEUNITS16, but can also use CODEUNITS32.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0008471.html

valid characters in Postgres Ltree label in utf8 charset

the doc of Postgres Ltree said that
A label is a sequence of alphanumeric characters and underscores (for example, in C locale the characters A-Za-z0-9_ are allowed). Labels must be less than 256 bytes long.
However, it does not said that if we set the locale to 'en_US.UTF-8', what is the valid character can be used in Postgres Ltree. So, can dash(hyphen) be used in the label of Ltree?
Sorry for not updating the answer.
Yes, i have finally figured out that it is one of our DBA that changed the source of ltree and recompile it with supporting the dash(-) char. We have a single table with more than 6B records.

How do I format a number of arbitrary length?

If I have data that includes a numeric column with values into the miillions (eg 63254830038), and I want to format the number as a US Dollar amount (eg. $63,254,830,038), I know I can use:
SELECT numeric_column, to_char(numeric_column, '$999G999G999G999') from table
to format the values, but to do so reliably I either have to include an unnecessarily long text string ('$999G999G999G999') or know the maximum number of possible digits. Is there a way to say, broadly, "group numbers with a comma" instead of explicitly saying "group the hundreds, group the thousands, Oh! and please group the millions"?
You just need cast integer to money type.
E.g.:
tests=> select cast(63254830038 as money);
Or alternative syntax:
tests=> select 6323254830038::money;
And output (I'm from Poland, so money type take my locales and set correct currency symbol):
money
----------------------
63.254.830.038,00 zł
Monetary Types documentation.
You can try something like this (works in sql-server, not sure about postgresql)
select convert(varchar,cast('63254830038' as money),1)
You could do things the hard way using regular expressions: convert the number into a string, reverse it, use regexp_replace to insert commas between pairs of 3 digits, and then reverse it again:
select '$' || reverse(regexp_replace(
reverse(numeric_column::varchar),
E'(\\d\\d\\d)(?=\\d)', '\1,', 'g'))
Explanation
The first argument to regexp_replace is the expression to match, which contains two parts:
(\\d\\d\\d) means 3 digits, which are captured
(?=\\d) is a positive lookahead constraint of a single digit, meaning the match only counts if there is a digit following it. (That is, this digit is checked to exist, but it does not count as part of the match.)
The second argument is what to replace with: the 3 captured digits, plus a comma.
The third argument 'g' is a flag indicating that it should match and replace as many times as possible.
For more information on regular expressions in PostgreSQL, see the documentation.

SQL Server 2008 R2 : detect encoding in nvarchar field

I have a 1,000,000 rows plus string table, that has some garbage inside due to encoding errors.
The garbage is minimal, but needs to be found.
The column in question is a NVARCHAR column that normally contains text in one of 11 languages.
All of the text should be unicode (utf-8 when we process it application side).
The corrupt columns contain ? characters and or a very limited unusual glyph set, by eye they can be very easily seen not to be valid language. It is likely that these columns have been encoded backwards and forwards into total garbage.
So in the name of speed, is there anything I can do on SQL Server to detect bad encoding / string garbage?
Thanks.
EDIT to add garbage example:
This was Russian или Ð˜Ð¼Ñ Ð£Ñ‡Ð°Ñтника

Column type for ZipCode in PostgreSQL database?

What is the correct column type for holding ZipCode values in PostgreSQL database?
I strongly disagree with the advice presented here.
The accepted answer accepts things that aren't digits.
The question is about Zip Codes, not postal codes.
If we assume the post is wrong and means international postal codes, there are characters that appear in international postal codes that don't appear in that list, and many international - and also the US domestic - postal codes can be over ten characters
If we actually answer the question they asked, about zip codes, then there should be no accomodation for anything but digits (and arguably the hyphen)
US zip codes can be up to 11 digits long (13 characters counting the two dashes) - there is a zip, a zip+4, and a zip+6 (which programmers would call zip+4+2) notation; the last is used by skyscrapers, universities, et cetera
US zip codes are always non-negative integers, and therefore should not be stored as text data, which is subject to non-canon representation problems (ask anyone who's done a system about that time they found out that their zip 00203 didn't match the zip 203 that they accidentally got when constantly unnecessarily parsing string representations)
If you pretend you're actually tracking international post codes, the short character sequence limited text fields here don't even begin to do the job. The word "China" comes to mind.
My opinon:
Decide whether you're actually handling US postal codes or international
If you're handling US postal codes, track them as unsigned integers, and left-pad them with zeros when text representing them. (Think unix timestamps and local TZ representations if you need to understand why this will be simpler in the long run.)
If you're handling international post codes, store them in an unbounded unicode string, tie them to the country they represent, and validate country by country with check constraints. This problem is far more difficult than it sounds up front. International addresses are some of the least standardized things on Earth. Wait'll you find out how Japanese house numbers work, or why the British postal 6-code has the gaps it has.
It is something like xxxxx-xxxx, so varchar(10) is recommended.
If you want to check the syntax of the values in the database, you could create a domain type for zip codes.
CREATE DOMAIN zipcode varchar(10)
CONSTRAINT valid_zipcode
CHECK (VALUE ~ '[A-Z0-9-]+'); -- or a better regular expression
You could have a look at this site, which proposes this regex:
(^\d{5}(-\d{4})?$)|(^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1} *\d{1}[A-Z]{1}\d{1}$)
But you should check it works for the PostgreSQL regex syntax.
it depends on what kind of zip you want. if you're sure you will only need to store the standard 5 digit then use an int will be the most space saving.
however if you need to do the 5+4 extended then a 10 digit character field is best. I personally suggest that as it does make it easier in the future if you end up needing to store international postal codes 10 digits covers just about every possible postal code format i've come across.