Unicode character default collation table - postgresql

I don't know which site this question belongs exactly, so posting it here.
I use Postgresql 9.2 on RHEL 6.4 and observe the following:
select foo
from unnest('{а,ә,б,в,г,д,е,ж}'::text[]) as foo
order by foo collate "kk_KZ.utf8"
gives
а
ә
б
в
г
д
е
ж
BUT
select foo
from unnest('{а,ә,б,в,г,д,е,ж}'::text[]) as foo
order by foo collate "en_US.utf8"
gives
а
б
в
г
д
е
ә -- misplaced
ж
Further, I found that there is the Default Unicode Collation Element Table [1], which lists the character in question (04D9 ; [.199D.0020.0002.04D9] # CYRILLIC SMALL LETTER SCHWA) in proper order.
I understand that it is silly to expect the cyrillic characters be handled properly by "en_US.utf8" locale, but what is the correct behavior by Unicode or any other relevant standards in cases, where a character does not normally belong to language/locale used for collation?
[1] http://www.unicode.org/Public/UCA/latest/allkeys.txt

It's not misplaced. It might be to you, but it's not to me. :-) In all seriousness, there is no correct behavior by Unicode; there simply cannot be. A character set is a mapping; the collation is a locale-specific set of rules to sort the characters in that set -- and even within the same locale there can be multiple collations.
The ICU docs has colorful examples of how thorny this kind of stuff gets, in case you're curious. Quoting extensively:
http://userguide.icu-project.org/collation
[H]ere are some of the ways languages vary in ordering strings:
The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k".
Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d".
Accented letters can be treated as minor variants of the unaccented letter. For example, "é" can be treated equivalent to "e".
Accented letters can be treated as distinct letters. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z".
Unaccented letters that are considered distinct in one language can be indistinct in another. For example, the letters "v" and "w" are two different letters according to English. However, "v" and "w" are considered variant forms of the same letter in Swedish.
A letter can be treated as if it were two letters. For example, in traditional German "ä" is compared as if it were "ae".
Thai requires that the order of certain letters be reversed.
French requires that letters sorted with accents at the end of the string be sorted ahead of accents in the beginning of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o".
Sometimes lowercase letters sort before uppercase letters. The reverse is required in other situations. For example, lowercase letters are usually sorted before uppercase letters in English. Latvian letters are the exact opposite.
Even in the same language, different applications might require different sorting orders. For example, in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite.
Sorting orders can change over time due to government regulations or new characters/scripts in Unicode.

Postgresql uses the locales provided by the operating system. In your setup, locales are provided by glibc. Glibc uses a heavily modified version of an "ancient" version of ISO 14651 (see glibc Bug 14095 - Review / update collation data from Unicode / ISO 14651 for information on current pains in trying to update glibc locale data).
As of glibc 2.28, to be released on 2018-08-01, glibc will use data from ISO 14651:2016 (which is synchronized to Unicode 9), and will give the order the OP expects for en_US.
ISO 14651 is Method for comparing character strings and description of the common template tailorable ordering and it is similar to the UCA, with some differences. The CTT (Common Template Table) is the ISO14651 equivalent of the DUCET, and they are aligned.
The first time CYRILLIC SMALL LETTER SCHWA appeared in a collation table in glibc was for the az_AZ locale (Azerbaijani), where it is ordered after CYRILLIC SMALL LETTER IE. This corresponds to:
commit fcababc4e18fee81940dab20f7c40b1e1fb67209
Author: Ulrich Drepper <drepper#redhat.com>
Date: Fri Aug 3 08:42:28 2001 +0000
Update.
2001-08-03 Ulrich Drepper <drepper#redhat.com>
* locale/iso-639.def: Add Tigrinya.
From there, that ordering was eventually moved to the file iso14651_t1 as per Bug 672 - Include iso14651_t1 in collation rules, which was an effort to simplify glibc locale data. This corresponds to:
commit 5d2489928c0040d2a71dd0e63c801f2cf98e7efc
Author: Ulrich Drepper <drepper#redhat.com>
Date: Sun Feb 18 04:34:28 2007 +0000
[BZ #672]
2005-01-16 Denis Barbier <barbier#linuxfr.org>
[BZ #672]
* locales/ca_ES: Replace current collation rules by including
iso14651_t1 and adding extra rules if needed. There should be
no noticeable changes in sorted text. only ligatures and
ignoreable characters have modified weights.
* locales/da_DK: Likewise.
* locales/en_CA: Likewise.
* locales/es_US: Likewise.
* locales/fi_FI: Likewise.
* locales/nb_NO: Likewise.
[BZ #672]
* locales/iso14651_t1: Simplified. Extended.
Most locales in glibc start from iso14651_t1, and tailor it, which is what you are seeing with en_US.
While glibc based its default ordering in Azerbaijani, the DUCET instead bases it on the ordering for Kazakh and Tatar, which is where the difference comes from.

The Unicode Collation Algorithm allows any tailorings to be made to the DUCET.
There isn't a "correct" behaviour. There are various behaviours one could expect, and the most appropriate depends on the context, the audience. Sometimes any behaviour could be correct, since there isn't really a reason to force any order of cyrillic betters in an American English collation.
The Common Locale Data Repository provides locale-specific tailorings to the DUCET. The CLDR uses LDML (Locale Data Markup Language) to specify the tailorings, and the syntax is given by the Unicode Technical Specification #35, part 5.
The latest version of the data provided by the CLDR for en_US has no tailorings: it uses a modified version of the DUCET (as stated in UTS#35 under "Root collation"). It lists the cyrillic schwa after the cyrillic A, i.e., the order you were expecting.
There is also data for an en_US_POSIX locale, and that one includes some modifications, but none changes anything that isn't in ASCII.
It appears the en_US locale installed in your system uses a tailoring that puts the schwa next to E probably because of their similar form. It could be argued that would cause fewer surprises to an American English audience than sorting the schwa after A: ask people what that is and see how many will just tell you it is an "upside-down E". It isn't right or wrong, but if you ask me, it seems more appropriate than the collation found in the CLDR.

Related

Strategy for defining Unicode Ranges by Culture

I am new to Unicode have been given the requirement to look at some translated text, iterate over all of the characters of that translation and determine if all the characters are valid for the target culture (language and location).
For example, if I am translating a document from English to Greek, I want to detect if there are any English/ASCII "A"s in the Greek translation and report that as an error. This may likely be the case from corrupted data from a translation memory.
Is there any existing grouping of Unicode characters by culture? Or is there any existing strategy for developing this kind of grouping? I see that there is some grouping of characters at (http://www.unicode.org/charts/). But it seems that this is not quite what I am looking for at first glance.
Does any thing exist like "Here are the valid Unicode characters for Spanish - Spain: [some Unicode range(s)]" or "Here are the valid Unicode characters for Russian - Russia: [some Unicode range(s)]"
Or has anyone developed a strategy to define these?
If this is not the right place to ask this question, I would welcome any direction on where might be a good place to ask the question.
This is something that CLDR (Common Locale Data Repository) deals with. It is not part of the Unicode Standard, but it is an activity and a resource managed by the Unicode Consortium. The LDML specification defines the format of the locale data. The Character Elements define some sets of characters: “main/standard”, “auxiliary”, “index”, and “punctuation”.
The data for Greek includes only Greek letters and some basic punctuation. This, like all such data at CLDR, is largely subjective. And even though the CLDR process is meant to produce well-reviewed data based on consensus, the reality is different. It can be argued that in normal Greek texts, Latin letters are not uncommon, especially in technical areas. For example, the international symbol for the ampere is “A” as a Latin letter; the symbol for the kilogram is “kg”, in Latin letters, even though the word for it is written Greek letters in Greek.
Thus, no matter how you run the analysis, the occurrence of Latin “A” in Greek text could be flagged as potentially suspicious, but not an error.
There are C/C++ and Java libraries that implement access to CLDR data, as part of ICU.

Subset of Unicode normally used in writing?

What is the subset of Unicode characters that are normally used in writing — such as those that would be typically found in a newspaper article?
For example, in English, the characters in the range [a-zA-Z0-9], plus some punctuation characters, would be sufficient for most writing.
But I want to support languages that use characters that fall outside the ASCII range, while excluding the non-printing or decorative characters.
The objective is to restrict the user input to the application to codepoints that are legitimately used in written language. Because the user input will be saved and displayed, I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters, Unicode flow control characters, etc.
Regrettably, I am not fluent in every single language found in Unicode. Has anyone compiled a list of all of the subset of Unicode characters that are normally used in writing?
The official list of Unicode code points is UnicodeData.txt. This is a plain text file with one line per code point; it's easily machine-readable. For example:
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
The third semicolon-delimited field is the abbreviated name of the "General Category". This is explained further in chapter 4 of the Unicode Standard, specifically in section 4.5; see the table on page 131 (page 12 of the PDF file). For example, "Lu" is uppercase letters, "Ll" is lowercase letters, Pc, Pd, Ps, et al are various kinds of punctuation. (The first letter of the two-letter abbreviation represents a higher-level category such as letter, digit, punctuation, etc.)
Note that some ranges of code points are not listed explicitly. For example, the range of CJK (Chinese, Japanese, Korean) ideographs is represented as:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
I think there are other files on unicode.org that fill in these gaps.
I'm still not 100% clear on just what subset you're trying to define, but you can probably define it as a particular set of General Category values.
I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters
Diacritics/combining characters will be used in normal written language. So if you want to stop 'pranksters' you're going to need something more sophisticated than just a list of permitted characters. You'll have to do some sort of linguistic analysis for every language you want to permit.
I'd recommend not bothering with this, because it's going to be hard and you won't succeed anyway. Just let people write what they want.
Try WGL4 (652 characters), MES-1 (335 characters) or MES-2 (1062 characters). Find these at Wikipedia.
You may wish to exclude characters IJijĸĿŀʼn˚―⅛⅜⅝⅞♪ from MES-1 if you want to use this set.
Edit: I realize this is a bad answer. Especially the removing characters from MES-1 part was total garbage. I shouldn't have posted this. I'm ashamed of whoever upvoted this.
If anything, use Subset1 (678 characters), Subset2 (1193 characters) and Subset3 (2823 characters). https://unicodesubsets.miraheze.org/wiki/User:PiotrGrochowski

XML and Unicode specifications: what’s a legal character?

My manager asked me to explain why I called jdom’s checkCharacterData before passing my string to an XMLStreamWriter, so I referred to the XML spec and then got confused.
XML 1.0 and XML 1.1 say that a valid XML character is “tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.” That sounds stupid: tab, carriage return, and line feed are legal characters of Unicode. Then there’s the comment “any Unicode character, excluding the surrogate blocks, FFFE, and FFFF,” which was modified in XML 1.1 to refer to U+0000 – U+10FFFF excluding U+0000, U+D800 – U+DFFF, and U+FFFE – U+FFFF; note that NUL is excluded. Then there’s the Note that says authors are “discouraged” from using the compatibility characters including some characters that are already excluded by the BNF.
Question: What is/was a legal Unicode character? Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.) Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded? And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?
Question: What is/was a legal Unicode character?
The Unicode Glossary defines it thus:
Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]
Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.)
NUL is a codepoint, and it falls under the definition of "abstract character" so it is a character by sense 2 above.
Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded?
NUL has been a control character from early versions.
Appendix D contains a list of changes.
It says in table D.2 that there have been 65 control characters from Version 1 through Version 3 without change.
Table D-2 documents the number of characters assigned in the different versions of the Unicode standard.
V1.0 V1.1 V2.0 V2.1 V3.0
...
Controls 65 65 65 65 65
And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?
Writing specifications that are both complete and succinct is hard. When the text disagrees with the BNF, trust the BNF.
The use of the word “character” is intentionally fuzzy in the Unicode standard, but mostly it is used in a technical sense: a code point designated as an assigned character code point. This does not completely coincide with the intuitive concept of character. For example, the intuitive character that consists of letter i with macron and grave accent does not exist as a code point; in Unicode, it can only be represented as a sequence of two or three code points. As another example, the so-called control characters are not characters in the intuitive sense.
When other standards and specifications refer to “Unicode characters,” they refer to code points designated as assigned character code points. The set of Unicode characters varies by Unicode standard version, since new code points are assigned. Technically, the UnicodeData.txt file (at ftp://ftp.unicode.org/Public/UNIDATA/) indicates which code points are characters.
U+0000, conventionally denoted by NUL, has been a Unicode character since the beginning.
The XML specifications are inexact in many ways as regards to characters, as you have observed. But the essential definition is the BNF production for “Char” and the statement “XML processors MUST accept any character in the range specified for Char.” This means that in XML specifications, the concept of character is broader than Unicode character. The ranges in the production contain unassigned code points, actually a huge number of them.
The comment to the “Char” production in XML specifications is best ignored. It is very confusing and even incorrect. The “Char” production simply refers to a set of Unicode code points (different sets in different versions of XML). The set includes code points that you should never use in character data, as well as code points that should be avoided for various reasons. But such rules are at a level different from the formal rules of XML and requirements on XML implementations.
When selecting or writing a routine for checking character data, it depends on the application and purpose what should be accepted and what should be done with code points that fail the test. Even surrogate code points might be processed in some way instead of being just discarded; they may well appear due to confusions with encodings (or e.g. when a Java string has been naively taken as a string of Unicode characters – it is as such just a sequence of 16-bit code units).
I would ignore the verbage and just focus on the definitions:
XML 1.0:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
XML 1.1:
Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]
Document authors are encouraged to avoid "compatibility characters", as defined in Unicode [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
It sounds stupid because it is stupid. The First Edition of XML (1998) read "the legal graphic characters of Unicode." For whatever reason, the word "graphic" was removed from the Second Edition of 2000, perhaps because it is inaccurate: XML allows many characters that are not graphic characters.
The definition in the Char production is indeed the right place to look.

Simplified Chinese Unicode table

Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

Detect if character is simplified or traditional Chinese character

I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.
What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?
update
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
http://unicode.org/faq/han_cjk.html
Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:
A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.
As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.
It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:
Characters that are traditional only.
Characters that are simplified only.
Characters that have been left untouched, and are available in both.
Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for 面 and 麵, face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to 面. So you can deduct that it is a traditional character only.
But 面 also has a kTraditionalVariant, which points to 麵. This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...
On the other hand, 韩 has a kTraditionalVariant, pointing to 韓, and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.
As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.