Due to the fact that Java code could be run in any Java VM I'd like to know how is it possible to identify programmatically which Unicode version supported?
The easiest way but worst way I can think of to do that would be to pick a code point that’d new to each Unicode release, and check its Character properties. Or you could check its General Category with a regex. Here are some selected code points:
Unicode 6.0.0:
Ꞡ U+A7A0 GC=Lu SC=Latin LATIN CAPITAL LETTER G WITH OBLIQUE STROKE
₹ U+20B9 GC=Sc SC=Common INDIAN RUPEE SIGN
ₜ U+209C GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER T
Unicode 5.2:
Ɒ U+2C70 GC=Lu SC=Latin LATIN CAPITAL LETTER TURNED ALPHA
⅐ U+2150 GC=No SC=Common VULGAR FRACTION ONE SEVENTH
⸱ U+2E31 GC=Po SC=Common WORD SEPARATOR MIDDLE DOT
Unicode 5.1:
ꝺ U+A77A GC=Ll SC=Latin LATIN SMALL LETTER INSULAR D
Ᵹ U+A77D GC=Lu SC=Latin LATIN CAPITAL LETTER INSULAR
⚼ U+26BC GC=So SC=Common SESQUIQUADRATE
Unicode 5.0:
Ⱶ U+2C75 GC=Lu SC=Latin LATIN CAPITAL LETTER HALF H
ɂ U+0242 GC=Ll SC=Latin LATIN SMALL LETTER GLOTTAL STOP
⬔ U+2B14 GC=So SC=Common SQUARE WITH UPPER RIGHT DIAGONAL HALF BLACK
I've included the general category and the script property, although you can only inspect the script in JDK7, the first Java release that supports that.
I found those code points by running commands like this from the command line:
% unichars -gs '\p{Age=5.1}'
% unichars -gs '\p{Lu}' '\p{Age=5.0}'
Where that’s the unichars program. It will only find properties supported in the Unicode Character Database for whichever UCD version that the version of Perl you’re running supports.
I also like my output sorted, so I tend to run
% unichars -gs '\p{Alphabetic}' '\p{Age=6.0}' | ucsort | less -r
where that’s the ucsort program, which sorts text according to the Unicode Collation Algorithm.
However, in Perl unlike in Java this is easy to find out. For example, if you
run this from the command line (yes, there’s a programmer API, too), you find:
$ corelist -a Unicode
v5.6.2 3.0.1
v5.8.0 3.2.0
v5.8.1 4.0.0
v5.8.8 4.1.0
v5.10.0 5.0.0
v5.10.1 5.1.0
v5.12.0 5.2.0
v5.14.0 6.0.0
That shows that Perl version 5.14.0 was the first one to support Unicode 6.0.0. For Java, I believe there is no API that gives you this information directly, so you’ll have to hardcode a table mapping Java versions and Unicode versions, or else use the empirical method of testing code points for properties. By empirically, I mean the equivalent of this sort of thing:
% ruby -le 'print "\u2C75" =~ /\p{Lu}/ ? "pass 5.2" : "fail 5.2"'
pass 5.2
% ruby -le 'print "\uA7A0" =~ /\p{Lu}/ ? "pass 6.0" : "fail 6.0"'
fail 6.0
% ruby -v
ruby 1.9.2p0 (2010-08-18 revision 29036) [i386-darwin9.8.0]
% perl -le 'print "\x{2C75}" =~ /\p{Lu}/ ? "pass 5.2" : "fail 5.2"'
pass 5.2
% perl -le 'print "\x{A7A0}" =~ /\p{Lu}/ ? "pass 6.0" : "fail 6.0"'
pass 6.0
% perl -v
This is perl 5, version 14, subversion 0 (v5.14.0) built for darwin-2level
To find out the age of a particular code point, run uniprops -a on it like this:
% uniprops -a 10424
U+10424 ‹𐐤› \N{DESERET CAPITAL LETTER EN}
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
All Any Alnum Alpha Alphabetic Assigned InDeseret Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Deseret Dsrt Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Deseret Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None Script=Deseret East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Dsrt Script=Dsrt Sentence_Break=UP Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE Word_Break=LE _X_Begin
All my Unicode tools are available in the Unicode::Tussle bundle, including unichars, uninames, uniquote, ucsort, and many more.
Java 1.7 Improvements
JDK7 goes a long way to making a few Unicode things easier. I talk about that a bit at the end of my OSCON Unicode Support Shootout talk. I had thought of putting together a table of which languages supports which versions of Unicode in which versions of those languages, but ended up scrapping that to tell people to just get the latest version of each language. For example, I know that Unicode 6.0.0 is supported by Java 1.7, Perl 5.14, and Python 2.7 or 3.2.
JDK7 contains updates for classes Character, String, and Pattern in support of Unicode 6.0.0. This includes support for Unicode script properties, and several enhancements to Pattern to allow it to meet Level 1 support requirements for Unicode UTS#18 Regular Expressions. These include
The isupper and islower methods now correctly correspond to the Unicode uppercase and lowercase properties; previously they misapplied only to letters, which isn’t right, because it misses Other_Uppercase and Other_Lowercase code points, respectively. For example, these are some lowercase codepoints which are not GC=Ll (lowercase letters), selected samples only:
% unichars -gs '\p{lowercase}' '\P{LL}'
◌ͅ U+0345 GC=Mn SC=Inherited COMBINING GREEK YPOGEGRAMMENI
ͺ U+037A GC=Lm SC=Greek GREEK YPOGEGRAMMENI
ˢ U+02E2 GC=Lm SC=Latin MODIFIER LETTER SMALL S
ˣ U+02E3 GC=Lm SC=Latin MODIFIER LETTER SMALL X
ᴬ U+1D2C GC=Lm SC=Latin MODIFIER LETTER CAPITAL A
ᴮ U+1D2E GC=Lm SC=Latin MODIFIER LETTER CAPITAL B
ᵂ U+1D42 GC=Lm SC=Latin MODIFIER LETTER CAPITAL W
ᵃ U+1D43 GC=Lm SC=Latin MODIFIER LETTER SMALL A
ᵇ U+1D47 GC=Lm SC=Latin MODIFIER LETTER SMALL B
ₐ U+2090 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER A
ₑ U+2091 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER E
ⅰ U+2170 GC=Nl SC=Latin SMALL ROMAN NUMERAL ONE
ⅱ U+2171 GC=Nl SC=Latin SMALL ROMAN NUMERAL TWO
ⅲ U+2172 GC=Nl SC=Latin SMALL ROMAN NUMERAL THREE
ⓐ U+24D0 GC=So SC=Common CIRCLED LATIN SMALL LETTER A
ⓑ U+24D1 GC=So SC=Common CIRCLED LATIN SMALL LETTER B
ⓒ U+24D2 GC=So SC=Common CIRCLED LATIN SMALL LETTER C
The alphabetic tests are now correct in that they use Other_Alphabetic. They did this wrong prior to 1.7, which is a problem.
The \x{HHHHH} pattern escape so you can meet RL1.1; this lets you rewrite [𝒜-𝒵] (which fails due to The UTF‐16 Curse) as [\x{1D49C}-\x{1D4B5}]. JDK7 is the first Java release that fully/correctly supports non-BMP characters in this regard. Amazing but true.
More properties for RL1.2, of which the script property is by far the most important. This lets you write \p{script=Greek} for example, abbreviated as \p{Greek}.
The new UNICODE_CHARACTER_CLASSES pattern compilation flag and corresponding pattern‐embeddable flag "(?U)" to meet RL1.2a on compatibility properties.
I can certainly see why you want to make sure you’re running a Java with Unicode 6.0.0 support, since that comes with all those other benefits, too.
This is not trivial if you are looking for a class to make this information available to you.
Typically, versions of Unicode supported by Java change from one major specification to another, and this information is documented in the Character class of the Java API documentation (which is derived from the Java Language specification). You cannot however rely on the Java language specification, as each major version of Java need not have its own version of the Java Language Specification.
Therefore, you ought to go transliterate between the version of Java supported by the JVM, and the supported Unicode version as:
String specVersion = System.getProperty("java.specification.version");
if(specVersion.equals("1.7"))
return "6.0";
else if(specVersion.equals("1.6"))
return "4.0";
else if(specVersion.equals("1.5"))
return "4.0";
else if(specVersion.equals("1.4"))
return "3.0";
... and so on
The details of the supported versions can be obtained from the Java Language Specification. Referring from JSR 901 which is the Language specification of Java 7:
The Java SE platform tracks the Unicode specification as it evolves.
The precise version of Unicode used by a given release is specified in
the documentation of the class Character.
Versions of the Java
programming language prior to 1.1 used Unicode version 1.1.5. Upgrades
to newer versions of the Unicode Standard occurred in JDK 1.1 (to
Unicode 2.0), JDK 1.1.7 (to Unicode 2.1), Java SE 1.4 (to Unicode
3.0), and Java SE 5.0 (to Unicode 4.0).
I don't think it's available via public API. But this not subject to change very often so you can get the specification version:
System.getProperties().getProperty("java.specification.version")
and on base of that, figure out the unicode version.
java 1.0 -> Unicode 1.1
java 1.1 -> Unicode 2.0
java 1.2 -> Unicode 2.0
java 1.3 -> Unicode 2.0
java 1.4 -> Unicode 3.0
java 1.5 -> Unicode 4.0
java 1.6 -> Unicode 4.0
java 1.7 -> Unicode 6.0
To verify it, you can see the JavaDoc for Character class.
The Unicode version is defined in the Java Language Specification §3.1. Since J2SE 5.0 Unicode 4.0 is supported.
To quote:
Versions of the Java programming language prior to JDK 1.1 used Unicode 1.1.5. Upgrades to newer versions of the Unicode Standard occurred in JDK 1.1 (to Unicode 2.0), JDK 1.1.7 (to Unicode 2.1), Java SE 1.4 (to Unicode 3.0), Java SE 5.0 (to Unicode 4.0), Java SE 7 (to Unicode 6.0), Java SE 8 (to Unicode 6.2), Java SE 9 (to Unicode 8.0), Java SE 11 (to Unicode 10.0), Java SE 12 (to Unicode 11.0), and Java SE 13 (to Unicode 12.1).
Here's a method I use, which should be compatible with all versions of Java >= 1.1. It's future-proofed only up to Unicode 15.0 (scheduled for release in September 2022), but is easily extended by referring to the Unicode "DerivedAge.txt" file (see the URL in the code comments).
As far back as I've tested, it agrees with the table Michał Šrajer compiled, and it correctly determines Java 8 supports Unicode 6.2, Java 9 supports Unicode 8.0, Java 13 supports Unicode 12.1, and Java 16 supports Unicode 13.0.
/**
* Gets the <a href="https://www.unicode.org/versions/enumeratedversions.html">Unicode
* version</a> supported by the current Java runtime. The version is as an {#code int}
* storing the major and minor version numbers in low-order octets 1 and 0, respectively.
* It can be converted to dotted-decimal by code such as {#code (version >> 8) + "." +
* (version & 0xFF)}, and {#code System.out.printf("Unicode version %d.%d%n", version >>
* 8, version & 0xFF)}.
* <p>
* As of 2022-05-01, the most recent Unicode derived age data stops at version 15.0.0d2.
* Therefore, if this method returns {#code 0xF00}, the Unicode version is 15.0 <i>or
* greater</i>. Prior version are identified unambiguously.
* <p>
* This method is compatible with Java versions >= 1.1.
*
* #return Unicode version number {#code int}, storing the major and minor versions in,
* respectively, low-order octets 1 and 0. Thus, version 19.2.5 is {#code 0x1302}
* (the "update" number, 5, is omitted, because updates cannot add code-points).
*/
public static int getUnicodeVersion() {
/* Version identification is a descending search for "Character.getType"
recognition of a new code-point unique to each version. (See
<https://www.unicode.org/Public/UCD/latest/ucd/DerivedAge.txt>.)
Major and minor versions ("A.B" in version "A.B.C") are identified,
but not "update" numbers ("C" in prior example), consistent with
"Unicode Standard Annex #44, Unicode Character Database", revision
28 (Unicode 14.0.0), section 5.14, which states:
"Formally, the Age property is a catalog property whose enumerated
values correspond to a list of tuples consisting of a major version
integer and a minor version integer. The major version is a positive
integer constrained to the range 1..255. The minor version is a non-
negative integer constrained to the range 0..255. These range limit-
ations are specified so that implementations can be guaranteed that
all valid, assigned Age values can be represented in a sequence of
two unsigned bytes. A third value corresponding to the Unicode update
version is not required, because new characters are never assigned in
update versions of the standard."
Source: <https://www.unicode.org/reports/tr44/#Character_Age>.
*/
// Preliminary Unicode 15.0 data from
// <https://www.unicode.org/Public/15.0.0/ucd/DerivedAge-15.0.0d2.txt>.
if (Character.getType('\u0CF3') != Character.UNASSIGNED)
return 0xF00; // 15.0, release scheduled for September 2022.
if (Character.getType('\u061D') != Character.UNASSIGNED)
return 0xE00; // 14.0, September 2021.
if (Character.getType('\u08Be') != Character.UNASSIGNED)
return 0xD00; // 13.0, March 2020.
if (Character.getType('\u32FF') != Character.UNASSIGNED)
return 0xC01; // 12.1, May 2019.
if (Character.getType('\u0C77') != Character.UNASSIGNED)
return 0xC00; // 12.0, March 2019.
if (Character.getType('\u0560') != Character.UNASSIGNED)
return 0xB00; // 11.0, June 2018.
if (Character.getType('\u0860') != Character.UNASSIGNED)
return 0xA00; // 10.0, June 2017.
if (Character.getType('\u08b6') != Character.UNASSIGNED)
return 0x900; // 9.0, June 2016.
if (Character.getType('\u08b3') != Character.UNASSIGNED)
return 0x800; // 8.0, June 2015.
if (Character.getType('\u037f') != Character.UNASSIGNED)
return 0x700; // 7.0, June 2014.
if (Character.getType('\u061c') != Character.UNASSIGNED)
return 0x603; // 6.3, September 2013.
if (Character.getType('\u20ba') != Character.UNASSIGNED)
return 0x602; // 6.2, September 2012.
if (Character.getType('\u058f') != Character.UNASSIGNED)
return 0x601; // 6.1, January 2012.
if (Character.getType('\u0526') != Character.UNASSIGNED)
return 0x600; // 6.0, October 2010.
if (Character.getType('\u0524') != Character.UNASSIGNED)
return 0x502; // 5.2, October 2009.
if (Character.getType('\u0370') != Character.UNASSIGNED)
return 0x501; // 5.1, March 2008.
if (Character.getType('\u0242') != Character.UNASSIGNED)
return 0x500; // 5.0, July 2006.
if (Character.getType('\u0237') != Character.UNASSIGNED)
return 0x401; // 4.1, March 2005.
if (Character.getType('\u0221') != Character.UNASSIGNED)
return 0x400; // 4.0, April 2003.
if (Character.getType('\u0220') != Character.UNASSIGNED)
return 0x302; // 3.2, March 2002.
if (Character.getType('\u03f4') != Character.UNASSIGNED)
return 0x301; // 3.1, March 2001.
if (Character.getType('\u01f6') != Character.UNASSIGNED)
return 0x300; // 3.0, September 1999.
if (Character.getType('\u20ac') != Character.UNASSIGNED)
return 0x201; // 2.1, May 1998.
if (Character.getType('\u0591') != Character.UNASSIGNED)
return 0x200; // 2.0, July 1996.
if (Character.getType('\u0000') != Character.UNASSIGNED)
return 0x101; // 1.1, June 1993.
return 0x100; // 1.0
}
The code for detecting Unicode versions prior to 2.0 will never be reached (given the Java 1.1 or greater requirement), and is present merely for the sake of completeness.
Since the supported unicode version is defined by the Java version you might use that information and infer the unicode version based on what System.getProperty("java.version") returns.
I assume you want to support only specific unicode versions or at least some minimum. I'm no unicode expert but since the versions seem to be backward compatible you might define the unicode version to be at least 4.0 which means the supported Java version would be at least 5.0
Related
I'm running into a tricky issue with the character ė (small e with one dot above it). I'm specifically using FPDF to generate PDF files in PHP and it won't support the ė character.
I noticed on Wikipedia the ISO hex for ė is the same as ë. Both are EB.
https://en.wikipedia.org/wiki/Ė
https://en.wikipedia.org/wiki/%C3%8B
Why are ė and ë considered the same character in ISO?
You get things wrong.
ISO is a standard organization, and it has many standards. Unicode has also an parallel ISO standard (ISO 10646). And we had other ISO standards for texts.
You are looking instead the ISO 8859, which is made by various parts: https://en.wikipedia.org/wiki/ISO/IEC_8859
This is a 8-bit character encoder, so you have a very limited character set (256 minus 32 characters). For this reason there are many different parts, and one choose what better fit on own country/language. You may choose Latin-1 for West European languages, or better Latin-9 (part 15) which includes the "new" character: Euro symbol (currency).
In your example, you have the language specific codes EB. In part 13 (Latin-7) it is ė (baltic), but in part 1, 2, 3, 4, 9, 10, 14, 15, and part 16 it is ë. As you see, this is variant is used in many more languages, so it is available in most of the ISO 8859 parts. In the page I linked above, you see also the table with every variant per code/value.
The main problem now it is to detect the original encoding. This could be very problematic for people who cannot asses which the language, so the spelling, of a text. For new text, better to use Unicode, which is unique (real text doesn't have Unicode byte pattern)
For reference, I'm using Prolog v7.4.2 on Windows 10, 64-bit
Entering the following code in the REPL:
write("\U0001D7F6"). % Mathematical Monospace Digit Zero
gives me this error in the output:
ERROR: Syntax error: Illegal character code
ERROR: write("
ERROR: ** here **
ERROR: \U0001D7F6") .
I know for a fact that U+1D7F6 is a valid Unicode character, so what's up?
SWI-Prolog internally uses C wchar_t to represent Unicode characters. On Windows these are 16 bit and intended to hold UTF-16 encoded strings. SWI-Prolog however uses wchar_t to get nice arrays of code points and thus effectively only supports UCS-2 on Windows (code points u0000..uffff).
On non-Windows systems, wchar_t is usually 32 bits and thus the complete Unicode range is supported.
It is not a trivial thing to fix as handling wchar_t as UTF-16 looses the nice property that each element of the array is exactly one code point and using our own 32-bit type means we cannot use the C library wide character functions and have to reimplement them in SWI-Prolog. This is not only work, but replacing them with pure C versions also looses the optimization typically present in modern C runtime libraries.
The ISO core standard syntax for char codes looks different. The following works in SICStus Prolog, Jekejeke Prolog, SWI-Prolog, etc.. for example, and is thus more portable:
Using SWI-Prolog on a Mac:
Welcome to SWI-Prolog (threaded, 64 bits, version 7.5.8)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
?- set_prolog_flag(double_quotes, codes).
true.
?- X = "\x1D7F6\".
X = [120822].
?- write('\x1D7F6\'), nl.
𝟶
And Jekejeke Prolog on a Mac:
Jekejeke Prolog 2, Runtime Library 1.2.2
(c) 1985-2017, XLOG Technologies GmbH, Switzerland
?- X = "\x1D7F6\".
X = [120822]
?- write('\x1D7F6\'), nl.
𝟶
The underlying syntax is found in the ISO core standard at section 6.4.2.1 hexadecimal escape sequence. It reads as follows and is shorter than the U-syntax:
hex_esc_seq --> "\x" hex_digit { hex_digit } "\".
For comparison, I get:
?- write('\U0001D7F6').
𝟶
What is your environment and what do the flags say?
For example:
$ set | grep LANG
LANG=en_US.UTF-8
and also:
?- current_prolog_flag(encoding, F).
F = utf8.
What is the difference between different text file encoding for my Android project, such as:
UTF-8
UTF-16BE
UTF-16LE
UTF-16
ISO-8859-1
US-ASCII
For example, for displaying Korean, I know I should use UTF-8. But when I should use the others?
About Character_encoding and their difference http://en.wikipedia.org/wiki/Character_encoding.
Usually UTF-8 works fine for cross platform and multiple language. http://en.wikipedia.org/wiki/UTF-8
But Korean version of Windows also use Unified Hangul Code
Unified Hangul Code (UHC) extends Wansung Code by adding the missing
8,822 Hangul characters, and is designed for smooth migration to
Unicode Version 2.0. All Wansung code points map directly to the same
UHC code points (but not vice versa). UHC also provides round trip
mapping with Unicode Version 2.0. UHC is used in Korean versions of
Windows 95 and Windows NT.
There is this command iconv under Linux (and also libiconv for c programming language), for encoding translation.
iconv -l
to list all encoding that iconv supports.
My manager asked me to explain why I called jdom’s checkCharacterData before passing my string to an XMLStreamWriter, so I referred to the XML spec and then got confused.
XML 1.0 and XML 1.1 say that a valid XML character is “tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.” That sounds stupid: tab, carriage return, and line feed are legal characters of Unicode. Then there’s the comment “any Unicode character, excluding the surrogate blocks, FFFE, and FFFF,” which was modified in XML 1.1 to refer to U+0000 – U+10FFFF excluding U+0000, U+D800 – U+DFFF, and U+FFFE – U+FFFF; note that NUL is excluded. Then there’s the Note that says authors are “discouraged” from using the compatibility characters including some characters that are already excluded by the BNF.
Question: What is/was a legal Unicode character? Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.) Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded? And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?
Question: What is/was a legal Unicode character?
The Unicode Glossary defines it thus:
Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]
Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.)
NUL is a codepoint, and it falls under the definition of "abstract character" so it is a character by sense 2 above.
Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded?
NUL has been a control character from early versions.
Appendix D contains a list of changes.
It says in table D.2 that there have been 65 control characters from Version 1 through Version 3 without change.
Table D-2 documents the number of characters assigned in the different versions of the Unicode standard.
V1.0 V1.1 V2.0 V2.1 V3.0
...
Controls 65 65 65 65 65
And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?
Writing specifications that are both complete and succinct is hard. When the text disagrees with the BNF, trust the BNF.
The use of the word “character” is intentionally fuzzy in the Unicode standard, but mostly it is used in a technical sense: a code point designated as an assigned character code point. This does not completely coincide with the intuitive concept of character. For example, the intuitive character that consists of letter i with macron and grave accent does not exist as a code point; in Unicode, it can only be represented as a sequence of two or three code points. As another example, the so-called control characters are not characters in the intuitive sense.
When other standards and specifications refer to “Unicode characters,” they refer to code points designated as assigned character code points. The set of Unicode characters varies by Unicode standard version, since new code points are assigned. Technically, the UnicodeData.txt file (at ftp://ftp.unicode.org/Public/UNIDATA/) indicates which code points are characters.
U+0000, conventionally denoted by NUL, has been a Unicode character since the beginning.
The XML specifications are inexact in many ways as regards to characters, as you have observed. But the essential definition is the BNF production for “Char” and the statement “XML processors MUST accept any character in the range specified for Char.” This means that in XML specifications, the concept of character is broader than Unicode character. The ranges in the production contain unassigned code points, actually a huge number of them.
The comment to the “Char” production in XML specifications is best ignored. It is very confusing and even incorrect. The “Char” production simply refers to a set of Unicode code points (different sets in different versions of XML). The set includes code points that you should never use in character data, as well as code points that should be avoided for various reasons. But such rules are at a level different from the formal rules of XML and requirements on XML implementations.
When selecting or writing a routine for checking character data, it depends on the application and purpose what should be accepted and what should be done with code points that fail the test. Even surrogate code points might be processed in some way instead of being just discarded; they may well appear due to confusions with encodings (or e.g. when a Java string has been naively taken as a string of Unicode characters – it is as such just a sequence of 16-bit code units).
I would ignore the verbage and just focus on the definitions:
XML 1.0:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
XML 1.1:
Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]
Document authors are encouraged to avoid "compatibility characters", as defined in Unicode [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
It sounds stupid because it is stupid. The First Edition of XML (1998) read "the legal graphic characters of Unicode." For whatever reason, the word "graphic" was removed from the Second Edition of 2000, perhaps because it is inaccurate: XML allows many characters that are not graphic characters.
The definition in the Char production is indeed the right place to look.
Which widely used programming languages were designed ground-up with Unicode support?
A lot of programming languages have added Unicode support as an afterthought in later versions, but which widely used languages were released with Unicode support from day one?
Java was probably the first popular language to have ground-up Unicode support.
Basically all of the .NET languages are Unicode languages, such as C# and VB.NET.
There were many breaking changes in Python 3, among them the switch to Unicode for all text.
So Python wasn't designed ground-up for Unicode, but Python 3 was.
I don't know how far this goes in other languages, but a fun thing about C# is that not only is the runtime (the string class etc) unicode aware - but unicode is fully supported in source:
using משליט = System.Object;
using תוצאה = System.Int32;
public class שלום : משליט {
public תוצאה בית() {
int אלף = 0;
for (int λ = 0; λ < 20; λ++) אלף+=λ;
return אלף;
}
}
Google's Go programming language supports Unicode and works with UTF-8.
It really is difficult to design Unicode support for the future, in a programming language right from the beginning.
Java is one one of the languages that had this designed into the language specification. However, Unicode support in v1.0 of Java is different from v5 and v6 of the Java SDK. This is primarily due to the version of Unicode that the language specification catered to, when the language was originally designed. Java attempts to track changes in the Unicode standard with every major release.
Early implementations of the JLS could claim Unicode support, primarily because Unicode itself supported 65536 characters (v1.0 of Java supported Unicode 1.1, and Java v1.4 supported Unicode 3.0) which was compatible with the 16-bit storage space taken up by characters. That changed with Unicode 3.1 - its an evolving standard, usually with more characters getting added in each release. The characters added later in 3.1 were called supplementary characters. Support for supplementary characters were added in Java 5 via JSR-204; Java 5 and 6 support Unicode 4.0.
Therefore, don't be surprised if different programming languages implement Unicode support differently.
On the other hand, PHP(!!) and Ruby did not have Unicode support built into them during inception.
PS: Support for v5.1 of Unicode is to be made in Java 7.
Java and the .NET languages, as other commenters have pointed out, although Java's strings are UTF-16 rather than UCS or UTF-8. (At the time, it seemed like a sensible idea! Now clearly either UTF-8 or UCS would be better.) And Python 3 is really a different, incompatible language from Python 1.x and 2.x, so it qualifies too.
The Plan9 languages around 1992 were probably the first to do this: their dialect of C, rc, Alef, mk, ACID, and so on, were all Unicode-enabled. They took the very simple approach that anything that wasn't ASCII was an identifier character. See their paper from 1993 on the subject. (This is the project where UTF-8 was invented, which meant they could do this in a pretty compatible way, in particular without plumbing binary-versus-text through all their programs.)
Other languages that support non-ASCII identifiers include current PHP.
Perl 6 has complete unicode support from scratch.
(With the Rakudo Perl 6 compiler being the first implementation)
General overview
Unicode operators
Strings, Regular expressions and grammars all operate based on graphemes, even for those codepoint combination for which there is no composed representation (a composed representation artificial codepoint is generated on the fly for those cases).
A special encoding exists to handle data of unknown encoding "utf8-c8": this assumes utf-8 when possible, but creates artificial codepoints for unencodable sequences, allowing them to roundtrip if necessary.
Python 3.x: http://docs.python.org/dev/3.0/whatsnew/3.0.html
Sometimes, a feature that was included in a language when it was first designed is not always the best.
Languages have changed over time and many have become bloated with extra features, while not necessarily keeping up-to-date with the features it first included.
So I just throw out the idea that you shouldn't necessarily discount languages that have recently added Unicode. They will have the advantage of adding Unicode to an already mature development tool, and getting the chance to do it right the first time.
With that in mind, I want to ensure that Delphi is included here, as one of your answers. Embarcadero added Unicode in their Delphi 2009 version and did a mighty fine job on it. It was enough to finally prompt me to upgrade from the Delphi 4 that I had been using for 10 years.
Java uses characters from the Unicode character set.
java and .net languages