The implementations of the "Unicode Line Breaking Algorithm" (UAX#14) - unicode

I've been searching around to find a latest implementation of the Unicode Line Breaking Algorithm (UAX#14) to port it for my needs.
I've found an old, but seemingly normative sample implementation of the algorithm, in which a "Pair Table-Based Implementation" is utilized. The implementation cited the corresponding section of the old document that is deleted starting from Unicode 10.0. So,
Why the "§7: Pair Table-Based Implementation" is deleted and what are the alternatives?
Are there any public and full-implementations to the latest version of the algorithm, or at-least to the last version of the algorithm in which the "Pair Table-Based Implementation" was still around (in Unicode 9.0)?

The latest official line breakers implementations for #14 and #29 are included in the Rust crate icu_segmenter maintained by the Unicode organization. See the API docs at https://unicode-org.github.io/icu4x-docs/doc/icu_segmenter/

Here is one in JavaScript: linebreak

Related

Have there been any "breaking changes" in the history of Unicode?

As Unicode versions progress, has there ever been a breaking change? For example, has it ever occurred that a symbol's code point has been re-mapped, be it so that this symbol appears together with the ones it relates to (think of a character set for a language that at some point gained a new letter)?
May Unicode "change" these things at all, or is there a guarantee that these mappings are constant forever?
If there weren't any code point re-mappings, have there been other breaking changes?
Yes, there have been many breaking changes. One interesting story related by Michael Kaplan (late Microsoft internationalization expert) in an archived version of "Every character has a story #5", quotes Ken Whistler:
Hundreds -- maybe thousands -- of Unicode 1.0 character names were changed in 1993 for Unicode 1.1 as part of the merger between the repertoires of Unicode and ISO/IEC 10646-1:1993. (The Great Compromise) The gory details of all the changes can be found in UTR #4, The Unicode Standard, Version 1.1. It was after that point (which was very painful for some people) that we put in place the never change a character name rule.
In another post ("Stability of the Unicode Character Database", archived), Kaplan quotes a discussion of changes to character categories, with this quote also by Ken Whistler:
The significant point of instability in General Category assignments was in establishing Unicode 2.0 data files (now more than 8 years in the past).
There was a significant hiccup for Unicode 3.0, at the point when it became clear that normalization stability was going to be a major issue, and when the data was culled for consistency under canonical and compatibility equivalence.
Since that time, the UTC has been very conservative, indeed, in approving any General Category change for an existing character. The types of changes have been limited to:
Clarification regarding obscure characters for which insufficient information was available earlier.
Establishment of further data consistency constraints (this impacted some numeric categories, and also explains the change for the Katakana middle dot)
Implementation issues with a few format characters (ZWSP, Arabic end of ayah, Mongolian free variation selectors)
There were many changes early in Unicode, and fewer "breaking changes" as time went on. Unicode has an official Stability Policy, describing what changes are no longer allowed, and the "applicable version" at which they instituted each policy (and thus finished making that kind of change). I have to expect that each of those policies may tell a story of changes being made, causing all sorts of trouble for people who were relying on the prior behavior and needing to update existing data in some way, or at least people knew that more pain would come in the future if they didn't fix those particular aspects of Unicode at that time. It really makes some amount of sense that as Unicode got adopted problems with particular aspects were found and fixed, and now that Unicode is ubiquitous there is less need to make breaking changes and more need to keep compatibility with the existing data that's out there.
To answer your specific question about code point placement, let me quote the Encoding Stability Policy:
Encoding Stability
Applicable Version: Unicode 2.0+
Once a character is encoded, it will not be moved or removed.
This policy ensures that implementers can always depend on each version of the Unicode Standard being a superset of the previous version. The Unicode Standard may deprecate the character (that is, formally discourage its use), but it will not reallocate, remove, or reassign the character.
Note: Ordering of characters is handled via collation, not by moving characters to different code points. For more information, see Unicode Technical Standard #10, Unicode Collation Algorithm, and the Unicode FAQ.
In general, I expect that you can rely on the Unicode Consortium to keep the promises that they've now made in their Stability Policy, though you may need to be aware of the changes made before each policy existed if you have data that predates the adoption of that applicable version of Unicode by the software that created it. And data not explicitly called out as now being "stable" can of course be changed in any future version.

Unicode case folding to upper case

I'm trying to implement a library for reading Microsoft CFB (Compound File Binary) Format files, according to the official specification of that format. The specification is available from this site.
In a nutshell - some of the structures of the file are stored in a red-black tree. I've got a problem with the comparison predicate used for storing these structures in that tree. The specification says that, if the names (the strings are stored as UTF-16, the standard in Windows APIs) of these structures are different, it is necessary to iterate through every UTF-16 code point and :
(...) convert to upper-case with the Unicode Default Case Conversion
Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> Compare each upper-cased UTF-16 code point binary value.
The <2> reference says that :
or Windows XP and Windows Server 2003: The compound file implementation
conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding
(http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions.
However, when I looked up the referenced case folding file, and read the UTR #21 "Case Mapping" referenced there, I realized that the case folding is defined as an operation that bears much more resemblance to lower-casing, rather than upper-casing.
By using CaseFolding-4.txt, we can obtain the case folding mappings of upper-case letters to lower-case ones. The mapping is always 1-to-1, since full foldings (those that expand to multiple characters) aren't needed here. However, the reverse mapping of lower-case letters to upper-case ones isn't straightforward anymore. For example,
0392; C; 03B2; # GREEK CAPITAL LETTER BETA
03D0; C; 03B2; # GREEK BETA SYMBOL
Thus, we have no way of knowing whether 03B2 should be converted to 0392 or 03D0. Does the standard define something like folding to upper-case? Maybe I should use case folding, and then convert to upper-case? Or have I understood the specification completely wrong?
Summary: The wording used by Microsoft is...confusing to say the least. It appears that simple upper case mapping should be done, though I can't be certain.
Background
Part of the confusion might be the difference between case folding and case mapping. Case mapping maps every character to a designated case. Case folding, while it is based on lower-casing, is defined to result in case-less characters (UTR #21 §1.3).
Now there are two variants of case mapping and case folding, simple and full. Unlike the simple transformation, The full one can change string length, and as you rightly point out is not needed here. The specification specifically mentions simple, and is probably the only clear thing in this answer. I do feel the need to mention for future reference that the the current Unicode Standard (6.3.0) mentions that the default case transformation is the full one, though the version Microsoft references (3.1.1) does not appear to make this distinction.
Spec Analysis
(...) convert to upper-case with the Unicode Default Case Conversion Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> Compare each upper-cased UTF-16 code point binary value.
To me this quote appears to suggest they want upper case, and simply made an error by saying case folding instead of case mapping. But then comes that reference you quoted:
For Windows XP and Windows Server 2003: The compound file implementation conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding (http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions.
They actually mention the case folding data file! At this point, I'm not sure what to think. My main line of thought is that Microsoft wants case folding though erroneously thought that it was based on upper casing rather than lower casing. This is even a stretch though, but its the closest I've been able to come to reconciling this possible contradiction, and I hope there's a better explanation.
I've found in section 2.6.1 the following which supports some form of upper-casing:
[...] the directory entry name is compared using a special case-insensitive upper-case mapping, described in Red-Black Tree.
Note that they do in fact use the term mapping here.
The exception list
Taking a look at the exception list for the mentioned Windows XP and Windows Server 2003, most entries are subtractions, suggesting code points Microsoft wants to keep distinct. However, in the table, the code points are actually listed in reverse order to the Unicode case folding data file.
One interpretation of this is that it's just a display quirk. This idea is shot down by the last row where they subtract the case transformation 0x03C2 -> 0x03C2. This transformation does not exist in the data file since the transformation 0x03C2 -> 0x03C3 does (an unlisted case transformation is considered to transform to itself).
Another interpretation is that they do in fact erroneously believe that its the reverse mapping that's the correct one. As you mentioned though, this runs into trouble, as the reverse mapping is not always straightforward. Otherwise, this interpretation would be fine.
A third interpretation is to consider their reference to the Unicode case folding data file wrong. This of course makes me feel uneasy, but if they actually did mean case mapping originally, they might have just provided the link as a quick reference point. The exception list they mention does have column headers such as "Lowercase UTF-16 code point", but we know that case folding is in fact case-less.
As an aside, I did look at the exception list for the later operating systems, hoping to gain some more insight. I found more confusion. In particular the addition of 0x03C3 -> 0x03A3 troubles me. Since the exception list and the Unicode file list their code points in the opposite order, it appears that the transformation is already in the data file and doesn't need to be added. This part of the specification does not want to be understood!
Conclusion
If you've read all of the above, you'll probably guess that this conclusion is going to be less than ideal. Clearly at one or more points, the specification is in error, but it's hard to tell where. Really there are three possibilities depending on your interpretation as to what kind of case transformation needs to be done.
Simple upper case mapping
Simple case folding, followed by simple upper case mapping
Simple case folding
To me it seems like Microsoft does in fact want upper casing. From there I believe that the case folding reference is an error, and as such my guess is they just want simple upper case mapping.
I highly doubt it's the last simple case folding only option though. Both of the other options would give very similar results with only a small amount of code points possibly giving different results.
It seems like the only way to know for sure would be to either contact Microsoft, or painstakingly look at binaries to see which method is followed.
In 3.13 Default Case Algorithms (p. 115) of The Unicode Standard
Version 6.2 – Core Specification the text refers to UnicodeData.txt. This contains:
03B2;GREEK SMALL LETTER BETA;Ll;0;L;;;;;N;;;0392;;0392
03D0;GREEK BETA SYMBOL;Ll;0;L;<compat> 03B2;;;;N;GREEK SMALL LETTER CURLED BETA;;0392;;0392
which indicates that the Greek small letter Beta should indeed map to the Greek Beta symbol, and as an aside indicates that the two symbols have some level of compatibility. It also contains the remainder of the bidirectional case conversion you are looking for. You may also need to look at SpecialCasing.txt for boundary cases.

Understanding the terms - Character Encodings, Fonts, Glyphs

I am trying to understand this stuff so that I can effectively work on internationalizing a project at work. I have just started and very much like to know from your expertise whether I've understood these concepts correct. So far here is the dumbed down version(for my understanding) of what I've gathered from web:
Character Encodings -> Set of rules that tell the OS how to store characters. Eg., ISO8859-1,MSWIN1252,UTF-8,UCS-2,UTF-16. These rules are also called Code Pages/Character Sets which maps individual characters to numbers. Apparently unicode handles this a bit differently than others. ie., instead of a direct mapping from a number(code point) to a glyph, it maps the code point to an abstract "character" which might be represented by different glyphs.[ http://www.joelonsoftware.com/articles/Unicode.html ]
Fonts -> These are implementation of character encodings. They are files of different formats (True Type,Open Type,Post Script) that contain mapping for each character in an encoding to number.
Glyphs -> These are visual representation of characters stored in the font files.
And based on the above understanding I have the below questions,
1)For the OS to understand an encoding, should it be installed separately?. Or installing a font that supports an encoding would suffice?. Is it okay to use the analogy of a protocol say TCP used in a network to an encoding as it is just a set of rules. (which ofcourse begs the question, how does the OS understands these network protocols when I do not install them :-p)
2)Will a font always have the complete implementation of a code page or just part of it?. Is there a tool that I can use to see each character in a font(.TTF file?)[Windows font viewer shows how a style of the font looks like but doesn't give information regarding the list of characters in the font file]
3)Does a font file support multiple encodings?. Is there a way to know which encoding(s) a font supports?
I apologize for asking too many questions, but I had these in my mind for some time and I couldn't find any site that is simple enough for my understanding. Any help/links for understanding this stuff would be most welcome. Thanks in advance.
If you want to learn more, of course I can point you to some resources:
Unicode, writing systems, etc.
The best source of information would probably be this book by Jukka:
Unicode Explained
If you were to follow the link, you'd also find these books:
CJKV Information Processing - deals with Chinese, Japanese, Korean and Vietnamese in detail but to me it seems quite hard to read.
Fonts & Encodings - personally I haven't read this book, so I can't tell you if it is good or not. Seems to be on topic.
Internationalization
If you want to learn about i18n, I can mention countless resources. But let's start with book that will save you great deal of time (you won't become i18n expert overnight, you know):
Developing International Software - it might be 8 years old but this is still worth every cent you're going to spend on it. Maybe the programming examples regard to Windows (C++ and .Net) but the i18n and L10n knowledge is really there. A colleague of mine said once that it saved him about 2 years of learning. As far as I can tell, he wasn't overstating.
You might be interested in some blogs or web sites on the topic:
Sorting it all out - Michael Kaplan's blog, often on i18n support on Windows platform
Global by design - John Yunker is actively posting bits of i18n knowledge to this site
Internationalization (I18n), Localization (L10n), Standards, and Amusements - also known as i18nguy, the web site where you can find more links, tutorials and stuff.
Java Internationalization
I am afraid that I am not aware of many up to date resources on that topic (that is publicly available ones). The only current resource I know is Java Internationalization trail. Unfortunately, it is fairly incomplete.
JavaScript Internationalization
If you are developing web applications, you probably need also something related to i18n in js. Unfortunately, the support is rather poor but there are few libraries which help dealing with the problem. The most notable examples would be Dojo Toolkit and Globalize.
The prior is a bit heavy, although supports many aspects of i18n, the latter is lightweight but unfortunately many stuff is missing. If you choose to use Globalize, you might be interested in the latest Jukka's book:
Going Global with JavaScript & Globalize.js - I read this and as far I can tell, it is great. It doesn't cover the topics you were originally asking for but it is still worth reading, even for hands-on examples of how to use Globalize.
Apparently unicode handles this a bit differently than others. ie.,
instead of a direct mapping from a number(code point) to a glyph, it
maps the code point to an abstract "character" which might be
represented by different glyphs.
In the Unicode Character Encoding Model, there are 4 levels:
Abstract Character Repertoire (ACR) — The set of characters to be encoded.
Coded Character Set (CCS) — A one-to-one mapping from characters to integer code points.
Character Encoding Form (CEF) — A mapping from code points to a sequence of fixed-width code units.
Character Encoding Scheme (CES) — A mapping from code units to a serialized sequence of bytes.
For example, the character 𝄞 is represented by the code point U+1D11E in the Unicode CCS, the two code units D834 DD1E in the UTF-16 CEF, and the four bytes 34 D8 1E DD in the UTF-16LE CES.
In most older encodings like US-ASCII, the CEF and CES are trivial: Each character is directly represented by a single byte representing its ASCII code.
1) For the OS to understand an encoding, should it be installed
separately?.
The OS doesn't have to understand an encoding. You're perfectly free to use a third-party encoding library like ICU or GNU libiconv to convert between your encoding and the OS's native encoding, at the application level.
2)Will a font always have the complete implementation of a code page or just part of it?.
In the days of 7-bit (128-character) and 8-bit (256-character) encodings, it was common for fonts to include glyphs for the entire code page. It is not common today for fonts to include all 100,000+ assigned characters in Unicode.
I'll provide you with short answers to your questions.
It's generally not the OS that supports an encoding but the applications. Encodings are used to convert a stream of bytes to lists of characters. For example, in C# reading a UTF-8 string will automatically make it UTF-16 if you tell it to treat it as a string.
No matter what encoding you use, C# will simply use UTF-16 internally and when you want to, for example, print a string from a foreign encoding, it will convert it to UTF-16 first, then look up the corresponding characters in the character tables (fonts) and shows the glyphs.
I don't recall ever seeing a complete font. I don't have much experience with working with fonts either, so I cannot give you an answer for this one.
The answer to this one is in #1, but a short summary: fonts are usually encoding-independent, meaning that as long as the system can convert the input encoding to the font encoding you'll be fine.
Bonus answer: On "how does the OS understand network protocols it doesn't know?": again it's not the OS that handles them but the application. As long as the OS knows where to redirect the traffic (which application) it really doesn't need to care about the protocol. Low-level protocols usually do have to be installed, to allow the OS to know where to send the data.
This answer is based on my understanding of encodings, which may be wrong. Do correct me if that's the case!

Which programming languages were designed with Unicode support from the beginning?

Which widely used programming languages were designed ground-up with Unicode support?
A lot of programming languages have added Unicode support as an afterthought in later versions, but which widely used languages were released with Unicode support from day one?
Java was probably the first popular language to have ground-up Unicode support.
Basically all of the .NET languages are Unicode languages, such as C# and VB.NET.
There were many breaking changes in Python 3, among them the switch to Unicode for all text.
So Python wasn't designed ground-up for Unicode, but Python 3 was.
I don't know how far this goes in other languages, but a fun thing about C# is that not only is the runtime (the string class etc) unicode aware - but unicode is fully supported in source:
using משליט = System.Object;
using תוצאה = System.Int32;
public class שלום : משליט {
public תוצאה בית() {
int אלף = 0;
for (int λ = 0; λ < 20; λ++) אלף+=λ;
return אלף;
}
}
Google's Go programming language supports Unicode and works with UTF-8.
It really is difficult to design Unicode support for the future, in a programming language right from the beginning.
Java is one one of the languages that had this designed into the language specification. However, Unicode support in v1.0 of Java is different from v5 and v6 of the Java SDK. This is primarily due to the version of Unicode that the language specification catered to, when the language was originally designed. Java attempts to track changes in the Unicode standard with every major release.
Early implementations of the JLS could claim Unicode support, primarily because Unicode itself supported 65536 characters (v1.0 of Java supported Unicode 1.1, and Java v1.4 supported Unicode 3.0) which was compatible with the 16-bit storage space taken up by characters. That changed with Unicode 3.1 - its an evolving standard, usually with more characters getting added in each release. The characters added later in 3.1 were called supplementary characters. Support for supplementary characters were added in Java 5 via JSR-204; Java 5 and 6 support Unicode 4.0.
Therefore, don't be surprised if different programming languages implement Unicode support differently.
On the other hand, PHP(!!) and Ruby did not have Unicode support built into them during inception.
PS: Support for v5.1 of Unicode is to be made in Java 7.
Java and the .NET languages, as other commenters have pointed out, although Java's strings are UTF-16 rather than UCS or UTF-8. (At the time, it seemed like a sensible idea! Now clearly either UTF-8 or UCS would be better.) And Python 3 is really a different, incompatible language from Python 1.x and 2.x, so it qualifies too.
The Plan9 languages around 1992 were probably the first to do this: their dialect of C, rc, Alef, mk, ACID, and so on, were all Unicode-enabled. They took the very simple approach that anything that wasn't ASCII was an identifier character. See their paper from 1993 on the subject. (This is the project where UTF-8 was invented, which meant they could do this in a pretty compatible way, in particular without plumbing binary-versus-text through all their programs.)
Other languages that support non-ASCII identifiers include current PHP.
Perl 6 has complete unicode support from scratch.
(With the Rakudo Perl 6 compiler being the first implementation)
General overview
Unicode operators
Strings, Regular expressions and grammars all operate based on graphemes, even for those codepoint combination for which there is no composed representation (a composed representation artificial codepoint is generated on the fly for those cases).
A special encoding exists to handle data of unknown encoding "utf8-c8": this assumes utf-8 when possible, but creates artificial codepoints for unencodable sequences, allowing them to roundtrip if necessary.
Python 3.x: http://docs.python.org/dev/3.0/whatsnew/3.0.html
Sometimes, a feature that was included in a language when it was first designed is not always the best.
Languages have changed over time and many have become bloated with extra features, while not necessarily keeping up-to-date with the features it first included.
So I just throw out the idea that you shouldn't necessarily discount languages that have recently added Unicode. They will have the advantage of adding Unicode to an already mature development tool, and getting the chance to do it right the first time.
With that in mind, I want to ensure that Delphi is included here, as one of your answers. Embarcadero added Unicode in their Delphi 2009 version and did a mighty fine job on it. It was enough to finally prompt me to upgrade from the Delphi 4 that I had been using for 10 years.
Java uses characters from the Unicode character set.
java and .net languages

Best Resource for Character Encodings

I'm searching for a document (not printed) that explains in details, but still simply, the subject of character encoding.
A great overview from the Programmer's perspective is:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
By Joel Spolsky
http://www.joelonsoftware.com/articles/Unicode.html
Have you tried Wikipedia's Character encoding page and it's links ?
This perhaps?
http://www.unicode.org/versions/Unicode5.1.0/
See section 2 onwards of this document http://ahds.ac.uk/creating/guides/linguistic-corpora/chapter4.htm, it has an interesting history of character encoding methods.
Wikipedia is actually as good a source as any to begin with:-
http://en.wikipedia.org/wiki/Character_encoding>Character-Encodings. As well as the more familiar ASCII, UTF-8 etc. they have good information on older schemes like fieldata and the various incarnations of EBCDIC.
For in depth info on utf-8 and unicode you cannot do any better than:-
http://www.unicode.org>Unicode.org
Various manufacturs sites such as Microsoft and IBM have lots of code page info but it tends to relate to thier own hardware/software products.
There is a French book about this called Fontes et codages by Yannis Haralambous, an O'Reilly book, I'm pretty sure it is or will be translated. Indeed, it is:
Fonts and Encodings.
A short explanation of the basic concepts: http://www.mihai-nita.net/article.php?artID=20060806a
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text is a spirituall successor to the Article on joelonsoftware.com (linked to by lkessler).
It is just as good an introduction but is a bit better on the technical details.