Unicode NFC quick check and supplemental characters example in UAX#15 section 9 - unicode

If I look in UAX#15 Section 9, there is sample code to check for normalization. That code uses the NFC_QC property and checks CCC ordering, as expected. It looks great except this line puzzles me if (Character.isSupplementaryCodePoint(ch)) ++i;. It seems to be saying that if a character is supplemental (i.e. >= 0x10000), then I can just assume the next character passes the quick check without bothering to check the NFC_QC property or CCC ordering on it.
Theoretically, I can have, say, a starting code point, followed by a supplemental code point with CCC>0, followed by a third code point with CCC>0 and lower than that of the second code point or NFC_QC==NO, and it will STILL pass the NFC quick check, even tho that would seem to not be in NFC form. There are a bunch of supplemental code points with CCC=7,9,216,220,230, so it seems like there are a lot of possibilities to hit this case. I guess this can work if we can assume that it will always be the case throughout future versions of Unicode that all supplemental characters with CCC>0 will also have NFC_QC==No.
Is this sample code correct? If so, why is this supplemental check valid? Are there cases that would produce incorrect results if that supplemental check were removed?
Here is the code snippet copied directly from that link.
public int quickCheck(String source) {
short lastCanonicalClass = 0;
int result = YES;
for (int i = 0; i < source.length(); ++i) {
int ch = source.codepointAt(i);
if (Character.isSupplementaryCodePoint(ch)) ++i;
short canonicalClass = getCanonicalClass(ch);
if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
return NO; }
int check = isAllowed(ch);
if (check == NO) return NO;
if (check == MAYBE) result = MAYBE;
lastCanonicalClass = canonicalClass;
}
return result;
}

The sample code is correct,1 but the part that concerns you has little to do with Unicode normalization. No characters in the string are actually skipped, it’s just that Java makes iterating over a string’s characters somewhat awkward.
The extra increment is a workaround for a historical wart in Java (which it happens to share with JavaScript and Windows, two other early adopters of Unicode): Java Strings are arrays of Java chars, but Java chars are not Unicode (abstract) characters or (concrete numeric) code points, they are 16-bit UTF-16 code units. This means that every character with code point C < 1 0000h takes up one position in a Java String, containing C, but every character with code point C ≥ 1 0000h takes two, as specified by UTF-16: the high or leading surrogate D800h + (C − 1 0000h) div 400h and the low or trailing surrogate DC00h + (C − 1 0000h) mod 400h (no Unicode characters are or ever will be assigned code points in the range [D800h, DFFFh], so the two cases are unambiguously distinguishable).
Because Unicode normalization operates in terms of a sequence of Unicode characters and cares little for the particulars of UTF-16, the sample code calls String.codePointAt(i) to decode the code point that occupies either position i or the two positions i and i+1 in the provided string, processes it, and uses Character.isSupplementaryCodePoint to figure out whether it should advance one or two positions. The way the loop is written treats the “supplementary” two-unit case like an unwanted stepchild, but that’s the accepted Java way of treating them.
1 Well, correct up to a small spelling error: codepointAt should be codePointAt.

Related

Case-insensitve string comparison

Is there any difference between the following ways of comparing strings without case in Swift?
let equal = str1.lowercased() == str2.lowercased() // or uppercased()
vs:
let equal = str1.caseInsensitiveCompare(str2) == .orderedSame
Is there any case in any language where one returns an incorrect result? I'm more interested in Unicode correctness than performance.
The caseInsensitiveCompare can be far more efficient (though I'd be shocked if it's observable in normal every day usage). And, IMHO, it's more intuitive regarding the intent.
Regarding "unicode correctness", I guess it depends upon what you mean. For example, comparing "Straße" to "strasse", caseInsensitiveCompare will say they're the same, whereas lowercased approach will not (though uppercased will).
But if you compare "\u{E9}" to "\u{65}\u{301}" in Swift 4 (see the unicode correctness discussion in WWDC 2017 What's New in Swift), they both correctly recognize that those are é and will say they're the same, even though those two strings have different numbers of unicode scalars.
Both do the same thing, lowercased() or uppercased() won't affect the Unicode characters so the end result will always be matching to other string when you compare.
These methods support all types of special characters including emoji icons.
Same goes with caseInsensitiveCompare; this too will ignore special characters, symbols etc.

Null pointer Exception in the case of reading in emoticons

I have a text file that looks like this:
shooting-stars 💫 "are cool"
I have a lexical analyzer that uses FileInputStream to read the characters one at a time, passing those characters to a switch statement that returns the corresponding lexeme.
In this case, 💫 represents assignment so this case passes:
case 'ð' :
return new Lexeme("ASSIGN");
For some reason, the file reader stops at that point, returning a null pointer even though it has yet to process the string (or whatever is after the 💫). Any time it reads in an emoticon it does this. If there were no emoticons, it gets to the end of file. Any ideas?
I suspect the problem is that the character 💫 (Unicode code point U+1F4AB) is outside the range of characters that Java represents internally as single char values. Instead, Java represents characters above U+FFFF as two characters known as surrogate pairs, in this case U+D83D followed by U+DCAB. (See this thread for more info and some links.)
It's hard to know exactly what's going on with the little bit of code that you presented, but my guess is that you are not handling this situation correctly. You will need to adjust your processing logic to deal with your emoticons arriving in two pieces.

How to search for any unicode symbol in a character string?

I've got an existing DOORS module which happens to have some rich text entries; these entries have some symbols in them such as 'curly' quotes. I'm trying to upgrade a DXL macro which exports a LaTeX source file, and the problem is that these high-number symbols are not considered "standard UTF-8" by TexMaker's import function (and in any case probably won't be processed by Xelatex or other converters) . I can't simply use the UnicodeString functions in DXL because those break the rest of the rich text, and apparently the character identifier charOf(decimal_number_code) only works over the basic set of characters, i.e. less than some numeric code value. For example, charOf(8217) should create a right-curly single quote, but when I tried code along the lines of
if (charOf(8217) == one_char)
I never get a match. I did copy the curly quote from the DOORS module and verified via an online unicode analyzer that it was definitely Unicode decimal value 8217 .
So, what am I missing here? I just want to be able to detect any symbol character, identify it correctly, and then replace it with ,e.g., \textquoteright in the output stream.
My overall setup works for lower-count chars, since this works:
( c is a single character pulled from a string)
thedeg = charOf(176)
if( thedeg == c )
{
temp += "$\\degree$"
}
Got some help from DXL coding experts over at IBM forums.
Quoting the important stuff (there's some useful code snippets there as well):
Hey, you are right it seems intOf(char) and charOf(int) both do some
modulo 256 and therefore cut anything above that off. Try:
int i=8217;
char c = addr_(i);
print c;
Which then allows comparison of c with any input char.

UTF-16 string : how to process over U+10000? [duplicate]

This question already has answers here:
How does Java store UTF-16 characters in its 16-bit char type?
(2 answers)
Closed 8 years ago.
As we know, UTF-16 is variable-length when there is a character over U+10000.
However, .Net, Java and Windows WCHAR UTF-16 string is treated as if they are fixed-length... What happens if I use over U+10000?
And if they process over U+10000, how do they process? For example, in .Net and Java char is 16bit. so one char cannot process over U+10000..
(.net, java and windows is just example.. I'm talking about how to process over U+10000. But I think I'd rather know how they process over U+10000, for my understanding)
thanks to #dystroy, I know how they process. But there is one problem: If string use UTF-16 surrogate, a random access operation, such as str[3], is O(N) algorithm because any character can be 4-byte or 2-byte! How is this problem treated?
I answered the first part of the question in this QA : Basically, some characters simply are spread over more than one Java char.
To answer the second part related to random access to unicode points str[3], there are more than one method :
charAt is careless and only handle chars in a fast and obvious way
codePointAt returns a 32 bits int (but need a char index)
codePointCount counts code points
And yes, counting the code points is costly and basically O(N). Here's how it's done in Java :
2665 static int More ...codePointCountImpl(char[] a, int offset, int count) {
2666 int endIndex = offset + count;
2667 int n = 0;
2668 for (int i = offset; i < endIndex; ) {
2669 n++;
2670 if (isHighSurrogate(a[i++])) {
2671 if (i < endIndex && isLowSurrogate(a[i])) {
2672 i++;
2673 }
2674 }
2675 }
2676 return n;
2677 }
UTF-16 is a bad format to deal with code points, especially if you leave the BMP. Most programs simply don't handle code points, which is the reason this format is usable. Most String operations are fast because they don't deal with code points : all standard API take char indexes as arguments, not worrying about what kind of rune points they do have behind.
Usually this problem is not treated at all. Many languages and libraries that use UTF-8 or UTF-16 do substrings or indexes by accessing code units, not code points. That is str[3] will just return the surrogate character in that case. Of course access is constant-time in that case, but for anything outside the BMP (or ASCII) you have to be careful what you do.
If you're lucky there are methods to access code points, e.g. in Java String.codePointAt. And in this case you have to scan the string from the start and determine code point boundaries.
Generally, even accessing code points doesn't gain you very much, though, only at library level. Strings often are used eventually to interact with the user and in that case graphemes or visual string length become more important than code points. And you have even more processing to do in that case.

Building an ngram frequency table and dealing with multibyte runes

I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.
Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.
This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.
I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.
Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0
I've only posted the table generating logic since everything already works just fine.
As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.
Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.
I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.
As ever, thanks!
If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.
Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv
By keeping a circular buffer of runes, you can minimise allocations. Also note that reading a new key from a map returns the zero value (which for int is 0), which means the unknown key check in your code is redundant.
func Parse(text string, n int) map[string]int {
chars := make([]rune, 2 * n)
table := make(map[string]int)
k := 0
for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
chars[n + k] = chars[k]
k = (k + 1) % n
table[string(chars[k:k+n])]++
}
return table
}