UTF-16 string : how to process over U+10000? [duplicate] - unicode

This question already has answers here:
How does Java store UTF-16 characters in its 16-bit char type?
(2 answers)
Closed 8 years ago.
As we know, UTF-16 is variable-length when there is a character over U+10000.
However, .Net, Java and Windows WCHAR UTF-16 string is treated as if they are fixed-length... What happens if I use over U+10000?
And if they process over U+10000, how do they process? For example, in .Net and Java char is 16bit. so one char cannot process over U+10000..
(.net, java and windows is just example.. I'm talking about how to process over U+10000. But I think I'd rather know how they process over U+10000, for my understanding)
thanks to #dystroy, I know how they process. But there is one problem: If string use UTF-16 surrogate, a random access operation, such as str[3], is O(N) algorithm because any character can be 4-byte or 2-byte! How is this problem treated?

I answered the first part of the question in this QA : Basically, some characters simply are spread over more than one Java char.
To answer the second part related to random access to unicode points str[3], there are more than one method :
charAt is careless and only handle chars in a fast and obvious way
codePointAt returns a 32 bits int (but need a char index)
codePointCount counts code points
And yes, counting the code points is costly and basically O(N). Here's how it's done in Java :
2665 static int More ...codePointCountImpl(char[] a, int offset, int count) {
2666 int endIndex = offset + count;
2667 int n = 0;
2668 for (int i = offset; i < endIndex; ) {
2669 n++;
2670 if (isHighSurrogate(a[i++])) {
2671 if (i < endIndex && isLowSurrogate(a[i])) {
2672 i++;
2673 }
2674 }
2675 }
2676 return n;
2677 }
UTF-16 is a bad format to deal with code points, especially if you leave the BMP. Most programs simply don't handle code points, which is the reason this format is usable. Most String operations are fast because they don't deal with code points : all standard API take char indexes as arguments, not worrying about what kind of rune points they do have behind.

Usually this problem is not treated at all. Many languages and libraries that use UTF-8 or UTF-16 do substrings or indexes by accessing code units, not code points. That is str[3] will just return the surrogate character in that case. Of course access is constant-time in that case, but for anything outside the BMP (or ASCII) you have to be careful what you do.
If you're lucky there are methods to access code points, e.g. in Java String.codePointAt. And in this case you have to scan the string from the start and determine code point boundaries.
Generally, even accessing code points doesn't gain you very much, though, only at library level. Strings often are used eventually to interact with the user and in that case graphemes or visual string length become more important than code points. And you have even more processing to do in that case.

Related

Unicode NFC quick check and supplemental characters example in UAX#15 section 9

If I look in UAX#15 Section 9, there is sample code to check for normalization. That code uses the NFC_QC property and checks CCC ordering, as expected. It looks great except this line puzzles me if (Character.isSupplementaryCodePoint(ch)) ++i;. It seems to be saying that if a character is supplemental (i.e. >= 0x10000), then I can just assume the next character passes the quick check without bothering to check the NFC_QC property or CCC ordering on it.
Theoretically, I can have, say, a starting code point, followed by a supplemental code point with CCC>0, followed by a third code point with CCC>0 and lower than that of the second code point or NFC_QC==NO, and it will STILL pass the NFC quick check, even tho that would seem to not be in NFC form. There are a bunch of supplemental code points with CCC=7,9,216,220,230, so it seems like there are a lot of possibilities to hit this case. I guess this can work if we can assume that it will always be the case throughout future versions of Unicode that all supplemental characters with CCC>0 will also have NFC_QC==No.
Is this sample code correct? If so, why is this supplemental check valid? Are there cases that would produce incorrect results if that supplemental check were removed?
Here is the code snippet copied directly from that link.
public int quickCheck(String source) {
short lastCanonicalClass = 0;
int result = YES;
for (int i = 0; i < source.length(); ++i) {
int ch = source.codepointAt(i);
if (Character.isSupplementaryCodePoint(ch)) ++i;
short canonicalClass = getCanonicalClass(ch);
if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
return NO; }
int check = isAllowed(ch);
if (check == NO) return NO;
if (check == MAYBE) result = MAYBE;
lastCanonicalClass = canonicalClass;
}
return result;
}
The sample code is correct,1 but the part that concerns you has little to do with Unicode normalization. No characters in the string are actually skipped, it’s just that Java makes iterating over a string’s characters somewhat awkward.
The extra increment is a workaround for a historical wart in Java (which it happens to share with JavaScript and Windows, two other early adopters of Unicode): Java Strings are arrays of Java chars, but Java chars are not Unicode (abstract) characters or (concrete numeric) code points, they are 16-bit UTF-16 code units. This means that every character with code point C < 1 0000h takes up one position in a Java String, containing C, but every character with code point C ≥ 1 0000h takes two, as specified by UTF-16: the high or leading surrogate D800h + (C − 1 0000h) div 400h and the low or trailing surrogate DC00h + (C − 1 0000h) mod 400h (no Unicode characters are or ever will be assigned code points in the range [D800h, DFFFh], so the two cases are unambiguously distinguishable).
Because Unicode normalization operates in terms of a sequence of Unicode characters and cares little for the particulars of UTF-16, the sample code calls String.codePointAt(i) to decode the code point that occupies either position i or the two positions i and i+1 in the provided string, processes it, and uses Character.isSupplementaryCodePoint to figure out whether it should advance one or two positions. The way the loop is written treats the “supplementary” two-unit case like an unwanted stepchild, but that’s the accepted Java way of treating them.
1 Well, correct up to a small spelling error: codepointAt should be codePointAt.

Building an ngram frequency table and dealing with multibyte runes

I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.
Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.
This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.
I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.
Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0
I've only posted the table generating logic since everything already works just fine.
As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.
Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.
I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.
As ever, thanks!
If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.
Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv
By keeping a circular buffer of runes, you can minimise allocations. Also note that reading a new key from a map returns the zero value (which for int is 0), which means the unknown key check in your code is redundant.
func Parse(text string, n int) map[string]int {
chars := make([]rune, 2 * n)
table := make(map[string]int)
k := 0
for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
chars[n + k] = chars[k]
k = (k + 1) % n
table[string(chars[k:k+n])]++
}
return table
}

Looking for a good 64 bit hash for file paths in UTF16

I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'.
The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys"
I want to hash this path into a 64-bit unsigned integer.
It does not need to be "cryptographically sound".
The hashes should be case insensitive, but able to handle non-ascii letters.
Obviously, the hash also should scatter well.
There are some ideas that I had though of:
A) Using the windows file identifier as a "hash". In my case i do want the hash to change if the file gets moved, so this is not an option.
B) Just use a regular sting hash: hash += prime * hash + codepoint for the whole string.
I do have the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be leveraged.
To sum up the needs:
1) 64bit hash
2) good distribution / few collisions for file system paths.
3) efficient
4) does not need to be secure
5) case insensitive
I would just use something straightforward. I don't know what language you are using, so the following is pseudocode:
ui64 res = 10000019;
for(i = 0; i < len; i += 2)
{
ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]);
res = res * 8191 + merge; // unchecked arithmetic
}
return res;
I'm assuming that path[i + 1] is safe on the basis that if len is odd then in the last case it will read the U+0000 safely.
I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.
Cryptographically secure hashes might not be very efficient in terms of speed, but there are implementations available for virtually any programming language.
Whether using them is feasible for your application depends on how much you depend on speed – a benchmark would give you an appropriate answer to that.
You could use a sub-string of such a hash, e.g. MD5 on your path, previously converted to lower case so that the hash is effectively case-insensitive (requires that you use a method for lower-casing which knows how to convert all UTF-16 non-standard characters that may occur in the file system).
Cryptographically secure hashes have the benefit of quite even distribution no matter which sub-string part you take because they are designed to be non-predictable, i.e. each part of the hash ideally depends on the entire hashed data as any other part of it.
Even if you do not need a cryptographic hash, you can still use one, and since your problem is not about security, then a "broken" cryptographic hash would be fine. I suggest MD4, which is quite fast. On my PC (a 2.4 GHz Core2 system, using a single core), MD4 hashes more than 700 MB/s, and even for small inputs (less than 50 bytes) it can process about 8 millions messages par second. You may find faster non-cryptographic hashes, but it already takes a rather specific situation for it to make a measurable difference.
For the specific properties you are after, you would need:
To "normalize" characters so that uppercase letters are converted to lowercase (for case insensitivity). Note that, generally speaking, case-insensitivity in the Unicode world is not an easy task. From what you explain, I gather that you are only after the same kind of case-insensitivity that Windows uses for file accesses (I think that it is ASCII-only, so conversion uppercase->lowercase is simple).
To truncate the output of MD4. MD4 produces 128 bits; just use the first 64 bits. This will be as scattered as you could wish for.
There are MD4 implementations available in may places, including right in the RFC 1320 I link to above. You may also find opensource MD4 implementations in C and Java in sphlib.
You could just create a shared library in C# and use the FileInfo class to obtain the full path of a directory or file. Then use .GetHashCode() in the path, like this:
Hash = fullPath.GetHashCode();
or
int getHashCode(string uri)
{
if (uri == null) throw new ArgumentNullException(nameof(uri));
FileInfo fileInfo = new FileInfo(uri);
return fileInfo.FullName.GetHashCode();
}
Altough this is just a 32 bits code, you duplicate it or append another HashCode based on some other characteristics of the file.

How remove the warning "large integer implicitly truncated" for sqlite/unicode support?

I use the solution of http://ioannis.mpsounds.net/2007/12/19/sqlite-native-unicode-like-support/ for my POS App for the iPhone, and work great.
However, as say in the comments:
For instance, sqlite_unicode.c line 1861 contains integral constants greater than 0xffff but are declared as unsigned short. I wonder how I should cope with that.
I'm fixing all the warnings in my project and this is the last one. The code is this:
static unsigned short unicode_unacc_data198[] = { 0x8B8A, 0x8D08, 0x8F38, 0x9072, 0x9199, 0x9276, 0x967C, 0x96E3, 0x9756, 0x97DB, 0x97FF, 0x980B, 0x983B, 0x9B12, 0x9F9C, 0x2284A, 0x22844, 0x233D5, 0x3B9D, 0x4018, 0x4039, 0x25249, 0x25CD0, 0x27ED3, 0x9F43, 0x9F8E, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF };
I don't know about this hex stuff, so what to do? I don't get a error, not know if this could cause one in the future...
Yes, 0x2284A is indeed larger than 0xFFFF, which is the largest a 16-bit unsigned integer can contain.(*)
This is a lookup table for mapping characters with diacritical marks to basic unaccented characters. For some reason, a few mappings are defined that point to characters outside the ‘Basic Multilingual Plane’ of Unicode characters that fit in 16 bits.
U+2284A and the others above are highly obscure extended Chinese characters. I'm not sure why a character in the BMP would refer to such a character as its base unaccented version. Maybe it's an error in the source data used to generate the tables, or maybe it's just another weird quirk of the Chinese writing system. Either way, it's extremely unlikely you'll ever need that mapping. So just change all the five-digit hex codes in this array to be 0xFFFF instead (which seems to be what the code is using to signify ‘no mapping’).
(*: in theory a short could be more than 16 bits, but in reality it isn't going to be. If it were it looks like this code would totally fall over anyway, as it's freely mixing short with u16 pointers.)

What are the limitations of primitive character types in D?

I am currently exploring the specification of the Digital Mars D language, and am having a little trouble understanding the complete nature of the primitive character types. The book Learn to Tango With D is similarly vague on the capabilities and limitations of the language in this area.
The types are given on the website as:
char; // unsinged 8 bit UTF-8
wchar; // unsigned 16 bit UTF-16
dchar; // unsigned 32 bit UTF-32
Since we know that most of the Unicode Transformation (UTF) Format encodings represent characters with a variable bit-width, does this mean that a char in D can only contain the values that will fit in 8 bits, or does it expand in the machine's physical memory when you give it double byte characters? Perhaps there is some other possibility, like automatic casting into the next most appropriate type as you overload the variable?
Let's say for example, I want to use the UTF-8 char in an editor and type in Chinese . Will it simply fall over, or is it able to deal with Unicode characters more 'correctly', like in C#? Would it still be necessary to provide glue code to allow working with any language supported by Unicode?
I'd appreciate any specific information you can offer on how these types work under the covers, and any general best practices advice on dealing with their limitations.
A single char or wchar represents an UTF code unit. This means that, by its own, a char in can either represent an ASCII symbol (0-127) or be part of an UTF-8 sequence representing an Unicode character (code point). Only the dchar type can represent an entire Unicode character, because there are more than 65536 code points in Unicode.
Casting one type of string type (string, wstring and dstring, which are simply dynamic arrays of the character types) will not automatically convert their contents to the respective UTF representation. In order to do this, you must use the functions toUTF8, toUTF16 and toUTF32 from std.utf (or toString / toString16 / toString32 from tango.text.convert.Utf if you use Tango).
Users have implemented string classes which will automatically use the most memory-efficient representation that can map each character to a single code unit. This allows quick slicing and indexing with a minimal memory overhead. One such implementation is mtext by Christopher E. Miller.
Further reading:
the String handling section in Wikipedia's entry on D
Text in D, by Daniel Keep