How does the handling of combining characters in the Unicode Collation Algorithm work? - unicode

I maintain an open-source, pure-Python implementation of the Unicode Collation Algorithm called pyuca.
While it meets my needs in sorting Ancient Greek text (and seems to meet the needs of many other people), I'm looking to improve its coverage of rarer cases by getting it to the point where it passes the entire suite of official conformance tests.
However, 1,869 of the tests (just over 1%) fail. The first failure is at 0332 0334 which the test files suggest should get the sort key | 004A 0021 | 0002 0002 |.
pyuca, however, forms the sort key | 0021 004A | 0002 0002 |.
At first I thought this might be due to lack of support for non-starter characters (S2.1.1 thru S2.1.3 of the algorithm in the latest spec). However, my subsequent implementation of this part did nothing to change the sort key and a manual working through the algorithm on paper also fails to trigger that section which has me wondering if I'm just missing something.
The relevant steps in the algorithm are:
S2.1.1 If there are any non-starters following S, process each non-starter C.
S2.1.2 If C is not blocked from S, find if S + C has a match in the table.
S2.1.3 If there is a match, replace S by S + C, and remove C.
The key phrase is "If there is a match". In the test mentioned above that fails, there is no match for 0332 0334 and so this part of the algorithm cannot explain why the sort key should be in a different order to what my implementation produces.
Can anyone explain what part of the UCA would form a sort key like the test file suggests?

Does it work better if you shove the string into Normalization Form D first? (Step 1.)
This is an utter wild guess based on the fact that 0332 0334 is not in NFD. I haven't tried to work through the algorithm at all.

Related

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

Why doesn't ICU4J match UTF-8 sort order?

I am having a hard time understanding unicode sorting order.
When I run Collator.getInstance(Locale.ENGLISH).compare("_", "#") under ICU4J 55.1 I get a return value of -1 indicating that _ comes before #.
However, looking at http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec I see that # (U+0023) comes before _ (U+005F). Why is ICU4J returning a value of -1?
First, UTF-8 is just an encoding. It specifies how to store the Unicode code points physically, but does not handle sorting, comparisons, etc.
Now, the page you linked to shows everything in numerical Code Point order. That is the order things would sort in if using a binary collation (in SQL Server, that would be collations with names ending in _BIN and _BIN2). But the non-binary ordering is far more complex. The rules are described here: Unicode Collation Algorithm (UCA).
The base rules are found here: http://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt
It shows:
005F ; [*010A.0020.0002] # LOW LINE
...
0023 ; [*0290.0020.0002] # NUMBER SIGN
It is very important to keep in mind that any locale / culture can override these base rules. Hence, while the few lines noted above explain this specific circumstance, other circumstances would need to check http://www.unicode.org/repos/cldr/tags/release-28/common/collation/ to see if there are any locale-specific overrides.
Converting Mark Ransom's comments into an answer:
The ordering of individual characters is based on a collation table, which has little relationship to the codepoint numbers. See: http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table
If you follow the first link on that page, it leads to allkeys.txt which gives the default collation ordering.
In particular, _ is 005F ; [*020B.0020.0002] # LOW LINE while # is 0023 ; [*0391.0020.0002] # NUMBER SIGN. Note that the collation numbers for _ are lower than the numbers for #.

Unicode case folding to upper case

I'm trying to implement a library for reading Microsoft CFB (Compound File Binary) Format files, according to the official specification of that format. The specification is available from this site.
In a nutshell - some of the structures of the file are stored in a red-black tree. I've got a problem with the comparison predicate used for storing these structures in that tree. The specification says that, if the names (the strings are stored as UTF-16, the standard in Windows APIs) of these structures are different, it is necessary to iterate through every UTF-16 code point and :
(...) convert to upper-case with the Unicode Default Case Conversion
Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> Compare each upper-cased UTF-16 code point binary value.
The <2> reference says that :
or Windows XP and Windows Server 2003: The compound file implementation
conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding
(http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions.
However, when I looked up the referenced case folding file, and read the UTR #21 "Case Mapping" referenced there, I realized that the case folding is defined as an operation that bears much more resemblance to lower-casing, rather than upper-casing.
By using CaseFolding-4.txt, we can obtain the case folding mappings of upper-case letters to lower-case ones. The mapping is always 1-to-1, since full foldings (those that expand to multiple characters) aren't needed here. However, the reverse mapping of lower-case letters to upper-case ones isn't straightforward anymore. For example,
0392; C; 03B2; # GREEK CAPITAL LETTER BETA
03D0; C; 03B2; # GREEK BETA SYMBOL
Thus, we have no way of knowing whether 03B2 should be converted to 0392 or 03D0. Does the standard define something like folding to upper-case? Maybe I should use case folding, and then convert to upper-case? Or have I understood the specification completely wrong?
Summary: The wording used by Microsoft is...confusing to say the least. It appears that simple upper case mapping should be done, though I can't be certain.
Background
Part of the confusion might be the difference between case folding and case mapping. Case mapping maps every character to a designated case. Case folding, while it is based on lower-casing, is defined to result in case-less characters (UTR #21 §1.3).
Now there are two variants of case mapping and case folding, simple and full. Unlike the simple transformation, The full one can change string length, and as you rightly point out is not needed here. The specification specifically mentions simple, and is probably the only clear thing in this answer. I do feel the need to mention for future reference that the the current Unicode Standard (6.3.0) mentions that the default case transformation is the full one, though the version Microsoft references (3.1.1) does not appear to make this distinction.
Spec Analysis
(...) convert to upper-case with the Unicode Default Case Conversion Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> Compare each upper-cased UTF-16 code point binary value.
To me this quote appears to suggest they want upper case, and simply made an error by saying case folding instead of case mapping. But then comes that reference you quoted:
For Windows XP and Windows Server 2003: The compound file implementation conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding (http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions.
They actually mention the case folding data file! At this point, I'm not sure what to think. My main line of thought is that Microsoft wants case folding though erroneously thought that it was based on upper casing rather than lower casing. This is even a stretch though, but its the closest I've been able to come to reconciling this possible contradiction, and I hope there's a better explanation.
I've found in section 2.6.1 the following which supports some form of upper-casing:
[...] the directory entry name is compared using a special case-insensitive upper-case mapping, described in Red-Black Tree.
Note that they do in fact use the term mapping here.
The exception list
Taking a look at the exception list for the mentioned Windows XP and Windows Server 2003, most entries are subtractions, suggesting code points Microsoft wants to keep distinct. However, in the table, the code points are actually listed in reverse order to the Unicode case folding data file.
One interpretation of this is that it's just a display quirk. This idea is shot down by the last row where they subtract the case transformation 0x03C2 -> 0x03C2. This transformation does not exist in the data file since the transformation 0x03C2 -> 0x03C3 does (an unlisted case transformation is considered to transform to itself).
Another interpretation is that they do in fact erroneously believe that its the reverse mapping that's the correct one. As you mentioned though, this runs into trouble, as the reverse mapping is not always straightforward. Otherwise, this interpretation would be fine.
A third interpretation is to consider their reference to the Unicode case folding data file wrong. This of course makes me feel uneasy, but if they actually did mean case mapping originally, they might have just provided the link as a quick reference point. The exception list they mention does have column headers such as "Lowercase UTF-16 code point", but we know that case folding is in fact case-less.
As an aside, I did look at the exception list for the later operating systems, hoping to gain some more insight. I found more confusion. In particular the addition of 0x03C3 -> 0x03A3 troubles me. Since the exception list and the Unicode file list their code points in the opposite order, it appears that the transformation is already in the data file and doesn't need to be added. This part of the specification does not want to be understood!
Conclusion
If you've read all of the above, you'll probably guess that this conclusion is going to be less than ideal. Clearly at one or more points, the specification is in error, but it's hard to tell where. Really there are three possibilities depending on your interpretation as to what kind of case transformation needs to be done.
Simple upper case mapping
Simple case folding, followed by simple upper case mapping
Simple case folding
To me it seems like Microsoft does in fact want upper casing. From there I believe that the case folding reference is an error, and as such my guess is they just want simple upper case mapping.
I highly doubt it's the last simple case folding only option though. Both of the other options would give very similar results with only a small amount of code points possibly giving different results.
It seems like the only way to know for sure would be to either contact Microsoft, or painstakingly look at binaries to see which method is followed.
In 3.13 Default Case Algorithms (p. 115) of The Unicode Standard
Version 6.2 – Core Specification the text refers to UnicodeData.txt. This contains:
03B2;GREEK SMALL LETTER BETA;Ll;0;L;;;;;N;;;0392;;0392
03D0;GREEK BETA SYMBOL;Ll;0;L;<compat> 03B2;;;;N;GREEK SMALL LETTER CURLED BETA;;0392;;0392
which indicates that the Greek small letter Beta should indeed map to the Greek Beta symbol, and as an aside indicates that the two symbols have some level of compatibility. It also contains the remainder of the bidirectional case conversion you are looking for. You may also need to look at SpecialCasing.txt for boundary cases.

doing away with encoding in a genetic algorithmic impementation

I was wondering if encoding in a genetic algorithm is really necessary , I mean let's say I have a program that is supposed to implement a GA to guess a word a user inputs,.
I don't see the point in having the chromosomes as a binary string, I would rather have it as just a string of letters , and mutate the string and crossbreed it accordingly.
Is such a approach unorthodox ? and will it really affect the outcome , or does it violate the definition of a genetic algorithm?
I do understand different types of encoding is possible.However that isn't what I am concerned about.Please keep your answer specific to the program objective of guessing a string that is similar to the one inputted by the user.
THIS IS NOT A QUESTION ABOUT CHOICE OF ENCODING, BUT WHETHER I CAN DO AWAY WITH THE WHOLE ENCODING SCENARIO RELEVANT TO THIS QUESTION OBJECTIVE.
Though unorthodox, your approach would be perfectly valid. The crossover and mutation functionalities may have to be tweaked however. There are in fact numerous such-non standard implementations (of encodings) today including alphabetic, alphanumeric, decimal, etc.
As per your specific case, if you do not encode an alphabetic chromosome, it is the same as encoding it in an alphabetic manner with an identity map; now, for an alphabetic encoding, the normal crossover functionality should be valid though the mutation may have to be so as to generate a random alphabet at the mutation site, if any.
Binary encoding in GA is generally followed only due to the simplicity and speed of the operations involved. For example, for your case, a string/character comparison takes longer to carry out in general considering the integer/boolean alternative.

Is this a bug of lex?

rule:
%%
AAAA print("AAAA : %s\n",yytext);
AAA print("AAA : %s\n",yytext);
AA print("AA : %s\n",yytext);
And the input is AAAAA,the output is:
AAAA : AAAA
A
Instead of :
AAA : AAA
AA : AAA
Is it a bug of lex?
No, it's adhering to the specification.
The rule is (man 1p lex)
During pattern matching, lex shall search the set of patterns for the single longest possible match. Among rules that match the same number of characters, the rule given first shall be chosen.
so it will always greedily search for the longest AAAA first. This rule is common in lexical conventions of many languages. Eg. C++:
void f(void*=0);
would fail to parse, because the characters *= are interpreted as assign-multiplication operator (which is the longest match), not * and then =.
The reason behind this rule is it can be implemented efficiently. The scanner with this rule only needs O(1) space (including input, ie. input need not fit into memory) and O(N) time. If it were to check that the rest of the input can be tokenized as well, it would need O(N) space and as much as O(N^2) time. Particularly the memory consumption was crucial in the Middle Ages of computing when all compilation was done in linear passes. And I'm sure you wouldn't appreciate O(N^2) running time when parsing today's several-hundred-thousand line source files (eg. C files including headers). Second, the scanners thus generated are very fast and help a lot when parsing.
Last, but not least, the rule is simple to understand. As an example of the opposite, consider ANTLR's rule for tokenization, which would sometimes fail to match even though a prefix of the current token is a token, and the input minus that prefix is tokenizable. For example:
TOK1 : 12
TOK2 : (13)+
would fail to match '12131312'. No such surprises happen with lex; so I suggest to take the rule as-is.