Unicode case folding to upper case - unicode

I'm trying to implement a library for reading Microsoft CFB (Compound File Binary) Format files, according to the official specification of that format. The specification is available from this site.
In a nutshell - some of the structures of the file are stored in a red-black tree. I've got a problem with the comparison predicate used for storing these structures in that tree. The specification says that, if the names (the strings are stored as UTF-16, the standard in Windows APIs) of these structures are different, it is necessary to iterate through every UTF-16 code point and :
(...) convert to upper-case with the Unicode Default Case Conversion
Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> Compare each upper-cased UTF-16 code point binary value.
The <2> reference says that :
or Windows XP and Windows Server 2003: The compound file implementation
conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding
(http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions.
However, when I looked up the referenced case folding file, and read the UTR #21 "Case Mapping" referenced there, I realized that the case folding is defined as an operation that bears much more resemblance to lower-casing, rather than upper-casing.
By using CaseFolding-4.txt, we can obtain the case folding mappings of upper-case letters to lower-case ones. The mapping is always 1-to-1, since full foldings (those that expand to multiple characters) aren't needed here. However, the reverse mapping of lower-case letters to upper-case ones isn't straightforward anymore. For example,
0392; C; 03B2; # GREEK CAPITAL LETTER BETA
03D0; C; 03B2; # GREEK BETA SYMBOL
Thus, we have no way of knowing whether 03B2 should be converted to 0392 or 03D0. Does the standard define something like folding to upper-case? Maybe I should use case folding, and then convert to upper-case? Or have I understood the specification completely wrong?

Summary: The wording used by Microsoft is...confusing to say the least. It appears that simple upper case mapping should be done, though I can't be certain.
Background
Part of the confusion might be the difference between case folding and case mapping. Case mapping maps every character to a designated case. Case folding, while it is based on lower-casing, is defined to result in case-less characters (UTR #21 §1.3).
Now there are two variants of case mapping and case folding, simple and full. Unlike the simple transformation, The full one can change string length, and as you rightly point out is not needed here. The specification specifically mentions simple, and is probably the only clear thing in this answer. I do feel the need to mention for future reference that the the current Unicode Standard (6.3.0) mentions that the default case transformation is the full one, though the version Microsoft references (3.1.1) does not appear to make this distinction.
Spec Analysis
(...) convert to upper-case with the Unicode Default Case Conversion Algorithm, simple case conversion variant (simple case foldings), with the following notes.<2> Compare each upper-cased UTF-16 code point binary value.
To me this quote appears to suggest they want upper case, and simply made an error by saying case folding instead of case mapping. But then comes that reference you quoted:
For Windows XP and Windows Server 2003: The compound file implementation conforms to the Unicode 3.0.1 Default Case Conversion Algorithm, simple case folding (http://www.unicode.org/Public/3.1-Update1/CaseFolding-4.txt) with the following exceptions.
They actually mention the case folding data file! At this point, I'm not sure what to think. My main line of thought is that Microsoft wants case folding though erroneously thought that it was based on upper casing rather than lower casing. This is even a stretch though, but its the closest I've been able to come to reconciling this possible contradiction, and I hope there's a better explanation.
I've found in section 2.6.1 the following which supports some form of upper-casing:
[...] the directory entry name is compared using a special case-insensitive upper-case mapping, described in Red-Black Tree.
Note that they do in fact use the term mapping here.
The exception list
Taking a look at the exception list for the mentioned Windows XP and Windows Server 2003, most entries are subtractions, suggesting code points Microsoft wants to keep distinct. However, in the table, the code points are actually listed in reverse order to the Unicode case folding data file.
One interpretation of this is that it's just a display quirk. This idea is shot down by the last row where they subtract the case transformation 0x03C2 -> 0x03C2. This transformation does not exist in the data file since the transformation 0x03C2 -> 0x03C3 does (an unlisted case transformation is considered to transform to itself).
Another interpretation is that they do in fact erroneously believe that its the reverse mapping that's the correct one. As you mentioned though, this runs into trouble, as the reverse mapping is not always straightforward. Otherwise, this interpretation would be fine.
A third interpretation is to consider their reference to the Unicode case folding data file wrong. This of course makes me feel uneasy, but if they actually did mean case mapping originally, they might have just provided the link as a quick reference point. The exception list they mention does have column headers such as "Lowercase UTF-16 code point", but we know that case folding is in fact case-less.
As an aside, I did look at the exception list for the later operating systems, hoping to gain some more insight. I found more confusion. In particular the addition of 0x03C3 -> 0x03A3 troubles me. Since the exception list and the Unicode file list their code points in the opposite order, it appears that the transformation is already in the data file and doesn't need to be added. This part of the specification does not want to be understood!
Conclusion
If you've read all of the above, you'll probably guess that this conclusion is going to be less than ideal. Clearly at one or more points, the specification is in error, but it's hard to tell where. Really there are three possibilities depending on your interpretation as to what kind of case transformation needs to be done.
Simple upper case mapping
Simple case folding, followed by simple upper case mapping
Simple case folding
To me it seems like Microsoft does in fact want upper casing. From there I believe that the case folding reference is an error, and as such my guess is they just want simple upper case mapping.
I highly doubt it's the last simple case folding only option though. Both of the other options would give very similar results with only a small amount of code points possibly giving different results.
It seems like the only way to know for sure would be to either contact Microsoft, or painstakingly look at binaries to see which method is followed.

In 3.13 Default Case Algorithms (p. 115) of The Unicode Standard
Version 6.2 – Core Specification the text refers to UnicodeData.txt. This contains:
03B2;GREEK SMALL LETTER BETA;Ll;0;L;;;;;N;;;0392;;0392
03D0;GREEK BETA SYMBOL;Ll;0;L;<compat> 03B2;;;;N;GREEK SMALL LETTER CURLED BETA;;0392;;0392
which indicates that the Greek small letter Beta should indeed map to the Greek Beta symbol, and as an aside indicates that the two symbols have some level of compatibility. It also contains the remainder of the bidirectional case conversion you are looking for. You may also need to look at SpecialCasing.txt for boundary cases.

Related

In Python (or any language) what does an "upper" function do to Hindi, Amharric and other non-Latin character sets?

Subject says it all. Been looking for an answer, but cannot seem to find it.
I am writing a web app that will store data in a database and also have language files translated into a wide variety of character sets. At various moments, the text will be presented. I want to control presentation such as spurious blank spaces at the beginning and end of strings. Also I want to ensure some letters are upper or lower case.
My question is: what happens in upper/lower case functions when the character set only has one case?
EDIT Sub question: Are there any unexpected side effects to be aware of?
My guess is that you simply get back the one and only character.
EDIT - Added Description
The main reason for asking this question is that I am writing a webapp that will be distributed and run on machines in remote areas with little or no chance to fix "on-the-spot" bugs. It's not a complicated webapp, but will run with many different language char sets. I want to be certain of my footing before releasing the server.
First of all the upper() and lower() method in python can be applied to Hindi, Amharric and non-letter character sets.
For instance will the upper() method converts the lowercase characters if an equivalent uppercase of this char exists. If not, then not.
Or better said, if there is nothing to convert, it stays the same.

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

How this mixed-character string split on unicode word boundaries

Consider the string "abc를". According to unicode's demo implementation of word segmentation, this string should be split into two words, "abc" and "를". However, 3 different Rust implementations of word boundary detection (regex, unic-segment, unicode-segmentation) have all disagreed, and grouped that string into one word. Which behavior is correct?
As a follow up, if the grouped behavior is correct, what would be a good way to scan this string for the search term "abc" in a way that still mostly respects word boundaries (for the purpose of checking the validity of string translations). I'd want to match something like "abc를" but don't match something like abcdef.
I'm not so certain that the demo for word segmentation should be taken as the ground truth, even if it is on an official site. For example, it considers "abc를" ("abc\uB97C") to be two separate words but considers "abc를" ("abc\u1105\u1173\u11af") to be one, even though the former decomposes to the latter.
The idea of a word boundary isn't exactly set in stone. Unicode has a Word Boundary specification which outlines where word-breaks should and should not occurr. However, it has an extensive notes section for elaborating on other cases (emphasis mine):
It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. The goal for the specification presented in this annex is to provide a workable default; tailored implementations can be more sophisticated.
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
...
My understanding is that the crates you list are following the spec without further contextual analysis. Why the demo disagrees I cannot say, but it may be an attempt to implement one of these edge cases.
To address your specific problem, I'd suggest using Regex with \b for matching a word boundary. This unfortunately follows the same unicode rules and will not consider "를" to be a new word. However, this regex implementation offers an escape hatch to fallback to ascii behaviour. Simply use (?-u:\b) to match a non-unicode boundary:
use regex::Regex;
fn main() {
let pattern = Regex::new("(?-u:\\b)abc(?-u:\\b)").unwrap();
println!("{:?}", pattern.find("some abcdef abc를 sentence"));
}
You can run it for yourself on the playground to test your cases and see if this works for you.

Are there any real alternatives to unicode?

As a C++ developer supporting unicode is, putting it mildly, a pain in the butt. Unicode has a few unfortunate properties that makes it very hard to determine the case of a letter, convert them or pretty much anything beyond identifying a single known codepoint or so (which may or may not be a letter). The only real rescue, it seems, is ICU for those who are unfortunate enough to not have unicode support builtin the language (i.e. C and C++). Support for unicode in other languages may or may not be good enough.
So, I thought, there must be a real alternative to unicode! i.e. an encoding that does allow easy identification of character classes, besides having a lookup datastructure (tree, table, whatever), and identifying the relationship between characters? I suspect that any such encoding would likely be multi-byte for most text -- that's not a real concern to me, but I accept that it is for others. Providing such an encoding is a lot of work, so I'm not really expecting any such encoding to exist 😞.
Short answer: not that I know of.
As a non-C++ developer, I don't know what specifically is a pain about Unicode, but since you didn't tag the question with C++, I still dare to attempt an answer.
While I'm personally very happy about Unicode in general, I agree that some aspects are cumbersome.
Some of them could arguably be improved if Unicode was redesigned from scratch, eg. by removing some redundancies like the "Latin Greek" math letters besides the actual Greek ones (but that would also break compatibility with older encodings).
But most of the "pains" just reflect the chaotic usage of writing in the first place.
You mention yourself the problem of uppercase "i", which is "I" in some, "İ" in other orthographies, but there are tons of other difficulties – eg. German "ß", which is lowercase, but has no uppercase equivalent (well, it has now, but is rarely used); or letters that look different in final position (Greek "σ"/"ς"); or quotes with inverted meaning («French style» vs. »Swiss style«, “English” vs. „German style“)... I could continue for a while.
I don't see how an encoding could help with that, other than providing tables of character properties, equivalences, and relations, which is what Unicode does.
You say in comments that, by looking at the bytes of an encoded character, you want it to tell you if it's upper or lower case.
To me, this sounds like saying: "When I look at a number, I want it to tell me if it's prime."
I mean, not even ASCII codes tell you if they are upper or lower case, you just memorised the properties table which tells you that 41..5A is upper, 61..7A is lower case.
But it's hard to memorise or hardcode these ranges for all 120k Unicode codepoints. So the easiest thing is to use a table look-up.
There's also a bit of confusion about what "encoding" means.
Unicode doesn't define any byte representation, it only assigns codepoints, ie. integers, to character definitions, and it maintains the said tables.
Encodings in the strict sense ("codecs") are the transformation formats (UTF-8 etc.), which define a mapping between the codepoints and their byte representation.
Now it would be possible to define a new UTF which maps codepoints to bytes in a way that provides a pattern for upper/lower case.
But what could that be?
Odd for upper, even for lower case?
But what about letters without upper-/lower-case distinction?
And then, characters that aren't letters?
And what about all the other character categories – punctuation, digits, whitespace, symbols, combining diacritics –, why not represent those as well?
You could put each in a predefined range, but what happens if too many new characters are added to one of the categories?
To sum it up: I don't think what you ask for is possible.

Theory: "Lexical Encoding"

I am using the term "Lexical Encoding" for my lack of a better one.
A Word is arguably the fundamental unit of communication as opposed to a Letter. Unicode tries to assign a numeric value to each Letter of all known Alphabets. What is a Letter to one language, is a Glyph to another. Unicode 5.1 assigns more than 100,000 values to these Glyphs currently. Out of the approximately 180,000 Words being used in Modern English, it is said that with a vocabulary of about 2,000 Words, you should be able to converse in general terms. A "Lexical Encoding" would encode each Word not each Letter, and encapsulate them within a Sentence.
// An simplified example of a "Lexical Encoding"
String sentence = "How are you today?";
int[] sentence = { 93, 22, 14, 330, QUERY };
In this example each Token in the String was encoded as an Integer. The Encoding Scheme here simply assigned an int value based on generalised statistical ranking of word usage, and assigned a constant to the question mark.
Ultimately, a Word has both a Spelling & Meaning though. Any "Lexical Encoding" would preserve the meaning and intent of the Sentence as a whole, and not be language specific. An English sentence would be encoded into "...language-neutral atomic elements of meaning ..." which could then be reconstituted into any language with a structured Syntactic Form and Grammatical Structure.
What are other examples of "Lexical Encoding" techniques?
If you were interested in where the word-usage statistics come from :
http://www.wordcount.org
This question impinges on linguistics more than programming, but for languages which are highly synthetic (having words which are comprised of multiple combined morphemes), it can be a highly complex problem to try to "number" all possible words, as opposed to languages like English which are at least somewhat isolating, or languages like Chinese which are highly analytic.
That is, words may not be easily broken down and counted based on their constituent glyphs in some languages.
This Wikipedia article on Isolating languages may be helpful in explaining the problem.
Their are several major problems with this idea. In most languages, the meaning of a word, and the word associated with a meaning change very swiftly.
No sooner would you have a number assigned to a word, before the meaning of the word would change. For instance, the word "gay" used to only mean "happy" or "merry", but it is now used mostly to mean homosexual. Another example is the morpheme "thank you" which originally came from German "danke" which is just one word. Yet another example is "Good bye" which is a shortening of "God bless you".
Another problem is that even if one takes a snapshot of a word at any point of time, the meaning and usage of the word would be under contention, even within the same province. When dictionaries are being written, it is not uncommon for the academics responsible to argue over a single word.
In short, you wouldn't be able to do it with an existing language. You would have to consider inventing a language of your own, for the purpose, or using a fairly static language that has already been invented, such as Interlingua or Esperanto. However, even these would not be perfect for the purpose of defining static morphemes in an ever-standard lexicon.
Even in Chinese, where there is rough mapping of character to meaning, it still would not work. Many characters change their meanings depending on both context, and which characters either precede or postfix them.
The problem is at its worst when you try and translate between languages. There may be one word in English, that can be used in various cases, but cannot be directly used in another language. An example of this is "free". In Spanish, either "libre" meaning "free" as in speech, or "gratis" meaning "free" as in beer can be used (and using the wrong word in place of "free" would look very funny).
There are other words which are even more difficult to place a meaning on, such as the word beautiful in Korean; when calling a girl beautiful, there would be several candidates for substitution; but when calling food beautiful, unless you mean the food is good looking, there are several other candidates which are completely different.
What it comes down to, is although we only use about 200k words in English, our vocabularies are actually larger in some aspects because we assign many different meanings to the same word. The same problems apply to Esperanto and Interlingua, and every other language meaningful for conversation. Human speech is not a well-defined, well oiled-machine. So, although you could create such a lexicon where each "word" had it's own unique meaning, it would be very difficult, and nigh on impossible for machines using current techniques to translate from any human language into your special standardised lexicon.
This is why machine translation still sucks, and will for a long time to come. If you can do better (and I hope you can) then you should probably consider doing it with some sort of scholarship and/or university/government funding, working towards a PHD; or simply make a heap of money, whatever keeps your ship steaming.
It's easy enough to invent one for yourself. Turn each word into a canonical bytestream (say, lower-case decomposed UCS32), then hash it down to an integer. 32 bits would probably be enough, but if not then 64 bits certainly would.
Before you ding for giving you a snarky answer, consider that the purpose of Unicode is simply to assign each glyph a unique identifier. Not to rank or sort or group them, but just to map each one onto a unique identifier that everyone agrees on.
How would the system handle pluralization of nouns or conjugation of verbs? Would these each have their own "Unicode" value?
As a translations scheme, this is probably not going to work without a lot more work. You'd like to think that you can assign a number to each word, then mechanically translate that to another language. In reality, languages have the problem of multiple words that are spelled the same "the wind blew her hair back" versus "wind your watch".
For transmitting text, where you'd presumably have an alphabet per language, it would work fine, although I wonder what you'd gain there as opposed to using a variable-length dictionary, like ZIP uses.
This is an interesting question, but I suspect you are asking it for the wrong reasons. Are you thinking of this 'lexical' Unicode' as something that would allow you to break down sentences into language-neutral atomic elements of meaning and then be able to reconstitute them in some other concrete language? As a means to achieve a universal translator, perhaps?
Even if you can encode and store, say, an English sentence using a 'lexical unicode', you can not expect to read it and magically render it in, say, Chinese keeping the meaning intact.
Your analogy to Unicode, however, is very useful.
Bear in mind that Unicode, whilst a 'universal' code, does not embody the pronunciation, meaning or usage of the character in question. Each code point refers to a specific glyph in a specific language (or rather the script used by a group of languages). It is elemental at the visual representation level of a glyph (within the bounds of style, formatting and fonts). The Unicode code point for the Latin letter 'A' is just that. It is the Latin letter 'A'. It cannot automagically be rendered as, say, the Arabic letter Alif (ﺍ) or the Indic (Devnagari) letter 'A' (अ).
Keeping to the Unicode analogy, your Lexical Unicode would have code points for each word (word form) in each language. Unicode has ranges of code points for a specific script. Your lexical Unicode would have to a range of codes for each language. Different words in different languages, even if they have the same meaning (synonyms), would have to have different code points. The same word having different meanings, or different pronunciations (homonyms), would have to have different code points.
In Unicode, for some languages (but not all) where the same character has a different shape depending on it's position in the word - e.g. in Hebrew and Arabic, the shape of a glyph changes at the end of the word - then it has a different code point. Likewise in your Lexical Unicode, if a word has a different form depending on its position in the sentence, it may warrant its own code point.
Perhaps the easiest way to come up with code points for the English Language would be to base your system on, say, a particular edition of the Oxford English Dictionary and assign a unique code to each word sequentially. You will have to use a different code for each different meaning of the same word, and you will have to use a different code for different forms - e.g. if the same word can be used as a noun and as a verb, then you will need two codes
Then you will have to do the same for each other language you want to include - using the most authoritative dictionary for that language.
Chances are that this excercise is all more effort than it is worth. If you decide to include all the world's living languages, plus some historic dead ones and some fictional ones - as Unicode does - you will end up with a code space that is so large that your code would have to be extremely wide to accommodate it. You will not gain anything in terms of compression - it is likely that a sentence represented as a String in the original language would take up less space than the same sentence represented as code.
P.S. for those who are saying this is an impossible task because word meanings change, I do not see that as a problem. To use the Unicode analogy, the usage of letters has changed (admittedly not as rapidly as the meaning of words), but it is not of any concern to Unicode that 'th' used to be pronounced like 'y' in the Middle ages. Unicode has a code point for 't', 'h' and 'y' and they each serve their purpose.
P.P.S. Actually, it is of some concern to Unicode that 'oe' is also 'œ' or that 'ss' can be written 'ß' in German
This is an interesting little exercise, but I would urge you to consider it nothing more than an introduction to the concept of the difference in natural language between types and tokens.
A type is a single instance of a word which represents all instances. A token is a single count for each instance of the word. Let me explain this with the following example:
"John went to the bread store. He bought the bread."
Here are some frequency counts for this example, with the counts meaning the number of tokens:
John: 1
went: 1
to: 1
the: 2
store: 1
he: 1
bought: 1
bread: 2
Note that "the" is counted twice--there are two tokens of "the". However, note that while there are ten words, there are only eight of these word-to-frequency pairs. Words being broken down to types and paired with their token count.
Types and tokens are useful in statistical NLP. "Lexical encoding" on the other hand, I would watch out for. This is a segue into much more old-fashioned approaches to NLP, with preprogramming and rationalism abound. I don't even know about any statistical MT that actually assigns a specific "address" to a word. There are too many relationships between words, for one thing, to build any kind of well thought out numerical ontology, and if we're just throwing numbers at words to categorize them, we should be thinking about things like memory management and allocation for speed.
I would suggest checking out NLTK, the Natural Language Toolkit, written in Python, for a more extensive introduction to NLP and its practical uses.
Actually you only need about 600 words for a half decent vocabulary.