Strange list pattern format in CLDR-arabic locale - unicode

I have observed in CLDR-25-data following entries for list pattern formats in arabic locale (similar also in hebrew):
<listPatterns>
<listPattern>
<listPatternPart type="start" draft="contributed">{0}، {1}</listPatternPart>
<listPatternPart type="middle" draft="contributed">{0}، {1}</listPatternPart>
<listPatternPart type="end" draft="contributed">{0}، و {1}</listPatternPart>
<listPatternPart type="2" draft="contributed">{0} و {1}</listPatternPart>
</listPattern>
</listPatterns>
Note that the LDML-specification only speaks about placeholders of the form "{0}" or "{1}" (not like in list pattern parts for types "end" and "2"). See also:
http://cldr.unicode.org/development/development-process/design-proposals/list-formatting
or
http://cldr.unicode.org/translation/lists
I suspect this has something to do with right-to-left-style, but how in detail?
UPDATE:
Now I have written a small Java program to see the real sequence of chars.
String s = "{0} و {1}"; // as displayed in browser or IDE-window
for (char c : s.toCharArray()) {
System.out.println(c);
}
The output is:
{
0
}
و
{
1
}
So it seems to be a display problem, not a problem of the char sequence itself?! I use Internet Explorer version 9 and Eclipse 4.3.

The char sequence is here (in codepoints):
123=>{
48=>0
125=>}
32=>
1608=>و // DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC=true
32=>
123=>{
49=>1
125=>}
Unicode infers the display style also from evaluating the bidirectional context. So here the unicode algorithm seems to apply first the standard LTR-context to the first chars found - hence preserving the char sequence "{0} ".
When the algorithm enters the arabic char it denotes its bidirectional status and applies it to the following next chars. According to the official paper of W3C this means:
The shape of opening bracket glyph "{" changes to "}" in RTL-context (right-to-left). So from the perspective of arabic char the sequence left to arabic char is "1} ", and this is equivalent to the usual LTR-form " {1". After having read the ASCII-char "1" the unicode algorithm evaluates that now the context is LTR again, so displaying the closing bracket in normal form "}". The final visual result (not in terms of codepoints however) is then as if there were one extra closing bracket and one less opening bracket.
I hope SO-readers might find this explanation useful if they encounter similar strange visual effects in bidirectional context.

Related

How to use FT_Load_Char in Arabic (compatibility characters)

This is a follow-up of this question. I'm interested by different glyphs for the same character, also known as "Unicode Compatibility Characters".
Let's take the following two Arabic "reversed-character" words: كلمة ةملك
First word is:
كلمة
in hex code:
0643 0644 0645 0629
Second word is:
ةملك
in hex code:
0629 0645 0644 0643
If I paste those two words in Microsoft Word using Deja Vu Sans, I get this:
With the following pseudo-code using FreeType2, I get:
FT_Face face;
FT_New_Face(library, "DejaVuSans.ttf", 0, &face);
FT_GlyphSlot slot;
FT_Load_Char(face, each_character, FT_LOAD_RENDER);
slot = face->glyph;
//Use slot->bitmap.buffer
FT_Done_Face(face);
What am I missing? How can I have the right glyphs depending of the context?
My key issue is that I store each "character" (I should say glyph - but for me, character was equivalent to glyph) in a table so it's going to be complicated. I'm limited in speed, not in space. Can I have two different unicode characters for the same logical character?
libraqm is a solution to get the glyth for each character depending of its position in the sentence. But I'm still interested to get the character corresponding to the glyth (I know it's not a 1-to-1 relation). For instance, there are 4 characters for the 4 glyths of the letter Kaf as stated in the comment above.

Does Unicode have a special marker character?

My father created in mid 90's an encoding for his engineering purposes for his company's computers. It was close to ISO 8859-2 (Latin 2), but with some differences.
For example there was added a special "MARKER CHARACTER". This character wasn't determined to be a literal, but also it wasn't a control character.
The purpose of this character was to be inserted by machine when needed to split text into parts. See the following Python parser script:
re.sub(r'\{\{', r'~{{', text)
re.sub(r'\[\[', r'~[[', text)
re.sub(r'\]\]', r']]~', text)
re.sub(r'\}\}', r'}}~', text)
parts = text.strip('~').split('~')
inCurly = [False]
inSharp = [False]
whereAmI = ['']
for part in parts:
if part[:2] == '{{':
inCurly.append(True)
whereAmI.append('Curly')
elif part[:2] == '[[':
inSharp.append(True)
whereAmI.append('Sharp')
if whereAmI[-1] == 'Sharp' and not inCurly[-1]:
# some advanced magic on current part,
# if it is directly surrounded by sharp brackets,
# but these sharp brackets are not in curly brackets anyhow
# (not: "{{ (( [[ some text ]] )) }}")
# detecting closing brackets and popping inSharp, inCurly, whereAmI
# joining parts back to text
This is an easy parser for advanced purposes, you can detect more parenthesis or quotation marks as you want. But this have one huge fault. It break things when a ~ is in text.
For this purpose and similar purposes like this (but in C lang I think) he added to his encoding/character set that marker character.
For years I use for this purpose three german "sharp s": ßßß, because it is almost impossible to see three of them in a row. But this is not an ideal solution.
Yesterday my father told me this story and I immediatelly thought: is there some equivalent in an Unicode family? Unicode is a modern developing standard spreaded all over the world in past decade drastically. There should be a special character only for this particular purpose, or not?
I don't think there's anything called that specifically, but you might find zero-width space or information separator, among others, suitable for the purpose. You can arbitrarily select any character as your marker, and use an escape character if it occurs within the string.
In the control pictures block, there is a symbol for the group separator.

How to search for any unicode symbol in a character string?

I've got an existing DOORS module which happens to have some rich text entries; these entries have some symbols in them such as 'curly' quotes. I'm trying to upgrade a DXL macro which exports a LaTeX source file, and the problem is that these high-number symbols are not considered "standard UTF-8" by TexMaker's import function (and in any case probably won't be processed by Xelatex or other converters) . I can't simply use the UnicodeString functions in DXL because those break the rest of the rich text, and apparently the character identifier charOf(decimal_number_code) only works over the basic set of characters, i.e. less than some numeric code value. For example, charOf(8217) should create a right-curly single quote, but when I tried code along the lines of
if (charOf(8217) == one_char)
I never get a match. I did copy the curly quote from the DOORS module and verified via an online unicode analyzer that it was definitely Unicode decimal value 8217 .
So, what am I missing here? I just want to be able to detect any symbol character, identify it correctly, and then replace it with ,e.g., \textquoteright in the output stream.
My overall setup works for lower-count chars, since this works:
( c is a single character pulled from a string)
thedeg = charOf(176)
if( thedeg == c )
{
temp += "$\\degree$"
}
Got some help from DXL coding experts over at IBM forums.
Quoting the important stuff (there's some useful code snippets there as well):
Hey, you are right it seems intOf(char) and charOf(int) both do some
modulo 256 and therefore cut anything above that off. Try:
int i=8217;
char c = addr_(i);
print c;
Which then allows comparison of c with any input char.

How do I specify a unicode literal that requires more than four hex digits in Antlr?

I want to define a lexer rule for ranges between unicode characters that have code points that need more than four hexadecimal digits to identify. To be concrete, I want to declare the following rule:
ID_Continue : [\uE0100-\uE01EF] ;
Unfortunately, it doesn't work. This rule will match characters that are not in this range. (I'm not certain to what exact behaviour this results in, but it isn't the one I want.) I've tried also the following (padding with leading zeros and using 8 digits):
ID_Continue : [\U000E0100-\U000E01EF] ;
But it seems to result in the same unwanted behaviour.
I am using Antlr4 and the IntelliJ plugin for it for testing.
Does Antlr4 not support unicode literals above \uFFFF?
No, ANTLR's max is the same as Java's Character.MAX_VALUE
If you look at (a part of) ANTLR4's lexer grammar you will see these rules:
// Any kind of escaped character that we can embed within ANTLR literal strings.
fragment EscSeq
: Esc
( [btnfr"'\\] // The standard escaped character set such as tab, newline, etc.
| UnicodeEsc // A Unicode escape sequence
| . // Invalid escape character
| EOF // Incomplete at EOF
)
;
...
fragment UnicodeEsc
: 'u' (HexDigit (HexDigit (HexDigit HexDigit?)?)?)?
;
...
fragment Esc : '\\' ;
Note: the limitation to the BMP is purely a Java limitation. Other targets might go much further. For instance my MySQL grammar, written for ANTLR3 (C target) can easily lex e.g. emojis from beyond the BMP. This works for quoted strings as well as IDENTIFIERs.
What's a bit strange here is however that I haven't specified that range in the grammar (it uses only the BMP). Still the parser can parse any utf-8 input. Might be a bug in the target runtime, though I'm happy it exists :-D

Meaning of an RTF faulty piece of code

I am working on an RTF file made by someone else on an unknown platform, and everything is interpreted correctly, except some characters, whatever character set I open them from in openoffice. Here is the plain text, after interpretation:
"Même taille que la Terre, même masse, même âgec Vénus a souvent été qualifiée de sœur de la Terre. "
and here is the original ANSI paragraph:
"M\u234\'3fme taille que la Terre, m\u234\'3fme masse, m\u234\'3fme \u226\'3fge\uc2 \u61825\'ff\'81\uc1 c V\u233\'3fnus a souvent \u233\'3ft\u233\'3f qualifi\u233\'3fe de s\u339\'3fur de la Terre."
To zoom in:
"âgec Vénus" becomes "\u226\'3fge\uc2 \u61825\'ff\'81\uc1 c V\u233\'3fnus"
and finally, what we come up with:
"\uc2 \u61825\'ff\'81\uc1 c"
here \uc2 and \uc1 are to say we are going back and forth between 4-bytes and 2-bytes Unicode encoding.
\u61825 is an unknown Unicode character. Indeed, according to the RTF specification, any UTF character greater than 2^15 should be written in a negative form; negative form with ANSI characters should make the "-" (minus) sign visible to the notepad, am I right? So here already I have something I don't understand, how the RTF writer used by the person who made the rtf file in the first place could have done it. Maybe I missed something in the specification, specific versions, character sets, I don't know. If taken as is, 61825 would correspond to F181 which is in a private area of the Unicode table.
And then, the \'ff\'81 would be some use of the ANSI equivalent field of the whole "specific character" group (whose structure is usually \uN\'XX), to code something that would be 4-byte long. And here again, I could not find:
what is the code page (Windows-1252, ISO-8859-1, other?) being refered to (as in all the other places in the file where a \uN\'XX sequence apears, XX are always 3F, the Windows-1252 code for "?", so it did not give me much information)
what does the \'FF (which looks like some control character inside an escape sequence!) stand for, and then why \'81... Actually, the translation of \u61825 to hex is F181, not FF81...I am lost here!
Finally, what the translated text (in French) would make us expect is the ":" (semicolon): "Same size as Earth, same mass, same age: Venus has often been qualified as Earth's sister". It would make sense. But what rtf writer could imagine such a complicated code for the semicolon?
So again, after 1 hour of search, I open the question to you fellows: does someone recognize this, and could tell me what control word encoding is used, is there a big endian/little endian/2's complement mess here with the 61825, and same with the \'ff\'81, which would assemble as FF81 instead of F181, which itself doesn't mean anything as is...here my question is only to know if there would be a way to find the complete original text back from the bizarre RTF encoding!
what the translated text (in french) would make us expect is the ":" (semicolon
Nearly: it should be the ellipsis. You can see the source text eg here.
The ellipsis should normally be written simply as three periods, but there has traditionally been a separate character representing ellipsis in order better to control their spacing, back before complex text layout algorithms existed that could do automatic glyph replacement. Consequently there exists a Unicode compatibility character U+2026 HORIZONTAL ELLIPSIS to allow round-tripping to legacy encodings such as Windows code page 1252, where it is byte 133.
That, however, is not what has been encoded in your RTF document. That would be too easy.
61825 is an unknown Unicode character.
It's a Private Use Area character, which means it could represent absolutely anything. Word has exported certain common symbol fonts as PUA characters - see this post for the background.
So someone at some point may have used a symbol font where code unit 129 (the 0x81 in U+F181, 61825) maps to something that looks like an ellipsis. Quite what that font is, I have no idea! It doesn't seem to be one of the usual suspects (Symbol, Wingdings, Webdings). You might just have to manually replace U+F181 with U+2026 for now unless you can find out more about the source.