In Unicode, why are there two representations for the Arabic digits? - unicode

I was reading the specification of Unicode # Wikipedia (Arabic Unicode)
and I see that each of the Arabic digits has 2 Unicode code points.
For example 1 is defined as U+0661 and as U+06F1.
Which one should I use?

According to the code charts, U+0660 .. U+0669 are ARABIC-INDIC DIGIT values 0 through 9, while U+06F0 .. U+06F9 are EXTENDED ARABIC-INDIC DIGIT values 0 through 9.
In the Unicode 3.0 book (5.2 is the current version, but these things don't change much once set), the U+066n series of glyphs are marked 'Arabic-Indic digits' and the U+06Fn series of glyphs are marked 'Eastern Arabic-Indic digits (Persian and Urdu)'.
It also notes:
U+06F4 - 'different glyphs in Persian and Urdu'
U+06F5 - 'Persian and Urdu share glyph different from Arabic'
U+06F6 - 'Persian glyph different from Arabic'
U+06F7 - 'Urdu glyph different from Arabic'
For comparison:
U+066n: ٠١٢٣٤٥٦٧٨٩
U+06Fn: ۰۱۲۳۴۵۶۷۸۹
Or, enlarged by making the information into a title:
U+066n: ٠١٢٣٤٥٦٧٨٩
U+06Fn: ۰۱۲۳۴۵۶۷۸۹
Or:
U+066n U+06Fn
0 ٠ ۰
1 ١ ۱
2 ٢ ۲
3 ٣ ۳
4 ٤ ۴
5 ٥ ۵
6 ٦ ۶
7 ٧ ۷
8 ٨ ۸
9 ٩ ۹
(Whether you can see any of those, and how clearly they are differentiated may depend on your browser and the fonts installed on your machine as much as anything else. I can see the difference on 4 and 6 clearly; 5 looks much the same in both.)
Based on this information, if you are working with Arabic from the Middle East, use the U+066n series of digits; if you are working with Persian or Urdu, use the U+06Fn series of digits. As a Unicode application, you should accept either set of codes as valid digits (but you might look askance at a sequence that mixed the two sets of digits - or you might just leave well alone).

In general you should not hard-code such info in your application.
On Windows you can use GetLocaleInfo with LOCALE_SNATIVEDIGITS.
On Mac CFNumberFormatterCopyProperty with kCFNumberFormatterZeroSymbol.
Or use something like ICU.
There are Arabic countries that don't use the Arabic-Indic digits by default. So there is no direct mapping saying Arabic -> Arabic-Indic digits.
And the user might have changed the defaults in the Control Panel anyway.

Which code do you prefer for representing the number 4, U+0664 or U+06F4?
(٤ or ۴ )?
To be consistent, let this choice guide which codes you use for 1, 2, and the other duplicate codes.

Related

How do you determine the byte width of a UTF-16 character?

What are the rules for reading a UTF-16 byte stream, to determine how many bytes a character takes up? I've read the standards, but based on empirical observations of real-world UTF-16 encoded streams, it looks like there are certain where the standards don't hold true (or there's an aspect of the standard that I'm missing).
From the reading the UTF-16 standard https://www.rfc-editor.org/rfc/rfc2781:
Value of leading 2 bytes
Resulting character length (bytes)
0x0000-0xC7FF
2
0xD800-0xDBFF
4
0xDC00-0xDFFF
Invalid sequence (RFC2781 2.2.2)
0xDFFF-0xFFFF
4
In practice, this appears to hold true, for some cases at least. Using an ad-hoc SQL script (SQL Server 2019; UTF-16 collation), but also verified with an online decoder:
Character
Unicode Name
ISO 10646
UTF-16 Encoding (hexadecimal, big endian)
Size (bytes)
A
LATIN CAPITAL LETTER A
U+0041
00 41
2
Б
CYRILLIC CAPITAL LETTER BE
U+0411
04 11
2
ァ
KATAKANA LETTER SMALL A
U+30A1
30 A1
2
🐰
RABBIT FACE
U+1F430
D8 3D DC 30
4
However when encoding the following ISO 10646 character into UTF-16, it appears to be 4 bytes, but reading the leading 2 bytes appears to give no indication that it will be this long:
Character
Unicode Name
UTF-16 Encoding (hexadecimal, big endian)
Size (bytes)
⚕️
STAFF OF AESCULAPIUS
26 95 FE 0F
4
Whilst I'd rather keep my question software-agnostic; the following SQL will reproduce this behaviour on Microsoft SQL Server 2019, with default collation and default language. (Note that SQL Server is little endian).
select cast(N'⚕️' as varbinary);
----------
0x95260FFE
Quite simply, how/why do you read 0x2695 and think "I'll need to read in the next word for this character."? Why doesn't this appear to align with the published UTF-16 standard?
The formal definition of all of this is called an "extended grapheme cluster," and it's defined in the Unicode Text Segmentation report. As Joachim Sauer notes, it's wise to be careful with the term "character" in Unicode.
Code points are what "U+...." syntax is referring to, and is attempting to capture a "unit" of written language, for example "an acute accent." But what a reader would think of a character (for example "an e with an acute accent") is a "grapheme cluster" and is made up of one or more code points. What is ultimately rendered to the screen is a "glyph" which is both context- and font-dependent.
Grapheme clusters in Unicode are actually more subtle than this. Unicode attempts to define them in a "neutral" way. (There's really no such thing as "neutral" when thinking about languages, but Unicode does try.) For example, in Slovak, ch, dz, and dž are each one letter, but are considered two grapheme clusters in Unicode. (Try to count the "letters" in a Slovak word. There are words that contain the letter dz and other words that have the letter d followed by the letter z. Oh human writing systems. I love you so much.)
The mapping of grapheme clusters to glyphs is also complex. For example, in Arabic, the single glyph لا is actually two grapheme clusters, ل (ARABIC LETTER LAM) followed by ا (ARABIC LETTER ALEF). If you use your mouse to select the glyph, you'll see there are two selectable pieces, and if you copy and paste them to another window you'll see them transform into their component parts. (Just to make thing even more complicated, Unicode also defines a single code point for ligature, ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM: ﻻ. If you try to select part of that one, you'll find you can't. It's one "character.")
Your specific case is a bit more special. The Variation Selector predates Unicode, and is mostly designed to handle different variations of Han (Chinese) characters. However, as with every Unicode feature, it eventually has come to be used primarily for emoji. VS-16 is the "emoji" presentation form. The most famous example is the red heart, which is HEAVY BLACK HEART ❤, followed by VS-16: ❤️.
Similarly, your character U+2695 STAFF OF AESCULAPIUS is a single code point, and it looks like this by default (text style): ⚕. When you add VS-16, it is rendered in "emoji style": ⚕️. In some ways it's the same "character." Or is it? Depends on what you're using it for.
Emoji style is typically a bit larger and centered in its block, sometimes adding color. Notice where the period after the staff is drawn in each case (there are no extra spaces in the second example; the glyph is just much wider).
There are other combining systems as well:
U+0031: 1
U+0031 U+20e3: 1⃣ (+ COMBINING ENCLOSING KEYCAP, default text style)
U+0031 U+20e3 U+fe0f: 1⃣️ (+ VARIATION SELECTOR-16, emoji style)
All of these predate Unicode. Modern emoji is dramatically more complicated, and includes several combining systems of its own (including two that are currently just used for flags).
But luckily, to your actual question, your wife is correct, and you can generally just consume all trailing code points that are marked "combining" to form an extended grapheme cluster, and that is kind of a "character" for some broad enough definition of "character."
All of your assertions are completely correct; your interpretation of the UTF-16 standards is correct and complete.
In your empirical observations however, you've assumed that you only have one character. In actuality, you've ran into a nuance of the Unicode implementation. Your "character" is actually two (albeit technically, not visually): U+2695 "STAFF OF AESCULAPIUS" followed by U+FE0F "VARIATION SELECTOR-16". The second character is a non-spacing mark which combines with the base character for the purpose of rendering a character variant.
This results in the byte sequence 26 95 FE 0F, however as you note neither of the words fall within the UTF-16 reserved extension character range. But this is because neither of them require the UTF-16 4 byte extension. They're simply classified as two discrete Unicode characters.
From 7.9 Combining Marks in ISO 10646: Universal Coded Character Set (UCS):,
Combining marks are a special class of characters in the Unicode Standard that are
intended to combine with a preceding character, called their base.
Combining marks usually have a visible glyphic form... a combining mark may interact graphically with neighbouring characters in various ways.
http://unicode.org/L2/L2010/10038-fcd10646-main.pdf
To explain why I'm answering my own question; I had my SO question all ready to fire off. My wife came into my office; after looking over my shoulder she whispered into my ear, "You know combination characters are a thing, right?". I've however still asked the question and answered it myself, in case my wife's sweet nothings help another member of the community.

Why does the character ë have its own ISO code (EB) but ė doesn't?

I'm running into a tricky issue with the character ė (small e with one dot above it). I'm specifically using FPDF to generate PDF files in PHP and it won't support the ė character.
I noticed on Wikipedia the ISO hex for ė is the same as ë. Both are EB.
https://en.wikipedia.org/wiki/Ė
https://en.wikipedia.org/wiki/%C3%8B
Why are ė and ë considered the same character in ISO?
You get things wrong.
ISO is a standard organization, and it has many standards. Unicode has also an parallel ISO standard (ISO 10646). And we had other ISO standards for texts.
You are looking instead the ISO 8859, which is made by various parts: https://en.wikipedia.org/wiki/ISO/IEC_8859
This is a 8-bit character encoder, so you have a very limited character set (256 minus 32 characters). For this reason there are many different parts, and one choose what better fit on own country/language. You may choose Latin-1 for West European languages, or better Latin-9 (part 15) which includes the "new" character: Euro symbol (currency).
In your example, you have the language specific codes EB. In part 13 (Latin-7) it is ė (baltic), but in part 1, 2, 3, 4, 9, 10, 14, 15, and part 16 it is ë. As you see, this is variant is used in many more languages, so it is available in most of the ISO 8859 parts. In the page I linked above, you see also the table with every variant per code/value.
The main problem now it is to detect the original encoding. This could be very problematic for people who cannot asses which the language, so the spelling, of a text. For new text, better to use Unicode, which is unique (real text doesn't have Unicode byte pattern)

Why some UTF-8 characters falls into some weird squares with four digits?

I find some UTF-8 characters falls into some weird squares with four digits in my terminal like this:
Could anyone please explain why that weird squares appears instead of correct UTF-8 characters?
PS:
The correct message is(you can look up UTF-8 tables to get that):
reboot: 只有 root 能够执行
Which means reboot: Only root can execute.
PPS:
I test UTF-8 characters with 5 or 6 hex digit:
Wow, I got a square with six digits inside!
Many thanks to Jonathan!
It means your font doesn't have a symbol for U+80FD or U+591F (etc), so the square is a fallback that allows you to determine what the Unicode symbol was, even though the glyph cannot be displayed accurately.
You either need to get a new font or change locale or something along those lines so that you get to see the message more nearly correctly.
Those glyphs are missing in the font you are using, so their hexadecimal number is rendered instead. Make your terminal use a font with CJK characters.

Does IOS support all Unicode emojies?

Hello All,
I have a problem regarding Unicode characters. I'm able to append Apple Art Work Unicode Characters in UITextView.
Like this : -
self.textView.text = #"\ue00A";
It is Okay.
But now i have many Unicodes Characters which're not in Apple art work.
One of them is U+1F3C7
Now I'm trying to show it in UITextView.
self.textView.text = #"\u1f3c7";
Then it is showing me an Special Character instead of Emoji.
This is the Emoji Icon of this Unicode But it is showing me Ἴ7.
Apple doesn't support all Unicode Characters ?
How can I add my own emojies in my application ?
Let me know if my question is not clear for you.
Doesn't Objective-C use UTF-16 internally, like Java and C#?
If so, then U+1F3C7 wouldn't be "\u1f3c7", but the surrogate-pair, "\uD83C\uDFC7".
Otherwise, there has to be some way to indicate a higher character, because "\u1f3c7" is the same as "\u1f3c" + "7", which is Ἴ7 (capital iota with psili and oxia, then 7).
Edit: After some discussion between the OP and myself, we figured out that the way to do this in Objective C is one I know as the C++ way:
"\U0001F3C7"
(\uXXXX with a small u and 4 hex digits works if it fits in thos 4 hex digits, \UXXXXXXXX with a capital U and 8 hex digits works for everything, but is longer to type).
Now our friend just needs to deal with the matter of font support, which alas is another problem in getting this to actually look as he wants.

Subset of Unicode normally used in writing?

What is the subset of Unicode characters that are normally used in writing — such as those that would be typically found in a newspaper article?
For example, in English, the characters in the range [a-zA-Z0-9], plus some punctuation characters, would be sufficient for most writing.
But I want to support languages that use characters that fall outside the ASCII range, while excluding the non-printing or decorative characters.
The objective is to restrict the user input to the application to codepoints that are legitimately used in written language. Because the user input will be saved and displayed, I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters, Unicode flow control characters, etc.
Regrettably, I am not fluent in every single language found in Unicode. Has anyone compiled a list of all of the subset of Unicode characters that are normally used in writing?
The official list of Unicode code points is UnicodeData.txt. This is a plain text file with one line per code point; it's easily machine-readable. For example:
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
The third semicolon-delimited field is the abbreviated name of the "General Category". This is explained further in chapter 4 of the Unicode Standard, specifically in section 4.5; see the table on page 131 (page 12 of the PDF file). For example, "Lu" is uppercase letters, "Ll" is lowercase letters, Pc, Pd, Ps, et al are various kinds of punctuation. (The first letter of the two-letter abbreviation represents a higher-level category such as letter, digit, punctuation, etc.)
Note that some ranges of code points are not listed explicitly. For example, the range of CJK (Chinese, Japanese, Korean) ideographs is represented as:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
I think there are other files on unicode.org that fill in these gaps.
I'm still not 100% clear on just what subset you're trying to define, but you can probably define it as a particular set of General Category values.
I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters
Diacritics/combining characters will be used in normal written language. So if you want to stop 'pranksters' you're going to need something more sophisticated than just a list of permitted characters. You'll have to do some sort of linguistic analysis for every language you want to permit.
I'd recommend not bothering with this, because it's going to be hard and you won't succeed anyway. Just let people write what they want.
Try WGL4 (652 characters), MES-1 (335 characters) or MES-2 (1062 characters). Find these at Wikipedia.
You may wish to exclude characters IJijĸĿŀʼn˚―⅛⅜⅝⅞♪ from MES-1 if you want to use this set.
Edit: I realize this is a bad answer. Especially the removing characters from MES-1 part was total garbage. I shouldn't have posted this. I'm ashamed of whoever upvoted this.
If anything, use Subset1 (678 characters), Subset2 (1193 characters) and Subset3 (2823 characters). https://unicodesubsets.miraheze.org/wiki/User:PiotrGrochowski