difference between U+06A4 and U+06A8? (ARABIC LETTER VEH and ARABIC LETTER QAF WITH THREE DOTS ABOVE) - unicode

I'm interested in these two Unicode characters:
U+06A4 ARABIC LETTER VEH ڤ
U+06A8 ARABIC LETTER QAF WITH THREE DOTS ABOVE ڨ
They seem to render the same when placed in the middle of a word:
بڤر
بڨر
From a developer's point of view, how do I distinguish between them? Should I normalize one to another?

These characters are not used much in Arabic. (I don't know if there are used in other languages that use the Arabic script).
I don't know the official answer on this, but this what I can gather. This Wikipedia page is very helpful: Ve (Arabic letter)
The first character U+06A4 ARABIC LETTER VEH ڤ is meant to be the letter representing the "v" sound in Arabic, used when transliterating words from foreign languages (since "v" is not part of the usual Arabic alphabet). Not all Arabs in Arab countries use this letter this way. It looks identical to the second character U+06A8, except when it comes to the final form and the isolated form. Think of it as ف but with three dots instead of one.
The second character U+06A8 ARABIC LETTER QAF WITH THREE DOTS ABOVE ڨ is meant to be the letter representing the "g" sound in some Arabic dialects, also used when transliterating words from foreign languages (since "g" is not part of the modern standard Arabic alphabet). Think of it as ق but with three dots instead of one.
This table illustrates the differences in the isolated and final forms (I am using U+0640 ARABIC TATWEEL ـ to form the initial, medial and final forms):
Position in word
Isolated
Final
Medial
Initial
U+0A64 Veh
eg: ڤ
eg: ـڤ
eg: ـڤـ
eg: ڤـ
U+0A68 Qaf with three dots above
eg: ڨ
eg: ـڨ
eg: ـڨـ
eg: ڨـ
Both of these characters don't change when normalised, as demonstrated by this Python script:
>>> veh = "\u06A4"
>>> qaf3 = "\u06A8"
>>> from unicodedata import normalize
>>> for form in ["NFC", "NFKC", "NFD", "NFKD"]:
... print(normalize(form, veh) == veh)
... print(normalize(form, qaf3) == qaf3)
...
True
True
# etc

Related

Delete greek letters in string

I am pre-processing dirty text fields. I have managed to delete single characters and numbers, but there are still Greek letters (from formulas) that I want to delete altogether.
Any type of Greek letter can occur at any position in the string.
Any ideas how to do it?
select regexp_replace(' ω ω α ω alkanediylbis alkylimino bis alkanolpolyethoxylate the formula where straight branched chain alkylene group also known alkanediyl group that has the range carbon atoms and least carbon atoms length and can the same different and are primary alkyl groups which contain carbon atoms each and can the same different and are alkylene groups which contain the range from carbon atoms each and and are the same different numerals the range each ', '\W+', '')
[Α-Ωα-ω] will match the standard Greek alphabet. (Note that the Α here is a distinct character from the Latin A, though they probably look identical).
Some commonly-used symbols are outside of the standard alphabet, so at the very least, you probably want to match the whole Greek Unicode block using [\u0370-\u03FF].
Unicode also has
The Greek Extended block containing letters with diacritics
The Coptic block with some very similar-looking characters
The Mathematical Operators block with its own ∆/∏/∑ symbols
Several copies of the Greek alphabet in the Mathematical Alphanumeric Symbols block
...and probably more.
Rather than trying to list everything you want to replace, it might be easier to list the things you want to keep. For example, to remove everything outside the printable ASCII range:
select regexp_replace(
'ABCΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω123',
'[^\u0020-\u007E]', '', 'g'
);
regexp_replace
----------------
ABC123

How to compare the same character with multiple code points? [duplicate]

Is there a standard way, in Python, to normalize a unicode string, so that it only comprehends the simplest unicode entities that can be used to represent it ?
I mean, something which would translate a sequence like ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT'] to ['LATIN SMALL LETTER A WITH ACUTE'] ?
See where is the problem:
>>> import unicodedata
>>> char = "á"
>>> len(char)
1
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A WITH ACUTE']
But now:
>>> char = "á"
>>> len(char)
2
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']
I could, of course, iterate over all the chars and do manual replacements, etc., but it is not efficient, and I'm pretty sure I would miss half of the special cases, and do mistakes.
The unicodedata module offers a .normalize() function, you want to normalize to the NFC form. An example using the same U+0061 LATIN SMALL LETTER - U+0301 A COMBINING ACUTE ACCENT combination and U+00E1 LATIN SMALL LETTER A WITH ACUTE code points you used:
>>> print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))
'\xe1'
>>> print(ascii(unicodedata.normalize('NFD', '\u00e1')))
'a\u0301'
(I used the ascii() function here to ensure non-ASCII codepoints are printed using escape syntax, making the differences clear).
NFC, or 'Normal Form Composed' returns composed characters, NFD, 'Normal Form Decomposed' gives you decomposed, combined characters.
The additional NFKC and NFKD forms deal with compatibility codepoints; e.g. U+2160 ROMAN NUMERAL ONE is really just the same thing as U+0049 LATIN CAPITAL LETTER I but present in the Unicode standard to remain compatible with encodings that treat them separately. Using either NFKC or NFKD form, in addition to composing or decomposing characters, will also replace all 'compatibility' characters with their canonical form.
Here is an example using the U+2167 ROMAN NUMERAL EIGHT codepoint; using the NFKC form replaces this with a sequence of ASCII V and I characters:
>>> unicodedata.normalize('NFC', '\u2167')
'Ⅷ'
>>> unicodedata.normalize('NFKC', '\u2167')
'VIII'
Note that there is no guarantee that composed and decomposed forms are commutative; normalizing a combined character to NFC form, then converting the result back to NFD form does not always result in the same character sequence. The Unicode standard maintains a list of exceptions; characters on this list are composable, but not decomposable back to their combined form, for various reasons. Also see the documentation on the Composition Exclusion Table.
Yes, there is.
unicodedata.normalize(form, unistr)
You need to select one of the four normalization forms.

Counting arabic characters based on their shapes

Arabic characters change their shape based on their positions in a word. I have a long arabic text. I want to count all different arabic shapes occuring in the text. However the codePointAt() function of javascript disappoints me.
For instance, this is arabic "L" => ل
and this is arabic "alif" => ا
If "alif" comes after "L" in a word they takes this shape together => لا
Now javascripts codesPointAt() separates all letters of a word before it outputs the unicode number of the letters. Thus, it sees ل and ا as different characters, which is not what I want.
I am using PHP as a server-side script. It has no unicode functions as far as I know.
What are my options after that?

Display issue with diacritics for a phonetic alphabet

I need to write unicode characters and diacritics in a web page. They are part of a phonetic alphabet designed for romanist studies (the Bourciez Alphabet). My problem is a display issue, I believe: the character codes are all OK in unicode, but some diacritics are not displayed as expected.
Most notably, the 'COMBINING DOUBLE BREVE BELOW' (U+035C) does not display as expected: it appears not under the 2 letters to which it is supposed to apply, but under the last of those letters and the next character (another letter, or a space).
Here for instance, the combining diacritic should be under the first 2 "a" characters, but it is displayed under the 2nd and 3rd "a"; yet you can see that the combination has been applied to the first 2 "a"s, because they are displayed in smaller size than the normal "a"s:
result of combining double breve below
I'm using fonts which have those characters (I tried Arial MS Unicode, Gentium, and Lucida Sans Unicode). They all have the same display issue.
Any idea how I can solve this issue?
I'm having trouble reproducing the problem. Connecting two characters with the breve diacritic seems to be working for me. First I enter the first character of the pair, then the U+035C character, finally the second and it shows as follows.
sample image

Replacing non-Latin chars with Latin counterparts like Phonebook App

I have a phonebook app where I generate the title for section headers by comparing the first letters of entries.
The indexes are predefined so I expect letters to be assigned from A-Z and for numbers #.
The problem is there are many letter with accents including ü, İ, ç etc in many languages. In my approach, since these chars do not fall under the range A-Z, they are assigned to # which is not desired.
The native iOS Phonebook app assigns for example ü to U and so on. Is there a simple way to make this casting without defining a set of chars?
Thanks.
Check out Unicode Normalization. You probably want some combination of NFD and extraction of the adequate data. If you look at this file from Unicode, you will see something like
00E9;LATIN SMALL LETTER E ACUTE;Ll;0;L;0065 0301;;;;N;;;00C9;;00C9
Where 00E9, ie 'é', is decomposed as 0065 0301. You pick up 0065 (a), and discard 0301 (´). This file should get you started nicely. There may be equivalent functions in Objective-C/iOS, but I wouldn't know where to start...