Counting arabic characters based on their shapes - unicode

Arabic characters change their shape based on their positions in a word. I have a long arabic text. I want to count all different arabic shapes occuring in the text. However the codePointAt() function of javascript disappoints me.
For instance, this is arabic "L" => ل
and this is arabic "alif" => ا
If "alif" comes after "L" in a word they takes this shape together => لا
Now javascripts codesPointAt() separates all letters of a word before it outputs the unicode number of the letters. Thus, it sees ل and ا as different characters, which is not what I want.
I am using PHP as a server-side script. It has no unicode functions as far as I know.
What are my options after that?

Related

difference between U+06A4 and U+06A8? (ARABIC LETTER VEH and ARABIC LETTER QAF WITH THREE DOTS ABOVE)

I'm interested in these two Unicode characters:
U+06A4 ARABIC LETTER VEH ڤ
U+06A8 ARABIC LETTER QAF WITH THREE DOTS ABOVE ڨ
They seem to render the same when placed in the middle of a word:
بڤر
بڨر
From a developer's point of view, how do I distinguish between them? Should I normalize one to another?
These characters are not used much in Arabic. (I don't know if there are used in other languages that use the Arabic script).
I don't know the official answer on this, but this what I can gather. This Wikipedia page is very helpful: Ve (Arabic letter)
The first character U+06A4 ARABIC LETTER VEH ڤ is meant to be the letter representing the "v" sound in Arabic, used when transliterating words from foreign languages (since "v" is not part of the usual Arabic alphabet). Not all Arabs in Arab countries use this letter this way. It looks identical to the second character U+06A8, except when it comes to the final form and the isolated form. Think of it as ف but with three dots instead of one.
The second character U+06A8 ARABIC LETTER QAF WITH THREE DOTS ABOVE ڨ is meant to be the letter representing the "g" sound in some Arabic dialects, also used when transliterating words from foreign languages (since "g" is not part of the modern standard Arabic alphabet). Think of it as ق but with three dots instead of one.
This table illustrates the differences in the isolated and final forms (I am using U+0640 ARABIC TATWEEL ـ to form the initial, medial and final forms):
Position in word
Isolated
Final
Medial
Initial
U+0A64 Veh
eg: ڤ
eg: ـڤ
eg: ـڤـ
eg: ڤـ
U+0A68 Qaf with three dots above
eg: ڨ
eg: ـڨ
eg: ـڨـ
eg: ڨـ
Both of these characters don't change when normalised, as demonstrated by this Python script:
>>> veh = "\u06A4"
>>> qaf3 = "\u06A8"
>>> from unicodedata import normalize
>>> for form in ["NFC", "NFKC", "NFD", "NFKD"]:
... print(normalize(form, veh) == veh)
... print(normalize(form, qaf3) == qaf3)
...
True
True
# etc

placing an arabic shadda on a hebrew letter

What is the best and safest way to place an Arabic Shaddah (ّ = \u0651 = "ARABIC SHADDA") over a Hebrew letter?
Hebrew textbooks for colloquial Arabic need to place the Shaddah over a letter. This was done in pre-Unicode days, so either fonts were hacked to have this, or else metal glyphs were placed together (in the days of manual typesetting). Unicode, however, doesn't make it easy to position diacritical marks from one language on top of characters from another language.
I'm looking for the best way of doing this in the Unicode standard.
U+0651 is a non-spacing mark, so you should be able to place it right after a base character and it will be applied over that character. This does rely on an appropriate font and whatever is rendering the font to support Unicode correctly.
For example this looks correct to me on Google Chrome and Microsoft Edge browsers ("correct" is relative as I know nothing about Hebrew):
U+05D0 U+0651 אّ HEBREW LETTER ALEF + ARABIC SHADDA
U+05D1 U+0651 בّ HEBREW LETTER BET + ARABIC SHADDA
U+05D2 U+0651 גّ HEBREW LETTER GIMEL + ARABIC SHADDA

Display issue with diacritics for a phonetic alphabet

I need to write unicode characters and diacritics in a web page. They are part of a phonetic alphabet designed for romanist studies (the Bourciez Alphabet). My problem is a display issue, I believe: the character codes are all OK in unicode, but some diacritics are not displayed as expected.
Most notably, the 'COMBINING DOUBLE BREVE BELOW' (U+035C) does not display as expected: it appears not under the 2 letters to which it is supposed to apply, but under the last of those letters and the next character (another letter, or a space).
Here for instance, the combining diacritic should be under the first 2 "a" characters, but it is displayed under the 2nd and 3rd "a"; yet you can see that the combination has been applied to the first 2 "a"s, because they are displayed in smaller size than the normal "a"s:
result of combining double breve below
I'm using fonts which have those characters (I tried Arial MS Unicode, Gentium, and Lucida Sans Unicode). They all have the same display issue.
Any idea how I can solve this issue?
I'm having trouble reproducing the problem. Connecting two characters with the breve diacritic seems to be working for me. First I enter the first character of the pair, then the U+035C character, finally the second and it shows as follows.
sample image

how to transform Unicode characters to a different font?

I was able to transform sinhala Unicode characters to symbols by just copying those characters into MS word and changing the font to TIMES NEW ROMAN, The letters are in the link image;
sequence of symbols and letters = fnda, rduqj - .Kl rduqj
But now I can't changed those Unicode characters into sequence of symbols and letters. Every time I paste those characters it doesn't allow me to change to another font type. How can I make it changeable or is there a better way of getting that sequence of letters?

to extract characters of a particular language

how can i extract only the characters in a particular language from a file containing language characters, alphanumeric character english alphabets
This depends on a few factors:
Is the string encoded with UTF-8?
Do you want all non-English characters, including things like symbols and punctuation marks, or only non-symbol characters from written languages?
Do you want to capture characters that are non-English or non-Latin? What I mean is, would you want characters like é and ç or would you only want characters outside of Romantic and Germanic alphabets?
and finally,
What programming language are you wanting to do this in?
Assuming that you are using UTF-8, you don't want basic punctuation but are okay with other symbols, and that you don't want any standard Latin characters but would be okay with accented characters and the like, you could use a string regular expression function in whatever language you are using that searches for all non-Ascii characters. This would elimnate most of what you probably are trying to weed out.
In php it would be:
$string2 = preg_replace('/[^(\x00-\x7F)]*/','', $string1);
However, this would remove line endings, which you may or may not want.