What is the best and safest way to place an Arabic Shaddah (ّ = \u0651 = "ARABIC SHADDA") over a Hebrew letter?
Hebrew textbooks for colloquial Arabic need to place the Shaddah over a letter. This was done in pre-Unicode days, so either fonts were hacked to have this, or else metal glyphs were placed together (in the days of manual typesetting). Unicode, however, doesn't make it easy to position diacritical marks from one language on top of characters from another language.
I'm looking for the best way of doing this in the Unicode standard.
U+0651 is a non-spacing mark, so you should be able to place it right after a base character and it will be applied over that character. This does rely on an appropriate font and whatever is rendering the font to support Unicode correctly.
For example this looks correct to me on Google Chrome and Microsoft Edge browsers ("correct" is relative as I know nothing about Hebrew):
U+05D0 U+0651 אّ HEBREW LETTER ALEF + ARABIC SHADDA
U+05D1 U+0651 בّ HEBREW LETTER BET + ARABIC SHADDA
U+05D2 U+0651 גّ HEBREW LETTER GIMEL + ARABIC SHADDA
Related
I found that the adjacent letters of the Japanese tilde are small if they are Chinese characters, and large if they are kana.
I would like to ask the seniors, is this a feature of the language, or a bug of vscode? See below
~つ
~着
~羽
~番
~足
~度
~キロ(メートル)
The search function, can confirm, is the same character, so I'm confused.
Moreover, my file encoding is UTF-8, there should be no strange character set errors.
I am a developer who is working with Chinese characters. I am trying to convert part of my project into English. I am currently rewriting the project's internationalization module.
I am unfamiliar with the standards for English, so I don't know if non-ascii is used widely?
If it is: Tell me some characters they use frequently.
Standard English spelling uses en dash (–), curly quotation marks (“, ”, ‘, ’); American English also uses em dash (—). Depending on conventions and preferences, several non-Ascii letters may be used, too, especially in words of French or Latin origin, such as é, ë, ç, and æ. Moreover, even in nonspecialized texts, various special character such as superscript two (²), micro sign (µ), and degree sign (°) may be seen.
I'm giving a tech talk about Unicode and encoding in my company, in which I'm trying to make the point that strings are always encoded, and developers should never carelessly assume that everything is 0-127 ASCII.
I have numerous examples of problems caused by mis-encoded text, but I didn't find any example of simple English text with numbers that's encoded above Unicode code point 127.
The basic English alphabet is mapped in Unicode to the same numerical value as the plain old ASCII: The range A-Z is mapped to [65-90] (or [0x41-0x5a] in hex), and [a-z] is mapped to [97-122] (hex [0x61-0x7a]).
Does the English alphabet appear elsewhere in the code charts? I do not mean circumflex letters or other Latin variants, just the plain English alphabet.
CJK characters are generally monospaced in all fonts, since that's how those languages tend to be written.
When mixing CJK and English characters, however, you run into a problem: ASCII characters do not in general have the width of a CJK character. This means that if you use ASCII, you lose the monospaced property - which may not always be desirable.
For this purpose, fullwidth characters (U+FF00-FFEE, Wikipedia, Unicode code chart) may be used in place of "regular" characters. These have the property that they have the same width as a single CJK character.
Note, however, that fullwidth characters are virtually never used outside of a CJK context, and even in those contexts, plain ASCII is frequently used as well, when monospacing is considered unimportant.
Plenty of punctuation and symbols have code point values above U+007F:
“Hello.”
He had been given the comprehensive sixty-four-crayon Crayola box—including the gold and silver crayons—and would not let me look.
x ≠ y
The above examples use:
U+201C and U+201D — smart quotes
U+2014 — em-dash
U+2260 — not equal to
See the Unicode charts for more.
Well, if you just mean a-z and A-Z then no, there are no English characters above 127. But words like fiancé, resumé etc are sometimes spelled like that in English and use codepoints above 127.
Then there are various punctuation signs, currency symbols and so on that are above 127. Not sure if this counts as simple English text.
Is anyone aware of a way to add diacritics from different unicode blocks to say, latin letters (or latin diacritics to say, Devanagari letters)? For instance:
Oै
I tried the zero-width-joiner in between, but it had no effect. Any ideas?
I know, for instance, that the Arabic combining diacritics will work on latin letters, but Hebrew will not. Is this random?
Accoding to the Unicode Standard, Chapter 2, Section 2.11, “All combining characters can be applied to any base character and can, in principle, be used with any script.” So the Latin letter O followed by the Devanagari vowel sign ai U+0948 is permitted. But the standard adds: “This does not create an obligation on implementations to support all possible combinations equally well. Thus, while application of an Arabic annotation mark to a Han character or a Devanagari consonant is permitted, it is unlikely to be supported well in rendering or to make much sense.”
So it is up to implementations. But there are some “cross-script” diacritics. For example, the acute accent has been unified with the Greek tonos mark, so the Latin letter é and the Greek letter έ, when decomposed, contain the same diacritic U+0301. Moreover, this combining mark can be placed after a Cyrillic letter, and this can be regarded as normal (though relatively rare) usage, so we can expect good implementations to render it properly.
Worked fine for me. I just typed in the characters. Probably depends on the program rendering the text.
Oै
how can i extract only the characters in a particular language from a file containing language characters, alphanumeric character english alphabets
This depends on a few factors:
Is the string encoded with UTF-8?
Do you want all non-English characters, including things like symbols and punctuation marks, or only non-symbol characters from written languages?
Do you want to capture characters that are non-English or non-Latin? What I mean is, would you want characters like é and ç or would you only want characters outside of Romantic and Germanic alphabets?
and finally,
What programming language are you wanting to do this in?
Assuming that you are using UTF-8, you don't want basic punctuation but are okay with other symbols, and that you don't want any standard Latin characters but would be okay with accented characters and the like, you could use a string regular expression function in whatever language you are using that searches for all non-Ascii characters. This would elimnate most of what you probably are trying to weed out.
In php it would be:
$string2 = preg_replace('/[^(\x00-\x7F)]*/','', $string1);
However, this would remove line endings, which you may or may not want.