In comics lettering some authors differentiate between a "capital i with crossbars" and a "capital i without crossbars", generally using the crossbars version for the first personal pronoun, and the other for ordinary words. An example can be seen in this picture from Dave Gibbon's lettering in Watchmen.
I was wondering if there is an Unicode code point for the "capital i with crossbars" gliph different by the normal latin one.
http://www.fileformat.info/info/unicode/char/49/index.htm lists no such related letter. You might find a font set where e.g. ROMAN NUMERAL ONE is rendered differently from the regular LATIN CAPITAL LETTER I but Unicode currently does not seem to currently have a pair of glyphs with this precise distinction.
Update: According to Wikipedia this was introduced in Unicode 9.0 (2016). Thanks #pelson for the link!
Related
I need the copyright symbol like this 'ⓒ'. I did googling how to use this, and got the answer, Unicode \u00a9. But on some (Android) devices, it shows up bold. (I don't know how can explain... 'c' looks good but the circle around it looks bold.) So my colleague says to use \u24d2. Yes, it looks perfect on every device, but I don't know if it's proper. What's the difference between \u00a9 and \u24d2?
Unicode codepoint U+00A9 (©) is COPYRIGHT SIGN, and belongs to the "C1 Controls and Latin-1 Supplement" family of codepoints (U+0080 - U+00FF).
Unicode codepoint U+24B8 (Ⓒ) is CIRCLED LATIN CAPITAL LETTER C, and belongs to the "Enclosed Alphanumerics" family of codepoints (U+2460 – U+24FF), which includes all kinds of letters and numbers wrapped inside of circles, parenthesis, etc.
While U+00A9 and U+24B8 may visually appear to be similar, they are semantically very different things.
I am new to Unicode have been given the requirement to look at some translated text, iterate over all of the characters of that translation and determine if all the characters are valid for the target culture (language and location).
For example, if I am translating a document from English to Greek, I want to detect if there are any English/ASCII "A"s in the Greek translation and report that as an error. This may likely be the case from corrupted data from a translation memory.
Is there any existing grouping of Unicode characters by culture? Or is there any existing strategy for developing this kind of grouping? I see that there is some grouping of characters at (http://www.unicode.org/charts/). But it seems that this is not quite what I am looking for at first glance.
Does any thing exist like "Here are the valid Unicode characters for Spanish - Spain: [some Unicode range(s)]" or "Here are the valid Unicode characters for Russian - Russia: [some Unicode range(s)]"
Or has anyone developed a strategy to define these?
If this is not the right place to ask this question, I would welcome any direction on where might be a good place to ask the question.
This is something that CLDR (Common Locale Data Repository) deals with. It is not part of the Unicode Standard, but it is an activity and a resource managed by the Unicode Consortium. The LDML specification defines the format of the locale data. The Character Elements define some sets of characters: “main/standard”, “auxiliary”, “index”, and “punctuation”.
The data for Greek includes only Greek letters and some basic punctuation. This, like all such data at CLDR, is largely subjective. And even though the CLDR process is meant to produce well-reviewed data based on consensus, the reality is different. It can be argued that in normal Greek texts, Latin letters are not uncommon, especially in technical areas. For example, the international symbol for the ampere is “A” as a Latin letter; the symbol for the kilogram is “kg”, in Latin letters, even though the word for it is written Greek letters in Greek.
Thus, no matter how you run the analysis, the occurrence of Latin “A” in Greek text could be flagged as potentially suspicious, but not an error.
There are C/C++ and Java libraries that implement access to CLDR data, as part of ICU.
Is anyone aware of a way to add diacritics from different unicode blocks to say, latin letters (or latin diacritics to say, Devanagari letters)? For instance:
Oै
I tried the zero-width-joiner in between, but it had no effect. Any ideas?
I know, for instance, that the Arabic combining diacritics will work on latin letters, but Hebrew will not. Is this random?
Accoding to the Unicode Standard, Chapter 2, Section 2.11, “All combining characters can be applied to any base character and can, in principle, be used with any script.” So the Latin letter O followed by the Devanagari vowel sign ai U+0948 is permitted. But the standard adds: “This does not create an obligation on implementations to support all possible combinations equally well. Thus, while application of an Arabic annotation mark to a Han character or a Devanagari consonant is permitted, it is unlikely to be supported well in rendering or to make much sense.”
So it is up to implementations. But there are some “cross-script” diacritics. For example, the acute accent has been unified with the Greek tonos mark, so the Latin letter é and the Greek letter έ, when decomposed, contain the same diacritic U+0301. Moreover, this combining mark can be placed after a Cyrillic letter, and this can be regarded as normal (though relatively rare) usage, so we can expect good implementations to render it properly.
Worked fine for me. I just typed in the characters. Probably depends on the program rendering the text.
Oै
Does someone know a easy way to find characters in Unicode that are similar to ASCII characters. An example is the "CYRILLIC SMALL LETTER DZE (ѕ)". I'd like to do a search and replace for similar characters. By similar I mean human readable. You can't see a difference by looking at it.
As noted by other commenters, Unicode normalisation ("compatibilty characters") isn't going to help you here as you aren't looking for official equivalences but for similarities in glyphs (letter shapes). (The linked Unicode Technical Report is still worth reading, though, as it is extremely well written.)
If I were you, to spare you the tedious work of assembling a list of characters yourself, I'd search for resources on homograph attacks: This is a method of maliciously misleading web users by displaying URLs containing domain names in which some letters have been replaced with visually similar letters. Another Unicode Technical Report, on security, contains a section on the problem. There is also -- and that may be what you most need -- a "confusables" table. Here's another article with mainly punctuation marks, some of which ASCII, that have visually similar counterparts in the non-ASCII code tables.
What I do hope is that you aren't asking the question to construct such an attack.
See the Unicode Database: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
Each line describes a unicode caharacter, for example:
1E9A;LATIN SMALL LETTER A WITH RIGHT HALF RING;Ll;0;L;<compat> 0061 02BE;;;;N;;;;;
If there's any similar (compatible) characters for that symbol, it will appear in the <compat> field of the entry. In this example, 0061 (ASCII a) is compatible to the LATIN SMALL LETTER A WITH RIGHT HALF RING Unicode character.
As for your character, the entry is
0455;CYRILLIC SMALL LETTER DZE;Ll;0;L;;;;;N;;;0405;;0405
which, as you can see, does not specify a compatibility character.
I just coded the first version of an efficient glyph-to-texture function which takes ranges of unicode characters to store into one or more pov2 textures and am searching for information regarding which code charts are used in which language. I know that the Unicode Consortium gives this per glyph, but that would take really long to check out on my own.
I'd like to support as many of European languages, Cyrillic not a necessity
Edit: I can use every Latin chart, but I would like to save space with removing some extended charts such as Latin extended-D. I'm pretty sure that the only ext. I need to represent every character in my languages alphabet (Slovenian) is Latin-1 + Latin EXTENDED A, so I save ~600 characters
thanks
This page might be helpful. Scroll down to the bottom for a list of codepoint ranges.
Found out about some lists.