Is the constructiveness of numeral codepoints guaranteed in the Unicode standard?
From the scripts that I've manually checked, they are indeed consecutive (U+30 to U+39 for basic laten and U+2080 to U+2089 for subscripts). However, there are too many numeral sets in the unicode standard for me to check by hand, and doing so says nothing about feature additions.
Any help is appreciated.
Thank you.
No, they are not. For example Suzhou numerals are spread between a bunch of different code points.
All you ever wanted to know about Unicode numerals is here:
http://en.wikipedia.org/wiki/Numerals_in_Unicode
Related
As a C++ developer supporting unicode is, putting it mildly, a pain in the butt. Unicode has a few unfortunate properties that makes it very hard to determine the case of a letter, convert them or pretty much anything beyond identifying a single known codepoint or so (which may or may not be a letter). The only real rescue, it seems, is ICU for those who are unfortunate enough to not have unicode support builtin the language (i.e. C and C++). Support for unicode in other languages may or may not be good enough.
So, I thought, there must be a real alternative to unicode! i.e. an encoding that does allow easy identification of character classes, besides having a lookup datastructure (tree, table, whatever), and identifying the relationship between characters? I suspect that any such encoding would likely be multi-byte for most text -- that's not a real concern to me, but I accept that it is for others. Providing such an encoding is a lot of work, so I'm not really expecting any such encoding to exist 😞.
Short answer: not that I know of.
As a non-C++ developer, I don't know what specifically is a pain about Unicode, but since you didn't tag the question with C++, I still dare to attempt an answer.
While I'm personally very happy about Unicode in general, I agree that some aspects are cumbersome.
Some of them could arguably be improved if Unicode was redesigned from scratch, eg. by removing some redundancies like the "Latin Greek" math letters besides the actual Greek ones (but that would also break compatibility with older encodings).
But most of the "pains" just reflect the chaotic usage of writing in the first place.
You mention yourself the problem of uppercase "i", which is "I" in some, "İ" in other orthographies, but there are tons of other difficulties – eg. German "ß", which is lowercase, but has no uppercase equivalent (well, it has now, but is rarely used); or letters that look different in final position (Greek "σ"/"ς"); or quotes with inverted meaning («French style» vs. »Swiss style«, “English” vs. „German style“)... I could continue for a while.
I don't see how an encoding could help with that, other than providing tables of character properties, equivalences, and relations, which is what Unicode does.
You say in comments that, by looking at the bytes of an encoded character, you want it to tell you if it's upper or lower case.
To me, this sounds like saying: "When I look at a number, I want it to tell me if it's prime."
I mean, not even ASCII codes tell you if they are upper or lower case, you just memorised the properties table which tells you that 41..5A is upper, 61..7A is lower case.
But it's hard to memorise or hardcode these ranges for all 120k Unicode codepoints. So the easiest thing is to use a table look-up.
There's also a bit of confusion about what "encoding" means.
Unicode doesn't define any byte representation, it only assigns codepoints, ie. integers, to character definitions, and it maintains the said tables.
Encodings in the strict sense ("codecs") are the transformation formats (UTF-8 etc.), which define a mapping between the codepoints and their byte representation.
Now it would be possible to define a new UTF which maps codepoints to bytes in a way that provides a pattern for upper/lower case.
But what could that be?
Odd for upper, even for lower case?
But what about letters without upper-/lower-case distinction?
And then, characters that aren't letters?
And what about all the other character categories – punctuation, digits, whitespace, symbols, combining diacritics –, why not represent those as well?
You could put each in a predefined range, but what happens if too many new characters are added to one of the categories?
To sum it up: I don't think what you ask for is possible.
I have a dataset which mixes use of unicode characters \u0421, 'С' and \u0043, 'C'. Is there some sort of unicode comparison which considers those two characters the same? So far I've tried several ICU collations, including the Russian one.
There is no Unicode comparison that treats characters as the same on the basis of visual identity of glyphs. However, Unicode Technical Standard #39, Unicode Security Mechanisms, deals with “confusables” – characters that may be confused with each other due to visual identity or similarity. It includes a data file of confusables as well as “intentionally confusable” pairs, i.e. “characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design”, which mainly consists of pairs of Latin and Cyrillic or Greek letters, like C and С. You would probably need to code your own use of this data, as ICU does not seem to have anything related to the confusable concept.
when you take a look at http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, you will see that some code positions are annotated for codepoints that are similar in use; however, i'm not aware of any extensive list that covers visual similarities across scripts. you might want to search for URL spoofing using intentional misspellings, which was discussed when they came up with punycode. other than that, your best bet might be to search the data for characters outside the expected using regular expressions, and compile a series of ad-hoc text fixers like text = text.replace /с/, 'c'.
I have heard that some characters are not present in the Unicode standard despite being written in everyday life by populations of some areas. Especially I have heard about recent Chinese first names fabricated by assembling existing characters parts, but I can't find any reference for this.
For instance, the character below is very common for 50 million people, yet it was not in Unicode until October 2009:
Is there a list of such characters? (images, or website listing such characters as images)
Also: Here's unicode.org's list of unsupported scripts
Well, there's loads of stuff not present in Unicode (though new characters are still being added).
Some examples:
Due to Han Unification, Unicode uses one codepoint for several similar characters from different languages. People disagree whether these characters are really "the same"; if you believe they should be represented separately, then these separate representations could be said to be "missing" (though this is something of a philosophical question).
In a similar vein, many languages (especially Asian languages) sometimes have several variants of one character/glyph. The distinction between "one character with several representations" (=one codepoint) and "distinct characters" (=different codepoints) is somewhat arbitratry, thus there are cases (e.g. with Kanji characters) where some people feel alternative variants are "missing".
Many historic and rarely used characters are missing.
Many old/historic scripts are not covered, e.g. Demotic. Actually, there is an initiative specifically for including more scripts in Unicode, the Script Encoding Initiative(SEI).
There is also a page by the W3C on this topic, Missing characters and glyphs, with more explanations.
There are tons of characters from the symbol part of the standard that are annoyingly not included.
See the "Missing symmetric versions" section of https://web.archive.org/web/20210830121541/http://xahlee.info/comp/unicode_arrows.html for a bunch of arrow symbols that exist, but only in certain directions. Some are just silly. For example, there is ⥂, ⥃, and ⥄, but there isn't a right pointing version of the last one.
And you can see from http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts that they picked apparently randomly which letters to support in super- and sub-script form. For example, they include the subscript vowels a, e, o, and even schwa (ə), but not i, which would be very useful, as it's a common subscript in mathematical typesetting. Take a look at the wikipedia article for more details (you'll need a unicode font installed, because at least at the time of this writing they regular ascii equivalents are not explicitly listed), but basically they picked about half of the latin alphabet seemingly at random for each of upper- and lower-case super- and sub-script characters.
Also, a lot of symbols that would be convenient for building shapes with unicode do not exist.
It does not support the bilabial trill letter, turned beta, reversed k.
Can someone explaing me about encoding and its importance. I understand that we have various encodings and in each of them first 127 characters are same.
Read Joel Spolsky's excellent article on the subject.
An interesting point that was noted in the discussion of another answer (which I didn't really think the author needed to delete) is that there is a difference between a character set, which (in the other author's words - don't remember his username) defines a mapping between integers and characters (e.g. "Capital A is 65"), and an encoding, which defines how those integers are to be represented in a byte stream. Most old character sets, such as ASCII, have only one very simple encoding: each integer becomes exactly one byte. The Unicode character set, on the other hand, has many different encodings, none of which are equally simple: UTF-8, UTF-16, UTF-32...
Apart from the article above mentioned by Aasmund Eldhuset, I find this Tedx Talk really interesting and explanatory on the same topic.
Hope this helps!
I think encoding is the technique to convert your message into a form that is not readable to unauthorized persons so that you can maintain your secrecy.
I just coded the first version of an efficient glyph-to-texture function which takes ranges of unicode characters to store into one or more pov2 textures and am searching for information regarding which code charts are used in which language. I know that the Unicode Consortium gives this per glyph, but that would take really long to check out on my own.
I'd like to support as many of European languages, Cyrillic not a necessity
Edit: I can use every Latin chart, but I would like to save space with removing some extended charts such as Latin extended-D. I'm pretty sure that the only ext. I need to represent every character in my languages alphabet (Slovenian) is Latin-1 + Latin EXTENDED A, so I save ~600 characters
thanks
This page might be helpful. Scroll down to the bottom for a list of codepoint ranges.
Found out about some lists.