I'm working on an application that eventually reads and prints arbitrary and untrustable Unicode characters to the screen.
There are a number of ways to wreck havoc using Unicode strings, and I would like my program to behave correctly for "dangerous" strings. For instance, the RTL override character will make strings look like they're backwards.
Since the audience is mostly programmers, my solution would be to, first, get the type C canonical form of the string, and then replace anything that's not a printable character on its own with the Unicode code point in the form \uXXXXXX. (The intent is not to have a perfectly accurate representation of the string, it is to have a mostly good representation. The full string data is still available.)
My problem, then, is determining what's an actual printable character and what's a non-printable character. Swift has a Character class, but contrary to, say, Java's Character class, the Swift one doesn't seem to have any method to find out the classification of a character.
How could I carry that plan? Is there anything else I should consider?
Related
Where can I get the complete list of all unicode characters that doesn't behave as simple characters. Examples: character 0x0363 (won't be printed without another one before), character 0x0084 (does weird things when printed). I need just a raw list of such unusual characters to replace them with something harmless to avoid unwanted output effects. Regular characters (those who not in this list) should use exactly one character place when printed (= cursor moved +1 to the right), should not depend on previous or next characters, and should not affect printing style in any way.
Edit because of multiple comments:
I have some unicode string, usually consists of "usual" characters like 0x20-0x7E or cyrillic letters. Also, there are a lot of other unicode characters that are usual and may be safely assumed as having strlen() = 1. The string is printed on the terminal and I should know the resulting position of the cursor. I don't want to use some complex and non-stable libraries to do that, i want to have simplest possible logic to do that. Every problematic character may be replaced with U+0xFFFD or something like "<U+0363>" (ASCII string with its index instead of character itself). I want to have a list of "possibly-problematic" characters to replace. It is acceptable to have some non-problematic characters in this list too, but not much.
There is no simple algorithm for this. You'll likely need a complex, but extremely stable library: libicu, or something based on it. Basically every other library that does this kind of work is based on libicu, which is maintained by the Unicode organization.
If you don't want to use the official library (or something based on their library), you'll need to parse the Unicode Character Database yourself. In particular, you need to look at Character Properties, and parse the files in the UCD.
I believe you're asking for Bidi_Class (i.e. "direction") to be Left_To_Right, Canonical_Combining_Class to be Not_Reordered, and Joining_Type to be Non_Joining.
You probably also want to check the General_Category and avoid M* (Marks) and C* (Other).
This should work for some Emoji, but this whole approach will break a lot of emoji that look simple and are not. Most famously: ❤️, which is two "characters," not one. You may want to filter out Emoji. As a simple starting point, you may want to restrict yourself to the Basic Multilingual Plane (BMP), which are code points 0000-FFFF. Anything above this range is, almost by definition, rare or unusual. The BMP does include some emoji, but most emoji (and all new emoji) are outside the range.
Remember that the glyphs for single characters can still have radically different widths, even in nominally fixed-width fonts. For example, 𒈙 (U+12219 CUNEIFORM SIGN LUGAL OPPOSING LUGAL) is a completely "normal" character in the way you're describing. It is left-to-right. It doesn't depend on or influence characters around it (it's non-combining and non-joining). Its "length in characters" is 1. Its glyph is also extremely wide in most fonts and breaks a lot of layout. I don't know anything in the Unicode database that would warn you of this, since "glyph width" is entirely a function of fonts, not characters, and Unicode explicitly does not consider fonts. (That said, most of the most problematic characters are outside the BMP. Probably the most common exception is DŽ, but many fixed-width fonts have a narrow glyph for it: DŽ.)
Let's write some cuneiform in a fixed-width font.
Normally, every character should line up with a character above.
Here: 𒈙. See how these characters don't align correctly?
Not only is it a very wide glyph, but its width is not even a multiple.
At least not in my font (Mac Safari 15.0).
But DŽ is ok.
Also remember that there are multiple ways to encode the same "character." For example, é can be a "simple" character (U+00E9), or it can be two characters (U+0065, U+0301). So in some cases é may print in your scheme, and in others it won't. I suspect this is fine for your problem, but if it isn't, you're going to need to apply a normalization form (likely NFC).
As a C++ developer supporting unicode is, putting it mildly, a pain in the butt. Unicode has a few unfortunate properties that makes it very hard to determine the case of a letter, convert them or pretty much anything beyond identifying a single known codepoint or so (which may or may not be a letter). The only real rescue, it seems, is ICU for those who are unfortunate enough to not have unicode support builtin the language (i.e. C and C++). Support for unicode in other languages may or may not be good enough.
So, I thought, there must be a real alternative to unicode! i.e. an encoding that does allow easy identification of character classes, besides having a lookup datastructure (tree, table, whatever), and identifying the relationship between characters? I suspect that any such encoding would likely be multi-byte for most text -- that's not a real concern to me, but I accept that it is for others. Providing such an encoding is a lot of work, so I'm not really expecting any such encoding to exist 😞.
Short answer: not that I know of.
As a non-C++ developer, I don't know what specifically is a pain about Unicode, but since you didn't tag the question with C++, I still dare to attempt an answer.
While I'm personally very happy about Unicode in general, I agree that some aspects are cumbersome.
Some of them could arguably be improved if Unicode was redesigned from scratch, eg. by removing some redundancies like the "Latin Greek" math letters besides the actual Greek ones (but that would also break compatibility with older encodings).
But most of the "pains" just reflect the chaotic usage of writing in the first place.
You mention yourself the problem of uppercase "i", which is "I" in some, "İ" in other orthographies, but there are tons of other difficulties – eg. German "ß", which is lowercase, but has no uppercase equivalent (well, it has now, but is rarely used); or letters that look different in final position (Greek "σ"/"ς"); or quotes with inverted meaning («French style» vs. »Swiss style«, “English” vs. „German style“)... I could continue for a while.
I don't see how an encoding could help with that, other than providing tables of character properties, equivalences, and relations, which is what Unicode does.
You say in comments that, by looking at the bytes of an encoded character, you want it to tell you if it's upper or lower case.
To me, this sounds like saying: "When I look at a number, I want it to tell me if it's prime."
I mean, not even ASCII codes tell you if they are upper or lower case, you just memorised the properties table which tells you that 41..5A is upper, 61..7A is lower case.
But it's hard to memorise or hardcode these ranges for all 120k Unicode codepoints. So the easiest thing is to use a table look-up.
There's also a bit of confusion about what "encoding" means.
Unicode doesn't define any byte representation, it only assigns codepoints, ie. integers, to character definitions, and it maintains the said tables.
Encodings in the strict sense ("codecs") are the transformation formats (UTF-8 etc.), which define a mapping between the codepoints and their byte representation.
Now it would be possible to define a new UTF which maps codepoints to bytes in a way that provides a pattern for upper/lower case.
But what could that be?
Odd for upper, even for lower case?
But what about letters without upper-/lower-case distinction?
And then, characters that aren't letters?
And what about all the other character categories – punctuation, digits, whitespace, symbols, combining diacritics –, why not represent those as well?
You could put each in a predefined range, but what happens if too many new characters are added to one of the categories?
To sum it up: I don't think what you ask for is possible.
I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.
What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?
update
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
http://unicode.org/faq/han_cjk.html
Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:
A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.
As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.
It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:
Characters that are traditional only.
Characters that are simplified only.
Characters that have been left untouched, and are available in both.
Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for 面 and 麵, face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to 面. So you can deduct that it is a traditional character only.
But 面 also has a kTraditionalVariant, which points to 麵. This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...
On the other hand, 韩 has a kTraditionalVariant, pointing to 韓, and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.
As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.
Trying to rephrase: Can you map every combining character combination into one code point?
I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?
Is this true for Basic Multilingual Plane also?
If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4-byte number. That's way more than big enough for every character to be represented by a single value, but it's quite wasteful if you don't need any of the higher chars.
If you mean the compatibility sequences (ie: where e + ´ => é): there are single-character representations for most of the combinations in use in existing modern languages. If you're making up your own language, you could run into problems...but if you're sticking to the ones that people actually use, you'll be fine.
Can you map every combining character
combination into one code point?
Every combining character combination? How would your proposed encoding represent the string "à̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚ͅ͏͓͔͕͖͙͚͐͑͒͗͛ͣͤͥͦͧͨͩͪͫͬͭͮͯ͘͜͟͢͝͞͠͡"? (an 'a' with more than a hundred combining marks attached to it?) It's just not practical.
There are, however, a lot of "precomposed" characters in Unicode, like áçñü. Normalization form C will use these instead of the decomposed version whenever possible.
it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?
Depends on the meaning of the meaning of the word “character.” Unicode has the concepts of abstract character (definition 7 in chapter 3 of the standard: “A unit of information used for the organization, control, or representation of textual data”) and encoded character (definition 11: “An association (or mapping) between an abstract character and a code point”). So a character never is a code point, but for many code points, there exists an abstract character that maps to the code point, this mapping being called “encoded character.” But (definition 11, paragraph 4): “A single abstract character may also be represented by a sequence of code points”
Is this true for Basic Multilingual Plane also?
There is no conceptual difference related to abstract or encoded characters between the BMP and the other planes. The statement above holds for all subsets of the codespace.
Depending on your application, you have to distinguish between the terms glyph, grapheme cluster, grapheme, abstract character, encoded character, code point, scalar value, code unit and byte. All of these concepts are different, and there is no simple mapping between them. In particular, there is almost never a one-to-one mapping between these entities.
Based on the link below, I'm confused as to whether the Lua programming language supports Unicode.
http://lua-users.org/wiki/LuaUnicode
It appears it does but has limitations. I simply don't understand, are the limitation anything big/key or not a big deal?
You can certainly store unicode strings in lua, as utf8. You can use these as you would any string.
However Lua doesn't provide any default support for higher-level "unicode aware" operations on such strings—e.g., counting string length in characters, converting lower-to-upper-case, etc. Whether this lack is meaningful for you really depends on what you intend to do with these strings.
Possible approaches, depending on your use:
If you just want to input/output/store strings, and generally use them as "whole units" (for table indexing etc), you may not need any special handling at all. In this case, you just treat these strings as binary blobs.
Due to utf8's clever design, some types of string manipulation can be done on strings containing utf8 and will yield the correct result without taking any special care.
For instance, you can append strings, split them apart before/after ascii characters, etc. As an example, if you have a string "開発.txt" and you search for "." in that string using string.find (string_var, "."), and then split it using the normal string.sub function into "開発" and ".txt", those result strings will be correct utf8 strings even though you're not using any kind of "unicode-aware" algorithm.
Similarly, you can do case-conversions on only the ASCII characters in strings (those with the high bit zero), and treat the rest of the strings as binary without screwing them up.
Some utf8-aware operations are so simple that it's easy to just write one's own functions to do them.
For instance, to calculate the length in unicode-characters of a string, just count the number of characters with the high bit zero (ASCII characters), and the number of characters with the top two bits 11 ("leading bytes" for non-ASCII characters); the length is the sum of those two.
For more complex operations—e.g., case-conversion on non-ASCII characters, etc.—you'll probably have to use a Lua unicode library, such as those on the (previously mentioned) Lua-users Unicode page
Lua does not have any support for unicode (other than accepting any byte value in strings). The library slnunicode has a lot of unicode string functions, however. For example unicode.utf8.len.
(note: this answer is completely stolen from grom's comment on another question - I just think it deserves its own answer)
If you want a short answer, it is 'yes and no' as put on the linked site.
Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.
What is not supported is iteration by unicode character, there is no standard function for string length in unicode characters etc. So the higher-level kind of Unicode support (like what is available in Python with length, lower -> upper case conversion, encoding in arbitrary coding etc) is not available.
Lua 5.3 was released now. It comes with a basic UTF-8 library.
You can use the utf8 library to do things about UTF-8 encoding, like getting the length of a UTF-8 string (not number of bytes as string.len), matching each characters (not bytes), etc.
It doesn't provide native support other than encoding, like is this character a Chinese character?
It supports it in the sense that you can use Unicode in Lua strings. It depends specifically on what you're planning to do, but most of the limitations can be fairly easily worked around by extending Lua with your own functions.