Whether it is optimal or not, I am trying to identify specific characters using its hexadecimal code. (Is there better way to identify alphabets, Arabic, Chinese, or Japanese characters?)
http://play.golang.org/p/b81_rgXr3G
fmt.Printf("%x \n", "가") //eab080
fmt.Printf("%x \n", "ㅎ") //e3858e
So it is true that in Korean
eab080 < e3858e
Then my question is
do we have any table or chart for each language's hexadecimal boundary?
I mean, for English
fmt.Printf("%x \n", "A") //41
fmt.Printf("%x \n", "z") //7a
Then 41 < 7a
As you see above, the alphabet is to be bounded between 41 and 7a.
I am trying out the same thing for another writing system that is not in alphabet.
Do I need unicode to identify different writing system? The unicode standard library seems only to provide encode and decode English alphabets.
Thanks in advance.
No, we do not have any table or chart for each language’s hexadecimal boundary. There is some data about characters typically used in various languages.
This answers the question asked, but you should consider whether that was your real problem. The question refers to writing systems, alphabets, and languages as if they were one thing; they are separate concepts. You should define your practical problem: what information do you really need? In a text in some language, any Unicode character may appear.
By the way, English has (at least in some forms of the language) also words like fiancé, coöoperation, rôle, anæmia, belovèd, etc.
Related
Users sometimes use weird ASCII characters in a program, and I was wondering if there was a way to "normalize" it.
So basically, if the input ᴀʙᴄᴅᴇꜰɢ, the output would be ABCDEFG. Is there a dictionary that exists somewhere that does something like this? If not, is there a better method than just doing something like str.replace("ᴀ", "A") for all the different "fonts"?
This isn't a language specific question -- if something doesn't exist like this than I guess the next step is to create a dictionary myself.
Yes.
BTW—The technical terms are: Latin Capital Letters from the C0 Controls and Basic Latin block and the Latin Letter Small Capitals from the Phonetic Extensions block.
Anyway, the general topic for your question is Unicode confusables. The link is for a mapping. Uncode.org has more material on confusables and everything else Unicode.
(Normalization is always something to consider when processing Unicode text, but it doesn't particularly relate to this issue.)
Your example seems to involve unicode characters, not ASCII characters. Unicode normalization (FAQ) is a large and complex subject, with many difference equivalence classes of characters, depending on what you are trying to do.
What is the subset of Unicode characters that are normally used in writing — such as those that would be typically found in a newspaper article?
For example, in English, the characters in the range [a-zA-Z0-9], plus some punctuation characters, would be sufficient for most writing.
But I want to support languages that use characters that fall outside the ASCII range, while excluding the non-printing or decorative characters.
The objective is to restrict the user input to the application to codepoints that are legitimately used in written language. Because the user input will be saved and displayed, I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters, Unicode flow control characters, etc.
Regrettably, I am not fluent in every single language found in Unicode. Has anyone compiled a list of all of the subset of Unicode characters that are normally used in writing?
The official list of Unicode code points is UnicodeData.txt. This is a plain text file with one line per code point; it's easily machine-readable. For example:
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
The third semicolon-delimited field is the abbreviated name of the "General Category". This is explained further in chapter 4 of the Unicode Standard, specifically in section 4.5; see the table on page 131 (page 12 of the PDF file). For example, "Lu" is uppercase letters, "Ll" is lowercase letters, Pc, Pd, Ps, et al are various kinds of punctuation. (The first letter of the two-letter abbreviation represents a higher-level category such as letter, digit, punctuation, etc.)
Note that some ranges of code points are not listed explicitly. For example, the range of CJK (Chinese, Japanese, Korean) ideographs is represented as:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
I think there are other files on unicode.org that fill in these gaps.
I'm still not 100% clear on just what subset you're trying to define, but you can probably define it as a particular set of General Category values.
I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters
Diacritics/combining characters will be used in normal written language. So if you want to stop 'pranksters' you're going to need something more sophisticated than just a list of permitted characters. You'll have to do some sort of linguistic analysis for every language you want to permit.
I'd recommend not bothering with this, because it's going to be hard and you won't succeed anyway. Just let people write what they want.
Try WGL4 (652 characters), MES-1 (335 characters) or MES-2 (1062 characters). Find these at Wikipedia.
You may wish to exclude characters IJijĸĿŀʼn˚―⅛⅜⅝⅞♪ from MES-1 if you want to use this set.
Edit: I realize this is a bad answer. Especially the removing characters from MES-1 part was total garbage. I shouldn't have posted this. I'm ashamed of whoever upvoted this.
If anything, use Subset1 (678 characters), Subset2 (1193 characters) and Subset3 (2823 characters). https://unicodesubsets.miraheze.org/wiki/User:PiotrGrochowski
I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.
What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?
update
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
http://unicode.org/faq/han_cjk.html
Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:
A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.
As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.
It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:
Characters that are traditional only.
Characters that are simplified only.
Characters that have been left untouched, and are available in both.
Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for 面 and 麵, face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to 面. So you can deduct that it is a traditional character only.
But 面 also has a kTraditionalVariant, which points to 麵. This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...
On the other hand, 韩 has a kTraditionalVariant, pointing to 韓, and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.
As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.
i want to use korean translations under in my - quite large - wxwidgets application. The application uses the wxwidgets translation framework, which is based on gettext.
I have working translations for french, german and russian. I want to go unicode anyway, but my first question is:
does my application need unicode support to display korean and japanese languages?
If so, - just for interest - why does russian work without, since they have a cyrillic letterset?
I have thousands of string literals. Do i have to prepend each and every one of them with 'L' ? ( wxString foo("foo") --> wxString foo(L"foo") )
if so, did someone build a regex or sed or perl script to do this in ca. 500 .cpp files ? ( pleeze! =) )
Will this change in wxWidgets 3.0?
Unicode question general: i use these string literals in many descriptive and many technical ways .. as displayed text as well as parts of GLSL shaders as well as XML. These APIs have char* / const char* as function arguments, so my internal wxString representation should not matter in these areas. Theory and practice: is this true? Some experiences to share, anyone?
I do some text processing ( comparing, string finding etc ) - are there any logical differences in unicode vs. ansi?
Is there any remarkeable performance impact in using Unicode?
Thank you!
Wendy
Addressing some of your questions…
does my application need unicode support to display korean and japanese languages?
If so, - just for interest - why does russian work without, since they have a cyrillic letterset?
Russian fits in a single-byte charset, just like western European languages (though it is a different charset). Korean and Japanese (and Chinese) don't. There are many workarounds for this, but the most elegant I know of to date is to use Unicode so that you don't need to rebuild your application for each locale; just change its message catalog.
Unicode question general: i use these string literals in many descriptive and many technical ways .. as displayed text as well as parts of GLSL shaders as well as XML. These APIs have char* / const char* as function arguments, so my internal wxString representation should not matter in these areas. Theory and practice: is this true? Some experiences to share, anyone?
Only strings that are going to be shown to (non-technical) users need to be localized, so they're the only ones that have to be in Unicode. The most common approach is to use UTF-8 (which is a particular way of encoding Unicode) as that means that ASCII strings – the most common type passed around inside programs – are exactly the same, which simplifies things a lot. The down-side is that you no longer have cheap indexing into the string as not all characters are the same number of bytes long. That can be anything from a non-issue to a right royal hindering PITA, depending on what the program is doing.
I do some text processing ( comparing, string finding etc ) - are there any logical differences in unicode vs. ansi?
Comparisons work fine, as does simple string finding. Other operations (e.g., getting the 20th character of a string, or working out how many characters into a string you've found a substring) are nasty because you've not got constant character widths. The nastiness can be mitigated by using wide characters, but they're less nice to use for external data (they introduce potential problems with endianness unless you go into working with byte-order marks, and that's another matter right there).
Is there any remarkeable performance impact in using Unicode?
Depends on exactly what you do. With UTF-8, if you're mostly dealing with ASCII text in reality then you get very little in the way of performance problems for most operations. With wide characters, you take more memory for every character, which naturally has performance implications (but which might acceptable because it does mean you've got constant-time indexing).
There's a korean .po file on http://www.wxwidgets.org/about/i18n.php for wxWidget's own strings. If your application displays wxWidget's own strings correctly when using that file, then it does not need Unicode support to display Korean and Japanese languages.
ISO-8859-5 is an 8 bit character set with Cyrillic letters.
Only if 1. does not yield the correct result. But if you want to translate the string, you should have used _().
I don't know.
wxWidgets 3.0 will not have separate Unicode- and ANSI-builds. 2.9.1 doesn't have, either.
It depends on how you use the arguments. C- and C++-functions usually operate on the representation of strings and are unaware of any particular character encoding. Particularly what you perceive to be a character and what the program considers a character might be different things.
See 6.
I do not know, but many toolkits use UTF-16 or UTF-32 instead of UTF-8 because these schemes are simpler. It's a size-speed tradeoff.
1.does my application need unicode support to display korean and japanese
languages?
Thanks to Oswald, i found out that you can have a korean translation without using unicode in your wxwidgets application. Change ( under windows, at least ) settings for non-unicode aware programs. But i still have to check out if this is enough for a whole application.
3.I have thousands of string literals. Do i have to prepend each
and every one of them with 'L' ? (
wxString foo("foo") --> wxString
foo(L"foo") )
If you have to use unicode with wxwidgets prior to 3.0, you have to. But do not use 'L' under wxwidgets, use wxT("foo")
4.if so, did someone build a regex or sed or perl script to do this in ca. 500 .cpp files ?
I did, at least a search and replace under Visual Studio:
Search: {"([^"]*)"}
Replace: wxT(\1)
But be careful! Will replace all string literals, #include "file.h" with #include wxT("file.h")
Will this change in wxWidgets 3.0?
Yes. See answer/quote above.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Being a application developer, do I need to know Unicode?
Unicode is a standard that defines numeric codes for glyphs used in written communication. Or, as they say it themselves:
The standard for digital
representation of the characters used
in writing all of the world's
languages. Unicode provides a uniform
means for storing, searching, and
interchanging text in any language. It
is used by all modern computers and is
the foundation for processing text on
the Internet. Unicode is developed and
maintained by the Unicode Consortium.
There are many common, yet easily avoided, programming errors committed by developers who don't bother to educate themselves about Unicode and its encodings.
First, go to the source for
authoritative, detailed information
and implementation guidelines.
As mentioned by others, Joel Spolsky
has a good list of these
errors.
I also like Elliotte Rusty Harold's
Ten Commandments of Unicode.
Developers should also watch out for
canonical representation attacks.
Some of the key concepts you should be aware of are:
Glyphs—concrete graphics used to represent written characters.
Composition—combining glyphs to create another glyph.
Encoding—converting Unicode points to a stream of bytes.
Collation—locale-sensitive comparison of Unicode strings.
At the risk of just adding another link, unicode.org is a spectacular resource.
In short, it's a replacement for ASCII that's designed to handle, literally, every character ever used by humans. Unicode has everal encoding schemes to handle all those characters - UTF-8, which is more or less the standard these days, works really hard to stay a single byte per character, and is identical to ASCII for the first 7 bits.
(As an addendum, there's a popular misconception amongst programmers that you only need to know about Unicode if you're going to be doing internationalization. While that's certainly one use, it's not the only one. For example, I'm working on a project that will only ever use English text - but with a huge number of fancy math symbols. Moving the whole project over to be fully Unicode solved more problems than I can count.)
Unicode is an industry agreed standard for consistently representing text that has capacity to represent the World's character systems. All developers need to know about it, as Globalization is a growing concern.
One (open) source of code for handling Unicode is ICU - Internationalization Components for Unicode. It includes ICU4J for Java and ICU4C for C and C++ (presents C interface; uses C++ compiler).
You don't need to learn unicode to use it, it's a hell of complex norm. You just need to know the main issues and how your programming tools deal with it. To learn that, check the Galwegian's link and your programming language and ide documentation.
E.G :
You can convert any caracter from latin-1 to unicode but it doesn't work the other way for all caracters.
PHP let you now that some function (like stristr) does not work with unicode.
Python declare unicode string this way : u"Hello World".
That's the kind of thin you must know.
Knowing that, if you do not have a GOOD reason to not use unicode, then just use it.
Unicode is a character set, that other than ASCII (which contains only letters for English, 127 characters, one third of them actually being non-printable control characters) contains roughly 2 million characters, including characters of every language known (Chinese, Russian, Greek, Arabian, etc.) and some languages you have probably never even heard of (even lots of dead language symbols not in use anymore, but useful for archiving ancient documents).
So instead of dealing with dozens of different character encodings, you have one encoding for all of them (which also makes it easier to mix characters from different languages within a single text string, as you don't need to switch the encoding somewhere in the middle of a text string). Actually there is still plenty of room left, we are far from having all 2 mio characters in use; the Unicode Consortium could easily add symbols for another 100 languages without even starting to fear running out of symbol space.
Pretty much any book in any language you can find in a library today can be expressed in Unicode. Unicode is the name of the encoding itself, how it is expressed as "bytes" is a different issue. There are several ways to write Unicode characters like UTF-8 (one to six bytes represent a single character, depending on character number, English is almost always one byte, other Roman languages might be two or three, Chinese/Japanese might be more), UTF-16 (most characters are two byte, some rarely used ones are four byte) and UTF-32, every character is four byte. There are others, but these are the dominant ones.
Unicode is the default encoding for many newer OSes (in Mac OS X almost anything is Unicode) and programming languages (Java uses Unicode as default encoding, usually UTF-16, I heard Python does as well and will use or already does use UTF-32). If you ever plan to write an app that should display, store, or process anything other than plain English text, you'd better get used to Unicode, the sooner the better.
Unicode is a standard that enumerates characters, and gives them unique numeric IDs (called "code points"). It includes a very large, and growing, set of characters for most modern written languages, and also a lot of exotic things like ancient Greek musical notation.
Unlike other character encoding schemes (like ASCII or the ISO-8859 standards), Unicode does not say anything about representing these characters in bytes; it just gives a universal set of IDs to characters. So it is wrong to say that Unicode is "a 16-bit replacement for ASCII".
There are various encoding schemes that can representing arbitrary Unicode characters in bytes, including UTF-8, UTF-16, and others.