wxWidgets and Unicode - unicode

i want to use korean translations under in my - quite large - wxwidgets application. The application uses the wxwidgets translation framework, which is based on gettext.
I have working translations for french, german and russian. I want to go unicode anyway, but my first question is:
does my application need unicode support to display korean and japanese languages?
If so, - just for interest - why does russian work without, since they have a cyrillic letterset?
I have thousands of string literals. Do i have to prepend each and every one of them with 'L' ? ( wxString foo("foo") --> wxString foo(L"foo") )
if so, did someone build a regex or sed or perl script to do this in ca. 500 .cpp files ? ( pleeze! =) )
Will this change in wxWidgets 3.0?
Unicode question general: i use these string literals in many descriptive and many technical ways .. as displayed text as well as parts of GLSL shaders as well as XML. These APIs have char* / const char* as function arguments, so my internal wxString representation should not matter in these areas. Theory and practice: is this true? Some experiences to share, anyone?
I do some text processing ( comparing, string finding etc ) - are there any logical differences in unicode vs. ansi?
Is there any remarkeable performance impact in using Unicode?
Thank you!
Wendy

Addressing some of your questions…
does my application need unicode support to display korean and japanese languages?
If so, - just for interest - why does russian work without, since they have a cyrillic letterset?
Russian fits in a single-byte charset, just like western European languages (though it is a different charset). Korean and Japanese (and Chinese) don't. There are many workarounds for this, but the most elegant I know of to date is to use Unicode so that you don't need to rebuild your application for each locale; just change its message catalog.
Unicode question general: i use these string literals in many descriptive and many technical ways .. as displayed text as well as parts of GLSL shaders as well as XML. These APIs have char* / const char* as function arguments, so my internal wxString representation should not matter in these areas. Theory and practice: is this true? Some experiences to share, anyone?
Only strings that are going to be shown to (non-technical) users need to be localized, so they're the only ones that have to be in Unicode. The most common approach is to use UTF-8 (which is a particular way of encoding Unicode) as that means that ASCII strings – the most common type passed around inside programs – are exactly the same, which simplifies things a lot. The down-side is that you no longer have cheap indexing into the string as not all characters are the same number of bytes long. That can be anything from a non-issue to a right royal hindering PITA, depending on what the program is doing.
I do some text processing ( comparing, string finding etc ) - are there any logical differences in unicode vs. ansi?
Comparisons work fine, as does simple string finding. Other operations (e.g., getting the 20th character of a string, or working out how many characters into a string you've found a substring) are nasty because you've not got constant character widths. The nastiness can be mitigated by using wide characters, but they're less nice to use for external data (they introduce potential problems with endianness unless you go into working with byte-order marks, and that's another matter right there).
Is there any remarkeable performance impact in using Unicode?
Depends on exactly what you do. With UTF-8, if you're mostly dealing with ASCII text in reality then you get very little in the way of performance problems for most operations. With wide characters, you take more memory for every character, which naturally has performance implications (but which might acceptable because it does mean you've got constant-time indexing).

There's a korean .po file on http://www.wxwidgets.org/about/i18n.php for wxWidget's own strings. If your application displays wxWidget's own strings correctly when using that file, then it does not need Unicode support to display Korean and Japanese languages.
ISO-8859-5 is an 8 bit character set with Cyrillic letters.
Only if 1. does not yield the correct result. But if you want to translate the string, you should have used _().
I don't know.
wxWidgets 3.0 will not have separate Unicode- and ANSI-builds. 2.9.1 doesn't have, either.
It depends on how you use the arguments. C- and C++-functions usually operate on the representation of strings and are unaware of any particular character encoding. Particularly what you perceive to be a character and what the program considers a character might be different things.
See 6.
I do not know, but many toolkits use UTF-16 or UTF-32 instead of UTF-8 because these schemes are simpler. It's a size-speed tradeoff.

1.does my application need unicode support to display korean and japanese
languages?
Thanks to Oswald, i found out that you can have a korean translation without using unicode in your wxwidgets application. Change ( under windows, at least ) settings for non-unicode aware programs. But i still have to check out if this is enough for a whole application.
3.I have thousands of string literals. Do i have to prepend each
and every one of them with 'L' ? (
wxString foo("foo") --> wxString
foo(L"foo") )
If you have to use unicode with wxwidgets prior to 3.0, you have to. But do not use 'L' under wxwidgets, use wxT("foo")
4.if so, did someone build a regex or sed or perl script to do this in ca. 500 .cpp files ?
I did, at least a search and replace under Visual Studio:
Search: {"([^"]*)"}
Replace: wxT(\1)
But be careful! Will replace all string literals, #include "file.h" with #include wxT("file.h")
Will this change in wxWidgets 3.0?
Yes. See answer/quote above.

Related

When to use Unicode (aside with non-unicode!)

I haven't found much (concise) info about when exactly to use Unicode. I understand that many say best practice is to always use Unicode. But Unicode strings DO have more memory footprint. Am I correct to say that Unicode must be used only when
Printing something to screen other than local (for example debugging) use.
Generally, sending any type of text across a network with the two ends being in different locales/country
When you're not sure which to use
I think it would be beneficial if someone explained the basics (concise) of what actually happens with Unicode... am I correct to say that things get messy when :
the physical (byte) string gets sent to a machine using a representation of strings (code page, others... this is already detail although interesting) different from the sender.
The context is using Unicode in a programming language (say C++), but I hope answers to this question can be used for any encoding situation.
Also, I'm aware Unicode and NLS are not the same thing, but is it correct to say that NLS implies usage of Unicode?
P.S. awesome site
Always use Unicode, it will save you and others a lot of pain.
What you may have confused is the issue of encoding. Unicode strings do not necessarily take more memory than the equivalent ASCII (or other encoding) strings, that depends a lot on the encoding used.
Sometimes "Unicode" is used as a synonym for "UCS-2" or "UTF-16". Strictly speaking that use is wrong, because "Unicode" is the standard that defines the set of characters and their unicode codepoints. It does not as such define a mapping to bytes (or words). UTF-16, UTF-8 and other encoding take over the job of mapping the characters to concrete bytes.
The beauty of Unicode is that it frees you from restrictions and lots of headaches. Unicode is the largest character set available to date, i.e. it enables you to actually encode and use virtually any character of any halfway mainstream language in use today. With any other character set you need to think about whether it can actually encode a character or not. Latin-1 cannot encode the character "あ", Shift-JIS cannot encode the character "ڥ" and so on. Only if you're very sure you will never ever need anything other than basic Latin/Arabic/Japanaese/whatever other subset of characters should you choose a specialized encoding such as Latin-1, BIG-5, Shift-JIS or ASCII.
Unicode is the most versatile charset available and therefore a good standard to adhere to.
The Unicode encodings are nothing special, they're just a little more complex in their bit representation since they have to encode many more characters while still trying to be space efficient. For a very detailed excursion into this topic, please see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
I have a little utility which is sometimes helpful in seeing the difference between character encodings. http://sodved.awardspace.info/unicode.pl. If you paste in ö into the Raw (UTF-8) field you will see that it is represented by different byte sequences in different encodings. And as the other two good answers describe, some non-unicode encodings cannot represent it at all.

Does Lua support Unicode?

Based on the link below, I'm confused as to whether the Lua programming language supports Unicode.
http://lua-users.org/wiki/LuaUnicode
It appears it does but has limitations. I simply don't understand, are the limitation anything big/key or not a big deal?
You can certainly store unicode strings in lua, as utf8. You can use these as you would any string.
However Lua doesn't provide any default support for higher-level "unicode aware" operations on such strings—e.g., counting string length in characters, converting lower-to-upper-case, etc. Whether this lack is meaningful for you really depends on what you intend to do with these strings.
Possible approaches, depending on your use:
If you just want to input/output/store strings, and generally use them as "whole units" (for table indexing etc), you may not need any special handling at all. In this case, you just treat these strings as binary blobs.
Due to utf8's clever design, some types of string manipulation can be done on strings containing utf8 and will yield the correct result without taking any special care.
For instance, you can append strings, split them apart before/after ascii characters, etc. As an example, if you have a string "開発.txt" and you search for "." in that string using string.find (string_var, "."), and then split it using the normal string.sub function into "開発" and ".txt", those result strings will be correct utf8 strings even though you're not using any kind of "unicode-aware" algorithm.
Similarly, you can do case-conversions on only the ASCII characters in strings (those with the high bit zero), and treat the rest of the strings as binary without screwing them up.
Some utf8-aware operations are so simple that it's easy to just write one's own functions to do them.
For instance, to calculate the length in unicode-characters of a string, just count the number of characters with the high bit zero (ASCII characters), and the number of characters with the top two bits 11 ("leading bytes" for non-ASCII characters); the length is the sum of those two.
For more complex operations—e.g., case-conversion on non-ASCII characters, etc.—you'll probably have to use a Lua unicode library, such as those on the (previously mentioned) Lua-users Unicode page
Lua does not have any support for unicode (other than accepting any byte value in strings). The library slnunicode has a lot of unicode string functions, however. For example unicode.utf8.len.
(note: this answer is completely stolen from grom's comment on another question - I just think it deserves its own answer)
If you want a short answer, it is 'yes and no' as put on the linked site.
Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.
What is not supported is iteration by unicode character, there is no standard function for string length in unicode characters etc. So the higher-level kind of Unicode support (like what is available in Python with length, lower -> upper case conversion, encoding in arbitrary coding etc) is not available.
Lua 5.3 was released now. It comes with a basic UTF-8 library.
You can use the utf8 library to do things about UTF-8 encoding, like getting the length of a UTF-8 string (not number of bytes as string.len), matching each characters (not bytes), etc.
It doesn't provide native support other than encoding, like is this character a Chinese character?
It supports it in the sense that you can use Unicode in Lua strings. It depends specifically on what you're planning to do, but most of the limitations can be fairly easily worked around by extending Lua with your own functions.

What are some common character encodings that a text editor should support?

I have a text editor that can load ASCII and Unicode files. It automatically detects the encoding by looking for the BOM at the beginning of the file and/or searching the first 256 bytes for characters > 0x7f.
What other encodings should be supported, and what characteristics would make that encoding easy to auto-detect?
Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html.
As far as I know, there's no guaranteed way to detect this automatically (although the probability of a mistaken diagnosis can be reduced to a very small amount by scanning).
I don't know about encodings, but make sure it can support the multiple different line ending standards! (\n vs \r\n)
If you haven't checked out Mich Kaplan's blog yet, I suggest doing so: http://blogs.msdn.com/michkap/
Specifically this article may be useful: http://www.siao2.com/2007/04/22/2239345.aspx
There is no way how you can detect an encoding. The best thing you could do is something like IE and depend on letter distributions in different languages, as well as standard characters for a language. But that's a long shot at best.
I would advise getting your hands on some large library of character sets (check out projects like iconv) and make all of those available to the user. But don't bother auto-detecting. Simply allow the user to select his preference of a default charset, which itself would be UTF-8 by default.
Latin-1 (ISO-8859-1) and its Windows extension CP-1252 must definitely be supported for western users. One could argue that UTF-8 is a superior choice, but people often don't have that choice. Chinese users would require GB-18030, and remember there are Japanese, Russians, Greeks too who all have there own encodings beside UTF-8-encoded Unicode.
As for detection, most encodings are not safely detectable. In some (like Latin-1), certain byte values are just invalid. In UTF-8, any byte value can occur, but not every sequence of byte values. In practice, however, you would not do the decoding yourself, but use an encoding/decoding library, try to decode and catch errors. So why not support all encodings that this library supports?
You could also develop heuristics, like decoding for a specific encoding and then test the result for strange characters or character combinations or frequency of such characters. But this would never be safe, and I agree with Vilx- that you shouldn't bother. In my experience, people normally know that a file has a certain encoding, or that only two or three are possible. So if they see you chose the wrong one, they can easily adapt. And have a look at other editors. The most clever solution is not always the best, especially if people are used to other programs.
UTF-16 is not very common in plain text files. UTF-8 is much more common because it is back compatible with ASCII and is specified in standards like XML.
1) Check for BOM of various Unicode encodings. If found, use that encoding.
2) If no BOM, check if file text is valid UTF-8, reading until you reach a sufficient non-ASCII sample (since many files are almost all ASCII but may have a few accented characters or smart quotes) or the file ends. If valid UTF-8, use UTF-8.
3) If not Unicode it's probably current platform default codepage.
4) Some encodings are easy to detect, for example Japanese Shift-JIS will have heavy use of the prefix bytes 0x82 and 0x83 indicating hiragana and katakana.
5) Give user option to change encoding if program's guess turns out to be wrong.
Whatever you do, use more than 256 bytes for a sniff test. It's important to get it right, so why not check the whole doc? Or at least the first 100KB or so.
Try UTF-8 and obvious UTF-16 (lots of alternating 0 bytes), then fall back to the ANSI codepage for the current locale.

What do I need to know about Unicode? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Being a application developer, do I need to know Unicode?
Unicode is a standard that defines numeric codes for glyphs used in written communication. Or, as they say it themselves:
The standard for digital
representation of the characters used
in writing all of the world's
languages. Unicode provides a uniform
means for storing, searching, and
interchanging text in any language. It
is used by all modern computers and is
the foundation for processing text on
the Internet. Unicode is developed and
maintained by the Unicode Consortium.
There are many common, yet easily avoided, programming errors committed by developers who don't bother to educate themselves about Unicode and its encodings.
First, go to the source for
authoritative, detailed information
and implementation guidelines.
As mentioned by others, Joel Spolsky
has a good list of these
errors.
I also like Elliotte Rusty Harold's
Ten Commandments of Unicode.
Developers should also watch out for
canonical representation attacks.
Some of the key concepts you should be aware of are:
Glyphs—concrete graphics used to represent written characters.
Composition—combining glyphs to create another glyph.
Encoding—converting Unicode points to a stream of bytes.
Collation—locale-sensitive comparison of Unicode strings.
At the risk of just adding another link, unicode.org is a spectacular resource.
In short, it's a replacement for ASCII that's designed to handle, literally, every character ever used by humans. Unicode has everal encoding schemes to handle all those characters - UTF-8, which is more or less the standard these days, works really hard to stay a single byte per character, and is identical to ASCII for the first 7 bits.
(As an addendum, there's a popular misconception amongst programmers that you only need to know about Unicode if you're going to be doing internationalization. While that's certainly one use, it's not the only one. For example, I'm working on a project that will only ever use English text - but with a huge number of fancy math symbols. Moving the whole project over to be fully Unicode solved more problems than I can count.)
Unicode is an industry agreed standard for consistently representing text that has capacity to represent the World's character systems. All developers need to know about it, as Globalization is a growing concern.
One (open) source of code for handling Unicode is ICU - Internationalization Components for Unicode. It includes ICU4J for Java and ICU4C for C and C++ (presents C interface; uses C++ compiler).
You don't need to learn unicode to use it, it's a hell of complex norm. You just need to know the main issues and how your programming tools deal with it. To learn that, check the Galwegian's link and your programming language and ide documentation.
E.G :
You can convert any caracter from latin-1 to unicode but it doesn't work the other way for all caracters.
PHP let you now that some function (like stristr) does not work with unicode.
Python declare unicode string this way : u"Hello World".
That's the kind of thin you must know.
Knowing that, if you do not have a GOOD reason to not use unicode, then just use it.
Unicode is a character set, that other than ASCII (which contains only letters for English, 127 characters, one third of them actually being non-printable control characters) contains roughly 2 million characters, including characters of every language known (Chinese, Russian, Greek, Arabian, etc.) and some languages you have probably never even heard of (even lots of dead language symbols not in use anymore, but useful for archiving ancient documents).
So instead of dealing with dozens of different character encodings, you have one encoding for all of them (which also makes it easier to mix characters from different languages within a single text string, as you don't need to switch the encoding somewhere in the middle of a text string). Actually there is still plenty of room left, we are far from having all 2 mio characters in use; the Unicode Consortium could easily add symbols for another 100 languages without even starting to fear running out of symbol space.
Pretty much any book in any language you can find in a library today can be expressed in Unicode. Unicode is the name of the encoding itself, how it is expressed as "bytes" is a different issue. There are several ways to write Unicode characters like UTF-8 (one to six bytes represent a single character, depending on character number, English is almost always one byte, other Roman languages might be two or three, Chinese/Japanese might be more), UTF-16 (most characters are two byte, some rarely used ones are four byte) and UTF-32, every character is four byte. There are others, but these are the dominant ones.
Unicode is the default encoding for many newer OSes (in Mac OS X almost anything is Unicode) and programming languages (Java uses Unicode as default encoding, usually UTF-16, I heard Python does as well and will use or already does use UTF-32). If you ever plan to write an app that should display, store, or process anything other than plain English text, you'd better get used to Unicode, the sooner the better.
Unicode is a standard that enumerates characters, and gives them unique numeric IDs (called "code points"). It includes a very large, and growing, set of characters for most modern written languages, and also a lot of exotic things like ancient Greek musical notation.
Unlike other character encoding schemes (like ASCII or the ISO-8859 standards), Unicode does not say anything about representing these characters in bytes; it just gives a universal set of IDs to characters. So it is wrong to say that Unicode is "a 16-bit replacement for ASCII".
There are various encoding schemes that can representing arbitrary Unicode characters in bytes, including UTF-8, UTF-16, and others.

What are the experiences with using unicode in identifiers

These days, more languages are using unicode, which is a good thing. But it also presents a danger. In the past there where troubles distinguising between 1 and l and 0 and O. But now we have a complete new range of similar characters.
For example:
ì, î, ï, ı, ι, ί, ׀ ,أ ,آ, ỉ, ﺃ
With these, it is not that difficult to create some very hard to find bugs.
At my work, we have decided to stay with the ANSI characters for identifiers. Is there anybody out there using unicode identifiers and what are the experiences?
Besides the similar character bugs you mention and the technical issues that might arise when using different editors (w/BOM, wo/BOM, different encodings in the same file by copy pasting which is only a problem when there are actually characters that cannot be encoded in ASCII and so on), I find that it's not worth using Unicode characters in identifiers. English has become the lingua franca of development and you should stick to it while writing code.
This I find particularly true for code that may be seen anywhere in the world by any developer (open source, or code that is sold along with the product).
My experience with using unicode in C# source files was disastrous, even though it was Japanese (so there was nothing to confuse with an "i"). Source Safe doesn't like unicode, and when you find yourself manually fixing corrupted source files in Word you know something isn't right.
I think your ANSI-only policy is excellent. I can't really see any reason why that would not be viable (as long as most of your developers are English, and even if they're not the world is used to the ANSI character set).
I think it is not a good idea to use the entire ANSI character set for identifiers. No matter which ANSI code page you're working in, your ANSI code page includes characters that some other ANSI code pages don't include. So I recommend sticking to ASCII, no character codes higher than 127.
In experiments I have used a wider range of ANSI characters than just ASCII, even in identifiers. Some compilers accepted it. Some IDEs needed options to be set for fonts that could display the characters. But I don't recommend it for practical use.
Now on to the difference between ANSI code pages and Unicode.
In experiments I have stored source files in Unicode and used Unicode characters in identifiers. Some compilers accepted it. But I still don't recommend it for practical use.
Sometimes I have stored source files in Unicode and used escape sequences in some strings to represent Unicode character values. This is an important practice and I recommend it highly. I especially had to do this when other programmers used ANSI characters in their strings, and their ANSI code pages were different from other ANSI code pages, so the strings were corrupted and caused compilation errors or defective results. The way to solve this is to use Unicode escape sequences.
I would also recommend using ascii for identifiers. Comments can stay in a non-english language if the editor/ide/compiler etc. are all locale aware and set up to use the same encoding.
Additionally, some case insensitive languages change the identifiers to lowercase before using, and that causes problems if active system locale is Turkish or Azerbaijani . see here for more info about Turkish locale problem. I know that PHP does this, and it has a long standing bug.
This problem is also present in any software that compares strings using Turkish locales, not only the language implementations themselves, just to point out. It causes many headaches
It depends on the language you're using. In Python, for example, is easierfor me to stick to unicode, as my aplications needs to work in several languages. So when I get a file from someone (something) that I don't know, I assume Latin-1 and translate to Unicode.
Works for me, as I'm in latin-america.
Actually, once everithing is ironed out, the whole thing becomes a smooth ride.
Of course, this depends on the language of choice.
I haven't ever used unicode for identifier names. But what comes to my mind is that Python allows unicode identifiers in version 3: PEP 3131.
Another language that makes extensive use of unicode is Fortress.
Even if you decide not to use unicode the problem resurfaces when you use a library that does. So you have to live with it to a certain extend.