What are the experiences with using unicode in identifiers - unicode

These days, more languages are using unicode, which is a good thing. But it also presents a danger. In the past there where troubles distinguising between 1 and l and 0 and O. But now we have a complete new range of similar characters.
For example:
ì, î, ï, ı, ι, ί, ׀ ,أ ,آ, ỉ, ﺃ
With these, it is not that difficult to create some very hard to find bugs.
At my work, we have decided to stay with the ANSI characters for identifiers. Is there anybody out there using unicode identifiers and what are the experiences?

Besides the similar character bugs you mention and the technical issues that might arise when using different editors (w/BOM, wo/BOM, different encodings in the same file by copy pasting which is only a problem when there are actually characters that cannot be encoded in ASCII and so on), I find that it's not worth using Unicode characters in identifiers. English has become the lingua franca of development and you should stick to it while writing code.
This I find particularly true for code that may be seen anywhere in the world by any developer (open source, or code that is sold along with the product).

My experience with using unicode in C# source files was disastrous, even though it was Japanese (so there was nothing to confuse with an "i"). Source Safe doesn't like unicode, and when you find yourself manually fixing corrupted source files in Word you know something isn't right.
I think your ANSI-only policy is excellent. I can't really see any reason why that would not be viable (as long as most of your developers are English, and even if they're not the world is used to the ANSI character set).

I think it is not a good idea to use the entire ANSI character set for identifiers. No matter which ANSI code page you're working in, your ANSI code page includes characters that some other ANSI code pages don't include. So I recommend sticking to ASCII, no character codes higher than 127.
In experiments I have used a wider range of ANSI characters than just ASCII, even in identifiers. Some compilers accepted it. Some IDEs needed options to be set for fonts that could display the characters. But I don't recommend it for practical use.
Now on to the difference between ANSI code pages and Unicode.
In experiments I have stored source files in Unicode and used Unicode characters in identifiers. Some compilers accepted it. But I still don't recommend it for practical use.
Sometimes I have stored source files in Unicode and used escape sequences in some strings to represent Unicode character values. This is an important practice and I recommend it highly. I especially had to do this when other programmers used ANSI characters in their strings, and their ANSI code pages were different from other ANSI code pages, so the strings were corrupted and caused compilation errors or defective results. The way to solve this is to use Unicode escape sequences.

I would also recommend using ascii for identifiers. Comments can stay in a non-english language if the editor/ide/compiler etc. are all locale aware and set up to use the same encoding.
Additionally, some case insensitive languages change the identifiers to lowercase before using, and that causes problems if active system locale is Turkish or Azerbaijani . see here for more info about Turkish locale problem. I know that PHP does this, and it has a long standing bug.
This problem is also present in any software that compares strings using Turkish locales, not only the language implementations themselves, just to point out. It causes many headaches

It depends on the language you're using. In Python, for example, is easierfor me to stick to unicode, as my aplications needs to work in several languages. So when I get a file from someone (something) that I don't know, I assume Latin-1 and translate to Unicode.
Works for me, as I'm in latin-america.
Actually, once everithing is ironed out, the whole thing becomes a smooth ride.
Of course, this depends on the language of choice.

I haven't ever used unicode for identifier names. But what comes to my mind is that Python allows unicode identifiers in version 3: PEP 3131.
Another language that makes extensive use of unicode is Fortress.
Even if you decide not to use unicode the problem resurfaces when you use a library that does. So you have to live with it to a certain extend.

Related

Understanding the terms - Character Encodings, Fonts, Glyphs

I am trying to understand this stuff so that I can effectively work on internationalizing a project at work. I have just started and very much like to know from your expertise whether I've understood these concepts correct. So far here is the dumbed down version(for my understanding) of what I've gathered from web:
Character Encodings -> Set of rules that tell the OS how to store characters. Eg., ISO8859-1,MSWIN1252,UTF-8,UCS-2,UTF-16. These rules are also called Code Pages/Character Sets which maps individual characters to numbers. Apparently unicode handles this a bit differently than others. ie., instead of a direct mapping from a number(code point) to a glyph, it maps the code point to an abstract "character" which might be represented by different glyphs.[ http://www.joelonsoftware.com/articles/Unicode.html ]
Fonts -> These are implementation of character encodings. They are files of different formats (True Type,Open Type,Post Script) that contain mapping for each character in an encoding to number.
Glyphs -> These are visual representation of characters stored in the font files.
And based on the above understanding I have the below questions,
1)For the OS to understand an encoding, should it be installed separately?. Or installing a font that supports an encoding would suffice?. Is it okay to use the analogy of a protocol say TCP used in a network to an encoding as it is just a set of rules. (which ofcourse begs the question, how does the OS understands these network protocols when I do not install them :-p)
2)Will a font always have the complete implementation of a code page or just part of it?. Is there a tool that I can use to see each character in a font(.TTF file?)[Windows font viewer shows how a style of the font looks like but doesn't give information regarding the list of characters in the font file]
3)Does a font file support multiple encodings?. Is there a way to know which encoding(s) a font supports?
I apologize for asking too many questions, but I had these in my mind for some time and I couldn't find any site that is simple enough for my understanding. Any help/links for understanding this stuff would be most welcome. Thanks in advance.
If you want to learn more, of course I can point you to some resources:
Unicode, writing systems, etc.
The best source of information would probably be this book by Jukka:
Unicode Explained
If you were to follow the link, you'd also find these books:
CJKV Information Processing - deals with Chinese, Japanese, Korean and Vietnamese in detail but to me it seems quite hard to read.
Fonts & Encodings - personally I haven't read this book, so I can't tell you if it is good or not. Seems to be on topic.
Internationalization
If you want to learn about i18n, I can mention countless resources. But let's start with book that will save you great deal of time (you won't become i18n expert overnight, you know):
Developing International Software - it might be 8 years old but this is still worth every cent you're going to spend on it. Maybe the programming examples regard to Windows (C++ and .Net) but the i18n and L10n knowledge is really there. A colleague of mine said once that it saved him about 2 years of learning. As far as I can tell, he wasn't overstating.
You might be interested in some blogs or web sites on the topic:
Sorting it all out - Michael Kaplan's blog, often on i18n support on Windows platform
Global by design - John Yunker is actively posting bits of i18n knowledge to this site
Internationalization (I18n), Localization (L10n), Standards, and Amusements - also known as i18nguy, the web site where you can find more links, tutorials and stuff.
Java Internationalization
I am afraid that I am not aware of many up to date resources on that topic (that is publicly available ones). The only current resource I know is Java Internationalization trail. Unfortunately, it is fairly incomplete.
JavaScript Internationalization
If you are developing web applications, you probably need also something related to i18n in js. Unfortunately, the support is rather poor but there are few libraries which help dealing with the problem. The most notable examples would be Dojo Toolkit and Globalize.
The prior is a bit heavy, although supports many aspects of i18n, the latter is lightweight but unfortunately many stuff is missing. If you choose to use Globalize, you might be interested in the latest Jukka's book:
Going Global with JavaScript & Globalize.js - I read this and as far I can tell, it is great. It doesn't cover the topics you were originally asking for but it is still worth reading, even for hands-on examples of how to use Globalize.
Apparently unicode handles this a bit differently than others. ie.,
instead of a direct mapping from a number(code point) to a glyph, it
maps the code point to an abstract "character" which might be
represented by different glyphs.
In the Unicode Character Encoding Model, there are 4 levels:
Abstract Character Repertoire (ACR) — The set of characters to be encoded.
Coded Character Set (CCS) — A one-to-one mapping from characters to integer code points.
Character Encoding Form (CEF) — A mapping from code points to a sequence of fixed-width code units.
Character Encoding Scheme (CES) — A mapping from code units to a serialized sequence of bytes.
For example, the character 𝄞 is represented by the code point U+1D11E in the Unicode CCS, the two code units D834 DD1E in the UTF-16 CEF, and the four bytes 34 D8 1E DD in the UTF-16LE CES.
In most older encodings like US-ASCII, the CEF and CES are trivial: Each character is directly represented by a single byte representing its ASCII code.
1) For the OS to understand an encoding, should it be installed
separately?.
The OS doesn't have to understand an encoding. You're perfectly free to use a third-party encoding library like ICU or GNU libiconv to convert between your encoding and the OS's native encoding, at the application level.
2)Will a font always have the complete implementation of a code page or just part of it?.
In the days of 7-bit (128-character) and 8-bit (256-character) encodings, it was common for fonts to include glyphs for the entire code page. It is not common today for fonts to include all 100,000+ assigned characters in Unicode.
I'll provide you with short answers to your questions.
It's generally not the OS that supports an encoding but the applications. Encodings are used to convert a stream of bytes to lists of characters. For example, in C# reading a UTF-8 string will automatically make it UTF-16 if you tell it to treat it as a string.
No matter what encoding you use, C# will simply use UTF-16 internally and when you want to, for example, print a string from a foreign encoding, it will convert it to UTF-16 first, then look up the corresponding characters in the character tables (fonts) and shows the glyphs.
I don't recall ever seeing a complete font. I don't have much experience with working with fonts either, so I cannot give you an answer for this one.
The answer to this one is in #1, but a short summary: fonts are usually encoding-independent, meaning that as long as the system can convert the input encoding to the font encoding you'll be fine.
Bonus answer: On "how does the OS understand network protocols it doesn't know?": again it's not the OS that handles them but the application. As long as the OS knows where to redirect the traffic (which application) it really doesn't need to care about the protocol. Low-level protocols usually do have to be installed, to allow the OS to know where to send the data.
This answer is based on my understanding of encodings, which may be wrong. Do correct me if that's the case!

What's the big deal with unicode?

I've heard a lot of people talk about how some new version of a language now supports unicode, and how much of an achievement unicode is. What's the big deal about being able to support a new characterset. It seems like something which would rarely if ever be used but people mention it quite often. What's the benefit or reason people use or even care about unicode?
Programming languages are used to produce software.
Software is used to solve problems faced by humans.
Producing software has a cost.
Software that solves problems for humans produces value. This value can be expressed in the form of profit, or the reduction of costs, depending on the business model of the software developer. How the value is expressed is irrelevant for the purposes of this discussion; what is relevant is that net value is produced.
There are seven billion humans in the world. A significant fraction of them are most comfortable reading text that is not written in the Latin alphabet.
Software which purports to solve a problem for some fraction of those seven billion humans who do not use the Latin alphabet does so more effectively if developers can easily manipulate text written in non-Latin alphabets.
Therefore, a programming language which supports non-Latin character sets lowers the costs of software developers, thereby enabling them to solve more problems for more people at lower costs, and thereby produce more value.
Unicode is the de facto standard for manipulation of non-Latin text.
Therefore, Unicode is important to the design and implementation of programming languages.
Our goal as programming language designers is the creation of tools which produce maximum value. Supporting Unicode is an easy way to massively increase the scope and range of real human problems that can be solved in software.
In the beginning, there were 256 possible characters and many different Code pages to represent them. It became a tangled mess. Supporting multiple languages and multiple characters sets became a programmer's nightmare.
Then the Unicode Consortium was formed. It created a standard that would allow a single character set with 256 x 256 = 65536 characters (plus combinations thereof) to include almost all languages of the world.
The biggest advantage is that a single character string may contain multiple languages. That is no small thing.
Unicode is now the native character specification used in Windows ever since Windows 2000. it is also allowed as a character set in HTML and on websites.
If your application does not support Unicode, or is not planning to support it, then it is only a matter of time until your application will be left behind.
What's the big deal about being able
to support a new characterset.
Unicode is not just "a new characterset". It's the character set that removes the need to think about character sets.
How would you rather write a string containing the Euro sign?
"\x80", "\x88", "\x9c", "\x9f", "\xa2\xe3", "\xa2\xe6", "\xa3\xe1", "\xa4", "\xa9\xa1", "\xd9\xe6", "\xdb", or "\xff" depending upon the encoding.
"\u20AC", in every locale, on every OS.
Unicode can support pretty much any language in the world. Without such an encoding you would have to worry about choosing the correct encoding for different languages, which is very bothersome (not to mention mixing multiple languages in the same text block, ugh)
Unicode support in a language means that the language's native character/string type supports all those languages as well, without the user having to worry about character encodings or multibyte characters and such while doing computations. Of course, one still has to acnowledge character encodings when doing I/O, but doing your string processing in one single sensible encoding helps a lot.
Well if you care anything about internationalization (AKA the rest of the world) scientific notations, etc you would care about unicode. Unicode is difficult to deal with because we have been so ingrained just ASCII support. But now that modern systems support Unicode, there is no reason really not to just encode your things UTF-8. I know I work in publishing and for a long time we had to do hack things like insert gif images of formulas etc. Now we can put unicode straight in and people can search and copy and paste etc, and our code can deal with it by using unicode regexes etc.
If you wish to communicate with someone whose native language is not English (either the British or American variants), you care. A lot.
As everyone says - support for all the charactersets and formatting used by every other language and locale in the world. Open source and commercial developers both like that because it increases their potential user base by about 20x fold (and growing).
Unicode is a good thing because it eliminates character set problems and leaves one less thing to worry about. Even if your software never leaves the U.S., you never know when you're going to run into a filename or text field with an odd character in it, and Unicode lets you live in ignorance.
Americans like Daisetsu may not care about Unicode, but the rest of the world uses a bit more than 26 Latin letters, and there Unicode is heavily used.
We had hundreds of messed up charsets in the past solely because American computer scientists thought "why would anyone want to use more than 26 Latin characters like we have in English?"
Narrow-mindedness is a bad thing.

wxWidgets and Unicode

i want to use korean translations under in my - quite large - wxwidgets application. The application uses the wxwidgets translation framework, which is based on gettext.
I have working translations for french, german and russian. I want to go unicode anyway, but my first question is:
does my application need unicode support to display korean and japanese languages?
If so, - just for interest - why does russian work without, since they have a cyrillic letterset?
I have thousands of string literals. Do i have to prepend each and every one of them with 'L' ? ( wxString foo("foo") --> wxString foo(L"foo") )
if so, did someone build a regex or sed or perl script to do this in ca. 500 .cpp files ? ( pleeze! =) )
Will this change in wxWidgets 3.0?
Unicode question general: i use these string literals in many descriptive and many technical ways .. as displayed text as well as parts of GLSL shaders as well as XML. These APIs have char* / const char* as function arguments, so my internal wxString representation should not matter in these areas. Theory and practice: is this true? Some experiences to share, anyone?
I do some text processing ( comparing, string finding etc ) - are there any logical differences in unicode vs. ansi?
Is there any remarkeable performance impact in using Unicode?
Thank you!
Wendy
Addressing some of your questions…
does my application need unicode support to display korean and japanese languages?
If so, - just for interest - why does russian work without, since they have a cyrillic letterset?
Russian fits in a single-byte charset, just like western European languages (though it is a different charset). Korean and Japanese (and Chinese) don't. There are many workarounds for this, but the most elegant I know of to date is to use Unicode so that you don't need to rebuild your application for each locale; just change its message catalog.
Unicode question general: i use these string literals in many descriptive and many technical ways .. as displayed text as well as parts of GLSL shaders as well as XML. These APIs have char* / const char* as function arguments, so my internal wxString representation should not matter in these areas. Theory and practice: is this true? Some experiences to share, anyone?
Only strings that are going to be shown to (non-technical) users need to be localized, so they're the only ones that have to be in Unicode. The most common approach is to use UTF-8 (which is a particular way of encoding Unicode) as that means that ASCII strings – the most common type passed around inside programs – are exactly the same, which simplifies things a lot. The down-side is that you no longer have cheap indexing into the string as not all characters are the same number of bytes long. That can be anything from a non-issue to a right royal hindering PITA, depending on what the program is doing.
I do some text processing ( comparing, string finding etc ) - are there any logical differences in unicode vs. ansi?
Comparisons work fine, as does simple string finding. Other operations (e.g., getting the 20th character of a string, or working out how many characters into a string you've found a substring) are nasty because you've not got constant character widths. The nastiness can be mitigated by using wide characters, but they're less nice to use for external data (they introduce potential problems with endianness unless you go into working with byte-order marks, and that's another matter right there).
Is there any remarkeable performance impact in using Unicode?
Depends on exactly what you do. With UTF-8, if you're mostly dealing with ASCII text in reality then you get very little in the way of performance problems for most operations. With wide characters, you take more memory for every character, which naturally has performance implications (but which might acceptable because it does mean you've got constant-time indexing).
There's a korean .po file on http://www.wxwidgets.org/about/i18n.php for wxWidget's own strings. If your application displays wxWidget's own strings correctly when using that file, then it does not need Unicode support to display Korean and Japanese languages.
ISO-8859-5 is an 8 bit character set with Cyrillic letters.
Only if 1. does not yield the correct result. But if you want to translate the string, you should have used _().
I don't know.
wxWidgets 3.0 will not have separate Unicode- and ANSI-builds. 2.9.1 doesn't have, either.
It depends on how you use the arguments. C- and C++-functions usually operate on the representation of strings and are unaware of any particular character encoding. Particularly what you perceive to be a character and what the program considers a character might be different things.
See 6.
I do not know, but many toolkits use UTF-16 or UTF-32 instead of UTF-8 because these schemes are simpler. It's a size-speed tradeoff.
1.does my application need unicode support to display korean and japanese
languages?
Thanks to Oswald, i found out that you can have a korean translation without using unicode in your wxwidgets application. Change ( under windows, at least ) settings for non-unicode aware programs. But i still have to check out if this is enough for a whole application.
3.I have thousands of string literals. Do i have to prepend each
and every one of them with 'L' ? (
wxString foo("foo") --> wxString
foo(L"foo") )
If you have to use unicode with wxwidgets prior to 3.0, you have to. But do not use 'L' under wxwidgets, use wxT("foo")
4.if so, did someone build a regex or sed or perl script to do this in ca. 500 .cpp files ?
I did, at least a search and replace under Visual Studio:
Search: {"([^"]*)"}
Replace: wxT(\1)
But be careful! Will replace all string literals, #include "file.h" with #include wxT("file.h")
Will this change in wxWidgets 3.0?
Yes. See answer/quote above.

Why does anyone use an encoding other than UTF-8? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I want to know why any developer would need to use an encoding other than UTF-8.
Wikipedia lists advantages and disadvantages of UTF-8 as compared to a variety of other encodings:
http://en.wikipedia.org/wiki/UTF-8#Advantages_and_disadvantages
The most important disadvantages are IMHO that UTF-8 might use significantly more space especially in Asian languages such as Chinese, Japanese or Hindi and that not all code points have the same size which makes measurements more difficult and many string operations such as search inefficient.
Well, some do it because their tools are archaic or flawed. Some do it because they don't see a need to support anything other than ASCII. Some do it because they don't know any better.
Those are the usual excuses for not using Unicode.
As for not using UTF-8 specifically there are different reasons. Some systems, like Windows1 (and stemming from that, .NET) and Java came to be in a time where Unicode was a strict 16-bit code. Therefore, there was really only one encoding: UCS-2, encoding code points directly as 16-bit words.
Later Unicode was expanded to 21 bits because 65536 code points weren't enough anymore. This caused encodings such as UTF-32 and UTF-16 to appear. For systems previously working with UCS-2 the transition to UTF-16 was the easiest and most sensible choice. Windows did that transition back in Ye Olde Days of Windows 2000.
So while I think that nearly all application nowadays should support Unicode I don't think it is entirely necessary for them to specifically use UTF-8. There are historic reasons for that and no real benefit in converting existing systems from UTF-16 to UTF-8.
1 NT.
In UTF-8 code points between 0800 and FFFF take up three bytes in UTF-8 but only two in UTF-16. See the wikipedia comparison for more details, but basically if text heavily uses code points in this range (say, if it's Chinese), UTF-8 files will be larger than UTF-16 files with the same content.
UTF-8 is very efficient at encoding plain English text (same as ASCII). If your user base is likely to be mostly, say, Chinese, you will be much better off using UTF-16.
For more information, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Because outside the English-speaking world, people have been using various encodings that predate Unicode and are tailored for their respective languages for decades. These language-specific encodings have become ingrained everywhere and are pretty much a standard. If you want to have any hope of interfacing with legacy systems, you have to use them, so all systems have to support them and usually use them as default even if they by now support UTF-8 as well. There may even be multiple legacy encodings traditionally used for different purposes.
Examples:
ISO-8859-1 in western Europe - actually outdated there as well, as you need ISO-8859-15 for the Euro sign
ISO-2022-JP in Japan for emails, Shift JIS for websites
Big5 in Taiwan
GB2312 in China
The last two examples show that encodings can even be a political issue.
Sometimes they are restricted due to historical/unsupported reasons (I'm developing on Windows using Zend Studio on a Samba share on a Linux box: and something in that mix means I keep reverting to Cp1512 instead of UTF8).
Sometimes you don't need to use UTF-8 (for example when storing a md5 hash in a database: you only need the hexadecimal range 0-9 A-F: why make it a UTF-8 field, which will take at least a byte extra storage instead of normal ASCII).
Sometimes it's just laziness learning the UTF-8 functions for a particular language.
Because they do not know better.
The only valid criticism to utf-8 is that encodings for common Asian languages are oversized from other encodings.
UTF-8 is superior because
It is ASCII compatible. Most known and tried string operations do not need adaptation.
It is Unicode. Anything that isn't Unicode shouldn't even be considered in this day and age. If you have important data in encoding X, spend two minutes on Google and write a conversion function. Even if you have to interface with sourceless legacy app Z, you can run your communications through a pipe so that your logic stays in the 21st century.
UTF-16 isn't fixed length either and assuming it is like many do, will only cause terrible bugs.
Additionally Unicode is very complex and it is almost certain than any fixed-size algorithm adapted from ASCII will yield bad results even in UTF-32.
Say you have this UTF-16 string.
[0][1][2][F|3] [4] [5]
And you want to insert a character with code 8 between [3] and [4]
you would do insert(5,8)
If you don't check for characters outside BMP(serially as in UTF-8 as you cannot know how many double sized characters you have) you get:
[0][1][2][F|8][3][4][5]
Two new garbage characters. So much for your fixed size encoding.
You can of course disallow such characters altogether, but then when your code interfaces with the real world, you might find your program saves the profile for this user who lives in rm -Rf / in .profile instead of [Classical Chinese Proverb].profile.
Or just an angry user that cannot write his thesis on Classical Chinese Proverbs with your software.
One legitimate reason is when you need to deal with legacy documents, software or hardware that are not Unicode compatible.
Another legitimate reason is that you need to use a programming language / libraries that do not support UTF8 / Unicode well ... or at all.
Other answers mention that UTF-16 is more compact than UTF-8 for Asian languages / characters.
And of course there are reasons like short-sightedness, ignorance, laziness ... and deadlines.
Its also worth remembering that in some circumstances (where a non-latin set of characters are needed) UTF-8 can actually bloat larger than the 16 bit Unicode encoding. In those cases ucs-2 or utf-16 would be a better choice.
The reasons for using non-Unicode 8-bit character sets / encodings are all back compatibility of some kind, and/or inertia. For that matter, the most frequent reasons for using UTF-8 are compatibility with standards like XML that mandate or prefer UTF-8.
Differences in the number of bytes you think text will take up in different encodings, especially in storage, are mostly theoretical. In real world situations, compatibility requirements are more important. If compression is used, the size differences go away anyway. Even if compression is not used, total text size is hard to predict and is rarely a deciding factor.
When converting legacy code that used non-Unicode 8-bit encodings, using UTF-16 can be a tool for making sure all code has been converted, because mismatches can be caught as compile-time type errors. Many languages, runtimes and libraries like Javascript, JVM, .NET, ICU use 16-bit strings and UTF-16, even though storage and Internet protocols are usually 8-bit.
Imagine all files to consider are in GB2312 (China mainland standard). Then you might choose GB18030 as Unicode encoding instead. They are compatible the same way as all ASCII is UTF-8. That is useful in China mainland!
You might decide even quicker when you find out that both mentioned GB-standards are required in your IT-product by law (as far as I have heard), if you want to ship in China (mainland).
Another upside is that GB2312, and as such GB18030 as well, are also ASCII compatible.
It is algorithmically not so robust, though. – So if you have no political reasons or any GB2312 legacy, it makes no sense to use it. But if you do, here you got your answer.
Related to the subject, when using MySQL, as if it wasn't complex enough, you get the option the choose which kind of UTF-8 collation you want to use. So what would you use?
UTF-8 general ci
or
UTF-8 unicode ci?
(I tend to use the UTF-8 variant that is used for the database connection)
Because you sometimes want to operate easily on codepoints -- then you'd choose f.e. UCS-2 or UCS-4.
Many APIs require other Unicode encodings - mostly UTF-16. For instance, Java, .NET, Win32.
At my previous employer we used iso-8859-1 for some of our ASP pages to match the collation of our SQL Server, which as you can guess was not Unicode. I wanted to change the collation, but the manager said to wait till we upgrade our SQL Server to do it. Needless to say it never happened - I haven't been with them for a little over a year now, so I don't know if they finally did it.
Unicode certainly is a good place to work from in most cases, but a developer should be familiar with many different types of character encoding. Certainly ASCII might be used if the set of characters is limited.
What if you're a developer and receiving data from a source that doesn't send UTF-8? There could be lots of interface issues if you don't understand your input.
Joel's article on the must-knows for character encoding is good and worth reading.

Why use Unicode if your program is English only?

So I've read Joel's article, and looked through SO, and it seems the only reason to switch from ASCII to Unicode is for internationalization. The company I work for, as a policy, will only release software in English, even though we have customers throughout the world. Since all of our customers are scientists, they have functional enough English to use our software as a non-native speaker. Or so the logic goes. Because of this policy, there is no pressing need to switch to Unicode to support other languages.
However, I'm starting a new project and wanted to use Unicode (because that is what a responsible programmer is supposed to do, right?). In order to do so, we would have to start converting all of the libraries we've written into Unicode. This is no small task.
If internationalization of the programs themselves is not considered a valid reason, how would one justify all the time spent recoding libraries and programs to make the switch to Unicode?
This obviously depends on what your app actually does, but just because you only have an english version in no way means that internationalization is not an issue.
What if I want to store a customer name which uses non-english characters? Or the name of a place in another country?
As an added bonus (since you say you're targeting scientists) is that all sorts of scientific symbols and notiations are supported as part of Unicode.
Ultimately, I find it much easier to be consistent. Unicode behaves the same no matter whose computer you run the app on. Non-unicode means that you use some locale-dependant character set or codepage by default, and so text that looks fine on your computer may be full of garbage characters on someone else's.
Apart from that, you probably don't need to translate all your libraries to Unicode in one go. Write wrappers as needed to convert between Unicode and whichever encoding you use otherwise.
If you use UTF-8 for your Unicode text, you even get the ability to read plain ASCII strings, which should save you some conversion headaches.
They say they will always put it in English now, but you admit you have worldwide clients. A client comes in and says internationalization is a deal breaker, will they really turn them down?
To clarify the point I'm trying to make you say that they will not accept this reasoning, but it is sound.
Always better to be safe than sorry, IMO.
The extended Scientific, Technical and Mathematical character set rules.
Where else can you say ⟦∀c∣c∈Unicode⟧ and similar technical stuff.
Characters beyond the 7-bit ASCII range are useful in English as well. Does anyone using your software even need to write the € sign? Or £? How about distinguishing "résumé" from "resume"?You say it's used by scientists around the world, who may have names like "Jörg" or "Guðmundsdóttir". In a scientific setting, it is useful to talk about wavelengths like λ, units like Å, or angles as Θ, even in English.
Some of these characters, like "ö", "£", and "€" may be available in 8-bit encodings like ISO-8859-1 or Windows-1252, so it may seem like you could just use those encodings and be done with it. The problem is that there are characters outside of those ranges that many people use very frequently, and so lots of existing data is encoded in UTF-8. If your software doesn't understand that when importing data, it may interpret the "£" character in UTF-8 as a sequence of 2 Windows-1252 characters, and render it as "£". If this sort of error goes undetected for long enough, you can start to get your data seriously garbled, as multiple passes of misinterpretation alter your data more and more until it becomes unrecoverable.
And it's good to think about these issues early on in the design of your program. Since strings tend to be very low-level concept that are threaded throughout your entire program, with lots of assumptions about how they work implicit in how they are used, it can be very difficult and expensive to add Unicode support to a program later on if you have never even thought about the issue to begin with.
My recommendation is to always use Unicode capable string types and libraries wherever possible, and make sure any tests you have (whether they be unit, integration, regression, or any other sort of tests) that deal with strings try passing some Unicode strings through your system to ensure that they work and come through unscathed.
If you don't handle Unicode, then I would recommend ensuring that all data accepted by the system is 7-bit clean (that is, there are no characters beyond the 7-bit US-ASCII range). This will help avoid problems with incompatibilities between 8-bit legacy encodings like the ISO-8859 family and UTF-8.
Suppose your program allows me to put my name in it, on a form, a dialog, whatever, and my name can't be written with ascii characters... Even though your program is in English, the data may be in other language...
It doesn't matter that your software is not translated, if your users use international characters then you need to support unicode to be able to do correct capitalization, sorting, etc.
If you have no business need to switch to unicode, then don't do it. I'm basing this on the fact that you thought you'd need to change code unrelated to component you already need to change to make it all work with Unicode. If you can make the component/feature you're working on "Unicode ready" without spreading code churn to lots of other components (especially other components without good test coverage) then go ahead and make it unicode ready. But don't go churn your whole codebase without business need.
If the business need arises later, address it then. Otherwise, you aren't going to need it.
People in this thread may suppose scenarios where it becomes a business requirement. Run those scenarios by your product managers before considering them scenarios worth addressing. Make sure they know the cost of addressing them when you ask.
Well for one, your users might know and understand english, but they can still have 'local' names. If you allow your users to do any kind of input to your application, they might want to use characters that are not part of ascii. If you don't support unicode, you will have no way of allowing these names. You'd be forcing your users to adopt a more simple name just because the application isn't smart enough to handle special characters.
Another thing is, even if the standard right now is that the app will only be released in English, you are also blocking the possibility of internationalization with ASCII, adding to the work that needs to be done when the company policy decides that translations are a good thing. Company policy is good, but has also been known to change.
I'd say this attitude expressed naïveté, but I wouldn't be able to spell naïveté in ASCII-only.
ASCII still works for some computer-only codes, but is no good for the façade between machine and user.
Even without the New Yorker's old-fashioned style of coöperation, how would some poor woman called Zoë cope if her employers used such a system?
Alas, she wouldn't even seek other employment, as updating her résumé would be impossible, and she'd have to resume instead. How's she going to explain that to her fiancée?
The company I work for, **as a policy**, will only release software in English, even though we have customers throughout the world.
1 reason only: Policies change, and when they change, they will break your existing code. Period.
Design for evil, and you have a chance of not breaking your code so soon. In this case, use Unicode. Happened to me on a brazilian specific stock-market legacy system.
Many languages (Java [and thus most JVM-based language implementations], C# [and thus most .NET-based language implementatons], Objective C, Python 3, ...) support Unicode strings by preference or even (nearly) exclusively (you have to go out of your way to work with "strings" of bytes rather than of Unicode characters).
If the company you work for ever intends to use any of these languages and platforms, it would therefore be quite advisable to start planning a Unicode-support strategy; a pilot project in particular might not be a bad idea.
That's a really good question. The only reason I can think of that has nothing to do with I18n or non-English text is that Unicode is particularly suited to being what might be called a hub character set. If you think of your system as a hub with its external dependencies as spokes, you want to isolate character encoding conversions to the spokes, so that your hub system works consistently with your chosen encoding. What makes Unicode a ideal character set for the hub of your system is that it acknowledges the existence of other character sets, it defines equivalences between its own characters and characters in those external character sets, and there's an ongoing process where it extends itself to keep up with the innovation and evolution of external character sets. There are all sorts of weird encodings out there: even when the documentation assures you that the external system or library is using plain ASCII it often turns out to be some variant like IBM775 or HPRoman8, and the nice thing about Unicode is that no matter what encoding is thrown at you, there's a good chance that there's a table on unicode.org that defines exactly how to convert that data into Unicode and back out again without losing information. Then again, equivalents of a-z are fairly well-defined in every character set, so if your data really is restricted to the standard English alphabet, ASCII may do just as well as a hub character set.
A decision on encoding is a decision on two things - what set of characters are permitted and how those characters are represented. Unicode permits you to use pretty much any character ever invented, but you may have your own reasons not to want or need such a wide choice. You might still restrict usernames, for example, to combinations of a-z and underscore, maybe because you have to put them into an external LDAP system whose own character set is restricted, maybe because you need to print them out using a font that doesn't cover all of Unicode, maybe because it closes off the security problems opened up by lookalike characters. If you're using something like ASCII or ISO8859-1, the storage/transmission layer implements a lot of those restrictions; with Unicode the storage layer doesn't restrict anything so you might have to implement your own rules at the application layer. This is more work - more programming, more testing, more possible system states. The tradeoff for that extra work is more flexibility, application-level rules being easier to change than system encodings.
The reason to use unicode is to respect proper abstractions in your design.
Just get used to treating the concept of text properly. It is not hard. There's no reason to create a broken design even if your users are English.
Just think of a customer wanting to use names like Schrödingers Cat for files he saved using your software. Or imagine some localized Windows with a translation of My Documents that uses non-ASCII characters. That would be internationalization that has, though you don't support internationalization at all, have effects on your software.
Also, having the option of supporting internationalization later is always a good thing.
Internationalization is so much more than just text in different languages. I bet it's the niche of the future in the IT-world. Heck, it already is. A lot has already been said, just thought I would add a small thing. Even though your customers right now are satisfied with english, that might change in the future. And the longer you wait, the harder it will be to convert your code base. They might even today have problems with e.g. file names or other types of data you save/load in your application.
Unicode is like cooties. Once it "infects" one area, it's usually hard to contain it given interconnectedness of dependencies. Sooner or later, you'll probably have to tie in a library that is unicode compliant and thus will use wchar_t's or the like. Instead of marshaling between character types, it's nice to have consistent strings throughout.
Thus, it's nice to be consistent. Otherwise you'll end up with something similar to the Windows API that has a "A" version and a "W" version for most APIs since they weren't consistent to start with. (And in some cases, Microsoft has abandoned creating "A" versions altogether.)
You haven't said what language you're using. In some languages, changing from ASCII to Unicode may be pretty easy, whereas in others (which don't support Unicode) it might be pretty darn hard.
That said, maybe in your situation you shouldn't support Unicode: you can't think of a compelling reason why you should, and there are some reasons (i.e. your cost to change your existing libraries) which argue against. I mean, perhaps 'ideally' you should but in practice there might be some other, more important or more urgent, thing to spend your time and effort on at the moment.
If program takes text input from the user, it should use unicode; you never know what language the user is going to use.
When using Unicode, it leaves the door open for internationalization if requirements ever change and you are required to use text in other languages than English.
Also, in your new project you could always just write wrappers for the libraries that internally convert between ASCII and Unicode and vice-versa.
Your potential client may already be running a non-unicode application in a language other than English and won't be able to run your program without swichting the windows unicode locale back and forth, which will be a big pain.
Because the internet is overwhelmingly using Unicode. Web pages use unicode. Text files including your customer's documents, and the data on their clipboards, is Unicode.
Secondly Windows, is natively Unicode, and the ANSI APIs are a legacy.
Modern applications should use Unicode where applicable, which is almost everywhere.