How UTF8/Unicode adapt to new writing systems? - unicode

An example to clarify my question:
The Hongkongers' native language is Cantonese, however, we all write in a different language: Madarin Chinese. Two languages are kindof similar, and Hongkongers are educated to write in Madarin Chinese language.
Cantonese doesn't have a writing system. Though we are still happy with Madarin as our writing language, however, in case one day Hongkongers decided to develop a 'Cantonese script' which contains not-yet-existing characters, how should UTF8/Unicode/fonts change, to adapt these new characters?
I mean, who will change the UTF8/Unicode/fonts standard? How exactly Linux/Windows OS have to be modified, in order to display these newly created characters?
(The example is just to make my question clear. We're not talking about politics ;D )

The Unicode coding space has over 1,000,000 code points, and only about 10% of them have been allocated, so there is a lot of room for new characters (even though some areas of the coding space have been set apart for use other than added characters). The Unicode Consortium, working in close cooperation with the relevant body at ISO, assigns code points to new characters on the basis of proposals that demonstrate actual usage or, in some cases, plans with a solid basis and widespread support.
Thus, if a new script were designed and there was a large community that would seriously use it, it would be added, with its characters, into Unicode after due proposals and discussion.
It would then be up to font manufacturers to add glyphs for such characters. This might take a long time, but if there is strong enough need, new fonts and enhancements to existing fonts would emerge.
No change to UTF-8 or other Unicode transfer encodings would be needed. They already encode the entire coding space, whether code points are assigned to characters or not.
Rendering software would need no modifications, unless there are some specialties in the writing system. Normal characters would be rendered just fine, as soon as suitable fonts are available.
However, if the characters added were outside the Basic Multilingual Plane (BMP), the “16-bit subset of Unicode”, both rendering and processing (and input) would be problematic. Many programming languages and programs effectively treat Unicode as if it were a 16-bit code and run into problems (possibly solvable, but still) when characters outside the BMP are used. If the writing system had, say, 10,000 characters, it is quite possible that it would have to allocated outside the BMP.

The Unicode committee adds new characters as they see fit. Then fonts add support for the new characters. Operating systems should not require changes simply to display the new characters. Typing the characters would generally require updates or plug-ins to an operating system's input methods.

Related

Is the Demotic script represented in Unicode?

Does Unicode have signs for Demotic script? Is there any font containing such signs?
The Unicode has assigned 1072 characters for Egyptian hieroglyphs and for Hieratic (which is the parent system for Demotic and the cursive version of hieroglyphs) - so I wonder if there is any Unicode support for Demotic too
Although Demotic is still not encoded, there are already texts encoded in rich-text documents (using specific fonts).
They are based on the Coptic script, with a few additions for the diacritical Yodh on some letters; this works with some ligatures and slightly modified letter forms, but this is not purely a "hack" because in fact the Coptic script was developed from Demotic (on its cursive form used in Thebes) with the simplified forms from Greek adapted for the Late Ancient Egyptian language (which was then transcribed in the same period and the same area of Thebes with BOTH the Demotic and Coptic scripts; while the Demotic script also coexisted with Hieratic, i.e. the cursive form of the complex hieroglyphs highly simplified).
You can see this here:
https://ucbclassics.dreamhosters.com/djm/demotic.html
This work is the working base for a future encoding of Demotic in Unicode, but many searchers can use this font (and the keyboard input layout, which is based on Classical Greek, with a few modifications) on MacOS, Windows and now as well Linux, within several Office word processors, and now as well on the web (provided the web browsers support Opentype features, and webfonts). It still does not allow plain-text, but this works, using the Coptic encoding (with just a few additional generic diacritics, plain-text is possible and even directly readable by Egyptologists).
So the good question is: will Demotic be encoded separately, or will Unicode just consider to unify it with Coptic with the few additions needed? Unicode already chose to unify Egyptian hieroglyphs with Egyptian Hieratic, but this is quite controversial as Hieratic is very far from hieroglyphs (currently encoded for its monumental form carved on stone that have been used with lots of variants during 2 millenia), and much nearer from Demotic.
So may be Demotic will be encoded separately by Unicode (to avoid breaking the modern Coptic script still used today) but unified with Hieratic (which will be separated from Hieroglyphs). This would create an Unicode "Hieratic-Demotic" script, i.e. "Late Egyptian Cursive" (not to be confused with "Egyptian Cursive Hieroglyphs", which is extremely similar to the older monumental Hieroglyphs, but were developped to be painted on papyrus instead of being carved in monumental stones, so their form is much less angular and a bit simplified by the speed of drawing with a brush, but a lower precision of the brush and diffusion of ink on papyrus). For now it is not decided. But Egyptologists already have their tools to create documents easily and discuss them... using a rich text form.
There are other existing fonts. However msot of them are not free. They initially requires proprietary rich text formats, but this is not logner the case with free office suites like LibreOffice and OpenOffice (which can also process MsOffice formats, all supporting as well the ODF formar instead of the old MS formats). Note that ODF is easily convertible to HTML+CSS: this makes publication on the web possible as well.
Note that for Egyptian Demotic, you need much less characters than for Egyptian Hieroglyphs and Egyptian Hieratic: using the Coptic set (mostly based on Greek) with a few diacritics (much less than those used in Classical Greek!) along with rich-text and specific font designs is still the best choice today.
But the most important problem with borrowing the Coptic script for writing Demotic is the directionality (note that this is also a problem inside the Greek script for writing Ancient Greek...)
Also Unicode still does not support boustrophedon correctly and does not support a suitable model the layout needed for hieroglyphs that are encoded, with the same level that Unicode adapted its model for Hangul squares compositions and for the vertical rendering of sinographic scripts! This will also be a problem for other scripts still to be encoded (e.g. SignWriting, or chemical and mathematical notations, or musical notations; all of them having modern use but requiring specific layouts that are still not representable in plan-text with jut Unicode encoding alone).
So you can't do all you want with just Unicode plain-text, and you need rich-text formats: a solution may be found with HTML+CSS, then supported by OpenType, long before Unicode decides doing something, or just resignates to do nothing before long (because most modern scripts are encoded and there are less companies interested in paying the development of paleographic scripts, and paying their membership to add it and work on it), or there's some new proposals to better encode complex text layouts than just basic directionality (and syllabic square layouts in Hangul, or Arabic-like and Brahmic conjoining layouts, all of them being fully supported by their specific properties) !
Another source you may look at, for a candidate font is
http://paleography.atspace.com/
which introduces this set of 279 paleographic fonts for 30 old scripts, available at:
https://download.cnet.com/Paleofonts/3000-2190_4-10547504.html
or individually at:
https://github.com/reclaimed/paleofonts
(which is where resides now all the archived fonts).
However this huge set only contains one "Demotic" font (in fact for the "Meroitic Demotic" script, not the Egyptian Meroitic, which has partial coverage with just mappings on top of ASCII Latin letters and not the needed diacritics and necessary ligatures). And this legacy font set does not have the quality that we find today: no OpenType features (only TrueType), no or incomplete Unicode mappings, partial coverage, poor metrics, no hinting: they are just small enough to replace fallback fonts that would just display mojibake in Unicode, or for legacy texts translittated to other input scripts.
So many of these paleographic scripts will be developed by community efforts (e.g. within the Noto opensourced project, and with help of Unicode contributors and other opensourcers to work on them and find and discuss the rare ressources used by paleographers). You'll have to be very patient or try to develop you own community of interest with rare linguists spread in universities around the world with very small budgets, which often have poor knowledge of the technical requirements for developing modern fonts.
However there's now a renewal of efforts, because tools to develop fonts are easier and more reliable to use, and just a few persons with good contacts (in various working languages) could seriously help develop this support that many linguists and poor students would appreciate for their work to revive this important human heritage: Egyptian Demotic with its 2600 years of active use and its real importance for many cultures with which it has been in contact, is really a big gap we should fill. Unicode is just waiting for proposals and active experimentations and talks (which should also involve other standard bodies like W3C for CSS Text, and OpenType for font designs, and various OS vendors). Of course, if this development requires encoding additional characters in the UCS for usage of these scripts in plain-text, ISO working groups will be involved too and will need to agree with Unicode (but we know that this can take many years after proposing encoding new scripts or desunifying any existing script).

Why isn't there a font that contains all Unicode glyphs?

Pretty much as the title says. Rendering all of the unicode format correctly what with composite characters and characters that affect other characters and ligatures is really hard, I understand that. We have fonts that seem to be designed for maximum Unicode symbol support(Symbola, Code2001, others) and specialized fonts for certain planes or character ranges(BabelStone Han, others).
I don't know much about the underlying technical details for fonts. Is there a maximum size? Is it a copyright problem? Is essentially redrawing all ~110,000 extant glyphs too hard? I understand style concerns, but why not fall back to a 'default' font that had glyphs for everything? They're on unicode.org, redrawing them all would be pretty hard work but then you'd have a guaranteed fallback font for everything. If you got rights to some pre-existing fonts you could just composite them and that should help a lot. Such a font would be a great help to humanity and I can't see a good technical reason why it doesn't exist or at least an open-source effort to create it, so I presume an invisible-to-me reason why it can't be done.
What is that reason?
"Why would you even want that?" questions aside, from a programming perspective there's a very simple reason: the OpenType spec only affords an addressable glyph index space of one USHORT, so one font can only support 16 bits worth of glyphs identifiers, or 65,536 glyphs max. (And note the terminology: a "glyph" is not the same as a "character" or "letter")
The current version of Unicode, v8 as of this answer, contains 120,737 assigned code points, or almost twice as many as fit in a modern font (2021 edit: v13 upped this number to 143,859). In fact, Unicode hasn't been able to fit in a modern OpenType font since 2001, with the release of Unicode 3.1, which upped the number of code points from 49,259 to 94,205.
"So what about font collections?" I hear you ask. Why not use multiple fonts and support all unicode that way? Well now, you've just described Adobe's Sans Pro, and Google's Noto (which are the same font).
As for the "how hard can it be": a uniform style for all glyphs in Unicode, across 129 established written scripts on this planet, each with their own typesetting rules? Incredibly hard. You may think fonts are just files with pictures for letters, and someone types a letter, that picture shows up: that is not how fonts work, and isn't how fonts have worked since the late 1980's.
Modern fonts are the typographic equivalent of a game ROM: sure, it's not much use without the hardware or software to run that ROM on, but all the things that actually matter are in the ROM. Similarly, modern fonts contain all the information for typesetting. Not just pictures, they contain the metadata, the metrics, the positioning and substitutions rules for arbitrary sequences, with separate rule sets for each written script that OpenType supports, mandatory and optional ligatures, language-specific character replacements for letters at the start/middle/final position in a word, or in isolation, character repositioning relative to arbitarily complex sequences of other characters either before or after it, arbitrarily complex sequence replacements with other arbitrarily complex sequences, possible bitmap fallbacks for small-point rendering, hinting instructions on how to properly rasterize vector graphics that are inherently not aligned to any particular pixel grid, and more. A modern font is a ridiculously complex application, that a font engine consults to figure out how to typeset sequences of code points.
Making a (set of) Unicode-encompassing font(s) that looks good for all contexts is a vast team effort.
So: "Why isn't there a font that contains all Unicode glyphs?", because that's been technically impossible since 2001. We can, and do, make font families that cover all of Unicode, but with 129 different scripts all with their own typesetting rules, it's a lot of work, and almost (almost) not worth the effort compared to only covering a subset of all languages.
And as for this:
Such a font would be a great help to humanity and I can't see a good technical reason why it doesn't exist or at least an open-source effort to create it, so I presume an invisible-to-me reason why it can't be done.
Just because you didn't know about them, doesn't mean they don't exist, with millions of people who are familiar with them. They exist =)
They're even open source, go out and thank the people who made them!
There is GNU Unifont. It aims to contain all Unicode, except Apple Emoji.
You will probably find what you are looking for at the following links.
Unicode Character Table
HTML Character Entity References
Huge List of Unicode Symbols
List of Unicode Characters of Category “Other Symbol
This other is funny for particular character since you can draw what you search:
Unicode Character Recognition
Can't enter unicode character with Alt+ even with EnableHexNumpad
Basic Questions
Q: How many characters are in Unicode?
A: The short answer is that as of Version 13.0, the Unicode Standard contains 143,859 characters. The long answer is rather more complicated, because of all the different kinds of characters that people might be interested in counting.
Unicode font
A Unicode font is a computer font that maps glyphs to code points defined in the Unicode Standard. The vast majority of modern computer fonts use Unicode mappings, even those fonts which only include glyphs for a single writing system, or even only support the basic Latin alphabet.
Fonts which support a wide range of Unicode scripts and Unicode symbols are sometimes referred to as "pan-Unicode fonts", although as the maximum number of glyphs that can be defined in a TrueType font is restricted to 65,535, it is not possible for a single font to provide individual glyphs for all defined Unicode characters (143,859 characters, with Unicode 13.0).
...
No single "Unicode font" includes all the characters defined in the present revision of ISO 10646 (Unicode) standard, as more and more languages and characters are continually added to it, and common font formats cannot contain more than 65,535 glyphs (about half the number of characters encoded in Unicode).
As a result, font developers and foundries incorporate new characters in newer versions or revisions of a font, or in separate auxiliary fonts intended specifically for particular languages.
Enjoy!

What's the big deal with unicode?

I've heard a lot of people talk about how some new version of a language now supports unicode, and how much of an achievement unicode is. What's the big deal about being able to support a new characterset. It seems like something which would rarely if ever be used but people mention it quite often. What's the benefit or reason people use or even care about unicode?
Programming languages are used to produce software.
Software is used to solve problems faced by humans.
Producing software has a cost.
Software that solves problems for humans produces value. This value can be expressed in the form of profit, or the reduction of costs, depending on the business model of the software developer. How the value is expressed is irrelevant for the purposes of this discussion; what is relevant is that net value is produced.
There are seven billion humans in the world. A significant fraction of them are most comfortable reading text that is not written in the Latin alphabet.
Software which purports to solve a problem for some fraction of those seven billion humans who do not use the Latin alphabet does so more effectively if developers can easily manipulate text written in non-Latin alphabets.
Therefore, a programming language which supports non-Latin character sets lowers the costs of software developers, thereby enabling them to solve more problems for more people at lower costs, and thereby produce more value.
Unicode is the de facto standard for manipulation of non-Latin text.
Therefore, Unicode is important to the design and implementation of programming languages.
Our goal as programming language designers is the creation of tools which produce maximum value. Supporting Unicode is an easy way to massively increase the scope and range of real human problems that can be solved in software.
In the beginning, there were 256 possible characters and many different Code pages to represent them. It became a tangled mess. Supporting multiple languages and multiple characters sets became a programmer's nightmare.
Then the Unicode Consortium was formed. It created a standard that would allow a single character set with 256 x 256 = 65536 characters (plus combinations thereof) to include almost all languages of the world.
The biggest advantage is that a single character string may contain multiple languages. That is no small thing.
Unicode is now the native character specification used in Windows ever since Windows 2000. it is also allowed as a character set in HTML and on websites.
If your application does not support Unicode, or is not planning to support it, then it is only a matter of time until your application will be left behind.
What's the big deal about being able
to support a new characterset.
Unicode is not just "a new characterset". It's the character set that removes the need to think about character sets.
How would you rather write a string containing the Euro sign?
"\x80", "\x88", "\x9c", "\x9f", "\xa2\xe3", "\xa2\xe6", "\xa3\xe1", "\xa4", "\xa9\xa1", "\xd9\xe6", "\xdb", or "\xff" depending upon the encoding.
"\u20AC", in every locale, on every OS.
Unicode can support pretty much any language in the world. Without such an encoding you would have to worry about choosing the correct encoding for different languages, which is very bothersome (not to mention mixing multiple languages in the same text block, ugh)
Unicode support in a language means that the language's native character/string type supports all those languages as well, without the user having to worry about character encodings or multibyte characters and such while doing computations. Of course, one still has to acnowledge character encodings when doing I/O, but doing your string processing in one single sensible encoding helps a lot.
Well if you care anything about internationalization (AKA the rest of the world) scientific notations, etc you would care about unicode. Unicode is difficult to deal with because we have been so ingrained just ASCII support. But now that modern systems support Unicode, there is no reason really not to just encode your things UTF-8. I know I work in publishing and for a long time we had to do hack things like insert gif images of formulas etc. Now we can put unicode straight in and people can search and copy and paste etc, and our code can deal with it by using unicode regexes etc.
If you wish to communicate with someone whose native language is not English (either the British or American variants), you care. A lot.
As everyone says - support for all the charactersets and formatting used by every other language and locale in the world. Open source and commercial developers both like that because it increases their potential user base by about 20x fold (and growing).
Unicode is a good thing because it eliminates character set problems and leaves one less thing to worry about. Even if your software never leaves the U.S., you never know when you're going to run into a filename or text field with an odd character in it, and Unicode lets you live in ignorance.
Americans like Daisetsu may not care about Unicode, but the rest of the world uses a bit more than 26 Latin letters, and there Unicode is heavily used.
We had hundreds of messed up charsets in the past solely because American computer scientists thought "why would anyone want to use more than 26 Latin characters like we have in English?"
Narrow-mindedness is a bad thing.

Why isn't everything we do in Unicode?

Given that Unicode has been around for 18 years, why are there still apps that don't have Unicode support? Even my experiences with some operating systems and Unicode have been painful to say the least. As Joel Spolsky pointed out in 2003, it's not that hard. So what's the deal? Why can't we get it together?
Start with a few questions
How often...
do you need to write an application that deals with something else than ascii?
do you need to write a multi-language application?
do you write an application that has to be multi-language from its first version?
have you heard that Unicode is used to represent non-ascii characters?
have you read that Unicode is a charset? That Unicode is an encoding?
do you see people confusing UTF-8 encoded bytestrings and Unicode data?
Do you know the difference between a collation and an encoding?
Where did you first heard of Unicode?
At school? (really?)
at work?
on a trendy blog?
Have you ever, in your young days, experienced moving source files from a system in locale A to a system in locale B, edited a typo on system B, saved the files, b0rking all the non-ascii comments and... ending up wasting a lot of time trying to understand what happened? (did your editor mix things up? the compiler? the system? the... ?)
Did you end up deciding that never again you will comment your code using non-ascii characters?
Have a look at what's being done elsewhere
Python
Did I mention on SO that I love Python? No? Well I love Python.
But until Python3.0, its Unicode support sucked. And there were all those rookie programmers, who at that time knew barely how to write a loop, getting UnicodeDecodeError and UnicodeEncodeError from nowhere when trying to deal with non-ascii characters. Well they basically got life-traumatized by the Unicode monster, and I know a lot of very efficient/experienced Python coders that are still frightened today about the idea of having to deal with Unicode data.
And with Python3, there is a clear separation between Unicode & bytestrings, but... look at how much trouble it is to port an application from Python 2.x to Python 3.x if you previously did not care much about the separation/if you don't really understand what Unicode is.
Databases, PHP
Do you know a popular commercial website that stores its international text as Unicode?
You will (perhaps) be surprised to learn that Wikipedia backend does not store its data using Unicode. All text is encoded in UTF-8 and is stored as binary data in the Database.
One key issue here is how to sort text data if you store it as Unicode codepoints. Here comes the Unicode collations, which define a sorting order on Unicode codepoints. But proper support for collations in Databases is missing/is in active development. (There are probably a lot of performance issues, too. -- IANADBA) Also, there is no widely-accepted standard for collations yet: for some languages, people don't agree on how words/letters/wordgroups should be sorted.
Have you heard of Unicode normalization? (Basically, you should convert your Unicode data to a canonical representation before storing it) Of course it's critical for Database storage, or local comparisons. But PHP for example only provides support for normalization since 5.2.4 which came out in August 2007.
And in fact, PHP does not completely supports Unicode yet. We'll have to wait PHP6 to get Unicode-compatible functions everywhere.
So, why isn't everything we do in Unicode?
Some people don't need Unicode.
Some people don't care.
Some people don't understand that they will need Unicode support later.
Some people don't understand Unicode.
For some others, Unicode is a bit like accessibility for webapps: you start without, and will add support for it later
A lot of popular libraries/languages/applications lack proper, complete Unicode support, not to mention collation & normalization issues. And until all items in your development stack completely support Unicode, you can't write a clean Unicode application.
The Internet clearly helps spreading the Unicode trend. And it's a good thing. Initiatives like Python3 breaking changes help educating people about the issue. But we will have to wait patiently a bit more to see Unicode everywhere and new programmers instinctively using Unicode instead of Strings where it matters.
For the anecdote, because FedEx does not apparently support international addresses, the Google Summer of Code '09 students all got asked by Google to provide an ascii-only name and address for shipping. If you think that most business actors understand stakes behind Unicode support, you are just wrong. FedEx does not understand, and their clients do not really care. Yet.
Many product developers don't consider their apps being used in Asia or other regions where Unicode is a requirement.
Converting existing apps to Unicode is expensive and usually driven by sales opportunities.
Many companies have products maintained on legacy systems and migrating to Unicode means a totally new development platform.
You'd be surprised how many developers don't understand the full implications of Unicode in a multi-language environment. It's not just a case of using wide strings.
Bottom line - cost.
Probably because people are used to ASCII and a lot of programming is done by native English speakers.
IMO, it's a function of collective habit, rather than conscious choice.
The widespread availability of development tools for working with Unicode may be a more recent event than you suppose. Working with Unicode was, until just a few years ago, a painful task of converting between character formats and dealing with incomplete or buggy implementations. You say it's not that hard, and as the tools improve that is becoming more true, but there are a lot of ways to trip up unless the details are hidden from you by good languages and libraries. Hell, just cutting and pasting unicode characters could be a questionable proposition a few years back. Developer education also took some time, and you still see people make a ton of really basic mistakes.
The Unicode standard weighs probably ten pounds. Even just an overview of it would have to discuss the subtle distinctions between characters, glyphs, codepoints, etc. Now think about ASCII. It's 128 characters. I can explain the entire thing to someone that knows binary in about 5 minutes.
I believe that almost all software should be written with full Unicode support these days, but it's been a long road to achieving a truly international character set with encoding to suit a variety of purposes, and it's not over just yet.
Laziness, ignorance.
One huge factor is programming language support, most of which use a character set that fits in 8 bits (like ASCII) as the default for strings. Java's String class uses UTF-16, and there are others that support variants of Unicode, but many languages opt for simplicity. Space is so trivial of a concern these days that coders who cling to "space efficient" strings should be slapped. Most people simply aren't running on embedded devices, and even devices like cell phones (the big computing wave of the near future) can easily handle 16-bit character sets.
Another factor is that many programs are written only to run in English, and the developers (1) don't plan (or even know how) to localize their code for multiple languages, and (2) they often don't even think about handling input in non-Roman languages. English is the dominant natural language spoken by programmers (at least, to communicate with each other) and to a large extent, that has carried over to the software we produce. However, the apathy and/or ignorance certainly can't last forever... Given the fact that the mobile market in Asia completely dwarfs most of the rest of the world, programmers are going to have to deal with Unicode quite soon, whether they like it or not.
For what it's worth, I don't think the complexity of the Unicode standard is not that big of a contributing factor for programmers, but rather for those who must implement language support. When programming in a language where the hard work has already been done, there is even less reason to not use the tools at hand. C'est la vie, old habits die hard.
All operating systems until very recently were built on the assumption that a character was a byte. It's APIs were built like that, the tools were built like that, the languages were built like that.
Yes, it would be much better if everything I wrote was already... err... UTF-8? UTF-16? UTF-7? UTF-32? Err... mmm... It seems that whatever you pick, you'll annoy someone. And, in fact, that's the truth.
If you pick UTF-16, then all of your data, as in, pretty much the western world whole economy, stops being seamlessly read, as you lose the ASCII compatibility. Add to that, a byte ceases to be a character, which seriously break the assumptions upon which today's software is built upon. Furthermore, some countries do not accept UTF-16. Now, if you pick ANY variable-length encoding, you break some basic premises of lots of software, such as not needing to traverse a string to find the nth character, of being able to read a string from any point of it.
And, then UTF-32... well, that's four bytes. What was the average hard drive size or memory size but 10 years ago? UTF-32 was too big!
So, the only solution is to change everything -- software, utilites, operating systems, languages, tools -- at the same time to be i18n-aware. Well. Good luck with "at the same time".
And if we can't do everything at the same time, then we always have to keep an eye out for stuff which hasn't been i18n. Which causes a vicious cycle.
It's easier for end user applications than for middleware or basic software, and some new languages are being built that way. But... we still use Fortran libraries written in the 60s. That legacy, it isn't going away.
Because UTF-16 became popular before UTF-8 and UTF-16 is a pig to work with. IMHO
Because for 99% of applications, Unicode support is not a checkbox on the customer's product comparison matrix.
Add to the equation:
It takes a conscious effort with almost no readily visible benefit.
Many programmers are afraid of it or don't understand it.
Management REALLY doesn't understand it or care about it, at least not until a customer is screaming about it.
The testing team isn't testing for Unicode compliance.
"We didn't localize the UI, so non-English speakers wouldn't be using it anyway."
Tradition and attitude. ASCII and computers are sadly synonyms to many people.
However, it would be naïve to think that the rôle of Unicode, is only a matter of Exotic languages from Eurasia and other parts of the world. A rich text encoding has lots of meaning to bring even to a "plain" English text. Look in a book sometime.
I would say there are mainly two reason. First one is simply that the Unicode support of your tools just isn't up to snuff. C++ still doesn't have Unicode support and won't get it until the next standard revision, which will take maybe a year or two to be finished and then another five or ten years to be in widespread use. Many other languages aren't much better and even if you finally have Unicode support, it might still be a more cumbersome to use then plain ASCII strings.
The second reason is in part what it causing the first issue, Unicode is hard, its not rocket science, but it gives you a ton of problems that you never had to deal with in ASCII. With ASCII you had a clear one byte == one glyph relationships, could address the Nth character of a string by a simple str[N], could just store all characters of the whole set in memory and so on. With Unicode you no longer can do that, you have to deal with different ways Unicode is encoded (UTF-8, UTF-16, ...), byte order marks, decoding errors, lots of fonts that have only a subset of characters which you would need for full Unicode support, more glyphs then you want to store in memory at a given time and so on.
ASCII could be understand by just looking at an ASCII table without any further documentation, with Unicode that is simply no longer the case.
Because of the inertia caused by C++. It had (has) horrible unicode support and dragged back the developers.
I personally do not like how certain formats of unicode break it so that you can no longer do string[3] to get the 3rd character. Sure it could be abstracted out, but imagine how much slower a big project with strings, such as GCC would be if it had to transverse a string to figure out the nth character. The only option is caching where "useful" positions are and even then it's slow, and in some formats your now taking a good 5 bytes per character. To me, that is just ridiculous.
More overhead, space requirements.
I suspect it's because software has such strong roots in the west. UTF-8 is a nice, compact format if you happen to live in America. But it's not so hot if you live in Asia. ;)
Unicode requires more work (thinking), you usually only get paid for what is required so you go with the fastest less complicated option.
Well that's from my point of view. I guess if you expect code to use std::wstring hw(L"hello world") you have to explain how it all works that to print wstring you need wcout : std::wcout << hw << std::endl; (I think), (but endl seems fine ..) ... so seems like more work to me - of course if I was writing international app I would have to invest into figuring it out but until then I don't (as I suspect most developers).
I guess this goes back to money, time is money.
It's simple. Because we only have ASCII characters on our keyboards, why would we ever encounter, or care about characters other than those? It's not so much an attitude as it is what happens when a programmer has never had to think about this issue, or never encountered it, perhaps doesn't even know what unicode is.
edit: Put another way, Unicode is something you have to think about, and thinking is not something most people are interested in doing, even programmers.

Why use Unicode if your program is English only?

So I've read Joel's article, and looked through SO, and it seems the only reason to switch from ASCII to Unicode is for internationalization. The company I work for, as a policy, will only release software in English, even though we have customers throughout the world. Since all of our customers are scientists, they have functional enough English to use our software as a non-native speaker. Or so the logic goes. Because of this policy, there is no pressing need to switch to Unicode to support other languages.
However, I'm starting a new project and wanted to use Unicode (because that is what a responsible programmer is supposed to do, right?). In order to do so, we would have to start converting all of the libraries we've written into Unicode. This is no small task.
If internationalization of the programs themselves is not considered a valid reason, how would one justify all the time spent recoding libraries and programs to make the switch to Unicode?
This obviously depends on what your app actually does, but just because you only have an english version in no way means that internationalization is not an issue.
What if I want to store a customer name which uses non-english characters? Or the name of a place in another country?
As an added bonus (since you say you're targeting scientists) is that all sorts of scientific symbols and notiations are supported as part of Unicode.
Ultimately, I find it much easier to be consistent. Unicode behaves the same no matter whose computer you run the app on. Non-unicode means that you use some locale-dependant character set or codepage by default, and so text that looks fine on your computer may be full of garbage characters on someone else's.
Apart from that, you probably don't need to translate all your libraries to Unicode in one go. Write wrappers as needed to convert between Unicode and whichever encoding you use otherwise.
If you use UTF-8 for your Unicode text, you even get the ability to read plain ASCII strings, which should save you some conversion headaches.
They say they will always put it in English now, but you admit you have worldwide clients. A client comes in and says internationalization is a deal breaker, will they really turn them down?
To clarify the point I'm trying to make you say that they will not accept this reasoning, but it is sound.
Always better to be safe than sorry, IMO.
The extended Scientific, Technical and Mathematical character set rules.
Where else can you say ⟦∀c∣c∈Unicode⟧ and similar technical stuff.
Characters beyond the 7-bit ASCII range are useful in English as well. Does anyone using your software even need to write the € sign? Or £? How about distinguishing "résumé" from "resume"?You say it's used by scientists around the world, who may have names like "Jörg" or "Guðmundsdóttir". In a scientific setting, it is useful to talk about wavelengths like λ, units like Å, or angles as Θ, even in English.
Some of these characters, like "ö", "£", and "€" may be available in 8-bit encodings like ISO-8859-1 or Windows-1252, so it may seem like you could just use those encodings and be done with it. The problem is that there are characters outside of those ranges that many people use very frequently, and so lots of existing data is encoded in UTF-8. If your software doesn't understand that when importing data, it may interpret the "£" character in UTF-8 as a sequence of 2 Windows-1252 characters, and render it as "£". If this sort of error goes undetected for long enough, you can start to get your data seriously garbled, as multiple passes of misinterpretation alter your data more and more until it becomes unrecoverable.
And it's good to think about these issues early on in the design of your program. Since strings tend to be very low-level concept that are threaded throughout your entire program, with lots of assumptions about how they work implicit in how they are used, it can be very difficult and expensive to add Unicode support to a program later on if you have never even thought about the issue to begin with.
My recommendation is to always use Unicode capable string types and libraries wherever possible, and make sure any tests you have (whether they be unit, integration, regression, or any other sort of tests) that deal with strings try passing some Unicode strings through your system to ensure that they work and come through unscathed.
If you don't handle Unicode, then I would recommend ensuring that all data accepted by the system is 7-bit clean (that is, there are no characters beyond the 7-bit US-ASCII range). This will help avoid problems with incompatibilities between 8-bit legacy encodings like the ISO-8859 family and UTF-8.
Suppose your program allows me to put my name in it, on a form, a dialog, whatever, and my name can't be written with ascii characters... Even though your program is in English, the data may be in other language...
It doesn't matter that your software is not translated, if your users use international characters then you need to support unicode to be able to do correct capitalization, sorting, etc.
If you have no business need to switch to unicode, then don't do it. I'm basing this on the fact that you thought you'd need to change code unrelated to component you already need to change to make it all work with Unicode. If you can make the component/feature you're working on "Unicode ready" without spreading code churn to lots of other components (especially other components without good test coverage) then go ahead and make it unicode ready. But don't go churn your whole codebase without business need.
If the business need arises later, address it then. Otherwise, you aren't going to need it.
People in this thread may suppose scenarios where it becomes a business requirement. Run those scenarios by your product managers before considering them scenarios worth addressing. Make sure they know the cost of addressing them when you ask.
Well for one, your users might know and understand english, but they can still have 'local' names. If you allow your users to do any kind of input to your application, they might want to use characters that are not part of ascii. If you don't support unicode, you will have no way of allowing these names. You'd be forcing your users to adopt a more simple name just because the application isn't smart enough to handle special characters.
Another thing is, even if the standard right now is that the app will only be released in English, you are also blocking the possibility of internationalization with ASCII, adding to the work that needs to be done when the company policy decides that translations are a good thing. Company policy is good, but has also been known to change.
I'd say this attitude expressed naïveté, but I wouldn't be able to spell naïveté in ASCII-only.
ASCII still works for some computer-only codes, but is no good for the façade between machine and user.
Even without the New Yorker's old-fashioned style of coöperation, how would some poor woman called Zoë cope if her employers used such a system?
Alas, she wouldn't even seek other employment, as updating her résumé would be impossible, and she'd have to resume instead. How's she going to explain that to her fiancée?
The company I work for, **as a policy**, will only release software in English, even though we have customers throughout the world.
1 reason only: Policies change, and when they change, they will break your existing code. Period.
Design for evil, and you have a chance of not breaking your code so soon. In this case, use Unicode. Happened to me on a brazilian specific stock-market legacy system.
Many languages (Java [and thus most JVM-based language implementations], C# [and thus most .NET-based language implementatons], Objective C, Python 3, ...) support Unicode strings by preference or even (nearly) exclusively (you have to go out of your way to work with "strings" of bytes rather than of Unicode characters).
If the company you work for ever intends to use any of these languages and platforms, it would therefore be quite advisable to start planning a Unicode-support strategy; a pilot project in particular might not be a bad idea.
That's a really good question. The only reason I can think of that has nothing to do with I18n or non-English text is that Unicode is particularly suited to being what might be called a hub character set. If you think of your system as a hub with its external dependencies as spokes, you want to isolate character encoding conversions to the spokes, so that your hub system works consistently with your chosen encoding. What makes Unicode a ideal character set for the hub of your system is that it acknowledges the existence of other character sets, it defines equivalences between its own characters and characters in those external character sets, and there's an ongoing process where it extends itself to keep up with the innovation and evolution of external character sets. There are all sorts of weird encodings out there: even when the documentation assures you that the external system or library is using plain ASCII it often turns out to be some variant like IBM775 or HPRoman8, and the nice thing about Unicode is that no matter what encoding is thrown at you, there's a good chance that there's a table on unicode.org that defines exactly how to convert that data into Unicode and back out again without losing information. Then again, equivalents of a-z are fairly well-defined in every character set, so if your data really is restricted to the standard English alphabet, ASCII may do just as well as a hub character set.
A decision on encoding is a decision on two things - what set of characters are permitted and how those characters are represented. Unicode permits you to use pretty much any character ever invented, but you may have your own reasons not to want or need such a wide choice. You might still restrict usernames, for example, to combinations of a-z and underscore, maybe because you have to put them into an external LDAP system whose own character set is restricted, maybe because you need to print them out using a font that doesn't cover all of Unicode, maybe because it closes off the security problems opened up by lookalike characters. If you're using something like ASCII or ISO8859-1, the storage/transmission layer implements a lot of those restrictions; with Unicode the storage layer doesn't restrict anything so you might have to implement your own rules at the application layer. This is more work - more programming, more testing, more possible system states. The tradeoff for that extra work is more flexibility, application-level rules being easier to change than system encodings.
The reason to use unicode is to respect proper abstractions in your design.
Just get used to treating the concept of text properly. It is not hard. There's no reason to create a broken design even if your users are English.
Just think of a customer wanting to use names like Schrödingers Cat for files he saved using your software. Or imagine some localized Windows with a translation of My Documents that uses non-ASCII characters. That would be internationalization that has, though you don't support internationalization at all, have effects on your software.
Also, having the option of supporting internationalization later is always a good thing.
Internationalization is so much more than just text in different languages. I bet it's the niche of the future in the IT-world. Heck, it already is. A lot has already been said, just thought I would add a small thing. Even though your customers right now are satisfied with english, that might change in the future. And the longer you wait, the harder it will be to convert your code base. They might even today have problems with e.g. file names or other types of data you save/load in your application.
Unicode is like cooties. Once it "infects" one area, it's usually hard to contain it given interconnectedness of dependencies. Sooner or later, you'll probably have to tie in a library that is unicode compliant and thus will use wchar_t's or the like. Instead of marshaling between character types, it's nice to have consistent strings throughout.
Thus, it's nice to be consistent. Otherwise you'll end up with something similar to the Windows API that has a "A" version and a "W" version for most APIs since they weren't consistent to start with. (And in some cases, Microsoft has abandoned creating "A" versions altogether.)
You haven't said what language you're using. In some languages, changing from ASCII to Unicode may be pretty easy, whereas in others (which don't support Unicode) it might be pretty darn hard.
That said, maybe in your situation you shouldn't support Unicode: you can't think of a compelling reason why you should, and there are some reasons (i.e. your cost to change your existing libraries) which argue against. I mean, perhaps 'ideally' you should but in practice there might be some other, more important or more urgent, thing to spend your time and effort on at the moment.
If program takes text input from the user, it should use unicode; you never know what language the user is going to use.
When using Unicode, it leaves the door open for internationalization if requirements ever change and you are required to use text in other languages than English.
Also, in your new project you could always just write wrappers for the libraries that internally convert between ASCII and Unicode and vice-versa.
Your potential client may already be running a non-unicode application in a language other than English and won't be able to run your program without swichting the windows unicode locale back and forth, which will be a big pain.
Because the internet is overwhelmingly using Unicode. Web pages use unicode. Text files including your customer's documents, and the data on their clipboards, is Unicode.
Secondly Windows, is natively Unicode, and the ANSI APIs are a legacy.
Modern applications should use Unicode where applicable, which is almost everywhere.