Website localization for multibyte languages

Website localization for multibyte languages - unicode

I have started to code a multi-language feature for a medium-sized website with a lot of hardcoded text. As the website is supposed to be translated into Japanese and Korean (multibyte character set) I am considering the following:
If I use string externalization, do the strings for Japanese or Korean need to be in unicode form within the locale file (i.e. 台北 instead of 台北 as string value)?
Would it make more sense to store the localization in a DB (i.e. MySQL) and retrieve the respective values via a localization function in PHP?
Your thought input is much appreciated.
Best regards

$0.02 from someone who has some experience with i18n...
Keep your translations in human-readable form, as it will likely be translators and not coders managing these resources.
If this text (hard-coded, you say) is not subject to frequent change, then you may wish to store these resources as files that you read in at runtime.
If this text is subject to frequent change, then you may wish to explore other alternatives for storing resources, such as databases or in-memory key-value stores.
Depending upon your requirements, you may want to consider a mixture of the above.
But I strongly suggest that you avoid mixing code (the HTML character entities) with your translation resources. Most translators will not understand what they mean and may break them when they are translating. And on the flip-side, a programmer may not understand how to insert code or formatting into the translation resources properly, unless they actually understand that language.
tl;dr
- use UTF-8
- don't mix any code/formatting into the translations themselves
- how you store the translations depends upon your requirements

I doubt that string externalization would be your biggest problem. But let me give you some advise.
String externalization
Of course you would need to separate translatable strings from the code. I would recommend storing translation in plain text, UTF-8 encoded file containing key-value pairs:
some.key=some translation
Of course you would need to write a helper script to resolve this at runtime. The script would need to detect end-user's language.
Language detection
Web browsers are so nice to send AcceptLanguage header each time they send a request. What you need to do, is to read the content of this header and check if you support any of the language user has listed. If so, read the resource file (as defined above) and return strings for given language, return your default language otherwise. The code example below will give you the most desired language (which is not necessary the one you support):
<?php
$locale = Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);
echo $locale;
?>
This is still, not the biggest of your challenges.
Styles and style sheets
The real problem with multilingual web sites or web applications are styles. People tend to put style definitions in-line, which is problematic to say the least. Also, designers tend to think that Arial is the best font for entire Universe, as well as emphasis always have to come with bolded font. The only problem is, the font might be unreadable under some circumstances.
I must admit, I don't know why it happens, but most of the times web browsers tend to ignore bold attribute for Asian scripts (which is good), but sometimes they do not and it could became a major challenge for end users if your font definition is say font-family:Arial; font-size:10px;.
The other problem could be colors. Depending on your web site design, some colors used might be inappropriate for target customers. That is because we all tend to assign meaning to colors based on our cultural background.
Images containing localizable text could also give you a headache, you would need to either externalize such texts (and write them down just like any other HTML element), or prepare multilingual resources structure (i.e. put all images to directories named after language code ("en", "ja", "ko")).
The real challenge however, are hard-coded formatting tags like <b>, <i>, <u>, <strong>, etc. Nobody should use them nowadays, style classes should be used instead but the common practice is different. You would probably need to replace them with style classes; each element could have more than one style class, which to my surprise is not common knowledge (for example <p class="main boldText">).
OK, once you have your styles externalized, you would probably be forced to implement some sort of CSS Localization Mechanism. This is needed in the lights of what I wrote above. The easiest way to do that is to create directory structure similar to the one I mentioned before - "en" for English base CSS files, "ja" for Japanese and "ko" for Korean, so each language would have their own, separate set of CSS files. This is similar to UI skins, only in that case user won't be able to choose the skin, you will decide on which CSS to present them - you would detect language anyway.
As for in-line style definitions (<p style="whatever">), after you define CSS L10n Mechanism, you could override any style by forcing it with !important keyword. That is, unless somebody in his very wrong mind put this keyword to in-line style definition.
Concatenations
Well, this is your biggest challenge. Even people who understand the need of string externalization tend to concatenate the strings like this:
$result = $label + ": " + $product;
$message = "$your_basket_is + $basket_status + ".";
This poses serious problem for Internationalization (and if it is not resolved for Localization as well). That is because, the order of the sentence tend to be different after translating text into different language (this especially regards to Korean). Also, I showed you hard-coded punctuations, which are not necessary correct for Asian languages. That is what I have to go through on a daily basis :/
What you would probably need to do, is to remove such concatenations, or use some means of message formatting. The PHP example (taken directly from web page I am referencing) would be:
<?php
$fmt = new MessageFormatter("en_US", "{0,number,integer} monkeys on {1,number,integer} trees make {2,number} monkeys per tree");
echo $fmt->format(array(4560, 123, 4560/123));
$fmt = new MessageFormatter("de", "{0,number,integer} Affen auf {1,number,integer} Bäumen sind {2,number} Affen pro Baum");
echo $fmt->format(array(4560, 123, 4560/123));
?>
As you can see in this example, numbers are also formatted to much locale style. This leads us to:
Locale aware formatting
Dates, times, numbers and currencies or other similar information need to be formatted according to user-detected Locale. There is a slight difference here: you should attempt to do that, even if you do not support related language resources (do not have translations). Of course for currency symbol, you would use whatever is your real currency, not the user's default, but the format should respect end user's cultural background.
Summary
I have just presented you with a short introduction to multilingual web site design with focus on Japanese and Korean target markets. If at some point you would need to support Chinese Simplified as well, support for GB18030 encoding would be probably needed as well. This would be very challenging...

You do not want to store all your text as HTML entities. It'll drive you mad. The only reason to do this is if you need to serve your document in an ASCII encoding and cannot embed the characters directly. But in this day and age there's no reason for that; serve your document as UTF-8 and write and store your contents in UTF-8 and be done with it.
Whether or not to store translations in the database depends on many factors, including performance, caching, whether you need to be able to search for the text, whether the text should be editable by non-programmers etc. Usually .mo/.po translation files with gettext are a good way to go unless proven otherwise.

Related

Is there a reason to use the ICU plural syntax for locales without plurals?

I noticed that a few locales have only one form of (cardinal) plural defined by the CLDR. Here are a few examples of popular languages without plurals:
Chinese
Indonesian
Japanese
Supposing the following English ICU message string:
You have {count, plural, =0 {no unread messages} one {# unread message} other {# unread messages}}.
Is there any value in translating the plural ICU syntax to these languages? For example in Chinese:
Plural Syntax 您有{count, plural, other {#}}條未讀消息。
Or since there is no plural, should we recommend translators to simply use the variable instead like this:
Variable Syntax 您有{count}條未讀消息。
I tested two libraries (GlobalizeJs and FormatJs) and both seem to work with both syntaxes.
Is there any known issue in interchanging these syntaxes?
Would it be safe for most libraries to recommend using the variable syntax for locales without plurals?

There is no value to keep the plural marker. The =0 case should say 沒有, though.

A couple of years late but please note that there is huge value depending on whether the phrase only uses plural categories or also uses overrides (like the =0 in your example) and whether you are only trying to be grammatically correct in each language or rather trying to sound like a human and not like a machine.
Also, if it’s a matter of teaching linguists how to use ICU, you’re better off explaining how arguments work because it’s not limited to plurals (there’s also selects which have a similar way of working).
Finally, if you are working with translation files such as YML with ICU in them, I would say it’s best to keep the ICU plural marker. This means you can change the source language of your translations at any point in time (including to a language that doesn’t have plurals like chinese), and you won’t have to rewrite all strings that should have kept the plural marker in the first place.

How UTF8/Unicode adapt to new writing systems?

An example to clarify my question:
The Hongkongers' native language is Cantonese, however, we all write in a different language: Madarin Chinese. Two languages are kindof similar, and Hongkongers are educated to write in Madarin Chinese language.
Cantonese doesn't have a writing system. Though we are still happy with Madarin as our writing language, however, in case one day Hongkongers decided to develop a 'Cantonese script' which contains not-yet-existing characters, how should UTF8/Unicode/fonts change, to adapt these new characters?
I mean, who will change the UTF8/Unicode/fonts standard? How exactly Linux/Windows OS have to be modified, in order to display these newly created characters?
(The example is just to make my question clear. We're not talking about politics ;D )

The Unicode coding space has over 1,000,000 code points, and only about 10% of them have been allocated, so there is a lot of room for new characters (even though some areas of the coding space have been set apart for use other than added characters). The Unicode Consortium, working in close cooperation with the relevant body at ISO, assigns code points to new characters on the basis of proposals that demonstrate actual usage or, in some cases, plans with a solid basis and widespread support.
Thus, if a new script were designed and there was a large community that would seriously use it, it would be added, with its characters, into Unicode after due proposals and discussion.
It would then be up to font manufacturers to add glyphs for such characters. This might take a long time, but if there is strong enough need, new fonts and enhancements to existing fonts would emerge.
No change to UTF-8 or other Unicode transfer encodings would be needed. They already encode the entire coding space, whether code points are assigned to characters or not.
Rendering software would need no modifications, unless there are some specialties in the writing system. Normal characters would be rendered just fine, as soon as suitable fonts are available.
However, if the characters added were outside the Basic Multilingual Plane (BMP), the “16-bit subset of Unicode”, both rendering and processing (and input) would be problematic. Many programming languages and programs effectively treat Unicode as if it were a 16-bit code and run into problems (possibly solvable, but still) when characters outside the BMP are used. If the writing system had, say, 10,000 characters, it is quite possible that it would have to allocated outside the BMP.

The Unicode committee adds new characters as they see fit. Then fonts add support for the new characters. Operating systems should not require changes simply to display the new characters. Typing the characters would generally require updates or plug-ins to an operating system's input methods.

What's a good method for writing fixed width field files?

I need to write a file that is probably being interpreted by something like RPG IV on an AS/400 (but I don't know that). The file will be created by reading data from our MySQL database and then writing it in the specified format. It could be quite large ( potentially measured in GB but haven't determined yet ). Right now I'm thinking Perl's built in format might actually be my best bet, because things like Xslate, and Template Toolkit are more designed for things that aren't fixed width (HTML). My only concern there is that format doesn't appear to have conditionals and it looks like I may need them (I found a format left justified if field A is set, right justified and padded if not)

Other possibilities that come to mind are pack and the sprintf family of functions.
I don't think pack supports right-justified text, so that wouldn't be an option.
That leaves (s)printf. You can build format specifiers programatically to support your conditional logic for justification.

Template Toolkit can do a serviceable job at creating fixed width formatted files. The trick is to use the templates to describe the file and record structure, but have a Perl function format the data for each field.
It may be easier to skip the templates and do all the formatting in Perl. Either way you need to consider how you need to format your fields. In my experience sprintf is better and handling more of the formatting cases required by fixed width formatted files. You will probably still need to implement a few helper functions the hand oddities (like EBCDIC/COBOL signed numbers encoded in ASCII, if your unlucky enough).
There are a thousand odd special cases in legacy fixed width formatted files, it's almost enough to make me like XML data files, typically it's the oddest special case in the end that determines what the best method for formatting the file is.

Understanding the terms - Character Encodings, Fonts, Glyphs

I am trying to understand this stuff so that I can effectively work on internationalizing a project at work. I have just started and very much like to know from your expertise whether I've understood these concepts correct. So far here is the dumbed down version(for my understanding) of what I've gathered from web:
Character Encodings -> Set of rules that tell the OS how to store characters. Eg., ISO8859-1,MSWIN1252,UTF-8,UCS-2,UTF-16. These rules are also called Code Pages/Character Sets which maps individual characters to numbers. Apparently unicode handles this a bit differently than others. ie., instead of a direct mapping from a number(code point) to a glyph, it maps the code point to an abstract "character" which might be represented by different glyphs.[ http://www.joelonsoftware.com/articles/Unicode.html ]
Fonts -> These are implementation of character encodings. They are files of different formats (True Type,Open Type,Post Script) that contain mapping for each character in an encoding to number.
Glyphs -> These are visual representation of characters stored in the font files.
And based on the above understanding I have the below questions,
1)For the OS to understand an encoding, should it be installed separately?. Or installing a font that supports an encoding would suffice?. Is it okay to use the analogy of a protocol say TCP used in a network to an encoding as it is just a set of rules. (which ofcourse begs the question, how does the OS understands these network protocols when I do not install them :-p)
2)Will a font always have the complete implementation of a code page or just part of it?. Is there a tool that I can use to see each character in a font(.TTF file?)[Windows font viewer shows how a style of the font looks like but doesn't give information regarding the list of characters in the font file]
3)Does a font file support multiple encodings?. Is there a way to know which encoding(s) a font supports?
I apologize for asking too many questions, but I had these in my mind for some time and I couldn't find any site that is simple enough for my understanding. Any help/links for understanding this stuff would be most welcome. Thanks in advance.

If you want to learn more, of course I can point you to some resources:
Unicode, writing systems, etc.
The best source of information would probably be this book by Jukka:
Unicode Explained
If you were to follow the link, you'd also find these books:
CJKV Information Processing - deals with Chinese, Japanese, Korean and Vietnamese in detail but to me it seems quite hard to read.
Fonts & Encodings - personally I haven't read this book, so I can't tell you if it is good or not. Seems to be on topic.
Internationalization
If you want to learn about i18n, I can mention countless resources. But let's start with book that will save you great deal of time (you won't become i18n expert overnight, you know):
Developing International Software - it might be 8 years old but this is still worth every cent you're going to spend on it. Maybe the programming examples regard to Windows (C++ and .Net) but the i18n and L10n knowledge is really there. A colleague of mine said once that it saved him about 2 years of learning. As far as I can tell, he wasn't overstating.
You might be interested in some blogs or web sites on the topic:
Sorting it all out - Michael Kaplan's blog, often on i18n support on Windows platform
Global by design - John Yunker is actively posting bits of i18n knowledge to this site
Internationalization (I18n), Localization (L10n), Standards, and Amusements - also known as i18nguy, the web site where you can find more links, tutorials and stuff.
Java Internationalization
I am afraid that I am not aware of many up to date resources on that topic (that is publicly available ones). The only current resource I know is Java Internationalization trail. Unfortunately, it is fairly incomplete.
JavaScript Internationalization
If you are developing web applications, you probably need also something related to i18n in js. Unfortunately, the support is rather poor but there are few libraries which help dealing with the problem. The most notable examples would be Dojo Toolkit and Globalize.
The prior is a bit heavy, although supports many aspects of i18n, the latter is lightweight but unfortunately many stuff is missing. If you choose to use Globalize, you might be interested in the latest Jukka's book:
Going Global with JavaScript & Globalize.js - I read this and as far I can tell, it is great. It doesn't cover the topics you were originally asking for but it is still worth reading, even for hands-on examples of how to use Globalize.

Apparently unicode handles this a bit differently than others. ie.,
instead of a direct mapping from a number(code point) to a glyph, it
maps the code point to an abstract "character" which might be
represented by different glyphs.
In the Unicode Character Encoding Model, there are 4 levels:
Abstract Character Repertoire (ACR) — The set of characters to be encoded.
Coded Character Set (CCS) — A one-to-one mapping from characters to integer code points.
Character Encoding Form (CEF) — A mapping from code points to a sequence of fixed-width code units.
Character Encoding Scheme (CES) — A mapping from code units to a serialized sequence of bytes.
For example, the character 𝄞 is represented by the code point U+1D11E in the Unicode CCS, the two code units D834 DD1E in the UTF-16 CEF, and the four bytes 34 D8 1E DD in the UTF-16LE CES.
In most older encodings like US-ASCII, the CEF and CES are trivial: Each character is directly represented by a single byte representing its ASCII code.
1) For the OS to understand an encoding, should it be installed
separately?.
The OS doesn't have to understand an encoding. You're perfectly free to use a third-party encoding library like ICU or GNU libiconv to convert between your encoding and the OS's native encoding, at the application level.
2)Will a font always have the complete implementation of a code page or just part of it?.
In the days of 7-bit (128-character) and 8-bit (256-character) encodings, it was common for fonts to include glyphs for the entire code page. It is not common today for fonts to include all 100,000+ assigned characters in Unicode.

I'll provide you with short answers to your questions.
It's generally not the OS that supports an encoding but the applications. Encodings are used to convert a stream of bytes to lists of characters. For example, in C# reading a UTF-8 string will automatically make it UTF-16 if you tell it to treat it as a string.
No matter what encoding you use, C# will simply use UTF-16 internally and when you want to, for example, print a string from a foreign encoding, it will convert it to UTF-16 first, then look up the corresponding characters in the character tables (fonts) and shows the glyphs.
I don't recall ever seeing a complete font. I don't have much experience with working with fonts either, so I cannot give you an answer for this one.
The answer to this one is in #1, but a short summary: fonts are usually encoding-independent, meaning that as long as the system can convert the input encoding to the font encoding you'll be fine.
Bonus answer: On "how does the OS understand network protocols it doesn't know?": again it's not the OS that handles them but the application. As long as the OS knows where to redirect the traffic (which application) it really doesn't need to care about the protocol. Low-level protocols usually do have to be installed, to allow the OS to know where to send the data.
This answer is based on my understanding of encodings, which may be wrong. Do correct me if that's the case!

Why use Unicode if your program is English only?

So I've read Joel's article, and looked through SO, and it seems the only reason to switch from ASCII to Unicode is for internationalization. The company I work for, as a policy, will only release software in English, even though we have customers throughout the world. Since all of our customers are scientists, they have functional enough English to use our software as a non-native speaker. Or so the logic goes. Because of this policy, there is no pressing need to switch to Unicode to support other languages.
However, I'm starting a new project and wanted to use Unicode (because that is what a responsible programmer is supposed to do, right?). In order to do so, we would have to start converting all of the libraries we've written into Unicode. This is no small task.
If internationalization of the programs themselves is not considered a valid reason, how would one justify all the time spent recoding libraries and programs to make the switch to Unicode?

This obviously depends on what your app actually does, but just because you only have an english version in no way means that internationalization is not an issue.
What if I want to store a customer name which uses non-english characters? Or the name of a place in another country?
As an added bonus (since you say you're targeting scientists) is that all sorts of scientific symbols and notiations are supported as part of Unicode.
Ultimately, I find it much easier to be consistent. Unicode behaves the same no matter whose computer you run the app on. Non-unicode means that you use some locale-dependant character set or codepage by default, and so text that looks fine on your computer may be full of garbage characters on someone else's.
Apart from that, you probably don't need to translate all your libraries to Unicode in one go. Write wrappers as needed to convert between Unicode and whichever encoding you use otherwise.
If you use UTF-8 for your Unicode text, you even get the ability to read plain ASCII strings, which should save you some conversion headaches.

They say they will always put it in English now, but you admit you have worldwide clients. A client comes in and says internationalization is a deal breaker, will they really turn them down?
To clarify the point I'm trying to make you say that they will not accept this reasoning, but it is sound.
Always better to be safe than sorry, IMO.

The extended Scientific, Technical and Mathematical character set rules.
Where else can you say ⟦∀c∣c∈Unicode⟧ and similar technical stuff.

Characters beyond the 7-bit ASCII range are useful in English as well. Does anyone using your software even need to write the € sign? Or £? How about distinguishing "résumé" from "resume"?You say it's used by scientists around the world, who may have names like "Jörg" or "Guðmundsdóttir". In a scientific setting, it is useful to talk about wavelengths like λ, units like Å, or angles as Θ, even in English.
Some of these characters, like "ö", "£", and "€" may be available in 8-bit encodings like ISO-8859-1 or Windows-1252, so it may seem like you could just use those encodings and be done with it. The problem is that there are characters outside of those ranges that many people use very frequently, and so lots of existing data is encoded in UTF-8. If your software doesn't understand that when importing data, it may interpret the "£" character in UTF-8 as a sequence of 2 Windows-1252 characters, and render it as "Â£". If this sort of error goes undetected for long enough, you can start to get your data seriously garbled, as multiple passes of misinterpretation alter your data more and more until it becomes unrecoverable.
And it's good to think about these issues early on in the design of your program. Since strings tend to be very low-level concept that are threaded throughout your entire program, with lots of assumptions about how they work implicit in how they are used, it can be very difficult and expensive to add Unicode support to a program later on if you have never even thought about the issue to begin with.
My recommendation is to always use Unicode capable string types and libraries wherever possible, and make sure any tests you have (whether they be unit, integration, regression, or any other sort of tests) that deal with strings try passing some Unicode strings through your system to ensure that they work and come through unscathed.
If you don't handle Unicode, then I would recommend ensuring that all data accepted by the system is 7-bit clean (that is, there are no characters beyond the 7-bit US-ASCII range). This will help avoid problems with incompatibilities between 8-bit legacy encodings like the ISO-8859 family and UTF-8.

Suppose your program allows me to put my name in it, on a form, a dialog, whatever, and my name can't be written with ascii characters... Even though your program is in English, the data may be in other language...

It doesn't matter that your software is not translated, if your users use international characters then you need to support unicode to be able to do correct capitalization, sorting, etc.

If you have no business need to switch to unicode, then don't do it. I'm basing this on the fact that you thought you'd need to change code unrelated to component you already need to change to make it all work with Unicode. If you can make the component/feature you're working on "Unicode ready" without spreading code churn to lots of other components (especially other components without good test coverage) then go ahead and make it unicode ready. But don't go churn your whole codebase without business need.
If the business need arises later, address it then. Otherwise, you aren't going to need it.
People in this thread may suppose scenarios where it becomes a business requirement. Run those scenarios by your product managers before considering them scenarios worth addressing. Make sure they know the cost of addressing them when you ask.

Well for one, your users might know and understand english, but they can still have 'local' names. If you allow your users to do any kind of input to your application, they might want to use characters that are not part of ascii. If you don't support unicode, you will have no way of allowing these names. You'd be forcing your users to adopt a more simple name just because the application isn't smart enough to handle special characters.
Another thing is, even if the standard right now is that the app will only be released in English, you are also blocking the possibility of internationalization with ASCII, adding to the work that needs to be done when the company policy decides that translations are a good thing. Company policy is good, but has also been known to change.

I'd say this attitude expressed naïveté, but I wouldn't be able to spell naïveté in ASCII-only.
ASCII still works for some computer-only codes, but is no good for the façade between machine and user.
Even without the New Yorker's old-fashioned style of coöperation, how would some poor woman called Zoë cope if her employers used such a system?
Alas, she wouldn't even seek other employment, as updating her résumé would be impossible, and she'd have to resume instead. How's she going to explain that to her fiancée?

The company I work for, **as a policy**, will only release software in English, even though we have customers throughout the world.
1 reason only: Policies change, and when they change, they will break your existing code. Period.
Design for evil, and you have a chance of not breaking your code so soon. In this case, use Unicode. Happened to me on a brazilian specific stock-market legacy system.

Many languages (Java [and thus most JVM-based language implementations], C# [and thus most .NET-based language implementatons], Objective C, Python 3, ...) support Unicode strings by preference or even (nearly) exclusively (you have to go out of your way to work with "strings" of bytes rather than of Unicode characters).
If the company you work for ever intends to use any of these languages and platforms, it would therefore be quite advisable to start planning a Unicode-support strategy; a pilot project in particular might not be a bad idea.

That's a really good question. The only reason I can think of that has nothing to do with I18n or non-English text is that Unicode is particularly suited to being what might be called a hub character set. If you think of your system as a hub with its external dependencies as spokes, you want to isolate character encoding conversions to the spokes, so that your hub system works consistently with your chosen encoding. What makes Unicode a ideal character set for the hub of your system is that it acknowledges the existence of other character sets, it defines equivalences between its own characters and characters in those external character sets, and there's an ongoing process where it extends itself to keep up with the innovation and evolution of external character sets. There are all sorts of weird encodings out there: even when the documentation assures you that the external system or library is using plain ASCII it often turns out to be some variant like IBM775 or HPRoman8, and the nice thing about Unicode is that no matter what encoding is thrown at you, there's a good chance that there's a table on unicode.org that defines exactly how to convert that data into Unicode and back out again without losing information. Then again, equivalents of a-z are fairly well-defined in every character set, so if your data really is restricted to the standard English alphabet, ASCII may do just as well as a hub character set.
A decision on encoding is a decision on two things - what set of characters are permitted and how those characters are represented. Unicode permits you to use pretty much any character ever invented, but you may have your own reasons not to want or need such a wide choice. You might still restrict usernames, for example, to combinations of a-z and underscore, maybe because you have to put them into an external LDAP system whose own character set is restricted, maybe because you need to print them out using a font that doesn't cover all of Unicode, maybe because it closes off the security problems opened up by lookalike characters. If you're using something like ASCII or ISO8859-1, the storage/transmission layer implements a lot of those restrictions; with Unicode the storage layer doesn't restrict anything so you might have to implement your own rules at the application layer. This is more work - more programming, more testing, more possible system states. The tradeoff for that extra work is more flexibility, application-level rules being easier to change than system encodings.

The reason to use unicode is to respect proper abstractions in your design.
Just get used to treating the concept of text properly. It is not hard. There's no reason to create a broken design even if your users are English.

Just think of a customer wanting to use names like Schrödingers Cat for files he saved using your software. Or imagine some localized Windows with a translation of My Documents that uses non-ASCII characters. That would be internationalization that has, though you don't support internationalization at all, have effects on your software.
Also, having the option of supporting internationalization later is always a good thing.

Internationalization is so much more than just text in different languages. I bet it's the niche of the future in the IT-world. Heck, it already is. A lot has already been said, just thought I would add a small thing. Even though your customers right now are satisfied with english, that might change in the future. And the longer you wait, the harder it will be to convert your code base. They might even today have problems with e.g. file names or other types of data you save/load in your application.

Unicode is like cooties. Once it "infects" one area, it's usually hard to contain it given interconnectedness of dependencies. Sooner or later, you'll probably have to tie in a library that is unicode compliant and thus will use wchar_t's or the like. Instead of marshaling between character types, it's nice to have consistent strings throughout.
Thus, it's nice to be consistent. Otherwise you'll end up with something similar to the Windows API that has a "A" version and a "W" version for most APIs since they weren't consistent to start with. (And in some cases, Microsoft has abandoned creating "A" versions altogether.)

You haven't said what language you're using. In some languages, changing from ASCII to Unicode may be pretty easy, whereas in others (which don't support Unicode) it might be pretty darn hard.
That said, maybe in your situation you shouldn't support Unicode: you can't think of a compelling reason why you should, and there are some reasons (i.e. your cost to change your existing libraries) which argue against. I mean, perhaps 'ideally' you should but in practice there might be some other, more important or more urgent, thing to spend your time and effort on at the moment.

If program takes text input from the user, it should use unicode; you never know what language the user is going to use.

When using Unicode, it leaves the door open for internationalization if requirements ever change and you are required to use text in other languages than English.
Also, in your new project you could always just write wrappers for the libraries that internally convert between ASCII and Unicode and vice-versa.

Your potential client may already be running a non-unicode application in a language other than English and won't be able to run your program without swichting the windows unicode locale back and forth, which will be a big pain.

Because the internet is overwhelmingly using Unicode. Web pages use unicode. Text files including your customer's documents, and the data on their clipboards, is Unicode.
Secondly Windows, is natively Unicode, and the ANSI APIs are a legacy.
Modern applications should use Unicode where applicable, which is almost everywhere.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse