Is the Unicode Basic Multilingual Plane enough for CJK speakers?

Is the Unicode Basic Multilingual Plane enough for CJK speakers? - unicode

The Question: "Is supporting only the Unicode BMP sufficient to enable native Chinese / Japanese / Korean speakers to use an application in their native language?"
I'm most concerned with Japanese speakers right now, but I'm also interested in the answer for Chinese people as well. If an application only supported characters on the BMP - would it make the application unusable for Chinese/Japanese speakers (i.e. app did not allow data entry / display of supplemental characters)?
I'm not asking if the BMP is the only thing you would ever need for any kind of application (clearly not - especially for all language in the entire world). I'm asking for CJK speakers, in a professional context, for a modern kind of ordinary app that deals with general free text entry (including, names, places, etc.) - is the BMP generally enough?
Even if only supporting the BMP is not correct - would it be pretty close / "good enough"? Would the lack of supplemental characters in an application only be an occasional minor inconvenience; or would a Japanese speaker, for example, consider the application completely broken? Especially considering that they would always be able to work around the problem by spelling out problematic words with Hiragana/Katakana?
What about Chinese speakers who don't have a fallback option, would the lack of supplemental characters be considered a show-stopping problem?
I'm considering general professional context here - not social or gaming stuff. As an example, there's a lot of the emoticons on the supplemental planes - but I personally would not consider an English app that did not support Unicode emoticon characters to be "broken", at least for most professional use.
The application I'm dealing with right now is written in Java, but I think this question applies more generally. Knowing the answer will also help me (regardless of language) get a better handle on how much effort I'd have to go through with regard to font support.
EDIT
Clarification: by "supports only the BMP" - I intend that the application would handle supplemental characters gracefully.
Unsupported characters (including the BMP surrogate code blocks) would be dealt with similarly to how most applications deal with ASCII control codes and other undesirable characters - filtered/disallowed for data entry and "dealt with" for display if that were necessary (filtered out or replaced with the unicode replacement character).

For people who might be looking for an actual answer to the actual question: the application that prompted this question is now in production allowing only characters from the BMP (actually a limited subset).
Multiple international customers using Korean language in production - Japanese going live soon. Chinese is in planning (I have my doubts that the BMP will be sufficient for that, but we'll see I guess).
It's fine - no reported issues related to unsupported characters.
But that's just anecdotal evidence, really. Just because my customers were fine with it - that doesn't mean yours will be. For context, customers of the app are international companies, hundreds of employees using the application to process hundreds of thousands of their customers.

Unfortunately CJK support in Unicode is broken. The BMP is not enough to properly support CJK, but worse than that even if you do implement full support for all Unicode pages it is still broken.
The basic problem is that they tried to merge characters from all three languages that look kinda similar but are not really the same. The result is that they only look right if you select the correct font to display them. For example, a particular character will only look right to a Chinese person if you render it with a Chinese font, and only look right to a Japanese person if you render it with a Japanese font.
There is no universal font. There is no way to determine which language a character is supposed to be from, so you have to somehow guess which font to use. You can try to examine the system language or some other hack like that. You can't support two languages in the same document unless you have additional metadata. If you get raw Unicode strings without any indication of what language they are in, you are screwed.
It's a total disaster. You need to talk to your clients to figure out their needs and how they indicate to their systems what font to use for broken Unicode characters.
Edit: Also need to mention, some characters required for people's names are missing from Unicode. Later revisions are better, but of course you also need updated fonts to take advantage of them.

The majority of CJK codepoints are defined in the BMP, however CJK Ideographs are not. So if you do not need to support Ideographs, then the BMP is fine, otherwise it is not.
However, I would consider any implementation that does not recognize and process UTF-16 surrogates, even if it does not handle the Unicode codepoints they represent, to be broken.

Unless you are a fond developer or developing an operating systems you should not care about that, let the OS layer deal with it.
Just implement proper Unicode support in your application and allow the operating system to deal with how the characters are types and displayed.
If you are using custom fonts in your application you may be in trouble
In the end to answer your question: NO, Unicode support is not only BMP and you need to support Unicode.

Related

UTF-7 on website?

We are in the process of converting our Windows-1252 based webshop to Unicode. Unfortunately we currently have to use a middleware between the shop and the ERP which cannot handle UTF-8 (it will corrupt the characters).
We could use UTF-7 for passing the content through the middleware but I'd like to avoid having to convert all data before it enters and exits the middleware.
This is why I thought of using UTF-7 alltogehter. Is there a technical reason not to use UTF-7 on your website?

HTML5 forbids the support of UTF-7 by browsers :
Furthermore, authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU
encodings, which also fall into this category; these encodings were
never intended for use for Web content.
...
User agents must support the encodings defined in the WHATWG Encoding
standard. User agents should not support other encodings.
User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
encodings. [CESU8] [UTF7] [BOCU1] [SCSU]
An extract from the list of character encodings supported by Firefox :
UTF-7 Obsolete since Gecko 5.0 Unicode Support removed for HTML5 compatibility.
Don't use UTF-7.
BTW having a middleware which supports UTF-7 but not UTF-8 looks strange. Maybe this middleware can handle the files as binary ? In any case your middleware might be a little too old to be in use now.

Current versions of Chrome, Firefox, and IE do not support UTF-7 at all (they render an UTF-7 encoded HTML document by displaying its source code as such, since they do not recognize any tags). This is a sufficient reason for not even considering the use of UTF-7 on the web.

Body encoding for multilingual data

I am currently working on an application that supports multiple languages: English, Spanish, Russian, Polish, etc.
I have set up my SQL server database to have Unicode field types (nvarchar etc).
I am concerned now with setting the correct encoding on the HTML, text, XML files etc. I am aware that it needs to be UTF, but not sure if it's UTF-8, UTF-16 or UTF-32. Could someone explain the difference and which encoding is the best to go with?

If this is about something that is supposed to use web browsers, as it seems, then UTF-8 is the only reasonable choice, since it’s the only encoding that is widely supported in browsers. Regarding the ways to set the encoding, check out the W3C page Character encodings.

A development platform for unicode spell checker?

I have decided to develop a (Unicode) spell checker for my final year project for a south Asian language. I want to develop it as a plugin or a web service. But I need to decide a suitable development platform for it. (This will not just check for a dictionary file, morphological analysis / generation modules (a stemmer) will also be used).
Would java script be able to handle such processing with a fair response time?
Will I be able to process a large dictionary on client side?
Is there any better suggestions that you can make?

Javascript is not up to the task, at least not by itself; its Unicode support is too primitive, and in many parts, actually missing. For example, Javascript has no support for Unicode grapheme clusters.
If you use Java, then make sure you use the ICU libraries so that you can get all the whizbang Unicode properties you’ll need for text segmentation. The place where Java’s native Unicode processing breaks down is in its regex library, which is why Android JNIs over to the ICU C/C++ regex library. There are a lot of NLP tools written for Java, some of which you might find handy. Most of these that I am aware of though are for English or at least Western languages.
If you are willing to run part of your computation server-side via CGI instead of just client-side action, you are no longer bound by language choice. For example, you might combine Javascript on the client with Perl on the server, whose Unicode support is even better than Java’s. How that would meld together and how to get the performance and behavior you would want depends on just what you actually want to do.
Perl also has quite a good number of industry-standard NLP modules widely available for it, most of which already know to use Unicode, since like Java, Perl uses Unicode internally.
A brief slide presentation on using NLP tools in Perl for certain sorts of morphological analysis, namely stemming and lammatization, is available here. The presentation is known to work under Safari, Firefox, or Chrome, but not so well under Opera or Microsoft’s Internet Explorer.
I am not aware of any tools specifically targeting Asian languages, although Perl does support UAX#11 (East Asian Width) and UAX#14 (Unicode Linebreaking) via the Unicode::LineBreak module from CPAN, and Perl does come with a fully-compliant collation module (implementing UTS#10, the Unicocde Collation Algorithm) by way of the standard Unicode::Collate module, with locale support available from the also-standard Unicode::Collate::Locale module, where many Asian locales are supported. If you are using CJK languages, you may want access to the Unihan database, available via the Unicode::Unihan module from CPAN. Even more fundamentally, Perl has native support for Unicode extended grapheme clusters by way of its \X metacharacter in its builtin regex engine, which neither Java nor Javascript provides.
All this is the sort of thing you are likely to need, and find terribly lacking, in Javascript.

How to address the formal / informal speech issue?

How do you deal with formal / informal speech when building an application that must have all its phrases in one of those?
Most frameworks will let you pick the language for things such as form validation error messages, by setting it to something like 'en-GB' or 'fr-FR'.
But if you want to change from formal to informal or viceversa, you will have to edit the language files.
I know this isn't a big issue in english, but it is in other languages where you have to pick the correct word for say, the equivalent of 'You', depending on whether it is a formal or informal conversation. The same can happen with almost any word in the sentence, depending on the language.
Any thoughts?
Have you ever been told to build an application fully in formal / informal speech?
Does the user even care about this?

Informal vs Formal
The real problem with choosing the form is the fact that it really depends on who you speak to. It is probably OK to use informal messages to an English user, but it would be regarded as an offense if you use the same tone to for example Japanese user. It is the essence of Internationalization.
How to deal with it?
I suggest to pick one "tone" and consequently use it throughout an application. If it is informal (for example target users are teenagers), then be it. However, let Localization decide on how to translate these messages, for they should have the vast knowledge of target market.
If you need to have both formal and informal language in one application, for example depending on target user's age, you can think of implementing themes. Of course theme should not only customize messages but also the User Interface (styles, colors). Again, if you do, let L10n decide what is good for international market (some themes might not be applicable for that market).
Does user even care?
Some users do, some users don't. Depends. From my experience, Asian customers (especially Japanese and Chinese) tend to care a lot. Using informal speech or bright colors might seem as if you being rude to them.

Is it common to localize you iphone app? Do you localize by default?

Trying to settle a debate with a client. I didnt localize the strings and images in the application and it didnt come up during the 3 week define and discover. They seem to think its a basic best practice and that it should have been done by default. I disagree, especially if there are no planned languages otw.
Seems like you would leave this for demand by users
So, Id like to ask the community to chime in on this and tell me if you by default have localized your iphone app or not?
I ask this to not only help me understand where I might have missed something but also to help others in the future as to what is considered "default" and "best practice"

This question is going to spark a lot of opinion. Because Apple makes it ridiculously easy on iOS to localize strings, I personally feel you should set all apps up for localization by default. Anywhere you find yourself defining a literal string, substitute it with NSLocalizedString(), until it becomes second nature. Then, if you decide you want to localize later, you don't have to hunt and peck all over the place. If you never localize, you lose nothing except a few keystrokes.
A true localization will probably have localized NIBs to (may have different sized buttons for different languages, for instance). Still, if you assign all strings that will appear in the NIB in code with NSLocalizedString(), rather than in Interface Builder, you'll likely save time in the long run.
The fact that the AppStore makes your app visible in so many countries greatly increases the demand to localize. Read this post by Wil Shipley about the money you could be making by reaching many markets.

First, you have to internationalize it, that is to write your code such that it is easy to add new languages. Default - and first - language is usually english (Apple makes things easier if you start with english and customers are more likely to buy an app in english only than in swahili only).
Second, you can localize it to languages that are important for customers of your app. spanish, chinese, french, farsi, ...

How could you possibly know what languages to use for localization without input from your client?
Assuming it was never discussed during requirements, it appears you are in the right here. If they wanted localized versions of their app, they should have requested it.
#greg has a very good point, in that it would have been beneficial to use localized strings from the very beginning, but setting up an app for localization isn't especially difficult. It's the actual translations that are difficult and expensive.

My small sample suggests that the vast majority of apps in the U.S. App store are not localized to any other language.
It may be technically easy, but can be editorially very difficult unless you have educated multi-lingual staff available (hiring multiple contractors to cross-review translations for grammatical correctness, doing multi-lingual app descriptions, app documentation, web site support pages, marketing materials, etc. in all the languages that might be expected for an app localized to those languages. Keeping all that editorially synchronized with every update/bug fix.)
It also appears that lot of apps add multi-lingual support only after International sales get to the level which can support the above initial and on-going costs.

i ask them going into it.
if i need a default answer, it is 'yes, prepare for localization'. it takes far less time to add it as you go, than the remove, rebuild, retest cycles.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse