I noticed that a few locales have only one form of (cardinal) plural defined by the CLDR. Here are a few examples of popular languages without plurals:
Chinese
Indonesian
Japanese
Supposing the following English ICU message string:
You have {count, plural, =0 {no unread messages} one {# unread message} other {# unread messages}}.
Is there any value in translating the plural ICU syntax to these languages? For example in Chinese:
Plural Syntax 您有{count, plural, other {#}}條未讀消息。
Or since there is no plural, should we recommend translators to simply use the variable instead like this:
Variable Syntax 您有{count}條未讀消息。
I tested two libraries (GlobalizeJs and FormatJs) and both seem to work with both syntaxes.
Is there any known issue in interchanging these syntaxes?
Would it be safe for most libraries to recommend using the variable syntax for locales without plurals?
There is no value to keep the plural marker. The =0 case should say 沒有, though.
A couple of years late but please note that there is huge value depending on whether the phrase only uses plural categories or also uses overrides (like the =0 in your example) and whether you are only trying to be grammatically correct in each language or rather trying to sound like a human and not like a machine.
Also, if it’s a matter of teaching linguists how to use ICU, you’re better off explaining how arguments work because it’s not limited to plurals (there’s also selects which have a similar way of working).
Finally, if you are working with translation files such as YML with ICU in them, I would say it’s best to keep the ICU plural marker. This means you can change the source language of your translations at any point in time (including to a language that doesn’t have plurals like chinese), and you won’t have to rewrite all strings that should have kept the plural marker in the first place.
Apple's search API documents specify:
lang =
The language, English or Japanese, you want to use when returning search results. Specify the language using the five-letter codename. For example: en_us.
The default is en_us (English).
examples: en_us, ja_jp
However, I cannot find this 5 letter standard codename. Are they just expecting a ISO 639-1 Code concatenated with a 2 letter country code?
Has anybody ran into this issue before?
As you can see the document below, en_us, ja_jp is not a "example" but the supported values for lang parameter key.
https://www.apple.com/itunes/affiliates/resources/documentation/itunes-store-web-service-search-api.html
This means, currently this api supports only these two langeages.
Hope this helps.
Simply put: why does the ITfLanguageProfileNotifySink interface sink not seem to invoke OnLanguageChange method when changing input language (from Language Bar) to one of the Chinese (Traditional), such as ChangJie or New Phonetic? This is on Windows 7. I get it going from say English to Chinese, but then when I select a different Chinese (sub-) language, there is no notification?!
Because you haven't changed the language, you've changed the active Input Method.
One can have multiple input methods for the same language. To track input methods, you'll have to use ITfActiveLanguageProfileNotifySink and figure out how to interpret the CLSIDs for the various input methods in the OnActivated function.
I have to write some code that translate numbers from english to french (from 1 to 999) using the DCG formalism of Prolog. Do I have to write down two separate grammar rules (one for english and one for french)or not?
Can this piece of code found on the internet help me?
https://groups.google.com/forum/?fromgroups=#!topic/comp.lang.prolog/ZF8p5cs4q0U
Please Help.
You can do it in two steps (english to number, number to french) or you can try doing it directly from english to french. The two-step option is more generic (i.e. would let you convert it both ways, and you can easily extend it to support more languages), and you already have a working code available (the one in the linked topic), so I suggest following this route.
Just remember that, the same way a DCG rule allows you to parse some text, it allows you to generate it too. As the linked topic shows:
?- phrase(number(N), [one, hundred, and, twenty, seven]).
N = 127
?- phrase(number(127), L).
L = [one, hundred, and, twenty, seven]
If you replace the second part with phrase(number_fr(127), L), using the rules you implemented, you'd have the number you parsed earlier expressed in french.
I have started to code a multi-language feature for a medium-sized website with a lot of hardcoded text. As the website is supposed to be translated into Japanese and Korean (multibyte character set) I am considering the following:
If I use string externalization, do the strings for Japanese or Korean need to be in unicode form within the locale file (i.e. 台北 instead of 台北 as string value)?
Would it make more sense to store the localization in a DB (i.e. MySQL) and retrieve the respective values via a localization function in PHP?
Your thought input is much appreciated.
Best regards
$0.02 from someone who has some experience with i18n...
Keep your translations in human-readable form, as it will likely be translators and not coders managing these resources.
If this text (hard-coded, you say) is not subject to frequent change, then you may wish to store these resources as files that you read in at runtime.
If this text is subject to frequent change, then you may wish to explore other alternatives for storing resources, such as databases or in-memory key-value stores.
Depending upon your requirements, you may want to consider a mixture of the above.
But I strongly suggest that you avoid mixing code (the HTML character entities) with your translation resources. Most translators will not understand what they mean and may break them when they are translating. And on the flip-side, a programmer may not understand how to insert code or formatting into the translation resources properly, unless they actually understand that language.
tl;dr
- use UTF-8
- don't mix any code/formatting into the translations themselves
- how you store the translations depends upon your requirements
I doubt that string externalization would be your biggest problem. But let me give you some advise.
String externalization
Of course you would need to separate translatable strings from the code. I would recommend storing translation in plain text, UTF-8 encoded file containing key-value pairs:
some.key=some translation
Of course you would need to write a helper script to resolve this at runtime. The script would need to detect end-user's language.
Language detection
Web browsers are so nice to send AcceptLanguage header each time they send a request. What you need to do, is to read the content of this header and check if you support any of the language user has listed. If so, read the resource file (as defined above) and return strings for given language, return your default language otherwise. The code example below will give you the most desired language (which is not necessary the one you support):
<?php
$locale = Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);
echo $locale;
?>
This is still, not the biggest of your challenges.
Styles and style sheets
The real problem with multilingual web sites or web applications are styles. People tend to put style definitions in-line, which is problematic to say the least. Also, designers tend to think that Arial is the best font for entire Universe, as well as emphasis always have to come with bolded font. The only problem is, the font might be unreadable under some circumstances.
I must admit, I don't know why it happens, but most of the times web browsers tend to ignore bold attribute for Asian scripts (which is good), but sometimes they do not and it could became a major challenge for end users if your font definition is say font-family:Arial; font-size:10px;.
The other problem could be colors. Depending on your web site design, some colors used might be inappropriate for target customers. That is because we all tend to assign meaning to colors based on our cultural background.
Images containing localizable text could also give you a headache, you would need to either externalize such texts (and write them down just like any other HTML element), or prepare multilingual resources structure (i.e. put all images to directories named after language code ("en", "ja", "ko")).
The real challenge however, are hard-coded formatting tags like <b>, <i>, <u>, <strong>, etc. Nobody should use them nowadays, style classes should be used instead but the common practice is different. You would probably need to replace them with style classes; each element could have more than one style class, which to my surprise is not common knowledge (for example <p class="main boldText">).
OK, once you have your styles externalized, you would probably be forced to implement some sort of CSS Localization Mechanism. This is needed in the lights of what I wrote above. The easiest way to do that is to create directory structure similar to the one I mentioned before - "en" for English base CSS files, "ja" for Japanese and "ko" for Korean, so each language would have their own, separate set of CSS files. This is similar to UI skins, only in that case user won't be able to choose the skin, you will decide on which CSS to present them - you would detect language anyway.
As for in-line style definitions (<p style="whatever">), after you define CSS L10n Mechanism, you could override any style by forcing it with !important keyword. That is, unless somebody in his very wrong mind put this keyword to in-line style definition.
Concatenations
Well, this is your biggest challenge. Even people who understand the need of string externalization tend to concatenate the strings like this:
$result = $label + ": " + $product;
$message = "$your_basket_is + $basket_status + ".";
This poses serious problem for Internationalization (and if it is not resolved for Localization as well). That is because, the order of the sentence tend to be different after translating text into different language (this especially regards to Korean). Also, I showed you hard-coded punctuations, which are not necessary correct for Asian languages. That is what I have to go through on a daily basis :/
What you would probably need to do, is to remove such concatenations, or use some means of message formatting. The PHP example (taken directly from web page I am referencing) would be:
<?php
$fmt = new MessageFormatter("en_US", "{0,number,integer} monkeys on {1,number,integer} trees make {2,number} monkeys per tree");
echo $fmt->format(array(4560, 123, 4560/123));
$fmt = new MessageFormatter("de", "{0,number,integer} Affen auf {1,number,integer} Bäumen sind {2,number} Affen pro Baum");
echo $fmt->format(array(4560, 123, 4560/123));
?>
As you can see in this example, numbers are also formatted to much locale style. This leads us to:
Locale aware formatting
Dates, times, numbers and currencies or other similar information need to be formatted according to user-detected Locale. There is a slight difference here: you should attempt to do that, even if you do not support related language resources (do not have translations). Of course for currency symbol, you would use whatever is your real currency, not the user's default, but the format should respect end user's cultural background.
Summary
I have just presented you with a short introduction to multilingual web site design with focus on Japanese and Korean target markets. If at some point you would need to support Chinese Simplified as well, support for GB18030 encoding would be probably needed as well. This would be very challenging...
You do not want to store all your text as HTML entities. It'll drive you mad. The only reason to do this is if you need to serve your document in an ASCII encoding and cannot embed the characters directly. But in this day and age there's no reason for that; serve your document as UTF-8 and write and store your contents in UTF-8 and be done with it.
Whether or not to store translations in the database depends on many factors, including performance, caching, whether you need to be able to search for the text, whether the text should be editable by non-programmers etc. Usually .mo/.po translation files with gettext are a good way to go unless proven otherwise.