Is it possible to type in Furigana (and Ruby characters) using Unicode? - unicode

I am currently making a Corona app where I would like to include Japanese text. For those of you who do not know, it appears that Japanese has multiple languages to write in text (Kanji, Hiragana, etc.). Furigana is a way to have Kanji characters with Hiragana in what looks to be subtext (or Ruby characters). See the Ruby slide on this page for an example.
I am looking for a way to use Furigana in my app. I was hoping there was a way to do it using Unicode. Well, I stumbled upon the Interlinear Annotation characters and tested them out (using unicodeToUtf8 and the LastResort font) in Corona as follows: :
local iaAnchor = unicodeToUtf8(0xfff9)
local iaSep = unicodeToUtf8(0xfffa)
local iaTerm = unicodeToUtf8(0xfffb)
local options = {
parent = localGroup,
text = iaAnchor .. "漢" .. iaSep .. "かん" .. iaTerm .. iaAnchor .."字" .. iaSep .. "じ" .. iaTerm,
x = 285,
y = 195,
font = "LastResort",
fontSize = 24,
}
local testText = display.newText(options)
Unfortunately, I had no success and ended up getting something like this:
So, my question is, is it possible to get Furigana (and Ruby characters) to work using Unicode? Or is this not an actual usable feature in Unicode? I just want to make sure that I am not wasting my time trying to get this stuff to work.
I checked out the Interlinear Annotation Characters section in this Unicode report, but the jargon is a bit too thick for me to understand what they're trying to say. Are they implying at all that such characters should not be used in regular practice? If so, then the previous resources on Unicode Ruby Characters is a bit misleading.

Interlinear Annotation Characters are a generic tool for annotating text (like Furigana, Bopomofo, or other phonetic guides), but the Unicode Standard doesn't specify how they should be interpreted or rendered. That is, you will probably have to implement rendering support for them yourself because most libraries do not know what to do with them.
It might be easier to use a higher-level protocol that already supports rendering Ruby text. For example, if you have access to an API that can render HTML, you can use the <ruby>/<rt> tags—which have well-defined rendering semantics.

Related

Converting emoji from hex code to unicode

I want to use emojis in my iOS and Android app. I checked the list of emojis here and it lists out the hex code for the emojis. When I try to use the hex code such as U+1F600 directly, I don't see the emoji within the app. I found one other way of representing emoji which looks like \uD83D\uDE00. When using this notation, the emoji is seen within the app without any extra code. I think this is a Unicode string for the emoji. I think this is more of a general question that specific to emojis. How can I convert an emoji hex code to the Unicode string as shown above. I didn't find any list where the Unicode for the emojis is listed.
It seems that your question is really one of "how do I display a character, knowing its code point?"
This question turns out to be rather language-dependent! Modern languages have little trouble with this. In Swift, we do this:
$ swift
Welcome to Apple Swift version 3.0.2 (swiftlang-800.0.63 clang-800.0.42.1). Type :help for assistance.
1> "\u{1f600}"
$R0: String = "😀"
In JavaScript, it is the same:
$ node
> "\u{1f600}"
'😀'
In Java, you have to do a little more work. If you want to use the code point directly you can say:
new StringBuilder().appendCodePoint(0x1f600).toString();
The sequence "\uD83D\uDE00" also works in all three languages. This is because those "characters" are actually what Unicode calls surrogates and when they are combined together a certain way they stand for a single character. The details of how this all works can be found on the web in many places (look for UTF-16 encoding). The algorithm is there. In a nutshell you take the code point, subtract 10000 hex, and spread out the 20 bits of that difference like this: 110110xxxxxxxxxx110111xxxxxxxxxx.
But rather than worrying about this translation, you should use the code point directly if your language supports it well. You might also be able to copy-paste the emoji character into a good text editor (make sure the encoding is set to UTF-8). If you need to use the surrogates, your best best is to look up a Unicode chart that shows you something called the "UTF-16 encoding."
In Delphi XE #$1F600 is equivalent to #55357#56832 or D83D DE04 smile.
Within a program, I use it in the following way:
const smilepage : array [1..3] of WideString =(#$1F600,#$1F60A,#$2764);
JavaScript - two way
let hex = "😀".codePointAt(0).toString(16)
let emo = String.fromCodePoint("0x"+hex);
console.log(hex, emo);

Why Julia returns "\uf8ff" when I use  (Apple logo) unicode?

I thought Julia supports raw unicode input, such as:
julia> test = "π£¢∞§"
"π£¢∞§"
julia> 😘 = 1 ;
julia> print(😘 )
1
However, it seems julia does not support  (Apple logo).
julia>  = 123
ERROR: syntax: invalid character ""
julia> test = ""
"\uf8ff"
I wonder what's the underlying reason for that, and whether there is a way I can use  character in Julia?
I believe this link more properly explains the case of the unicode character that you see as apple's logo.
The problem is that the unicode value used is one of several that is set aside for private use. That means that each operating system, or application, or implementation is free to use those unicode characters for anything they want. It just so happens that Apple has chosen to use unicode character U+F8FF (decimal value 63743, or on the web as either  or ) as the Apple Logo. But some Windows fonts put in a Windows logo. And some other fonts put in a Klingon Mummification glyph. Or elven script. Or anything they want. And if it isn't defined in your local font, you'll just see a square.
My opinion is that Julia simply doesn't use this special value for anything. This also explains why your "π£¢∞§" characters work nicely - they are proper unicode characters, more largely supported by different platforms.
As a side note, i too see a simple square instead of the apple logo on this instance.
Edit
Here is a list of unicode characters supported by Julia.
To expand on Alex's answer...
Apple's logo () isn't an official Unicode symbol. I think there are very few commercial logos and symbols in the main Unicode tables.
However, Unicode provides some 'anything goes' areas (called PUAs - private use areas) that companies and individuals can fill with their own symbols, so that their users can access certain special glyphs. The main PUA is U+E000 to U+F8FF. Depending on which font you're using, you'll find all kinds of stuff assigned to these codes. On a Mac, I can usually get the Apple logo at "\uf8ff", with the right font selected, but not the Ubuntu symbol or the Windows logo, unless I choose another font. (There's also a fallback mechanism, whereby if you request a code point that the current font doesn't have, the OS will find a suitable substitute in another font and use that.)
[
In Julia, you can only use certain Unicode characters for variable names. Julia wouldn't allow anything from the private use area anyway, unless some fonts were distributed to every computer and everyone agreed on who had which Unicode point. (Mathematica makes extensive use of PUA symbols in their notebooks, because they can and do install their own fonts, and can then access various glyphs from the PUA in the notebook with guaranteed results.)
You are allowed to use emoji characters as variable names, so you could try the Emoji apple, rather than the Apple apple:

Is there a way I can add unicode text to a MBCS MFC menu

I have a MFC application compiled with the MBCS character set. I have a submenu off of my main menu that I would like to add unicode characters to. Can that be done?
You can force the use of Unicode strings even in MBCS apps by explicitely calling the Unicode form of an API and passing it a Unicode string.
In your case, ModifyMenuW() is the API that sets the menu item text (assuming the menu item already exists):
ModifyMenuW(GetMenu()->m_hMenu,ID_APP_ABOUT, MF_BYCOMMAND , 0, L"\u573F");
This code displays a Chinese ideogram (I have no idea of its meaning) instead of the original text
The L in front of the string says it's a Unicode string. \u573F is the way you encode a Unicode char in your C++ ASCII source file. The W at the end of the API name: It stands for Wide and denotes the Unicode form of the API.
Note that if your goal is to translate the full UI of your app, this is a complete other story: The method I showed here is only suitable for one-shot calls. You can't create a full UI that way.
You can translate your MBCS app to Japanese, Russian, whatever,... without switching to Unicode (Although it would be a very good idea to do that switch. But that can be costly for legacy apps).
You have 2 friends to help you out there: appTranslator lets you very easily translate your app (and manage your translations (Disclaimer: This is my own ad ;-) and Microsoft AppLocale helps you test MBCS apps in different codepages without actually changing the codepage of your computer (which requires a reboot).

Website localization for multibyte languages

I have started to code a multi-language feature for a medium-sized website with a lot of hardcoded text. As the website is supposed to be translated into Japanese and Korean (multibyte character set) I am considering the following:
If I use string externalization, do the strings for Japanese or Korean need to be in unicode form within the locale file (i.e. 台北 instead of 台北 as string value)?
Would it make more sense to store the localization in a DB (i.e. MySQL) and retrieve the respective values via a localization function in PHP?
Your thought input is much appreciated.
Best regards
$0.02 from someone who has some experience with i18n...
Keep your translations in human-readable form, as it will likely be translators and not coders managing these resources.
If this text (hard-coded, you say) is not subject to frequent change, then you may wish to store these resources as files that you read in at runtime.
If this text is subject to frequent change, then you may wish to explore other alternatives for storing resources, such as databases or in-memory key-value stores.
Depending upon your requirements, you may want to consider a mixture of the above.
But I strongly suggest that you avoid mixing code (the HTML character entities) with your translation resources. Most translators will not understand what they mean and may break them when they are translating. And on the flip-side, a programmer may not understand how to insert code or formatting into the translation resources properly, unless they actually understand that language.
tl;dr
- use UTF-8
- don't mix any code/formatting into the translations themselves
- how you store the translations depends upon your requirements
I doubt that string externalization would be your biggest problem. But let me give you some advise.
String externalization
Of course you would need to separate translatable strings from the code. I would recommend storing translation in plain text, UTF-8 encoded file containing key-value pairs:
some.key=some translation
Of course you would need to write a helper script to resolve this at runtime. The script would need to detect end-user's language.
Language detection
Web browsers are so nice to send AcceptLanguage header each time they send a request. What you need to do, is to read the content of this header and check if you support any of the language user has listed. If so, read the resource file (as defined above) and return strings for given language, return your default language otherwise. The code example below will give you the most desired language (which is not necessary the one you support):
<?php
$locale = Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);
echo $locale;
?>
This is still, not the biggest of your challenges.
Styles and style sheets
The real problem with multilingual web sites or web applications are styles. People tend to put style definitions in-line, which is problematic to say the least. Also, designers tend to think that Arial is the best font for entire Universe, as well as emphasis always have to come with bolded font. The only problem is, the font might be unreadable under some circumstances.
I must admit, I don't know why it happens, but most of the times web browsers tend to ignore bold attribute for Asian scripts (which is good), but sometimes they do not and it could became a major challenge for end users if your font definition is say font-family:Arial; font-size:10px;.
The other problem could be colors. Depending on your web site design, some colors used might be inappropriate for target customers. That is because we all tend to assign meaning to colors based on our cultural background.
Images containing localizable text could also give you a headache, you would need to either externalize such texts (and write them down just like any other HTML element), or prepare multilingual resources structure (i.e. put all images to directories named after language code ("en", "ja", "ko")).
The real challenge however, are hard-coded formatting tags like <b>, <i>, <u>, <strong>, etc. Nobody should use them nowadays, style classes should be used instead but the common practice is different. You would probably need to replace them with style classes; each element could have more than one style class, which to my surprise is not common knowledge (for example <p class="main boldText">).
OK, once you have your styles externalized, you would probably be forced to implement some sort of CSS Localization Mechanism. This is needed in the lights of what I wrote above. The easiest way to do that is to create directory structure similar to the one I mentioned before - "en" for English base CSS files, "ja" for Japanese and "ko" for Korean, so each language would have their own, separate set of CSS files. This is similar to UI skins, only in that case user won't be able to choose the skin, you will decide on which CSS to present them - you would detect language anyway.
As for in-line style definitions (<p style="whatever">), after you define CSS L10n Mechanism, you could override any style by forcing it with !important keyword. That is, unless somebody in his very wrong mind put this keyword to in-line style definition.
Concatenations
Well, this is your biggest challenge. Even people who understand the need of string externalization tend to concatenate the strings like this:
$result = $label + ": " + $product;
$message = "$your_basket_is + $basket_status + ".";
This poses serious problem for Internationalization (and if it is not resolved for Localization as well). That is because, the order of the sentence tend to be different after translating text into different language (this especially regards to Korean). Also, I showed you hard-coded punctuations, which are not necessary correct for Asian languages. That is what I have to go through on a daily basis :/
What you would probably need to do, is to remove such concatenations, or use some means of message formatting. The PHP example (taken directly from web page I am referencing) would be:
<?php
$fmt = new MessageFormatter("en_US", "{0,number,integer} monkeys on {1,number,integer} trees make {2,number} monkeys per tree");
echo $fmt->format(array(4560, 123, 4560/123));
$fmt = new MessageFormatter("de", "{0,number,integer} Affen auf {1,number,integer} Bäumen sind {2,number} Affen pro Baum");
echo $fmt->format(array(4560, 123, 4560/123));
?>
As you can see in this example, numbers are also formatted to much locale style. This leads us to:
Locale aware formatting
Dates, times, numbers and currencies or other similar information need to be formatted according to user-detected Locale. There is a slight difference here: you should attempt to do that, even if you do not support related language resources (do not have translations). Of course for currency symbol, you would use whatever is your real currency, not the user's default, but the format should respect end user's cultural background.
Summary
I have just presented you with a short introduction to multilingual web site design with focus on Japanese and Korean target markets. If at some point you would need to support Chinese Simplified as well, support for GB18030 encoding would be probably needed as well. This would be very challenging...
You do not want to store all your text as HTML entities. It'll drive you mad. The only reason to do this is if you need to serve your document in an ASCII encoding and cannot embed the characters directly. But in this day and age there's no reason for that; serve your document as UTF-8 and write and store your contents in UTF-8 and be done with it.
Whether or not to store translations in the database depends on many factors, including performance, caching, whether you need to be able to search for the text, whether the text should be editable by non-programmers etc. Usually .mo/.po translation files with gettext are a good way to go unless proven otherwise.

wxWidgets and Unicode

i want to use korean translations under in my - quite large - wxwidgets application. The application uses the wxwidgets translation framework, which is based on gettext.
I have working translations for french, german and russian. I want to go unicode anyway, but my first question is:
does my application need unicode support to display korean and japanese languages?
If so, - just for interest - why does russian work without, since they have a cyrillic letterset?
I have thousands of string literals. Do i have to prepend each and every one of them with 'L' ? ( wxString foo("foo") --> wxString foo(L"foo") )
if so, did someone build a regex or sed or perl script to do this in ca. 500 .cpp files ? ( pleeze! =) )
Will this change in wxWidgets 3.0?
Unicode question general: i use these string literals in many descriptive and many technical ways .. as displayed text as well as parts of GLSL shaders as well as XML. These APIs have char* / const char* as function arguments, so my internal wxString representation should not matter in these areas. Theory and practice: is this true? Some experiences to share, anyone?
I do some text processing ( comparing, string finding etc ) - are there any logical differences in unicode vs. ansi?
Is there any remarkeable performance impact in using Unicode?
Thank you!
Wendy
Addressing some of your questions…
does my application need unicode support to display korean and japanese languages?
If so, - just for interest - why does russian work without, since they have a cyrillic letterset?
Russian fits in a single-byte charset, just like western European languages (though it is a different charset). Korean and Japanese (and Chinese) don't. There are many workarounds for this, but the most elegant I know of to date is to use Unicode so that you don't need to rebuild your application for each locale; just change its message catalog.
Unicode question general: i use these string literals in many descriptive and many technical ways .. as displayed text as well as parts of GLSL shaders as well as XML. These APIs have char* / const char* as function arguments, so my internal wxString representation should not matter in these areas. Theory and practice: is this true? Some experiences to share, anyone?
Only strings that are going to be shown to (non-technical) users need to be localized, so they're the only ones that have to be in Unicode. The most common approach is to use UTF-8 (which is a particular way of encoding Unicode) as that means that ASCII strings – the most common type passed around inside programs – are exactly the same, which simplifies things a lot. The down-side is that you no longer have cheap indexing into the string as not all characters are the same number of bytes long. That can be anything from a non-issue to a right royal hindering PITA, depending on what the program is doing.
I do some text processing ( comparing, string finding etc ) - are there any logical differences in unicode vs. ansi?
Comparisons work fine, as does simple string finding. Other operations (e.g., getting the 20th character of a string, or working out how many characters into a string you've found a substring) are nasty because you've not got constant character widths. The nastiness can be mitigated by using wide characters, but they're less nice to use for external data (they introduce potential problems with endianness unless you go into working with byte-order marks, and that's another matter right there).
Is there any remarkeable performance impact in using Unicode?
Depends on exactly what you do. With UTF-8, if you're mostly dealing with ASCII text in reality then you get very little in the way of performance problems for most operations. With wide characters, you take more memory for every character, which naturally has performance implications (but which might acceptable because it does mean you've got constant-time indexing).
There's a korean .po file on http://www.wxwidgets.org/about/i18n.php for wxWidget's own strings. If your application displays wxWidget's own strings correctly when using that file, then it does not need Unicode support to display Korean and Japanese languages.
ISO-8859-5 is an 8 bit character set with Cyrillic letters.
Only if 1. does not yield the correct result. But if you want to translate the string, you should have used _().
I don't know.
wxWidgets 3.0 will not have separate Unicode- and ANSI-builds. 2.9.1 doesn't have, either.
It depends on how you use the arguments. C- and C++-functions usually operate on the representation of strings and are unaware of any particular character encoding. Particularly what you perceive to be a character and what the program considers a character might be different things.
See 6.
I do not know, but many toolkits use UTF-16 or UTF-32 instead of UTF-8 because these schemes are simpler. It's a size-speed tradeoff.
1.does my application need unicode support to display korean and japanese
languages?
Thanks to Oswald, i found out that you can have a korean translation without using unicode in your wxwidgets application. Change ( under windows, at least ) settings for non-unicode aware programs. But i still have to check out if this is enough for a whole application.
3.I have thousands of string literals. Do i have to prepend each
and every one of them with 'L' ? (
wxString foo("foo") --> wxString
foo(L"foo") )
If you have to use unicode with wxwidgets prior to 3.0, you have to. But do not use 'L' under wxwidgets, use wxT("foo")
4.if so, did someone build a regex or sed or perl script to do this in ca. 500 .cpp files ?
I did, at least a search and replace under Visual Studio:
Search: {"([^"]*)"}
Replace: wxT(\1)
But be careful! Will replace all string literals, #include "file.h" with #include wxT("file.h")
Will this change in wxWidgets 3.0?
Yes. See answer/quote above.