How do I make text-transform:uppercase work properly with Greek? - uppercase

The issue I came across has to do with the capitalization of Greek characters by the text-transform: uppercase property.
In Greek, vowels can have acute accents, both small and caps, for instance one in Greek is ένα. In the beginning of a sentence would be Ένα. But when a word or a phrase is written in all caps then Greek grammar says that it should have no accented letters.
As it is now, CSS's text-transform: uppercase capitalizes Greek letters preserving accents which is grammatically wrong (so ένα becomes ΈΝΑ, while it should be ΕΝΑ).
How do I make text-transform: uppercase work properly for Greek?

CSS will handle this fine if it knows that the language is Greek. Merely specifying Greek characters does not tell CSS that the language is Greek; that requires the lang attribute on some parent element (up to and including the html tag).
<p lang='el' style="text-transform: uppercase">ένα</p>
should get the job done for you, rendering
ΕΝΑ
See fiddle at http://jsfiddle.net/34tww2g8/.

Torazaburo's answer is correct but for older browsers (IE version mostly) there is no proper support for greek accented characters and hence you need to use javascript to replace the accented with non accented characters
replaceAccented();
function replaceAccented(){
var e = document.getElementsByTagName('*'), l = e.length, i;
if( typeof getComputedStyle == "undefined")
getComputedStyle = function(e) {return e.currentStyle;};
for( i=0; i<l; i++) {
if( getComputedStyle(e[i]).textTransform == "uppercase") {
// do stuff with e[i] here.
e[i].innerHTML = greekReplaceAccented(e[i].innerHTML);
}
}
}
function greekReplaceAccented(str) {
var charList = {'Ά':'Α','ά':'α','Έ':'Ε','έ':'ε','Ή':'Η','ή':'η','Ί':'Ι','ί':'ι','ΐ':'ϊ','Ό':'Ο'
,'ό':'ο','Ύ':'Υ','ύ':'υ','ΰ':'ϋ','Ώ':'Ω','ώ':'ω','ς':'Σ'
};
return str.replace(/./g, function(c) {return c in charList? charList[c] : c}) ;
}
Here is the working function in a fiddle.
You can comment //replaceAccented() to see what is actually fixed by JavaScript or test which browser might need such a solution.

What you are describing isn't really a bug in CSS. CSS is designed to stylize the elements of a page. This is an agnostic definition, independent of culture. What you are describing would require the CSS to handle localization of a page, based upon the culture specific stylized CSS would then be loaded. (en, fr, au...).
There are a number of links online that discuss Globalization and localization as well as CSS.
Check the Mozilla site which discusses this same topic Look to the section on Create localizable UI

Related

What's the simplest and most reliable way to get a string containing nothing but _all unicode emojis_?

Using the command line, a script, etc.
http://www.unicode.org/emoji/
"Unicode symbols" don't have a clean "plane of Unicode blocks" (see Wikipedia: Unicode symbols) and to make matters worse there doesn't exist a definite subset of Unicode symbols that can be considered "proper emojis" (as far as I know).
E.g. the emojis covered by Twitter don't cover all of the Unicode symbols but are dispersed all over the place (though particularly concentrated in some areas over others, such as Miscellaneous Symbols, Emoticons, Miscellaneous Symbols and Pictographs, Transport and Map Symbols, and Dingbats).
But once you've settled on the Unicode block ranges that you want you can easily print them out, I'll be using the "Emoticons (U+1F600–U+1F64F)" range as an example, since they are "the most universally considered emojis".
var emojis = "";
var code = parseInt("1F600", 16);
while(code <= parseInt("1F64F", 16)) {
emojis += String.fromCodePoint("0x"+code.toString(16));
code += 1;
}
console.log(emojis);
parseInt() - JavaScript | MDN
String.fromCodePoint() - JavaScript | MDN

Is it possible to type in Furigana (and Ruby characters) using Unicode?

I am currently making a Corona app where I would like to include Japanese text. For those of you who do not know, it appears that Japanese has multiple languages to write in text (Kanji, Hiragana, etc.). Furigana is a way to have Kanji characters with Hiragana in what looks to be subtext (or Ruby characters). See the Ruby slide on this page for an example.
I am looking for a way to use Furigana in my app. I was hoping there was a way to do it using Unicode. Well, I stumbled upon the Interlinear Annotation characters and tested them out (using unicodeToUtf8 and the LastResort font) in Corona as follows: :
local iaAnchor = unicodeToUtf8(0xfff9)
local iaSep = unicodeToUtf8(0xfffa)
local iaTerm = unicodeToUtf8(0xfffb)
local options = {
parent = localGroup,
text = iaAnchor .. "漢" .. iaSep .. "かん" .. iaTerm .. iaAnchor .."字" .. iaSep .. "じ" .. iaTerm,
x = 285,
y = 195,
font = "LastResort",
fontSize = 24,
}
local testText = display.newText(options)
Unfortunately, I had no success and ended up getting something like this:
So, my question is, is it possible to get Furigana (and Ruby characters) to work using Unicode? Or is this not an actual usable feature in Unicode? I just want to make sure that I am not wasting my time trying to get this stuff to work.
I checked out the Interlinear Annotation Characters section in this Unicode report, but the jargon is a bit too thick for me to understand what they're trying to say. Are they implying at all that such characters should not be used in regular practice? If so, then the previous resources on Unicode Ruby Characters is a bit misleading.
Interlinear Annotation Characters are a generic tool for annotating text (like Furigana, Bopomofo, or other phonetic guides), but the Unicode Standard doesn't specify how they should be interpreted or rendered. That is, you will probably have to implement rendering support for them yourself because most libraries do not know what to do with them.
It might be easier to use a higher-level protocol that already supports rendering Ruby text. For example, if you have access to an API that can render HTML, you can use the <ruby>/<rt> tags—which have well-defined rendering semantics.

Unicode Keystroke Characters?

Does unicode have characters in it similar to stuff like the things formed by the <kbd> tag in HTML? I want to use it as part of a game to indicate that the user can press a key to perform a certain action, for example:
Press R to reset, or S to open the settings menu.
Are there characters for that? I don't need anything fancy like ⇧ Shift or Tab ⇆, single-letter keys are plenty. I am looking for something that would work somewhat like the Enclosed Alphanumerics subrange.
If there are characters for that, where could I find a page describing them? All the google searches I tried turned only turned up "unicode character keyboard shortcuts" stuff.
If there are not characters for that, how can I display something like that as part of (or at least in line with) a text string in Processing 2.0.1?
(The rendering referred to is not the default rendering of kbd, which simply shows the content in the system’s default monospace font. But e.g. in StackOverflow pages, a style sheet is used to format kbd so that it looks like a keycap.)
Somewhat surprisingly, there is a Unicode way to create something that looks like a character in a keycap: enter the character, then immediately COMBINING ENCLOSING KEYCAP U+20E3.
Font support to this character is very limited but contains a few free fonts. Unfortunately, none of them is a sans-serif font, and the character to be shown inside should normally appear in such a font – after all, real keycaps contains very simple shapes for characters, without serifs. And generally, a character and an enclosing mark should be taken from the same font; otherwise they might be incompatible. However, it seems that taking the normal character from the sans-serif font (FreeSans) in GNU Freefont and the combining mark from the serif font (FreeSerif) of the same source creates a reasonable presentation:
I’m afraid it won’t work here in text, but I’ll try: A⃣ .
Whether this works depends on the use of suitable fonts, as mentioned, but also on the rendering software. Programs have been rather bad at displaying combining marks, but there has been some improvement. I tested this in Word 2007, where it works OK, and also on web browsers (Chrome, Firefox, IE) with good results using code like this:
<style>
.cap { font-family: FreeSerif; }
.cap span { font-family: FreeSans; }
</style>
<span class="cap"><span>A</span>⃣</span>
It isn’t perfect, when using the fonts mentioned. The character in the cap is not quite centered. Moreover, if I try to use the technique e.g. for the character Å (which is present on normal Nordic keyboards), the ring above A extends out of the cap. You could tweak this by setting the font size of the letter in the cap to, say, 85% of the font size of the combining mark, but then the horizontal position of the letter is even more off.
To summarize, it is possible to do such things at the character level, but if you can use other methods, like using a border or a background image for a character, you can probably achieve better rendering.

How to mark all CJK text in a document?

I have a file, file1.txt, containing text in English, Chinese, Japanese, and Korean. For use in ConTeXt, I need to mark each region of text within the file according to language, except for English, and output a new file, e.g., here is a sample line:
The 恐龙 ate 鱼.
As this contains text in Chinese characters, this will get marked like this:
The \language[cn]{恐龙} ate \language[cn]{鱼}.
The document is saved as UTF-8.
Text in Chinese should be marked \language[cn]{*}.
Text in Japanese should be marked \language[ja]{*}.
Text in Korean should be marked \language[ko]{*}.
The content never continues from one line to the next.
If the code is ever in doubt about whether something is Chinese, Japanese, or Korean, it is best if it defaults to Chinese.
How can I mark the text according to the language present?
A crude algorithm:
use 5.014;
use utf8;
while (<DATA>) {
s
{(\p{Hangul}+)}
{\\language[ko]{$1}}g;
s
{(\p{Hani}+)}
{\\language[zh]{$1}}g;
s
{(\p{Hiragana}+|\p{Katakana}+)}
{\\language[ja]{$1}}g;
say;
}
__DATA__
The 恐龙 ate 鱼.
The 恐竜 ate 魚.
The キョウリュウ ate うお.
The 공룡 ate 물고기.
(Also see Detect chinese character using perl?)
There are problems with that. Daenyth comments that e.g. 恐竜 is misidentified as Chinese. I find it unlikely that you are really working with mixed English-CJK, and are just giving bad example text. Perform a lexical analysis first to differentiate Chinese from Japanese.
I'd like to provide a Python solution. No matter which language, it is based on Unicode Script information (from Unicode Database, aka UCD). Perl has rather detailed UCD compared to Python.
Python has no Script information opened in its "unicodedata" module. But someone has added it at here https://gist.github.com/2204527 (tiny and useful). My implementaion is based on it. BTW, it is not space sensitive(no need of any lexical analysis).
# coding=utf8
import unicodedata2
text=u"""The恐龙ate鱼.
The 恐竜ate 魚.
Theキョウリュウ ate うお.
The공룡 ate 물고기. """
langs = {
'Han':'cn',
'Katakana':'ja',
'Hiragana':'ja',
'Hangul':'ko'
}
alist = [(x,unicodedata2.script_cat(x)[0]) for x in text]
# Add Last
alist.append(("",""))
newlist = []
langlist = []
prevlang = ""
for raw, lang in alist:
if prevlang in langs and prevlang != lang:
newlist.append("\language[%s]{" % langs[prevlang] +"".join(langlist) + "}")
langlist = []
if lang not in langs:
newlist.append(raw)
else:
langlist.append(raw)
prevlang = lang
newtext = "".join(newlist)
print newtext
The Output is :
$ python test.py
The\language[cn]{恐龙}ate\language[cn]{鱼}.
The \language[cn]{恐竜}ate \language[cn]{魚}.
The\language[ja]{キョウリュウ} ate \language[ja]{うお}.
The\language[ko]{공룡} ate \language[ko]{물고기}.
While Korean doesn't use much sinograms [漢字/Kanji] anymore, they still pop up sometimes. Some Japanese sinograms are solely Japanese, like 竜, but many are identical to either Simplified Chinese or Traditional. So you're kind of stuck. So you need to look at a full sentence if you have some "Han" chars. If it has some hiragana/katakana + kanji, probability is very high it's Japanese. Likewise, a bunch of hangul syllables and a couple of sinograms will tell you the sentence is in Korean.
Then, if it's all Han characters, ie Chinese, you can look at whether some of the chars are simplified: kZVariant denotes a Simplified Chinese char. Oh, and kSpecializedSemanticVariant is very often used for Japanese specific simplified chars. 内 and 內 may look the same to you, but the first is Japanese, the second Traditional Chinese and Korean (Korean uses Traditional Chinese as a standard).
I have code somewhere that returns, for one codepoint, the script name. That could help. You go through a sentence, and see what's left at the end. I'll put up the code somewhere.
EDIT: the code
http://pastebin.com/e276zn6y
In response to the comment below:
This function above is built based on data provided by Unicode.org... While not being an expert per se, I contributed quite a bit to the Unihan database – and I happen to speak CJK. Yes, all 3. I do have some code that takes advantage of the kXXX properties in the Unihan database, but A/ I wasn't aware we were supposed to write code for the OP, and B/ it would require a logistics that might go beyond what the OP is ready to implement. My advice stands. With the function above, loop through one full sentence. If all codepoints are "Han", (or "Han"+"Latin"), chances are high it's Chinese. If on the other hand the result is a mix of "Han"+"Hangul"(+"latin" possibly) you can't go wrong with Korean. Likewise, a mix of "Han" and "Katakana"/"Hiragana" you have Japanese.
A QUICK TEST
Some code to be used with the function I linked to before.
function guessLanguage(x) {
var results={};
var s='';
var i,j=x.length;
for(i=0;i<j;i++) {
s=scriptName(x.substr(i,1));
if(results.hasOwnProperty(s)) {
results[s]+=1;
} else {
results[s]=1;
}
}
console.log(results);
mostCount=0;
mostName='';
for(x in results) {
if (results.hasOwnProperty(x)) {
if(results[x]>mostCount) {
mostCount=results[x];
mostName=x;
}
}
}
return mostName;
}
Some tests:
r=guessLanguage("外人だけど、日本語をペラペラしゃべるよ!");
Object
Common: 2
Han: 5
Hiragana: 9
Katakana: 4
__proto__: Object
"Hiragana"
The r object contains the number of occurrences of each script. Hiragana is the most frequent, and Hiragana+Katakana --> 2/3 of the sentence.
r=guessLanguage("我唔知道,佢講乜話.")
Object
Common: 2
Han: 8
__proto__: Object
"Han"
An obvious case of Chinese (Cantonese in this case).
r=guessLanguage("中國이 韓國보다 훨씬 크지만, 꼭 아름다운 나라가 아니다...");
Object
Common: 11
Han: 4
Hangul: 19
__proto__: Object
"Hangul"
Some Han characters, and a whole lot of Hangul. A Korean sentence, assuredly.

RichTextBox use to retrieve Text property in C++

I am using a hidden RichTextBox to retrieve Text property from a RichEditCtrl.
rtb->Text; returns the text portion of either English of national languages – just great!
But I need this text in \u12232? \u32232? instead of national characters and symbols. to work with my db and RichEditCtrl. Any idea how to get from “пассажирским поездом Невский” to “\u12415?\u12395?\u23554?\u20219?\u30456?\u35527?\u21729? (where each national character is represented as “\u23232?”
If you have, that would be great.
I am using visual studio 2008 C++ combination of MFC and managed code.
Cheers and have a wonderful weekend
If you need a System::String as an output as well, then something like this would do it:
String^ s = rtb->Text;
StringBuilder^ sb = gcnew StringBuilder(s->Length);
for (int i = 0; i < s->Length; ++i) {
sb->AppendFormat("\u{0:D5}?", (int)s[i]);
}
String^ result = s->ToString();
By the way, are you sure the format is as described? \u is a traditional Escape sequence for a hexadecimal Unicode codepoint, exactly 4 hex digits long, e.g. \u0F3A. It's also not normally followed by ?. If you actually want that, format specifier {0:X4} should do the trick.
You don't need to use escaping to put formatted Unicode in a RichText control. You can use UTF-8. See my answer here: Unicode RTF text in RichEdit.
I'm not sure what your restrictions are on your database, but maybe you can use UTF-8 there too.