Unicode Library - Where to Find One (NEW FOCUS: Unicode in XSL-FO) - unicode

EDIT: I am no longer asking for a Unicode Library. Not only has one been linked, but the origial question was inappropriate to ask, as mentioned below. This question is now focused on how to implement unicode in XSL-FO.
My primary question now is what steps are required in implementing the unicode. I already have the necessary unicode character references, but I understand that the proper 'font' needs to be selected as well, and am led to believe there are other steps that need to be taken in order to implement it in my XSL-FO document.

What do you mean by "write foreign characters"? An XSL-FO file is just an XML file, so you can use any Unicode reference to figure out the character number and then an XML numeric character reference to include it.
For example, the Unicode hex for the Euro symbol € is U+20ac, so in XML (XSL-FO) that would be €

I experienced a same kind of problem. The problem with unicode characters is hardcoded in the FONET.DLL. In the class TrueTypeFont method MapCharacter is written as:
public override ushort MapCharacter(char c)
{
if (c > Byte.MaxValue)
return (ushort) FirstChar;
return mapping.MapCharacter(c);
}
So any character with a value greater than 255 will be "ignored". I downloaded the sources (from https://fonet.codeplex.com/) and modified the method to:
public override ushort MapCharacter(char c)
{
return mapping.MapCharacter(c);
}
Using this library with this new method, the euro-symbol magically became visible!

Related

How to search for any unicode symbol in a character string?

I've got an existing DOORS module which happens to have some rich text entries; these entries have some symbols in them such as 'curly' quotes. I'm trying to upgrade a DXL macro which exports a LaTeX source file, and the problem is that these high-number symbols are not considered "standard UTF-8" by TexMaker's import function (and in any case probably won't be processed by Xelatex or other converters) . I can't simply use the UnicodeString functions in DXL because those break the rest of the rich text, and apparently the character identifier charOf(decimal_number_code) only works over the basic set of characters, i.e. less than some numeric code value. For example, charOf(8217) should create a right-curly single quote, but when I tried code along the lines of
if (charOf(8217) == one_char)
I never get a match. I did copy the curly quote from the DOORS module and verified via an online unicode analyzer that it was definitely Unicode decimal value 8217 .
So, what am I missing here? I just want to be able to detect any symbol character, identify it correctly, and then replace it with ,e.g., \textquoteright in the output stream.
My overall setup works for lower-count chars, since this works:
( c is a single character pulled from a string)
thedeg = charOf(176)
if( thedeg == c )
{
temp += "$\\degree$"
}
Got some help from DXL coding experts over at IBM forums.
Quoting the important stuff (there's some useful code snippets there as well):
Hey, you are right it seems intOf(char) and charOf(int) both do some
modulo 256 and therefore cut anything above that off. Try:
int i=8217;
char c = addr_(i);
print c;
Which then allows comparison of c with any input char.

When using UTF-8, is it better to reference character for international use using decimal or hex... and why?

When using UTF-8, which character reference is better, or more widely supported worldwide on various browsers... using decimal references or hex references?
UPDATE
For instance, for replacing quotation marks...
" or "
which one is better to use, and why?
All HTML entities use only the ASCII subset, so the fact that you encode your document in UTF-8, as opposed to any other byte oriented encoding which extends ASCII, is unrelated.
Anyway:
When using UTF-8, you can just copy and paste the relevant characters into the document, without references at all. E.g. StackOverflow does not convert this ⫅ to an entity (see the source of this page).
If you prefer using entities, then I would use the hex references purely since this is the way Unicode codepoints are usually written in the charts. References are so widely supported that I do not think that you will head a compatibility problem with neither hex nor decimal references.
There is no functional difference between decimal references and hexadecimal references. Old browsers did not support the latter, but then we are talking about really old browsers like Netscape 4 and IE 4.
Hexadecimal references are usually more handy, because in character code standards and other reference works, characters are referred to by their code numbers in hexadecimal. Using them, you avoid the conversion from hexadecimal to decimal (and thereby may avoid some mistakes).
There is no reason to use either " or " in text. (In attribute values, they, or ", are needed in rare cases.)
This does not depend on the document encoding (UTF-8 or something else), except in the sense that when using UTF-8, you do not need the references (except for the markup-significant characters < and &). UTF-8 lets you enter any character as such, though you might still use references if you find that more comfortable than finding an editor that lets you enter the characters themselves.

Getting first symbol from a glyph

Related (in fact, perhaps a duplicate of): how to extract characters from a Korean string in VBA
The linked question doesn't give me satisfactory answers and it's 2 years old so I'm making a new question.
I want to find the first symbol in a Korean glyph, ie. "한" -> "ㅎ" or "가" -> "ㄱ". I also want to recognize inputs that are already single symbols, such as "ㄱ".
I'm working with NSString, which I believe uses UTF-8. Do I have to convert the string to EUC-KR, then start reading bytes, or what?
As a disclaimer, I have no experience in working with iphone or NSString, except for what I've read in the documentation in order to answer this question. I'm addressing the question mainly as a unicode problem.
In order to find the first symbol (jamo) from a Korean glyph, you have to perform a decomposition as described in my answer to how to extract characters from a Korean string in VBA (it's a new answer so you didn't see it when you posted your question). To apply my answer (which is derived directly from the Unicode standard), you have to work with the Unicode code points (numerical values) of the Korean syllables. It looks like calling the method dataUsingEncoding passing NSUnicodeStringEncoding as a parameter should do the trick.
In order to identify single symbols, you have to check whether the Unicode code point of the character you are checking is in any of the following ranges:
1100-11FF (Hangul Jamo). I think this should cover most of the real life cases.
A960-A97F (Hangul Jamo Extended-A)
D7B0-D7FF (Hangul Jamo Extended-B)
3130-318F (Hangul Compatibility Jamo)
FFA0-FFDC (Halfwidth Jamo)
Check the Unicode Code Charts for a complete reference.

How to convert unicode escape code to character in Objective C (on iPhone)

I have a string that contains unicode escape codes, eg. #"D\u017cem" (\u017c is code for ż). I would like to convert that string to the one containg actual characters. In the example that would be #"Dżem".
Is there any method in SDK or library that can do such replacement AND work on iPhone?
(Obviously I can do the replacement myself, changing characters one by one, but it is rather cumbersome)
According to Apple,
It is not safe is to include high-bit characters in your source code
Note that the "universal character name" \u017c is replaced at compile time with an implementation-defined value which in practice is the UTF8 representation, so the end result is the same as you would get if you (correctly) did the replacement you are talking about. If you're having a problem with some other source-processing tool, you might be better served by teaching that tool to recognize C99 universal character names.
I suggest to start using NSLocalizedString()
http://www.pushplay.net/2009/08/developing-localized-iphone-applications/
http://developer.apple.com

RichTextBox use to retrieve Text property in C++

I am using a hidden RichTextBox to retrieve Text property from a RichEditCtrl.
rtb->Text; returns the text portion of either English of national languages – just great!
But I need this text in \u12232? \u32232? instead of national characters and symbols. to work with my db and RichEditCtrl. Any idea how to get from “пассажирским поездом Невский” to “\u12415?\u12395?\u23554?\u20219?\u30456?\u35527?\u21729? (where each national character is represented as “\u23232?”
If you have, that would be great.
I am using visual studio 2008 C++ combination of MFC and managed code.
Cheers and have a wonderful weekend
If you need a System::String as an output as well, then something like this would do it:
String^ s = rtb->Text;
StringBuilder^ sb = gcnew StringBuilder(s->Length);
for (int i = 0; i < s->Length; ++i) {
sb->AppendFormat("\u{0:D5}?", (int)s[i]);
}
String^ result = s->ToString();
By the way, are you sure the format is as described? \u is a traditional Escape sequence for a hexadecimal Unicode codepoint, exactly 4 hex digits long, e.g. \u0F3A. It's also not normally followed by ?. If you actually want that, format specifier {0:X4} should do the trick.
You don't need to use escaping to put formatted Unicode in a RichText control. You can use UTF-8. See my answer here: Unicode RTF text in RichEdit.
I'm not sure what your restrictions are on your database, but maybe you can use UTF-8 there too.