How can I display "«" in Japanese? Shift JIS encoding - unicode

I have text and I want to add '«' at the start and '»' at the end of the string.
for example: «MyText».
#define START_MARKER 0x00AB // '«'
#define END_MARKER 0x00BB // '»'
char startMarker[2], endMarker[2];
sprintf(startMarker, "%c",START_MARKER);
sprintf(endMarker, "%c",END_MARKER);
const UTF16String unicodeStartMarker = ai::UnicodeString(startMarker).as_ASUnicode();
const UTF16String unicodeEndMarker = ai::UnicodeString(endMarker).as_ASUnicode();
sUnicodePlaceholder = unicodeStartMarker + UTF8String(sFieldName).GetUnicodeString() + unicodeEndMarker;
It works for all the languages except Japanese.
I get different character as you can see below:
The character "«" is also defined : "U+00ab" but with suffix SJIS (character encoding for the Japanese language)
Do you have any idea how can I display "«" in Japanese?
thanks!!

Related

Fixing mojibake for email

My application for macOS archives emails from email clients and IMAP accounts. One user got an email from a Windows user and archived the email from Mail. My app reads the data directly from the hard disk.
The user has an email with some a nice mojibake:
I can identify a few of the characters:
‰ → ä
¸ → ü
But I can't figure out whhat the original encoding is. I made myself an encoding table for the characters "ö ä ü ß" and my data is not in that table:
Code:
dim theLeft as string = "ö ä ü ß"
for currentEncoding as integer = 0 to Encodings.Count - 1
dim EncodingInternetName as String = Encodings.Item(currentEncoding).internetName
if EncodingInternetName.IndexOf("iso") = -1 and EncodingInternetName.IndexOf("windows") = -1 then Continue
dim newString as string = DefineEncoding(theLeft, Encodings.Item(currentEncoding)) '<---convert
newString = ConvertEncoding(newString, Encodings.UTF8)
Result.Add(EncodingInternetName + " " + newString)
next
Does anyone have an idea what encoding was used for the Mojibake?
dim theString as String = "ˆ ‰ ¸ fl"
theString = theString.ConvertEncoding(Encodings.WindowsANSI)
theString = theString.DefineEncoding(Encodings.UTF8)
theString = theString.ConvertEncoding(Encodings.MacRoman)
theString = theString.DefineEncoding(Encodings.WindowsANSI)
theString = theString.ConvertEncoding(Encodings.UTF8)

Is there an uppercase letter that has two lowercase alternatives?

Are there any two characters ch1, ch2, ch1 <> ch2, and ch1 and ch2 are lowercase letters, where uppercase(ch1) == uppercase(ch2)? Is there actually such a character in Unicode?
A follow-up question is: For any ch that's a lowercase letter, is the following expression always true?
ch == lowercase(uppercase(ch))
Did a quick test in Java:
public static void main(String[] args) {
for (char ch1 = 0; ch1 < 65534; ch1++) {
if (!isLetter(ch1) || !isLowerCase(ch1)) {
continue;
}
String s1 = "" + ch1;
for (char ch2 = (char) (ch1 + 1); ch2 < 65535; ch2++) {
if (!isLetter(ch2) || !isLowerCase(ch2)) {
continue;
}
String s2 = "" + ch2;
if (s1.toUpperCase(Locale.US).equals(s2.toUpperCase(Locale.US))) {
System.out.println("ch1=" + ch1 + " (" + (int) ch1 + "), ch2=" + ch2 + " (" + (int) ch2 + ")");
}
}
}
}
It prints:
ch1=i (105), ch2=ı (305)
ch1=s (115), ch2=ſ (383)
ch1=µ (181), ch2=μ (956)
ch1=ΐ (912), ch2=ΐ (8147)
ch1=ΰ (944), ch2=ΰ (8163)
ch1=β (946), ch2=ϐ (976)
ch1=ε (949), ch2=ϵ (1013)
ch1=θ (952), ch2=ϑ (977)
ch1=ι (953), ch2=ι (8126)
ch1=κ (954), ch2=ϰ (1008)
ch1=π (960), ch2=ϖ (982)
ch1=ρ (961), ch2=ϱ (1009)
ch1=ς (962), ch2=σ (963)
ch1=φ (966), ch2=ϕ (981)
ch1=в (1074), ch2=ᲀ (7296)
ch1=д (1076), ch2=ᲁ (7297)
ch1=о (1086), ch2=ᲂ (7298)
ch1=с (1089), ch2=ᲃ (7299)
ch1=т (1090), ch2=ᲄ (7300)
ch1=т (1090), ch2=ᲅ (7301)
ch1=ъ (1098), ch2=ᲆ (7302)
ch1=ѣ (1123), ch2=ᲇ (7303)
ch1=ᲄ (7300), ch2=ᲅ (7301)
ch1=ᲈ (7304), ch2=ꙋ (42571)
ch1=ṡ (7777), ch2=ẛ (7835)
ch1=ſt (64261), ch2=st (64262)
So the answer is: yes, there are distinct characters that have the same upper-case representation.
Upper, title, and lower cases are locale-specific, and so in different locales you may have different lower case letter (e.g. French upper cases may lose accents).
But Unicode defines also a standard way to convert to upper case or to lower case, and with an exception for Turkish languages, which may have different rules (marked with T in the CaseFolding.txt Unicode database, and further special cases for Turkish, Greek, and Lithuanian, in SpecialCasing.txt).
For most cases, you have a unique way to convert lower to upper (and the contrary), but see SIGN KELVIN which maps with K and other signs which use the same glyphs as other letters (that should go away, if you remove compatibility characters with a normalization).
One case is the Greek Sigma letter. There is only one in upper case, but you may use two different in lower case, depending on whether it is at the end of a word.
You will find more information in the Unicode document about Unicode database: http://www.unicode.org/reports/tr44/#Casemapping and in the Unicode standard (linked in the document, as well the two files I named above).
Note: some characters increase the number of code points, so when converting back, one should check the longest match.

How to add ^B (ASCII code - 2) in String in scala

I have a string abc,def . I need to convert all the ',' into "^B" (ASCII code 2) . How can we do it in scala.
i tried
var l = str.replace(',', 2.asInstanceOf[Char])
var l = str.replace(',', 2.tochar)
but both are not working
STX has code 2 in both ascii and UTF-16:
"abc,def".replace(',', '\u0002')

UILabel Convert Unicode(Japanese) and display

After hours of research I gave up.
I receive text data from a WebService. For some case, the text is inJapanese, and the WS returns its Unicoded version. For example: \U00e3\U0082\U008f
I know that this is a Japanese char.
I am trying to display this Unicode char or string inside a UILabel.
Since the simple setText method does'nt display the correct chars, I used this (copied) routine:
unichar unicodeValue = (unichar) strtol([[[p innerData] valueForKey:#"title"] UTF8String], NULL, 16);
char buffer[2];
int len = 1;
if (unicodeValue > 127) {
buffer[0] = (unicodeValue >> 8) & (1 << 8) - 1;
buffer[1] = unicodeValue & (1 << 8) - 1;
len = 2;
} else {
buffer[0] = unicodeValue;
}
[[cell title] setText:[[NSString alloc] initWithBytes:buffer length:len encoding:NSUTF8StringEncoding] ];
But no success: the UILabel is empty.
I know that one way could be convert the chars to hex and then from hex to String...is there a simpler way?
SOLVED
First you must be sure that your server is sending UTF8 and not UNICODE CODE POINTS. The only way I found is to json_encode strings which contain UNICODE chars.
Then, in iOS user unescaping following this link Using Objective C/Cocoa to unescape unicode characters, ie \u1234

Escape Double-Byte Characters for RTF

I am trying to escape double-byte (usually Japanese or Chinese) characters from a string so that they can be included in an RTF file. Thanks to poster falconcreek, I can successfully escape special characters (e.g. umlaut, accent, tilde) that are single-byte.
- (NSString *)stringFormattedRTF:(NSString *)inputString
{
NSMutableString *result = [NSMutableString string];
for ( int index = 0; index < [inputString length]; index++ ) {
NSString *temp = [inputString substringWithRange:NSMakeRange( index, 1 )];
unichar tempchar = [inputString characterAtIndex:index];
if ( tempchar > 127) {
[result appendFormat:#"\\\'%02x", tempchar];
} else {
[result appendString:temp];
}
}
return result;
}
It appears this is looking for any unicode characters with a decimal value higher than 127 (which basically means anything not ASCII). If I find one, I escape it and translate that to a hex value.
EXAMPLE: Small "e" with acute accent gets escaped and converted to its hex value, resulting in "\'e9"
While Asian characters are above 127 decimal value, the output from the above appears to be reading the first byte of the unicode double byte character and encoding that then passing the second byte as is. For the end user it ends up ????.
Suggestions are greatly appreciated. Thanks.
UPDATED Code sample based on suggestion. Not detecting. :(
NSString *myDoubleByteTestString = #"blah は凄くいいアップです blah åèüñ blah";
NSMutableString *resultDouble = [NSMutableString string];
for ( int index = 0; index < [myDoubleByteTestString length]; index++ )
{
NSString *tempDouble = [myDoubleByteTestString substringWithRange:NSMakeRange( index, 1 )];
NSRange doubleRange = [tempDouble rangeOfComposedCharacterSequenceAtIndex:index];
if(doubleRange.length > 2)
{
NSLog(#"%# is a double-byte character. Escape it.", tempDouble);
// How to escape double-byte?
[resultDouble appendFormat:tempDouble];
}
else
{
[resultDouble appendString:tempDouble];
}
}
Take a look at the code at rangeOfComposedCharacterSequenceAtIndex: to see how to get all the characters in a composed character. You'll then need to encode each of the characters in the resulting range.