Why _printf_l can't print multibyte string of Chinese locale - unicode

I am using Windows 7, VS2008 to test following code:
wchar_t *pWCBuffer = L"你好,世界"; // some Chinese character
char *pMBBuffer = (char *)malloc( BUFFER_SIZE );
_locale_t locChinese = _create_locale(LC_CTYPE, "chs");
_wcstombs_l(pMBBuffer, pWCBuffer, BUFFER_SIZE, locChinese );
_printf_l("Multibyte character: %s\n\n", locChinese, pMBBuffer );
I convert a wide string to multibyte string and then print it out, using chinese locale, but the printed out string is not right, it is something weird like: ─π║├ú¼╩└╜τ
How could I print out the right multi-byte string?

This is not an absolute answer, because unicode on different platforms can be tricky. But if your Windows 7 is an English version, then your might want to try the Powershell ISE to see the output. I use that to print out unicode when writing programs in Ruby too.

Related

Create Unicode from a hex number in C++

My objective is to take a character which represents to UK pound symbol and convert it to it's unicode equivalent in a string.
Here's my code and output so far from my test program:
#include <iostream>
#include <stdio.h>
int main()
{
char x = 163;
unsigned char ux = x;
const char *str = "\u00A3";
printf("x: %d\n", x);
printf("ux: %d %x\n", ux, ux);
printf("str: %s\n", str);
return 0;
}
Output
$ ./pound
x: -93
ux: 163 a3
str: £
My goal is to take the unsigned char 0xA3 and put it into a string representing the unicode UK pound representation: "\u00A3"
What exactly is your question? Anyway, you say you're writing C++, but you're using char* and printf and stdlib.h so you're really writing C, and base C does not support unicode. Remember that a char in C is not a "character" it's just a byte, and a char* is not an array of characters, it's an array of bytes. When you printf the "\u00A3" string in your sample program, you are not printing a unicode character, you are actually printing those literal bytes, and your terminal is helping you out and interpreting them as a unicode character. The fact that it correctly prints the £ character is just coincidence. You can see this for yourself. If you printf str[0] in your sample program you should just see the "\" character.
If you want to use unicode correctly in C you'll need to use a library. There are many to choose from and I haven't used any of them enough to recommend one. Or you'll need to use C++11 or newer and use std::wstring and friends. But what you are doing is not real unicode and will not work as you expect in the long run.

How does WideCharToMultiByte deal with codepages?

When I execute the below code, why am I getting '?' for the first case? AFAIK, codepage 932 supports line draw characters.
How does this API deal with codepages? AFAIK, it searches and maps the character in the codepage, then returns the position of the character from the codepage.
typedef struct dbcs {
unsigned char HighByte;
unsigned char LowByte;
} DBCS;
static DBCS set[5] = {0x25,0x5D};
unsigned char array[2];
#include <windows.h>
#include <stdio.h>
int main()
{
// printf("hello world");
int str_size;
LPCWSTR charpntr;
LPSTR getcd;
LPBOOL flg;
int i ;
array[0] = set[0].LowByte;
array[1] = set[0].HighByte;
charpntr = &array;
str_size = WideCharToMultiByte(932, 0, charpntr, 1, getcd, 2, NULL, NULL);
printf(" value of %u", getcd);
printf("number of bytes %d character is %s", str_size, getcd);
printf("\n");
array[0] = set[0].LowByte;
array[1] = set[0].HighByte;
charpntr = &array;
str_size = WideCharToMultiByte(437, 0, charpntr, 1, getcd, 2, NULL, NULL);
printf(" value of %u", getcd);
printf("number of bytes %d character is %s", str_size, getcd);
printf("\n");
}
Result of execution in CodeBlocks:
Windows codepage 932 is not a simple thing - as it uses multibyte characters.
I have no Windows here, so I have been experimenting with the encoding of the character you are using in Python3, in an UTF-8 terminal: it works fine with cp437 and UTF-8, but Python refuses to encode the character to what it calls "cp932", or any of its aliases listed in the Wikipedia article:
https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)
It may be a fault in Python's internal Unicode tables (fetched directly from the Unicode consortium), or possibly, this codepage don't map this character at all.
Anyway, there are problems in your code: one is that you never initialize getcd. Reading the docs for WideCharToMultiByte(), one see it should not be set to NULL, so you have to have the proper return buffer allocated there.
So, try putting the getcd declaration as:
char getcd[6]={};
That should give you enough space for even the widest characters you experiment with, and include a string \x00 terminator.
And another thing is that if these line drawing characters are present in CP932, they are definitely multibyte - thus the cbMultiByte parameter for the call (the "1" after charptr) should be set to at least 2. If no other error kicks in, and the char exists in cp932, this alone might fix your issue.

Decode a string with both Unicode and Utf-8 codes in Python 2.x

Say we have a string:
s = '\xe5\xaf\x92\xe5\x81\x87\\u2014\\u2014\xe5\x8e\xa6\xe9\x97\xa8'
Somehow two symbols, '—', whose Unicode is \u2014 was not correctly encoded as '\xe2\x80\x94' in UTF-8. Is there an easy way to decode this string? It should be decoded as 寒假——厦门
Manually using the replace function is OK:
t = u'\u2014'
s.replace('\u2014', t.encode('utf-8')
print s
However, it is not automatic. If we extract the Unicode,
index = s.find('\u')
t = s[index : index+6]
then t = '\\u2014'. How to convert it to UTF-8 code?
You're missing extra slashes in your replace()
It should be:
s.replace("\\u2014", u'\u2014'.encode("utf-8") )
Check my warning in the comments of the question. You should not end up in this situation.

C# - Get ANSI code value of a character

I'd like to retrieve the ANSI code value of a given character.
E.g. when I now get the int value of the trademark character, I get 8482.
Instead I would like to get 153, which is the value of the trademark character in codepage 1252.
Some help would be appreciated.
Jurgen
Found it myself:
Encoding ansiEncoding = Encoding.GetEncoding(1252);
byte[] bytes = ansiEncoding.GetBytes(c);
int code = bytes[0];

How do I encode Unicode character codes in a PowerShell string literal?

How can I encode the Unicode character U+0048 (H), say, in a PowerShell string?
In C# I would just do this: "\u0048", but that doesn't appear to work in PowerShell.
Replace '\u' with '0x' and cast it to System.Char:
PS > [char]0x0048
H
You can also use the "$()" syntax to embed a Unicode character into a string:
PS > "Acme$([char]0x2122) Company"
AcmeT Company
Where T is PowerShell's representation of the character for non-registered trademarks.
Note: this method works only for characters in Plane 0, the BMP (Basic Multilingual Plane), chars < U+10000.
According to the documentation, PowerShell Core 6.0 adds support with this escape sequence:
PS> "`u{0048}"
H
see https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_special_characters?view=powershell-6#unicode-character-ux
Maybe this isn't the PowerShell way, but this is what I do. I find it to be cleaner.
[regex]::Unescape("\u0048") # Prints H
[regex]::Unescape("\u0048ello") # Prints Hello
For those of us still on 5.1 and wanting to use the higher-order Unicode charset (for which none of these answers work) I made this function so you can simply build strings like so:
'this is my favourite park ',0x1F3DE,'. It is pretty sweet ',0x1F60A | Unicode
#takes in a stream of strings and integers,
#where integers are unicode codepoints,
#and concatenates these into valid UTF16
Function Unicode {
Begin {
$output=[System.Text.StringBuilder]::new()
}
Process {
$output.Append($(
if ($_ -is [int]) { [char]::ConvertFromUtf32($_) }
else { [string]$_ }
)) | Out-Null
}
End { $output.ToString() }
}
Note that getting these to display in your console is a whole other problem, but if you're outputting to an Outlook email or a Gridview (below) it will just work (as utf16 is native for .NET interfaces).
This also means you can also output plain control (not necessarily unicode) characters pretty easily if you're more comfortable with decimal since you dont actually need to use the 0x (hex) syntax to make the integers. 'hello',32,'there' | Unicode would put a non-breaking space betwixt the two words, the same as if you did 0x20 instead.
Another way using PowerShell.
$Heart = $([char]0x2665)
$Diamond = $([char]0x2666)
$Club = $([char]0x2663)
$Spade = $([char]0x2660)
Write-Host $Heart -BackgroundColor Yellow -ForegroundColor Magenta
Use the command help Write-Host -Full to read all about it.
To make it work for characters outside the BMP you need to use Char.ConvertFromUtf32()
'this is my favourite park ' + [char]::ConvertFromUtf32(0x1F3DE) +
'. It is pretty sweet ' + [char]::ConvertFromUtf32(0x1F60A)
Note that some characters like 🌎 might need a "double rune" to be printed:
PS> "C:\foo\bar\$([char]0xd83c)$([char]0xdf0e)something.txt"
Will print:
C:\foo\bar\🌎something.txt
You can find these "runes" here, in the "unicode escape" row:
https://dencode.com/string