Dealing with u"strings" in gdb - unicode

Why can't I cast u"string" to wchar_t* in gdb?
(gdb) print (wchar_t*)L"abc"
$60 = 0x568ae3b0 L"abc"
(gdb) print (wchar_t*)u"abc"
$61 = 0x567c5078 L"\x620061c\020I\x56640948\x567c50d0\x567c4f80\x567c4f30"
u"string" is an array of unsigned short which is almost the same as array of wchars.

wchar_t is NOT almost the same as unsigned short on linux, it's almost the same as unsinged. On linux, wchar_t is for UTF-32, not UTF-16. You'll have to use a Unicode library to convert (probably ICU) .
You can cast it to char32_t* though.

Related

Create Unicode from a hex number in C++

My objective is to take a character which represents to UK pound symbol and convert it to it's unicode equivalent in a string.
Here's my code and output so far from my test program:
#include <iostream>
#include <stdio.h>
int main()
{
char x = 163;
unsigned char ux = x;
const char *str = "\u00A3";
printf("x: %d\n", x);
printf("ux: %d %x\n", ux, ux);
printf("str: %s\n", str);
return 0;
}
Output
$ ./pound
x: -93
ux: 163 a3
str: £
My goal is to take the unsigned char 0xA3 and put it into a string representing the unicode UK pound representation: "\u00A3"
What exactly is your question? Anyway, you say you're writing C++, but you're using char* and printf and stdlib.h so you're really writing C, and base C does not support unicode. Remember that a char in C is not a "character" it's just a byte, and a char* is not an array of characters, it's an array of bytes. When you printf the "\u00A3" string in your sample program, you are not printing a unicode character, you are actually printing those literal bytes, and your terminal is helping you out and interpreting them as a unicode character. The fact that it correctly prints the £ character is just coincidence. You can see this for yourself. If you printf str[0] in your sample program you should just see the "\" character.
If you want to use unicode correctly in C you'll need to use a library. There are many to choose from and I haven't used any of them enough to recommend one. Or you'll need to use C++11 or newer and use std::wstring and friends. But what you are doing is not real unicode and will not work as you expect in the long run.

How does WideCharToMultiByte deal with codepages?

When I execute the below code, why am I getting '?' for the first case? AFAIK, codepage 932 supports line draw characters.
How does this API deal with codepages? AFAIK, it searches and maps the character in the codepage, then returns the position of the character from the codepage.
typedef struct dbcs {
unsigned char HighByte;
unsigned char LowByte;
} DBCS;
static DBCS set[5] = {0x25,0x5D};
unsigned char array[2];
#include <windows.h>
#include <stdio.h>
int main()
{
// printf("hello world");
int str_size;
LPCWSTR charpntr;
LPSTR getcd;
LPBOOL flg;
int i ;
array[0] = set[0].LowByte;
array[1] = set[0].HighByte;
charpntr = &array;
str_size = WideCharToMultiByte(932, 0, charpntr, 1, getcd, 2, NULL, NULL);
printf(" value of %u", getcd);
printf("number of bytes %d character is %s", str_size, getcd);
printf("\n");
array[0] = set[0].LowByte;
array[1] = set[0].HighByte;
charpntr = &array;
str_size = WideCharToMultiByte(437, 0, charpntr, 1, getcd, 2, NULL, NULL);
printf(" value of %u", getcd);
printf("number of bytes %d character is %s", str_size, getcd);
printf("\n");
}
Result of execution in CodeBlocks:
Windows codepage 932 is not a simple thing - as it uses multibyte characters.
I have no Windows here, so I have been experimenting with the encoding of the character you are using in Python3, in an UTF-8 terminal: it works fine with cp437 and UTF-8, but Python refuses to encode the character to what it calls "cp932", or any of its aliases listed in the Wikipedia article:
https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)
It may be a fault in Python's internal Unicode tables (fetched directly from the Unicode consortium), or possibly, this codepage don't map this character at all.
Anyway, there are problems in your code: one is that you never initialize getcd. Reading the docs for WideCharToMultiByte(), one see it should not be set to NULL, so you have to have the proper return buffer allocated there.
So, try putting the getcd declaration as:
char getcd[6]={};
That should give you enough space for even the widest characters you experiment with, and include a string \x00 terminator.
And another thing is that if these line drawing characters are present in CP932, they are definitely multibyte - thus the cbMultiByte parameter for the call (the "1" after charptr) should be set to at least 2. If no other error kicks in, and the char exists in cp932, this alone might fix your issue.

How does Perl store integers in-memory?

say pack "A*", "asdf"; # Prints "asdf"
say pack "s", 0x41 * 256 + 0x42; # Prints "BA" (0x41 = 'A', 0x42 = 'B')
The first line makes sense: you're taking an ASCII encoded string, packing it into a string as an ASCII string. In the second line, the packed form is "\x42\x41" because of the little endian-ness of short integers on my machine.
However, I can't shake the feeling that somehow, I should be able to treat the packed string from the second line as a number, since that's how (I assume) Perl stores numbers, as little-endian sequence of bytes. Is there a way to do so without unpacking it? I'm trying to get the correct mental model for the thing that pack() returns.
For instance, in C, I can do this:
#include <stdio.h>
int main(void) {
char c[2];
short * x = c;
c[0] = 0x42;
c[1] = 0x41;
printf("%d\n", *x); // Prints 16706 == 0x41 * 256 + 0x42
return 0;
}
If you're really interested in how Perl stores data internally, I'd recommend PerlGuts Illustrated. But usually, you don't have to care about stuff like that because Perl doesn't give you access to such low-level details. These internals are only important if you're writing XS extensions in C.
If you want to "cast" a two-byte string to a C short, you can use the unpack function like this:
$ perl -le 'print unpack("s", "BA")'
16706
However, I can't shake the feeling that somehow, I should be able to treat the packed string from the second line as a number,
You need to unpack it first.
To be able to use it as a number in C, you need
char* packed = "\x42\x41";
int16_t int16;
memcpy(&int16, packed, sizeof(int16_t));
To be able to use it as a number in Perl, you need
my $packed = "\x42\x41";
my $num = unpack('s', $packed);
which is basically
use Inline C => <<'__EOI__';
SV* unpack_s(SV* sv) {
STRLEN len;
char* buf;
int16_t int16;
SvGETMAGIC(sv);
buf = SvPVbyte(sv, len);
if (len != sizeof(int16_t))
croak("usage");
Copy(buf, &int16, 1, int16_t);
return newSViv(int16);
}
__EOI__
my $packed = "\x42\x41";
my $num = unpack_s($packed);
since that's how (I assume) perl stores numbers, as little-endian sequence of bytes.
Perl stores numbers in one of following three fields of a scalar:
IV, a signed integer of size perl -V:ivsize (in bytes).
UV, an unsigned integer of size perl -V:uvsize (in bytes). (ivsize=uvsize)
NV, a floating point numbers of size perl -V:nvsize (in bytes).
In all case, native endianness is used.
I'm trying to get the correct mental model for the thing that pack() returns.
pack is used to construct "binary data" for interfacing with external APIs.
I see pack as a serialization function. It takes as input Perl values, and outputs a serialized form. The fact the output serialized form happens to be a Perl bytestring is more of an implementation detail than a core functionality.
As such, all you're really expected to do with the resulting string is feed it to unpack, though the serialized form is convenient to have it move around processes, hosts, planets.
If you're interested in serializing it to a number instead, consider using vec:
say vec "BA", 0, 16; # prints 16961
To take a closer look at the string's internal representation, take a look at Devel::Peek, though you're not going to see anything surprising with a pure ASCII string.
use Devel::Peek;
Dump "BA";
SV = PV(0xb42f80) at 0xb56300
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0xb60cc0 "BA"\0
CUR = 2
LEN = 16

Why _printf_l can't print multibyte string of Chinese locale

I am using Windows 7, VS2008 to test following code:
wchar_t *pWCBuffer = L"你好,世界"; // some Chinese character
char *pMBBuffer = (char *)malloc( BUFFER_SIZE );
_locale_t locChinese = _create_locale(LC_CTYPE, "chs");
_wcstombs_l(pMBBuffer, pWCBuffer, BUFFER_SIZE, locChinese );
_printf_l("Multibyte character: %s\n\n", locChinese, pMBBuffer );
I convert a wide string to multibyte string and then print it out, using chinese locale, but the printed out string is not right, it is something weird like: ─π║├ú¼╩└╜τ
How could I print out the right multi-byte string?
This is not an absolute answer, because unicode on different platforms can be tricky. But if your Windows 7 is an English version, then your might want to try the Powershell ISE to see the output. I use that to print out unicode when writing programs in Ruby too.

Objective-C Decimal to Base 16 Hex conversion

Does anyone have a code snippet or a class that will take a long long and turn it into a 16 byte Hex string?
I'm looking to turn data like this
long long decimalRepresentation = 1719886131591410351;
and turn it into this
//Base 16 Hex Output: 17DE435307A07300
The %x operator doesn't want to work for me
NSLog(#"Hex: %x",decimalRepresentation);
//console : "Hex: 7a072af"
As you can see that's not even close. Any help is truly appreciated!
%x prints an unsigned integer in hexadecimal representation and sizeof(long long) != sizeof(unsigned). See e.g. "Data Type Size and Alignment" in the 64bit transitioning guide.
Use the ll specifier (thats two lower-case L) to get the desired output:
NSLog(#"%llx", myLongLong);