Create Unicode from a hex number in C++ - unicode

My objective is to take a character which represents to UK pound symbol and convert it to it's unicode equivalent in a string.
Here's my code and output so far from my test program:
#include <iostream>
#include <stdio.h>
int main()
{
char x = 163;
unsigned char ux = x;
const char *str = "\u00A3";
printf("x: %d\n", x);
printf("ux: %d %x\n", ux, ux);
printf("str: %s\n", str);
return 0;
}
Output
$ ./pound
x: -93
ux: 163 a3
str: £
My goal is to take the unsigned char 0xA3 and put it into a string representing the unicode UK pound representation: "\u00A3"

What exactly is your question? Anyway, you say you're writing C++, but you're using char* and printf and stdlib.h so you're really writing C, and base C does not support unicode. Remember that a char in C is not a "character" it's just a byte, and a char* is not an array of characters, it's an array of bytes. When you printf the "\u00A3" string in your sample program, you are not printing a unicode character, you are actually printing those literal bytes, and your terminal is helping you out and interpreting them as a unicode character. The fact that it correctly prints the £ character is just coincidence. You can see this for yourself. If you printf str[0] in your sample program you should just see the "\" character.
If you want to use unicode correctly in C you'll need to use a library. There are many to choose from and I haven't used any of them enough to recommend one. Or you'll need to use C++11 or newer and use std::wstring and friends. But what you are doing is not real unicode and will not work as you expect in the long run.

Related

How does WideCharToMultiByte deal with codepages?

When I execute the below code, why am I getting '?' for the first case? AFAIK, codepage 932 supports line draw characters.
How does this API deal with codepages? AFAIK, it searches and maps the character in the codepage, then returns the position of the character from the codepage.
typedef struct dbcs {
unsigned char HighByte;
unsigned char LowByte;
} DBCS;
static DBCS set[5] = {0x25,0x5D};
unsigned char array[2];
#include <windows.h>
#include <stdio.h>
int main()
{
// printf("hello world");
int str_size;
LPCWSTR charpntr;
LPSTR getcd;
LPBOOL flg;
int i ;
array[0] = set[0].LowByte;
array[1] = set[0].HighByte;
charpntr = &array;
str_size = WideCharToMultiByte(932, 0, charpntr, 1, getcd, 2, NULL, NULL);
printf(" value of %u", getcd);
printf("number of bytes %d character is %s", str_size, getcd);
printf("\n");
array[0] = set[0].LowByte;
array[1] = set[0].HighByte;
charpntr = &array;
str_size = WideCharToMultiByte(437, 0, charpntr, 1, getcd, 2, NULL, NULL);
printf(" value of %u", getcd);
printf("number of bytes %d character is %s", str_size, getcd);
printf("\n");
}
Result of execution in CodeBlocks:
Windows codepage 932 is not a simple thing - as it uses multibyte characters.
I have no Windows here, so I have been experimenting with the encoding of the character you are using in Python3, in an UTF-8 terminal: it works fine with cp437 and UTF-8, but Python refuses to encode the character to what it calls "cp932", or any of its aliases listed in the Wikipedia article:
https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)
It may be a fault in Python's internal Unicode tables (fetched directly from the Unicode consortium), or possibly, this codepage don't map this character at all.
Anyway, there are problems in your code: one is that you never initialize getcd. Reading the docs for WideCharToMultiByte(), one see it should not be set to NULL, so you have to have the proper return buffer allocated there.
So, try putting the getcd declaration as:
char getcd[6]={};
That should give you enough space for even the widest characters you experiment with, and include a string \x00 terminator.
And another thing is that if these line drawing characters are present in CP932, they are definitely multibyte - thus the cbMultiByte parameter for the call (the "1" after charptr) should be set to at least 2. If no other error kicks in, and the char exists in cp932, this alone might fix your issue.

Dealing with u"strings" in gdb

Why can't I cast u"string" to wchar_t* in gdb?
(gdb) print (wchar_t*)L"abc"
$60 = 0x568ae3b0 L"abc"
(gdb) print (wchar_t*)u"abc"
$61 = 0x567c5078 L"\x620061c\020I\x56640948\x567c50d0\x567c4f80\x567c4f30"
u"string" is an array of unsigned short which is almost the same as array of wchars.
wchar_t is NOT almost the same as unsigned short on linux, it's almost the same as unsinged. On linux, wchar_t is for UTF-32, not UTF-16. You'll have to use a Unicode library to convert (probably ICU) .
You can cast it to char32_t* though.

How does Perl store integers in-memory?

say pack "A*", "asdf"; # Prints "asdf"
say pack "s", 0x41 * 256 + 0x42; # Prints "BA" (0x41 = 'A', 0x42 = 'B')
The first line makes sense: you're taking an ASCII encoded string, packing it into a string as an ASCII string. In the second line, the packed form is "\x42\x41" because of the little endian-ness of short integers on my machine.
However, I can't shake the feeling that somehow, I should be able to treat the packed string from the second line as a number, since that's how (I assume) Perl stores numbers, as little-endian sequence of bytes. Is there a way to do so without unpacking it? I'm trying to get the correct mental model for the thing that pack() returns.
For instance, in C, I can do this:
#include <stdio.h>
int main(void) {
char c[2];
short * x = c;
c[0] = 0x42;
c[1] = 0x41;
printf("%d\n", *x); // Prints 16706 == 0x41 * 256 + 0x42
return 0;
}
If you're really interested in how Perl stores data internally, I'd recommend PerlGuts Illustrated. But usually, you don't have to care about stuff like that because Perl doesn't give you access to such low-level details. These internals are only important if you're writing XS extensions in C.
If you want to "cast" a two-byte string to a C short, you can use the unpack function like this:
$ perl -le 'print unpack("s", "BA")'
16706
However, I can't shake the feeling that somehow, I should be able to treat the packed string from the second line as a number,
You need to unpack it first.
To be able to use it as a number in C, you need
char* packed = "\x42\x41";
int16_t int16;
memcpy(&int16, packed, sizeof(int16_t));
To be able to use it as a number in Perl, you need
my $packed = "\x42\x41";
my $num = unpack('s', $packed);
which is basically
use Inline C => <<'__EOI__';
SV* unpack_s(SV* sv) {
STRLEN len;
char* buf;
int16_t int16;
SvGETMAGIC(sv);
buf = SvPVbyte(sv, len);
if (len != sizeof(int16_t))
croak("usage");
Copy(buf, &int16, 1, int16_t);
return newSViv(int16);
}
__EOI__
my $packed = "\x42\x41";
my $num = unpack_s($packed);
since that's how (I assume) perl stores numbers, as little-endian sequence of bytes.
Perl stores numbers in one of following three fields of a scalar:
IV, a signed integer of size perl -V:ivsize (in bytes).
UV, an unsigned integer of size perl -V:uvsize (in bytes). (ivsize=uvsize)
NV, a floating point numbers of size perl -V:nvsize (in bytes).
In all case, native endianness is used.
I'm trying to get the correct mental model for the thing that pack() returns.
pack is used to construct "binary data" for interfacing with external APIs.
I see pack as a serialization function. It takes as input Perl values, and outputs a serialized form. The fact the output serialized form happens to be a Perl bytestring is more of an implementation detail than a core functionality.
As such, all you're really expected to do with the resulting string is feed it to unpack, though the serialized form is convenient to have it move around processes, hosts, planets.
If you're interested in serializing it to a number instead, consider using vec:
say vec "BA", 0, 16; # prints 16961
To take a closer look at the string's internal representation, take a look at Devel::Peek, though you're not going to see anything surprising with a pure ASCII string.
use Devel::Peek;
Dump "BA";
SV = PV(0xb42f80) at 0xb56300
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0xb60cc0 "BA"\0
CUR = 2
LEN = 16

Objective-C character encoding - Change char to int, and back

Simple task: I need to convert two characters to two numbers, add them together and change that back to an character.
What I have got: (works perfect in Java - where encoding is handled for you, I guess):
int myChar1 = (int)([myText1 characterAtIndex:i]);
int myChar2 = (int)([myText2 characterAtIndex:keyCurrent]);
int newChar = (myChar1 + myChar2);
//NSLog(#"Int's %d, %d, %d", textChar, keyChar, newChar);
char newC = ((char) newChar);
NSString *tmp1 = [NSString stringWithFormat:#"%c", newC];
NSString *tmp2 = [NSString stringWithFormat:#"%#", newString];
newString = [NSString stringWithFormat:#"%#%#", tmp2, tmp1]; //Adding these char's in a string
The algorithm is perfect, but now I can't figure out how to implement encoding properties. I would like to do everything in UTF-8 but have no idea how to get a char's UTF-8 value, for instance. And if I've got it, how to change that value back to an char.
The NSLog in the code outputs the correct values. But when I try to do the opposite with the algorithm (I.e. - the values) then it goes wrong. It gets the wrong character value for weird/odd characters.
NSString works with unichar characters that are 2 bytes long (16 bits). Char is one byte long so you can only store code point from U+0000 to U+00FF (i.e. Basic Latin and Latin-1 Supplement).
You should do you math on unichar values then use +[NSString stringWithCharacters:length:] to create the string representation.
But there is still an issue with that solution. You code may generate code points between U+D800 and U+DFFF that aren't valid Unicode characters. The standard reserves them to encode code points from U+10000 to U+10FFFF in UTF-16 by pairs of 16-bit code units. In such a case, your string would be ill-formed and could neither be displayed nor converted in UTF8.
Also, the temporary variable tmp2 is useless and you should not create a new newString as you concatenate the string but rather use a NSMutableString.
I am assuming that your strings are NSStrings consisting of numerals which represent a number. If that is the case, you could try the following:
Include the following headers:
#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
Then use the following code:
// convert NSString to UTF8 string
const char * utf8String1 = [myText1 UTF8String];
const char * utf8String2 = [myText2 UTF8String];
// convert UTF8 string into long integers
long num1 = strtol(utf8String1, NULL 0);
long num2 = strtol(utf8String2, NULL 0);
// perform calculations
long calc = num1 - num2;
// convert calculated value back into NSString
NSString * calcText = [[NSString alloc] initWithFormat:#"%li" calc];
// convert calculated value back into UTF8 string
char calcUTF8[64];
snprintf(calcUTF8, 64, "%li", calc);
// log results
NSLog(#"calcText: %#", calcText);
NSLog(#"calcUTF8: %s", calcUTF8);
Not sure if this is what you meant, but from what I understood, you wanted to create a NSString with the UTF-8 string encoding from a char?
If that's what you want, maybe you can use the initWithCString:encoding: method in NSString.

Objective-C Decimal to Base 16 Hex conversion

Does anyone have a code snippet or a class that will take a long long and turn it into a 16 byte Hex string?
I'm looking to turn data like this
long long decimalRepresentation = 1719886131591410351;
and turn it into this
//Base 16 Hex Output: 17DE435307A07300
The %x operator doesn't want to work for me
NSLog(#"Hex: %x",decimalRepresentation);
//console : "Hex: 7a072af"
As you can see that's not even close. Any help is truly appreciated!
%x prints an unsigned integer in hexadecimal representation and sizeof(long long) != sizeof(unsigned). See e.g. "Data Type Size and Alignment" in the 64bit transitioning guide.
Use the ll specifier (thats two lower-case L) to get the desired output:
NSLog(#"%llx", myLongLong);