How does WideCharToMultiByte deal with codepages? - unicode

When I execute the below code, why am I getting '?' for the first case? AFAIK, codepage 932 supports line draw characters.
How does this API deal with codepages? AFAIK, it searches and maps the character in the codepage, then returns the position of the character from the codepage.
typedef struct dbcs {
unsigned char HighByte;
unsigned char LowByte;
} DBCS;
static DBCS set[5] = {0x25,0x5D};
unsigned char array[2];
#include <windows.h>
#include <stdio.h>
int main()
{
// printf("hello world");
int str_size;
LPCWSTR charpntr;
LPSTR getcd;
LPBOOL flg;
int i ;
array[0] = set[0].LowByte;
array[1] = set[0].HighByte;
charpntr = &array;
str_size = WideCharToMultiByte(932, 0, charpntr, 1, getcd, 2, NULL, NULL);
printf(" value of %u", getcd);
printf("number of bytes %d character is %s", str_size, getcd);
printf("\n");
array[0] = set[0].LowByte;
array[1] = set[0].HighByte;
charpntr = &array;
str_size = WideCharToMultiByte(437, 0, charpntr, 1, getcd, 2, NULL, NULL);
printf(" value of %u", getcd);
printf("number of bytes %d character is %s", str_size, getcd);
printf("\n");
}
Result of execution in CodeBlocks:

Windows codepage 932 is not a simple thing - as it uses multibyte characters.
I have no Windows here, so I have been experimenting with the encoding of the character you are using in Python3, in an UTF-8 terminal: it works fine with cp437 and UTF-8, but Python refuses to encode the character to what it calls "cp932", or any of its aliases listed in the Wikipedia article:
https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)
It may be a fault in Python's internal Unicode tables (fetched directly from the Unicode consortium), or possibly, this codepage don't map this character at all.
Anyway, there are problems in your code: one is that you never initialize getcd. Reading the docs for WideCharToMultiByte(), one see it should not be set to NULL, so you have to have the proper return buffer allocated there.
So, try putting the getcd declaration as:
char getcd[6]={};
That should give you enough space for even the widest characters you experiment with, and include a string \x00 terminator.
And another thing is that if these line drawing characters are present in CP932, they are definitely multibyte - thus the cbMultiByte parameter for the call (the "1" after charptr) should be set to at least 2. If no other error kicks in, and the char exists in cp932, this alone might fix your issue.

Related

Create Unicode from a hex number in C++

My objective is to take a character which represents to UK pound symbol and convert it to it's unicode equivalent in a string.
Here's my code and output so far from my test program:
#include <iostream>
#include <stdio.h>
int main()
{
char x = 163;
unsigned char ux = x;
const char *str = "\u00A3";
printf("x: %d\n", x);
printf("ux: %d %x\n", ux, ux);
printf("str: %s\n", str);
return 0;
}
Output
$ ./pound
x: -93
ux: 163 a3
str: £
My goal is to take the unsigned char 0xA3 and put it into a string representing the unicode UK pound representation: "\u00A3"
What exactly is your question? Anyway, you say you're writing C++, but you're using char* and printf and stdlib.h so you're really writing C, and base C does not support unicode. Remember that a char in C is not a "character" it's just a byte, and a char* is not an array of characters, it's an array of bytes. When you printf the "\u00A3" string in your sample program, you are not printing a unicode character, you are actually printing those literal bytes, and your terminal is helping you out and interpreting them as a unicode character. The fact that it correctly prints the £ character is just coincidence. You can see this for yourself. If you printf str[0] in your sample program you should just see the "\" character.
If you want to use unicode correctly in C you'll need to use a library. There are many to choose from and I haven't used any of them enough to recommend one. Or you'll need to use C++11 or newer and use std::wstring and friends. But what you are doing is not real unicode and will not work as you expect in the long run.

Iterate through alphabet in Swift explanation

I accidentally wrote this simple code to print alphabet in terminal:
var alpha:Int = 97
while (alpha <= 122) {
write(1, &alpha, 1)
alpha += 1
}
write(1, "\n", 1)
//I'm using write() function from C, to avoid newline on each symbol
And I've got this output:
abcdefghijklmnopqrstuvwxyz
Program ended with exit code: 0
So, here is the question: Why does it work?
In my logic, it should display a row of numbers, because an integer variable is being used. In C, it would be a char variable, so we would mean that we point to a sign at some index in ASCII. Then:
char alpha = 97;
Would be a code point to an 'a' sign, by incrementing alpha variable in a loop we would display each element of ascii through 122nd.
In Swift though, I couldn't assign an integer to Character or String type variable. I used Integer and then declared several variables to assign UnicodeScalar, but accidentally I found out that when I'm calling write, I point to my integer, not the new variable of UnicodeScalar type, although it works! Code is very short and readable, but I don't completely understand how does work and why at all.
Has anyone had such situation?
Why does it work?
This works “by chance” because the integer is stored in little-endian byte order.
The integer 97 is stored in memory as 8 bytes
0x61 0x00 0x00 0x00 0x00 0x00 0x00 0x00
and in write(1, &alpha, 1), the address of that memory location is
passed to the write system call. Since the last parameter (nbyte)
is 1, the first byte at that memory address is written to the
standard output: That is 0x61 or 97, the ASCII code of the letter
a.
In Swift though, I couldn't assign an integer to Character or String type variable.
The Swift equivalent of char is CChar, a type alias for Int8:
var alpha: CChar = 97
Here is a solution which does not rely on the memory layout and
works for non-ASCII character as well:
let first: UnicodeScalar = "α"
let last: UnicodeScalar = "ω"
for v in first.value...last.value {
if let c = UnicodeScalar(v) {
print(c, terminator: "")
}
}
print()
// αβγδεζηθικλμνξοπρςστυφχψω

USART format data type

i would like to ask, how i can send data via usart as integer, i mean variable which stores number. I am able to send char variable, but terminal shows me ascii presentation of this number and i need to see number.
I edited code like shown below but it gives me error: "conflicting types for 'USART_Transmit'"
#include <avr/io.h>
#include <util/delay.h>
#define FOSC 8000000// Clock Speed
#define BAUD 9600
#define MYUBRR FOSC/16/BAUD-1
void USART_Init( unsigned int ubrr );
void USART_Transmit( unsigned char data );
unsigned char USART_Receive( void );
int main( void )
{
unsigned char str[5] = "serus";
unsigned char strLenght = 5;
unsigned int i = 47;
USART_Init ( MYUBRR );
//USART_Transmit('S' );
while(1)
{
/*USART_Transmit( str[i++] );
if(i >= strLenght)
i = 0;*/
USART_Transmit(i);
_delay_ms(250);
}
return(0);
}
void USART_Init( unsigned int ubrr )
{
/* Set baud rate */
UBRR0H = (unsigned char)(ubrr>>8);
UBRR0L = (unsigned char)ubrr;
/* Enable receiver and transmitter */
UCSR0B = (1<<RXEN)|(1<<TXEN);
/* Set frame format: 8data, 2stop bit */
UCSR0C = (1<<USBS)|(3<<UCSZ0);
}
void USART_Transmit( unsigned int data )
{
/* Wait for empty transmit buffer */
while ( !( UCSR0A & (1<<UDRE)) )
;
/* Put data into buffer, sends the data */
UDR0 = data;
}
unsigned char USART_Receive( void )
{
/* Wait for data to be received */
while ( !(UCSR0A & (1<<RXC)) )
;
/* Get and return received data from buffer */
return UDR0;
}
Do you have any ideas what is wrong?
PS: I hope you understand what im trying to explain.
I like to use sprintf to format numbers for serial.
At the top of your file, put:
#include <stdio.h>
Then write some code in a function like this:
char buffer[16];
sprintf(buffer, "%d\n", number);
char * p = buffer;
while (*p) { USART_Transmit(*p++); }
The first two lines construct a null-terminated string in the buffer. The last two lines are a simple loop to send all the characters in the buffer. I put a newline in the format string to make it easier to see where one number ends and the other begins.
Technically a UART serial connection is just a stream of bits divided into symbols of a certain length. It's perfectly possible send the data in raw form, but this comes with a number of issues the must be addressed:
How to identify the start and end of a transmission unambiguously?
How to deal with endianess on either side of the connection?
How to serialize and deserialize the data in a robust way?
How to deal with transmission errors?
At the end of the day it turns out, that you never can resolve all the ambiguties and binary data somehow must be escaped or otherwise encoded to prevent misinterpretation.
As far as delimiting transmissions is concerned, that has been addressed by the creators of the ASCII standard through the set of nonprintable control characters: Of interest for you should be the special control characters
STX / 0x02 / Start of Text
ETX / 0x03 / End of Text
There are also other control characters which form a pretty complete set to make up data structures; you don't need JSON or XML for this. However ASCII itself does support the transmission of arbitrary binary data. However the standard staple for this task for a long time has been and is base64 encoding. Use that for transmission of arbitrary binary data.
Numbers you probably should not transmit in binary at all; just push digits around; if you're using octal or hexadecimal digits parsing into integers is super simple (boils down to a bunch of bit masking and shifting).

CoreFoundation UTF-16 un-paired surrogate

I'm trying to encode from utf16 to say utf32 using Apple Core Foundation API :
cfString = CFStringCreateWithBytes(nullptr, str, strLen, kCFStringEncodingUTF16, FALSE);
auto range = CFRangeMake(0, CFStringGetLenth(cfString));
CFStringGetBytes(cfString, range, kCFStringEncodingUTF32, 0, false, buffer, bufferSize, usedsize);
Most of the time that works, untill input buffer contains first part of surrogate pair say U+df9f, Corefoundation will simply return output without ill-formed characters.
So to be a bit unicode compliant, I have to manually determine that situation and follow unicode documentation to create standard substitution for that in form of U+FFFD: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
Same situation for other encodings: like symbol 0x80 in the middle of utf-8, then CFStringCreateWithBytes always return nullptr instead of pointing to invalid character.
Is that expected behaviour or UB of Corefoundation, or may be there is a hint to tune CF to be reporting malformed input somehow?
UPDATE:
I did exactly following:
UInt8 str[] = {0x41, 0x00, 0x9f, 0xdf}; // coresponding to unicode A + invalid surogate pair
CFStringRef mystr = CFStringCreateWithBytes(nullptr, str, 4, kCFStringEncodingUTF16, false, FALSE);
after that mystr has 2 characters len according to CFStringGetLength(), so looks invalid char gets processed
std::vector<char> str(7);
CFStringGetCString(mystr, &*str.begin(), str.size(), kCFStringEncodingUTF8);
that gives me false, so no conversion to utf8 is possible, and Xcode debug watches shows nothing for string myStr.
So output is nothing for utf8, and c-string, ok after that i checked with conversion to utf-32 with get bytes routine
result = CFStringGetBytes(s, range, kCFStringEncodingUTF32BE, 0, false, buffer, bufferSize, usedSize);
that gives me usedSize=4, result=1, and output contains 0x0041, so only A symbol converted. So that is why i’m thinking no substitution happened for malformed surogate pair.

Objective-C character encoding - Change char to int, and back

Simple task: I need to convert two characters to two numbers, add them together and change that back to an character.
What I have got: (works perfect in Java - where encoding is handled for you, I guess):
int myChar1 = (int)([myText1 characterAtIndex:i]);
int myChar2 = (int)([myText2 characterAtIndex:keyCurrent]);
int newChar = (myChar1 + myChar2);
//NSLog(#"Int's %d, %d, %d", textChar, keyChar, newChar);
char newC = ((char) newChar);
NSString *tmp1 = [NSString stringWithFormat:#"%c", newC];
NSString *tmp2 = [NSString stringWithFormat:#"%#", newString];
newString = [NSString stringWithFormat:#"%#%#", tmp2, tmp1]; //Adding these char's in a string
The algorithm is perfect, but now I can't figure out how to implement encoding properties. I would like to do everything in UTF-8 but have no idea how to get a char's UTF-8 value, for instance. And if I've got it, how to change that value back to an char.
The NSLog in the code outputs the correct values. But when I try to do the opposite with the algorithm (I.e. - the values) then it goes wrong. It gets the wrong character value for weird/odd characters.
NSString works with unichar characters that are 2 bytes long (16 bits). Char is one byte long so you can only store code point from U+0000 to U+00FF (i.e. Basic Latin and Latin-1 Supplement).
You should do you math on unichar values then use +[NSString stringWithCharacters:length:] to create the string representation.
But there is still an issue with that solution. You code may generate code points between U+D800 and U+DFFF that aren't valid Unicode characters. The standard reserves them to encode code points from U+10000 to U+10FFFF in UTF-16 by pairs of 16-bit code units. In such a case, your string would be ill-formed and could neither be displayed nor converted in UTF8.
Also, the temporary variable tmp2 is useless and you should not create a new newString as you concatenate the string but rather use a NSMutableString.
I am assuming that your strings are NSStrings consisting of numerals which represent a number. If that is the case, you could try the following:
Include the following headers:
#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
Then use the following code:
// convert NSString to UTF8 string
const char * utf8String1 = [myText1 UTF8String];
const char * utf8String2 = [myText2 UTF8String];
// convert UTF8 string into long integers
long num1 = strtol(utf8String1, NULL 0);
long num2 = strtol(utf8String2, NULL 0);
// perform calculations
long calc = num1 - num2;
// convert calculated value back into NSString
NSString * calcText = [[NSString alloc] initWithFormat:#"%li" calc];
// convert calculated value back into UTF8 string
char calcUTF8[64];
snprintf(calcUTF8, 64, "%li", calc);
// log results
NSLog(#"calcText: %#", calcText);
NSLog(#"calcUTF8: %s", calcUTF8);
Not sure if this is what you meant, but from what I understood, you wanted to create a NSString with the UTF-8 string encoding from a char?
If that's what you want, maybe you can use the initWithCString:encoding: method in NSString.