CoreFoundation UTF-16 un-paired surrogate - unicode

I'm trying to encode from utf16 to say utf32 using Apple Core Foundation API :
cfString = CFStringCreateWithBytes(nullptr, str, strLen, kCFStringEncodingUTF16, FALSE);
auto range = CFRangeMake(0, CFStringGetLenth(cfString));
CFStringGetBytes(cfString, range, kCFStringEncodingUTF32, 0, false, buffer, bufferSize, usedsize);
Most of the time that works, untill input buffer contains first part of surrogate pair say U+df9f, Corefoundation will simply return output without ill-formed characters.
So to be a bit unicode compliant, I have to manually determine that situation and follow unicode documentation to create standard substitution for that in form of U+FFFD: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
Same situation for other encodings: like symbol 0x80 in the middle of utf-8, then CFStringCreateWithBytes always return nullptr instead of pointing to invalid character.
Is that expected behaviour or UB of Corefoundation, or may be there is a hint to tune CF to be reporting malformed input somehow?
UPDATE:
I did exactly following:
UInt8 str[] = {0x41, 0x00, 0x9f, 0xdf}; // coresponding to unicode A + invalid surogate pair
CFStringRef mystr = CFStringCreateWithBytes(nullptr, str, 4, kCFStringEncodingUTF16, false, FALSE);
after that mystr has 2 characters len according to CFStringGetLength(), so looks invalid char gets processed
std::vector<char> str(7);
CFStringGetCString(mystr, &*str.begin(), str.size(), kCFStringEncodingUTF8);
that gives me false, so no conversion to utf8 is possible, and Xcode debug watches shows nothing for string myStr.
So output is nothing for utf8, and c-string, ok after that i checked with conversion to utf-32 with get bytes routine
result = CFStringGetBytes(s, range, kCFStringEncodingUTF32BE, 0, false, buffer, bufferSize, usedSize);
that gives me usedSize=4, result=1, and output contains 0x0041, so only A symbol converted. So that is why i’m thinking no substitution happened for malformed surogate pair.

Related

How to convert string in UTF-8 to ASCII ignoring errors and removing non ASCII characters

I am new to Scala.
Please advise how to convert strings in UTF-8 to ASCII ignoring errors and removing non ASCII characters in output.
For example, how to remove non ASCII character \uc382 from result string: "hello���", so that "hello" is printed in output.
scala.io.Source.fromBytes("hello\uc382".getBytes ("UTF-8"), "US-ASCII").mkString
val str = "hello\uc382"
str.filter(_ <= 0x7f) // keep only valid ASCII characters
If you had text in UTF-8 as bytes that is now in a String then it was converted.
If you have text in a String and you want it in ASCII as bytes, you can convert it later.
It seems that you just want to filter for only the UTF-16 code units for the C0 Controls and Basic Latin codepoints. Fortunately, such codepoints take only one code unit so we can filter them directly without converting them to codepoints.
"hello\uC382"
.filter(Character.UnicodeBlock.of(_) == Character.UnicodeBlock.BASIC_LATIN)
.getBytes(StandardCharsets.US_ASCII)
.foreach {
println }
With the question generalized to an arbitrary, known character encoding, filtering doesn't do the job. Instead, the feature of the encoder to ignore characters that are not present in the target Charset can be used. An Encoder requires a bit more wrapping and unwrapping. (The API design is based on streaming and reusing the buffer within the same stream and even other streams.) So, with ISO_8859_1 as an example:
val encoder = StandardCharsets.ISO_8859_1
.newEncoder()
.onMalformedInput(CodingErrorAction.IGNORE)
.onUnmappableCharacter(CodingErrorAction.IGNORE)
val string = "ñhello\uc382"
println(string)
val chars = CharBuffer.allocate(string.length())
.put(string)
chars.rewind()
val buffer = encoder.encode(chars)
val bytes = Array.ofDim[Byte](buffer.remaining())
buffer.get(bytes)
println(bytes)
bytes
.foreach {
println }

I was wondering if someone could explain to me .decode and .encode in hashlib?

I understand that you have a hex string and perform SHA256 on it twice and then byte-swap the final hex string. The goal of this code is to find a Merkle Root by concatenating two transactions. I would like to understand what's going on in the background a bit more. What exactly are you decoding and encoding?
import hashlib
transaction_hex = "93a05cac6ae03dd55172534c53be0738a50257bb3be69fff2c7595d677ad53666e344634584d07b8d8bc017680f342bc6aad523da31bc2b19e1ec0921078e872"
transaction_bin = transaction_hex.decode('hex')
hash = hashlib.sha256(hashlib.sha256(transaction_bin).digest()).digest()
hash.encode('hex_codec')
'38805219c8ac7e9a96416d706dc1d8f638b12f46b94dfd1362b5d16cf62e68ff'
hash[::-1].encode('hex_codec')
'ff682ef66cd1b56213fd4db9462fb138f6d8c16d706d41969a7eacc819528038'
header_hex is a regular string of lower case ASCII characters and the decode() method with 'hex' argument changes it to a (binary) string (or bytes object in Python 3) with bytes 0x93 0xa0 etc. In C it would be an array of unsigned char of length 64 in this case.
This array/byte string of length 64 is then hashed with SHA256 and its result (another binary string of size 32) is again hashed. So hash is a string of length 32, or a bytes object of that length in Python 3. Then encode('hex_codec') is a synomym for encode('hex') (in Python 2); in Python 3, it replaces it (so maybe this code is meant to work in both versions). It outputs an ASCII (lower hex) string again that replaces each raw byte (which is just a small integer) with a two character string that is its hexadecimal representation. So the final bit reverses the double hash and outputs it as hexadecimal, to a form which I usually call "lowercase hex ASCII".

How does WideCharToMultiByte deal with codepages?

When I execute the below code, why am I getting '?' for the first case? AFAIK, codepage 932 supports line draw characters.
How does this API deal with codepages? AFAIK, it searches and maps the character in the codepage, then returns the position of the character from the codepage.
typedef struct dbcs {
unsigned char HighByte;
unsigned char LowByte;
} DBCS;
static DBCS set[5] = {0x25,0x5D};
unsigned char array[2];
#include <windows.h>
#include <stdio.h>
int main()
{
// printf("hello world");
int str_size;
LPCWSTR charpntr;
LPSTR getcd;
LPBOOL flg;
int i ;
array[0] = set[0].LowByte;
array[1] = set[0].HighByte;
charpntr = &array;
str_size = WideCharToMultiByte(932, 0, charpntr, 1, getcd, 2, NULL, NULL);
printf(" value of %u", getcd);
printf("number of bytes %d character is %s", str_size, getcd);
printf("\n");
array[0] = set[0].LowByte;
array[1] = set[0].HighByte;
charpntr = &array;
str_size = WideCharToMultiByte(437, 0, charpntr, 1, getcd, 2, NULL, NULL);
printf(" value of %u", getcd);
printf("number of bytes %d character is %s", str_size, getcd);
printf("\n");
}
Result of execution in CodeBlocks:
Windows codepage 932 is not a simple thing - as it uses multibyte characters.
I have no Windows here, so I have been experimenting with the encoding of the character you are using in Python3, in an UTF-8 terminal: it works fine with cp437 and UTF-8, but Python refuses to encode the character to what it calls "cp932", or any of its aliases listed in the Wikipedia article:
https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)
It may be a fault in Python's internal Unicode tables (fetched directly from the Unicode consortium), or possibly, this codepage don't map this character at all.
Anyway, there are problems in your code: one is that you never initialize getcd. Reading the docs for WideCharToMultiByte(), one see it should not be set to NULL, so you have to have the proper return buffer allocated there.
So, try putting the getcd declaration as:
char getcd[6]={};
That should give you enough space for even the widest characters you experiment with, and include a string \x00 terminator.
And another thing is that if these line drawing characters are present in CP932, they are definitely multibyte - thus the cbMultiByte parameter for the call (the "1" after charptr) should be set to at least 2. If no other error kicks in, and the char exists in cp932, this alone might fix your issue.

Inserting ASCII symbols into a String (Swift)

I'm trying to insert a symbol with ASCII code 255 (Telnet IAC) into a String, but when converting the data back to utf8 I'm getting a different symbol:
var s = "\u{ff}"
print(s.utf8.count) // 2
try! s.write(toFile: "output.txt", atomically: true, encoding: .utf8)
The file contains C3 BF, not FF. I've also tried using
var s = "\(Character(UnicodeScalar(255)))"
but this produced the same result. How to escape it properly?
ASCII defines 128 characters from 0x00 to 0x7F. 0xFF (255) is not included.
In Unicode, U+00FF (in Swift, "\u{ff}") represents "ÿ" (LATIN SMALL LETTER Y WITH DIARESIS).
And its UTF-8 representation is 0xC3 0xBF. See UTF-8, characters with code point from U+0080 to U+07FF are represented with two-byte sequence.
Also you need to know that 0xFF is not a valid byte in UTF-8 byte sequence, which means you cannot get any 0xFF bytes in UTF-8 text file.
If you want to output "\u{ff}" as a single-byte 0xFF, use ISO-8859-1 (aka ISO-Latin-1) instead:
try! s.write(toFile: "output.txt", atomically: true, encoding: .isoLatin1)

Get the UTF-8 Encoding of a Character in Bytes

On a String, I can use utf8 and count to get the number of bytes required to encode the String with UTF-8 encoding:
"a".utf8.count // 1
"チャオ".utf8.count // 9
"チ".utf8.count // 3
However, I don't see an equivalent method on a single Character value. To get the number of bytes required to encode a character in the string to UTF-8, I could iterate through the string by character, convert the Character to a String, and get the utf8.count of that String:
"チャオ".characters.forEach({print(String($0).utf8.count)}) // 3, 3, 3
This seems unnecessarily verbose. Is there a way to get the UTF-8 encoding of a Character in Swift?
Character has no direct (public) accessor to its UTF-8 representation.
There are some internal methods in Character.swift dealing with the UTF-8 bytes, but the public stuff is implemented in
String.UTF8View in StringUTF8.swift.
Therefore String(myChar).utf8.count is the correct way to obtain
the length of the characters UTF-8 representation.