Manually converting unicode codepoints into UTF-8 and UTF-16 - unicode

I have a university programming exam coming up, and one section is on unicode.
I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help.
The question will be something like:
The string 'mЖ丽' has these unicode codepoints U+006D, U+0416 and
U+4E3D, with answers written in hexadecimal, manually encode the
string into UTF-8 and UTF-16.
Any help at all will be greatly appreciated as I am trying to get my head round this.

Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)
The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:
Encoding
The following byte sequences are used to represent a
character. The sequence to be used depends on the UCS code
number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
[... removed obsolete five and six byte forms ...]
The xxx bit positions are filled with the bits of the
character code number in binary representation. Only the
shortest possible multibyte sequence which can represent the
code number of the character can be used.
The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
as 0xfffe and 0xffff (UCS noncharacters) should not appear in
conforming UTF-8 streams.
It might be easier to remember a 'compressed' version of the chart:
Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.
0x80 5 bits, one byte
0x800 4 bits, two bytes
0x10000 3 bits, three bytes
You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:
2**(5+1*6) == 2048 == 0x800
2**(4+2*6) == 65536 == 0x10000
2**(3+3*6) == 2097152 == 0x200000
I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)
Update
Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:
U+4E3E
This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:
1110xxxx 10xxxxxx 10xxxxxx
0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):
1110x100 10111000 10111110
There is an x spot left over at the start, fill it in with 0:
11100100 10111000 10111110
Convert from bits to hex:
0xE4 0xB8 0xBE

The descriptions on Wikipedia for UTF-8 and UTF-16 are good:
Procedures for your example string:
UTF-8
UTF-8 uses up to 4 bytes to represent Unicode codepoints. For the 1-byte case, use the following pattern:
1-byte UTF-8 = 0xxxxxxxbin = 7 bits = 0-7Fhex
The initial byte of 2-, 3- and 4-byte UTF-8 start with 2, 3 or 4 one bits, followed by a zero bit. Follow on bytes always start with the two-bit pattern 10, leaving 6 bits for data:
2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 5+6(11) bits = 80-7FFhex
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 4+6+6(16) bits = 800-FFFFhex
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 3+6+6+6(21) bits = 10000-10FFFFhex†
†Unicode codepoints are undefined beyond 10FFFFhex.
Your codepoints are U+006D, U+0416 and U+4E3D requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign the bits:
U+006D = 1101101bin = 01101101bin = 6Dhex
U+0416 = 10000 010110bin = 11010000 10010110bin = D0 96hex
U+4E3D = 0100 111000 111101bin = 11100100 10111000 10111101bin = E4 B8 BDhex
Final byte sequence:
6D D0 96 E4 B8 BD
or if nul-terminated strings are desired:
6D D0 96 E4 B8 BD 00
UTF-16
UTF-16 uses 2 or 4 bytes to represent Unicode codepoints. Algorithm:
U+0000 to U+D7FF uses 2-byte 0000hex to D7FFhex
U+D800 to U+DFFF are invalid codepoints reserved for 4-byte UTF-16
U+E000 to U+FFFF uses 2-byte E000hex to FFFFhex
U+10000 to U+10FFFF uses 4-byte UTF-16 encoded as follows:
Subtract 10000hex from the codepoint.
Express result as 20-bit binary.
Use the pattern 110110xxxxxxxxxx 110111xxxxxxxxxxbin to encode the upper- and lower- 10 bits into two 16-bit words.
Using your codepoints:
U+006D = 006Dhex
U+0416 = 0416hex
U+4E3D = 4E3Dhex
Now, we have one more issue. Some machines store the two bytes of a 16-bit word least significant byte first (so-called little-endian machines) and some store most significant byte first (big-endian machines). UTF-16 uses the codepoint U+FEFF (called the byte order mark or BOM) to help a machine determine if a byte stream contains big- or little-endian UTF-16:
big-endian = FE FF 00 6D 04 16 4E 3D
little-endian = FF FE 6D 00 16 04 3D 4E
With nul-termination, U+0000 = 0000hex:
big-endian = FE FF 00 6D 04 16 4E 3D 00 00
little-endian = FF FE 6D 00 16 04 3D 4E 00 00
Since your instructor didn't give a codepoint that required 4-byte UTF-16, here's one example:
U+1F031 = 1F031hex - 10000hex = F031hex = 0000111100 0000110001bin =
1101100000111100 1101110000110001bin = D83C DC31hex

The following program will do the necessary work. It may not be "manual" enough for your purposes, but at a minimum you can check your work.
#!/usr/bin/perl
use 5.012;
use strict;
use utf8;
use autodie;
use warnings;
use warnings qw< FATAL utf8 >;
no warnings qw< uninitialized >;
use open qw< :std :utf8 >;
use charnames qw< :full >;
use feature qw< unicode_strings >;
use Encode qw< encode decode >;
use Unicode::Normalize qw< NFD NFC >;
my ($x) = "mЖ丽";
open(U8,">:encoding(utf8)","/tmp/utf8-out");
print U8 $x;
close(U8);
open(U16,">:encoding(utf16)","/tmp/utf16-out");
print U16 $x;
close(U16);
system("od -t x1 /tmp/utf8-out");
my $u8 = encode("utf-8",$x);
print "utf-8: 0x".unpack("H*",$u8)."\n";
system("od -t x1 /tmp/utf16-out");
my $u16 = encode("utf-16",$x);
print "utf-16: 0x".unpack("H*",$u16)."\n";

Related

Problems writing string with ASCII value > 127 to serial port with Powershell script

My problem is to retrieve real-time data from an inverter (Voltronic family).
The inverter has a server and, if correctly asked, can send back information according to a communication protocol.
The communication is done through the serial port.
In particular a string similar to "XXXX"+ + CR has to be sent and the relevant data are sent back.
In my case the only string I need to send is "QPIGS". In this case I would have back many information that will allow me to produce a sort of control desk.
Since the string I need is always and only this, I made an off-line calculation of the <crc> that I need to complete the request.
The <crc> value is composed by two bytes, "·©". The first is the "mid point", hex b7, and the second is the "copyright sign" hex a9.
So the complete string should be "QPIGS·©". if I add the CR in powershell "`r", the complete string should be "QPIGS·©`r".
The script is very simple:
$port= new-Object System.IO.Ports.SerialPort COM1,2400,None,8,one
$port.ReadTimeout = 1000
$port.open()
$str='QPIGS·©`r'
$port.WriteLine($str')
Start-Sleep -Milliseconds 300
while ($x = $port.ReadExisting())
{
Write-Host $x
}
$port.Close()
But unfortunately it didn't work.
The inverter recognise the string but it doesn't match with what it was expecting and send back a NACK response. The exchange happens but is not succesfull.
In order to investigate more deeply I used a serial port serial sniffer to have evidence of what was really sent to the inverter and I found that what has been sent is the following
175 15/10/2022 17:06:29 IRP_MJ_WRITE DOWN 51 50 49 47 53 3f 3f 0a QPIGS??. 8 8 COM1
instead of what I was expecting
175 15/10/2022 17:06:29 IRP_MJ_WRITE DOWN 51 50 49 47 53 b7 a9 0d QPIGS·©. 8 8 COM1
It seems that the two <crc> bytes are ignored and substituted with two ?, hex 3f.
I imagine a problem of encoding.......but I can't find a solution.
Thanks for your help.
Tip of the hat to CherryDT for his comments and links to relevant related posts.
the complete string should be "QPIGS·©" [ + "`r" for a CR]
If you send strings to a serial port, its character encoding matters.
The default encoding is ASCII, which means that only Unicode characters in the 7-bit ASCII subrange can be sent, which excludes · (MIDDLE DOT, U+00B7) and © (COPYRIGHT SIGN, U+00A9) - that is, any Unicode character whose code point is greater than 0x7f (127) is "lossily" converted to a verbatim ASCII-range ? character, 0x3f (63).
You have two basic options:
Avoid string processing altogether and send an array of bytes: Convert the QPIGS substring to an array of (ASCII-range) byte values and append byte values 0xb7 and 0xa9:
Because the .NET strings are Unicode strings (encodes as UTF-16LE code units), you can take advantage of the fact that the code-point range 0x0 - 0x7f coincides with the ASCII code-point range, so you can simply cast ASCII-range characters to [byte[]] (via a [char[]] cast):
# Results in the following byte array:
# [byte[]] (0x51, 0x50, 0x49, 0x47, 0x53, 0xb7, 0xa9, 0xd)
[byte[]] $bytes = [char[]] 'QPIGS' + 0xb7, 0xa9, [char] "`r"
$port.Write($bytes, 0, $bytes.Count)
Use the port's .Encoding property to specify a character encoding in which the string "QPIGS·©`r" results in the desired byte values:
In this case you need the Windows-1252 encoding, where · is represented as byte value 0xb7, and © as 0xa9, and all ASCII-range characters are represented by their usual byte values:
$port.Encoding = [System.Text.Encoding]::GetEncoding(1252)
$port.Write("QPIGS·©`r")

how does UTF-8 end up with bigger bits than UTF-16 [duplicate]

I have a university programming exam coming up, and one section is on unicode.
I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help.
The question will be something like:
The string 'mЖ丽' has these unicode codepoints U+006D, U+0416 and
U+4E3D, with answers written in hexadecimal, manually encode the
string into UTF-8 and UTF-16.
Any help at all will be greatly appreciated as I am trying to get my head round this.
Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)
The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:
Encoding
The following byte sequences are used to represent a
character. The sequence to be used depends on the UCS code
number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
[... removed obsolete five and six byte forms ...]
The xxx bit positions are filled with the bits of the
character code number in binary representation. Only the
shortest possible multibyte sequence which can represent the
code number of the character can be used.
The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
as 0xfffe and 0xffff (UCS noncharacters) should not appear in
conforming UTF-8 streams.
It might be easier to remember a 'compressed' version of the chart:
Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.
0x80 5 bits, one byte
0x800 4 bits, two bytes
0x10000 3 bits, three bytes
You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:
2**(5+1*6) == 2048 == 0x800
2**(4+2*6) == 65536 == 0x10000
2**(3+3*6) == 2097152 == 0x200000
I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)
Update
Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:
U+4E3E
This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:
1110xxxx 10xxxxxx 10xxxxxx
0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):
1110x100 10111000 10111110
There is an x spot left over at the start, fill it in with 0:
11100100 10111000 10111110
Convert from bits to hex:
0xE4 0xB8 0xBE
The descriptions on Wikipedia for UTF-8 and UTF-16 are good:
Procedures for your example string:
UTF-8
UTF-8 uses up to 4 bytes to represent Unicode codepoints. For the 1-byte case, use the following pattern:
1-byte UTF-8 = 0xxxxxxxbin = 7 bits = 0-7Fhex
The initial byte of 2-, 3- and 4-byte UTF-8 start with 2, 3 or 4 one bits, followed by a zero bit. Follow on bytes always start with the two-bit pattern 10, leaving 6 bits for data:
2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 5+6(11) bits = 80-7FFhex
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 4+6+6(16) bits = 800-FFFFhex
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 3+6+6+6(21) bits = 10000-10FFFFhex†
†Unicode codepoints are undefined beyond 10FFFFhex.
Your codepoints are U+006D, U+0416 and U+4E3D requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign the bits:
U+006D = 1101101bin = 01101101bin = 6Dhex
U+0416 = 10000 010110bin = 11010000 10010110bin = D0 96hex
U+4E3D = 0100 111000 111101bin = 11100100 10111000 10111101bin = E4 B8 BDhex
Final byte sequence:
6D D0 96 E4 B8 BD
or if nul-terminated strings are desired:
6D D0 96 E4 B8 BD 00
UTF-16
UTF-16 uses 2 or 4 bytes to represent Unicode codepoints. Algorithm:
U+0000 to U+D7FF uses 2-byte 0000hex to D7FFhex
U+D800 to U+DFFF are invalid codepoints reserved for 4-byte UTF-16
U+E000 to U+FFFF uses 2-byte E000hex to FFFFhex
U+10000 to U+10FFFF uses 4-byte UTF-16 encoded as follows:
Subtract 10000hex from the codepoint.
Express result as 20-bit binary.
Use the pattern 110110xxxxxxxxxx 110111xxxxxxxxxxbin to encode the upper- and lower- 10 bits into two 16-bit words.
Using your codepoints:
U+006D = 006Dhex
U+0416 = 0416hex
U+4E3D = 4E3Dhex
Now, we have one more issue. Some machines store the two bytes of a 16-bit word least significant byte first (so-called little-endian machines) and some store most significant byte first (big-endian machines). UTF-16 uses the codepoint U+FEFF (called the byte order mark or BOM) to help a machine determine if a byte stream contains big- or little-endian UTF-16:
big-endian = FE FF 00 6D 04 16 4E 3D
little-endian = FF FE 6D 00 16 04 3D 4E
With nul-termination, U+0000 = 0000hex:
big-endian = FE FF 00 6D 04 16 4E 3D 00 00
little-endian = FF FE 6D 00 16 04 3D 4E 00 00
Since your instructor didn't give a codepoint that required 4-byte UTF-16, here's one example:
U+1F031 = 1F031hex - 10000hex = F031hex = 0000111100 0000110001bin =
1101100000111100 1101110000110001bin = D83C DC31hex
The following program will do the necessary work. It may not be "manual" enough for your purposes, but at a minimum you can check your work.
#!/usr/bin/perl
use 5.012;
use strict;
use utf8;
use autodie;
use warnings;
use warnings qw< FATAL utf8 >;
no warnings qw< uninitialized >;
use open qw< :std :utf8 >;
use charnames qw< :full >;
use feature qw< unicode_strings >;
use Encode qw< encode decode >;
use Unicode::Normalize qw< NFD NFC >;
my ($x) = "mЖ丽";
open(U8,">:encoding(utf8)","/tmp/utf8-out");
print U8 $x;
close(U8);
open(U16,">:encoding(utf16)","/tmp/utf16-out");
print U16 $x;
close(U16);
system("od -t x1 /tmp/utf8-out");
my $u8 = encode("utf-8",$x);
print "utf-8: 0x".unpack("H*",$u8)."\n";
system("od -t x1 /tmp/utf16-out");
my $u16 = encode("utf-16",$x);
print "utf-16: 0x".unpack("H*",$u16)."\n";

ASCII characters converted to 2 hexagonal numbers

Could someone tell me why characters from extended ASCII table are being converted to 2 hexagonal numbers instead of 1? For example:
a = 61
â = C3 A2 (even though it should be normally encoded as E2)
This is "Hex UTF-8 bytes".
U+007F (127) -> 1 Byte
U+07FF (2,047) -> 2 Byte
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%A2&mode=char
http://unicode.mayastudios.com/examples/utf8.html

What is this character separator: ^_?

I dumped a SQLite3 table (from an Anki deck) to a CSV file. I found that the sfld column is separated by ^_.
What is this character or escape character in Unicode?
It's a control-underscore (Control-_), or 0x1F, or Unit Separator character from the ASCII (and ISO 8859-x and Unicode) control characters.
The upper-case letters in ASCII, ISO 8859-x and Unicode have code points (all numbers in hex):
41 U+0041 LATIN CAPITAL LETTER A
…
5A U+005A LATIN CAPITAL LETTER Z
The subsequent characters are:
5B U+005B LEFT SQUARE BRACKET
5C U+005C REVERSE SOLIDUS
5D U+005D RIGHT SQUARE BRACKET
5E U+005E CIRCUMFLEX ACCENT
5F U+005F LOW LINE
The control characters like Control-A have a code 0x40 less than the upper-case letters, so you have
01 U+0001 START OF HEADING (aka SOH or Control-A)
…
1A U+001A SUBSTITUTE (aka SUB or Control-Z)
and then you get:
1B U+001B ESCAPE (aka ESC or Control-[)
1C U+001C FILE SEPARATOR (aka FS or Control-\)
1D U+001D GROUP SEPARATOR (aka GS or Control-])
1E U+001E RECORD SEPARATOR (aka RS or Control-^)
1F U+001F UNIT SEPARATOR (aka US or Control-_)

Convert text to binary and store in a single array in matlab

I need to convert the given text (not in file format) into binary values and store in a single array that is to be given as input to other function in Matlab .
Example:
Hi how are you ?
It is to be converted into binary and stored in an array.I have used dec2bin() function but i did not suceed in getting the output required.
Sounds a bit like a trick question. In MATLAB, a character array (string) is just a different representation of 16-bit unsigned character codes.
>> str = 'Hi, how are you?'
str =
Hi, how are you?
>> whos str
Name Size Bytes Class Attributes
str 1x16 32 char
Note that the 16 characters occupy 32 bytes, or 2 bytes (16-bits) per character. From the documentation for char:
Valid codes range from 0 to 65535, where codes 0 through 127 correspond to 7-bit ASCII characters. The characters that MATLAB® can process (other than 7-bit ASCII characters) depend upon your current locale setting. To convert characters into a numeric array,use the double function.
Now, you could use double as it recommends to get the character codes into double arrays, but a minimal representation would simply involve uint16:
int16bStr = uint16(str)
To split this into bytes, typecast into 8-bit integers:
typecast(int16bStr,'uint8')
which yields 32 uint8 values (bytes), which are suitable for conversion to binary representation with dec2bin, if you want to see the binary (but these arrays are already binary data).
If you don't expect anything other than ASCII characters, just throw out the extra bits from the start:
>> int8bStr =
72 105 44 32 104 111 119 32 97 114 101 32 121 111 117 63
>> binStr = reshape(dec2bin(binStr8b.'),1,[])
ans =
110011101110111001111111111111110000001001001011111011000000 <...snip...>