UTF-8 & Unicode, what's with 0xC0 and 0x80?

UTF-8 & Unicode, what's with 0xC0 and 0x80? - unicode

I've been reading about Unicode and UTF-8 in the last couple of days and I often come across a bitwise comparison similar to this :
int strlen_utf8(char *s)
{
int i = 0, j = 0;
while (s[i])
{
if ((s[i] & 0xc0) != 0x80) j++;
i++;
}
return j;
}
Can someone clarify the comparison with 0xc0 and checking if it's the most significant bit ?
Thank you!
EDIT: ANDed, not comparison, used the wrong word ;)

It's not a comparison with 0xc0, it's a logical AND operation with 0xc0.
The bit mask 0xc0 is 11 00 00 00 so what the AND is doing is extracting only the top two bits:
ab cd ef gh
AND 11 00 00 00
-- -- -- --
= ab 00 00 00
This is then compared to 0x80 (binary 10 00 00 00). In other words, the if statement is checking to see if the top two bits of the value are not equal to 10.
"Why?", I hear you ask. Well, that's a good question. The answer is that, in UTF-8, all bytes that begin with the bit pattern 10 are subsequent bytes of a multi-byte sequence:
UTF-8
Range Encoding Binary value
----------------- -------- --------------------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
So, what this little snippet is doing is going through every byte of your UTF-8 string and counting up all the bytes that aren't continuation bytes (i.e., it's getting the length of the string, as advertised). See this wikipedia link for more detail and Joel Spolsky's excellent article for a primer.
An interesting aside by the way. You can classify bytes in a UTF-8 stream as follows:
With the high bit set to 0, it's a single byte value.
With the two high bits set to 10, it's a continuation byte.
Otherwise, it's the first byte of a multi-byte sequence and the number of leading 1 bits indicates how many bytes there are in total for this sequence (110... means two bytes, 1110... means three bytes, etc).

Related

Utf8 encoding makes me confused

let buf1 = Buffer.from("3", "utf8");
let buf2 = Buffer.from("Здравствуйте", "utf8");
// <Buffer 33>
// <Buffer d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 d1 82 d0 b5>
Why does char '3' encode to '33' in buf1 but 'd0 97' in buf2?

Because 3 is not З, despite the similarity to the untrained eye. Look closer and you'll see the difference, however subtle.
The former is Unicode code point U+0033 - DIGIT THREE (see here), while the latter is U+0417 - CYRILLIC CAPITAL LETTER ZE (see here), encoded in UTF-8 as d0 97.
The Russian word is actually hello, pronounced (very roughly, since I only know hello and goodbye, taught by a Russian girlfriend many decades ago) "Strasvoytza", with no "three" anywhere in the concept.

The first character of the second buffer is the Cyrillic character "Ze" https://en.m.wikipedia.org/wiki/Ze_(Cyrillic) and not the Arabic numeral 3 https://en.m.wikipedia.org/wiki/3

convert 16 bits signed (x2) to 32bits unsigned

I've got a problem with a modbus device :
The device send data in modbus protocol.
I read 4 bytes from the modbus communication that represent a pressure value
I have to convert theses 4 bytes in a unsigned 32bits integer.
There is the modbus documentation :
COMBINING 16bit REGISTERS TO 32bit VALUE
Pressure registers 2 & 3 in SENSOR INPUT REGISTER MAP of this guide are stored as u32 (UNSIGNED 32bit INTEGER)
You can calculate pressure manually :
1) Determine what display you have - if register values are positive skip to step 3.
2) Convert negative register 2 & 3 values from Signed to Unsigned (note: 65536 = 216 ):
(reg 2 value) + 65536* = 35464 ; (reg 3 value) + 65536 = 1
3) Shift register #3 as this is the upper 16 bits: 65536 * (converted reg 3 value) = 65536
4) Put two 16bit numbers together: (converted reg 2 value) + (converted reg 3 value) = 35464 + 65536 = 101000 Pa
Pressure information is then 101000 Pascal.
I don't find it very clear... For exemple, we don't have the 4 bytes that gives this calcul.
So, if anybody has a formula to convert my bytes into a 32bits unsigned int it could be very helpful

You should be able to read your bytes in some kind of type representation (hex, dec, bin, oct...)
let's assume you're receiving the following bytes frame:
in hex:
0x00, 0x06, 0x68, 0xA0
in bin:
0000 0000, 0000 0110, 0110 1000, 1010 0000
all of these are different representation of the same 4 bytes values.
Another thing that you should know is the bytes position (endianess):
If you're frame is transmitted in big endian, you're going to read the bytes in the order that you have them ( so 0x00, 0x06, 0x68, 0xA0 is correct).
If the frame is transmitted in little endian, you need to perform the following operation:
Switch the first 2 bytes with the last 2:
0x68, 0xA0, 0x00, 0x06
and then switch the position between the first and the second byte and the third and the fourth byte:
0xA0, 0x68, 0x06, 0x00
so if your frame is in little endian, the correct frame will be 0xA0, 0x68, 0x06, 0x00.
If you don't know the endianess, assume it's in big endian.
Now you simply have to 'put' your values togheter:
0x00, 0x06, 0x68, 0xA0 will become 0x000668A0
or
0000 0000, 0000 0110, 0110 1000, 1010 0000 will become 00000000000001100110100010100000
Once you have your hex or bin, you can convert your bin to an integer or convert your hex to an integer
Here you can find an interesting tool for converting HEX to float, unit32, int32, int16 in all endianess.
TL;DR
if you can use python, you should use struct:
import struct
frame = [0x00, 0x06, 0x68, 0xA0] # or [0, 6, 104, 160] in dec or [0b00000000, 0b00000110, 0b01101000, 0b10100000] in bin
print struct.unpack('>L', ''.join(map(chr, frame)))[0]

How do I print a string in one line in MARIE?

I want to print a set of letters in one line in MARIE. I modified the code to print Hello World and came up with:
ORG 0 / implemented using "do while" loop
WHILE, LOAD STR_BASE / load str_base into ac
ADD ITR / add index to str_base
STORE INDEX / store (str_base + index) into ac
CLEAR / set ac to zero
ADDI INDEX / get the value at ADDR
SKIPCOND 400 / SKIP if ADDR = 0 (or null char)
JUMP DO / jump to DO
JUMP PRINT / JUMP to END
DO, STORE TEMP / output value at ADDR
LOAD ITR / load iterator into ac
ADD ONE / increment iterator by one
STORE ITR / store ac in iterator
JUMP WHILE / jump to while
PRINT, SUBT ONE
SKIPCOND 000
JUMP PR
HALT
PR, OUTPUT
JUMP WHILE
ONE, DEC 1
ITR, DEC 0
INDEX, HEX 0
STR_BASE, HEX 12 / memory location of str
STR, HEX 48 / H
HEX 65 / E
HEX 6C / L
HEX 6C / L
HEX 6F / O
HEX 0 / carriage return
HEX 57 / W
HEX 6F / O
HEX 72 / R
HEX 6C / L
HEX 64 / D
HEX 0 / NULL char
My program ends up halting past two iterations. I can't seem to figure out how to print a set of characters in one line. Thanks.

Your value of STR_BASE is almost certainly incorrect. Based on what is here I would say it needs to be 18 instead of 12. Also you would either want to remove current null char that is between "HELLO" and "WORLD" and replace it with a space or simply remove that line, depending on your intended output.

Two byte report count for hid report descriptor

I'm trying to create an HID report descriptor for USB 3.0 with a report count of 1024 bytes.
The documentation at usb.org for HID does not seem to mention a two byte report count. Nonetheless, I have seen some people use 0x96 (instead of 0x95) to enter a two byte count, such as:
0x96, 0x00, 0x02, // REPORT_COUNT (512)
which was taken from here:
Custom HID device HID report descriptor
Likewise, from this same example, 0x26 is used for a two byte logical maximum.
Where did this 0x96 and 0x26 field come from? I don't see any documentation for it.

REPORT_COUNT is defined in the Device Class Definition for HID 1.11 document in section 6.2.2.7 Global Items on page 36 as:
Report Count 1001 01 nn Unsigned integer specifying the number of data
fields for the item; determines how many fields are included in the
report for this particular item (and consequently how many bits are
added to the report).
The nn in the above code is the item length indicator (bSize) and is defined earlier in section 6.2.2.2 Short Items as:
bSize Numeric expression specifying size of data:
0 = 0 bytes
1 = 1 byte
2 = 2 bytes
3 = 4 bytes
Rather confusingly, the valid values of bSize are listed in decimal. So, in binary, the bits for nn would be:
00 = 0 bytes (i.e. there is no data associated with this item)
01 = 1 byte
10 = 2 bytes
11 = 4 bytes
Putting it all together for REPORT_COUNT, which is an unsigned integer, the following alternatives could be specified:
1001 01 00 = 0x94 = REPORT_COUNT with no length (can only have value 0?)
1001 01 01 = 0x95 = 1-byte REPORT_COUNT (can have a value from 0 to 255)
1001 01 10 = 0x96 = 2-byte REPORT_COUNT (can have a value from 0 to 65535)
1001 01 11 = 0x97 = 4-byte REPORT_COUNT (can have a value from 0 to 4294967295)
Similarly, for LOGICAL_MAXIMUM, which is a signed integer (usually, there is an exception):
0010 01 00 = 0x24 = LOGICAL_MAXIMUM with no length (can only have value 0?)
0010 01 01 = 0x25 = 1-byte LOGICAL_MAXIMUM (can have a values from -128 to 127)
0010 01 10 = 0x26 = 2-byte LOGICAL_MAXIMUM (can have a value from -32768 to 32767)
0010 01 11 = 0x27 = 4-byte LOGICAL_MAXIMUM (can have a value from -2147483648 to 2147483647)
The specification is unclear on what value a zero-length item defaults to in general. It only mentions, at the end of section 6.2.2.4 Main Items, that MAIN item types and, within that type, INPUT item tags, have a default value of 0:
Remarks - The default data value for all Main items is zero (0).
- An Input item could have a data size of zero (0) bytes. In this case the value of
each data bit for the item can be assumed to be zero. This is functionally
identical to using a item tag that specifies a 4-byte data item followed by four
zero bytes.
It would be reasonable to assume 0 as the default for other item types too, but for REPORT_COUNT (a GLOBAL item) a value of 0 is not really a sensible default (IMHO). The specification doesn't really say.

Need help identifying and computing a number representation

I need help identifying the following number format.
For example, the following number format in MIB:
0x94 0x78 = 2680
0x94 0x78 in binary: [1001 0100] [0111 1000]
It seems that if the MSB is 1, it means another character follows it. And if it is 0, it is the end of the number.
So the value 2680 is [001 0100] [111 1000], formatted properly is [0000 1010] [0111 1000]
What is this number format called and what's a good way for computing this besides bit manipulation and shifting to a larger unsigned integer?

I have seen this called either 7bhm (7-bit has-more) or VLQ (variable length quantity); see http://en.wikipedia.org/wiki/Variable-length_quantity
This is stored big-endian (most significant byte first), as opposed to the C# BinaryReader.Read7BitEncodedInt method described at Encoding an integer in 7-bit format of C# BinaryReader.ReadString
I am not aware of any method of decoding other than bit manipulation.
Sample PHP code can be found at
http://php.net/manual/en/function.intval.php#62613
or in Python I would do something like
def encode_7bhm(i):
o = [ chr(i & 0x7f) ]
i /= 128
while i > 0:
o.insert(0, chr(0x80 | (i & 0x7f)))
i /= 128
return ''.join(o)
def decode_7bhm(s):
o = 0
for i in range(len(s)):
v = ord(s[i])
o = 128*o + (v & 0x7f)
if v & 0x80 == 0:
# found end of encoded value
break
else:
# out of string, and end not found - error!
raise TypeError
return o

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse