Why are constants in C-header files of libraries always defined as hexadecimal? - constants

No matter which C-compatible library I use, when I look at the header defined constants, they are always defined as hexadecimal values. Here, for instance, in GL/gl.h:
#define GL_POINTS 0x0000
#define GL_LINES 0x0001
#define GL_LINE_LOOP 0x0002
#define GL_LINE_STRIP 0x0003
#define GL_TRIANGLES 0x0004
#define GL_TRIANGLE_STRIP 0x0005
#define GL_TRIANGLE_FAN 0x0006
#define GL_QUADS 0x0007
#define GL_QUAD_STRIP 0x0008
#define GL_POLYGON 0x0009
Is there any particular reason for this convention, why not simply use decimal values instead?

There are a number of possible reasons:
1) Bit flags are much easier to express as hex, since each hex digit represents exactly 4 bits.
2) Even for values which aren't explicitly bit flags, there are often intentional bit patterns that are more evident when written as hex.
For instance, all the AlphaFunctions start with 0x02 and differ in only a single byte:
#define GL_NEVER 0x0200
#define GL_LESS 0x0201
#define GL_EQUAL 0x0202
#define GL_LEQUAL 0x0203
#define GL_GREATER 0x0204
#define GL_NOTEQUAL 0x0205
#define GL_GEQUAL 0x0206
#define GL_ALWAYS 0x0207
3) Hex values are allowed to have leading zeroes, so it is easier to line up the values. This can make reading (and proof-reading) easier. You might be surprised that leading zeroes are allowed in hex and octal literals but not decimal, but the C++ spec says quite explicitly
A decimal integer literal (base ten) begins with a digit other than 0 and consists of a sequence of decimal digits.

If the constant values refer to bit flags, and are intended to be combined, then Hex notation is a convenient way of displaying which bits are represented.
For example, from a Boost header:
// Type encoding:
//
// bit 0: callable builtin
// bit 1: non member
// bit 2: naked function
// bit 3: pointer
// bit 4: reference
// bit 5: member pointer
// bit 6: member function pointer
// bit 7: member object pointer
#define BOOST_FT_type_mask 0x000000ff // 1111 1111
#define BOOST_FT_callable_builtin 0x00000001 // 0000 0001
#define BOOST_FT_non_member 0x00000002 // 0000 0010
#define BOOST_FT_function 0x00000007 // 0000 0111
#define BOOST_FT_pointer 0x0000000b // 0000 1011
#define BOOST_FT_reference 0x00000013 // 0001 0011
#define BOOST_FT_non_member_callable_builtin 0x00000003 // 0000 0011
#define BOOST_FT_member_pointer 0x00000020 // 0010 0000
#define BOOST_FT_member_function_pointer 0x00000061 // 0110 0001
#define BOOST_FT_member_object_pointer 0x000000a3 // 1010 0001

It is shorter, but more importantly, if they are bit flags, it is easier to combine them and make masks.

Related

how to set max-functions in pcie device tree? (about the syntax "max-functions = /bits/ 8 <8>;")

In linux-6.15.68, in Documentation/devicetree/bindings/pci/rockchip-pcie-ep.txt, I see these explanations. (please see the marked line.)
Optional Property:
- num-lanes: number of lanes to use
- max-functions: Maximum number of functions that can be configured (default 1).
pcie0-ep: pcie#f8000000 {
compatible = "rockchip,rk3399-pcie-ep";
#address-cells = <3>;
#size-cells = <2>;
rockchip,max-outbound-regions = <16>;
clocks = <&cru ACLK_PCIE>, <&cru ACLK_PERF_PCIE>,
<&cru PCLK_PCIE>, <&cru SCLK_PCIE_PM>;
clock-names = "aclk", "aclk-perf",
"hclk", "pm";
max-functions = /bits/ 8 <8>; // <---- see this line
num-lanes = <4>;
reg = <0x0 0xfd000000 0x0 0x1000000>, <0x0 0x80000000 0x0 0x20000>;
<skip>
In the example dts, what does "max-functions = /bins/ 8 <8>;" mean?
I found in Documentation/devicetree/bindings/pci/snps,dw-pcie-ep.yaml, it says
max-functions:
$ref: /schemas/types.yaml#/definitions/uint32
description: maximum number of functions that can be configured
But I don't know how to read the $ref document.
ADD
I found this.
The storage size of an element can be changed using the /bits/ prefix. The /bits/ prefix allows for the creation of 8, 16, 32, and
64-bit elements. The resulting array will not be padded to a
multiple of the default 32-bit element size.
e.g. interrupts = /bits/ 8 <17 0xc>; e.g. clock-frequency = /bits/
64 <0x0000000100000000>;
Does this mean 17 and 0xc are both 8-bit variables and when it is compiled to dtb, it keep the 8-bit format? The linux code will analyze the dtb file, then does the dtb contains the format information too?
The Device Tree Compiler v1.4.0 onwards supports some extra syntaxes for specifying property values that are not present in The Devicetree Specification up to at least version v0.4-rc1. These extra property value syntaxes are documented in the Device Tree Compiler's Device Tree Source Format and include:
A number in an array between angle brackets can be specified as a character literal such as 'a', '\r' or '\xFF'.
The size of elements in an array between angle brackets can be set using the prefix /bits/ and a bit-width of 8, 16, 32, or 64, defaulting to 32-bit integers.
The binary Flattened Devicetree (DTB) Format contains no explicit information on the type of a property value. A property value is just a string of bytes. Numbers (and character literals) between angle brackets in the source are converted to bytes in big-endian byte order in accordance with the element size in bits divided by 8. For example:
<0x11 'a'> is encoded in the same way as the bytestring [00 00 00 11 00 00 00 61].
/bits/ 8 <17 0xc> is encoded in the same way as the bytestring [11 0c].
It is up to the reader of the property value to "know" what type it is expecting. For example, the Rockchip AXI PCIe endpoint controller driver in the Linux kernel ("drivers/pci/controller/pcie-rockchip-ep.c") "knows" that the "max-functions" property should have been specified as a single byte and attempts to read it using the statement err = of_property_read_u8(dev->of_node, "max-functions", &ep->epc->max_functions);. (It is probably encoded as a single byte property for convenience so that it can be copied directly into the u8 max_functions member of a struct pci_epc.)

how does UTF-8 end up with bigger bits than UTF-16 [duplicate]

I have a university programming exam coming up, and one section is on unicode.
I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help.
The question will be something like:
The string 'mЖ丽' has these unicode codepoints U+006D, U+0416 and
U+4E3D, with answers written in hexadecimal, manually encode the
string into UTF-8 and UTF-16.
Any help at all will be greatly appreciated as I am trying to get my head round this.
Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)
The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:
Encoding
The following byte sequences are used to represent a
character. The sequence to be used depends on the UCS code
number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
[... removed obsolete five and six byte forms ...]
The xxx bit positions are filled with the bits of the
character code number in binary representation. Only the
shortest possible multibyte sequence which can represent the
code number of the character can be used.
The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
as 0xfffe and 0xffff (UCS noncharacters) should not appear in
conforming UTF-8 streams.
It might be easier to remember a 'compressed' version of the chart:
Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.
0x80 5 bits, one byte
0x800 4 bits, two bytes
0x10000 3 bits, three bytes
You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:
2**(5+1*6) == 2048 == 0x800
2**(4+2*6) == 65536 == 0x10000
2**(3+3*6) == 2097152 == 0x200000
I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)
Update
Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:
U+4E3E
This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:
1110xxxx 10xxxxxx 10xxxxxx
0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):
1110x100 10111000 10111110
There is an x spot left over at the start, fill it in with 0:
11100100 10111000 10111110
Convert from bits to hex:
0xE4 0xB8 0xBE
The descriptions on Wikipedia for UTF-8 and UTF-16 are good:
Procedures for your example string:
UTF-8
UTF-8 uses up to 4 bytes to represent Unicode codepoints. For the 1-byte case, use the following pattern:
1-byte UTF-8 = 0xxxxxxxbin = 7 bits = 0-7Fhex
The initial byte of 2-, 3- and 4-byte UTF-8 start with 2, 3 or 4 one bits, followed by a zero bit. Follow on bytes always start with the two-bit pattern 10, leaving 6 bits for data:
2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 5+6(11) bits = 80-7FFhex
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 4+6+6(16) bits = 800-FFFFhex
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 3+6+6+6(21) bits = 10000-10FFFFhex†
†Unicode codepoints are undefined beyond 10FFFFhex.
Your codepoints are U+006D, U+0416 and U+4E3D requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign the bits:
U+006D = 1101101bin = 01101101bin = 6Dhex
U+0416 = 10000 010110bin = 11010000 10010110bin = D0 96hex
U+4E3D = 0100 111000 111101bin = 11100100 10111000 10111101bin = E4 B8 BDhex
Final byte sequence:
6D D0 96 E4 B8 BD
or if nul-terminated strings are desired:
6D D0 96 E4 B8 BD 00
UTF-16
UTF-16 uses 2 or 4 bytes to represent Unicode codepoints. Algorithm:
U+0000 to U+D7FF uses 2-byte 0000hex to D7FFhex
U+D800 to U+DFFF are invalid codepoints reserved for 4-byte UTF-16
U+E000 to U+FFFF uses 2-byte E000hex to FFFFhex
U+10000 to U+10FFFF uses 4-byte UTF-16 encoded as follows:
Subtract 10000hex from the codepoint.
Express result as 20-bit binary.
Use the pattern 110110xxxxxxxxxx 110111xxxxxxxxxxbin to encode the upper- and lower- 10 bits into two 16-bit words.
Using your codepoints:
U+006D = 006Dhex
U+0416 = 0416hex
U+4E3D = 4E3Dhex
Now, we have one more issue. Some machines store the two bytes of a 16-bit word least significant byte first (so-called little-endian machines) and some store most significant byte first (big-endian machines). UTF-16 uses the codepoint U+FEFF (called the byte order mark or BOM) to help a machine determine if a byte stream contains big- or little-endian UTF-16:
big-endian = FE FF 00 6D 04 16 4E 3D
little-endian = FF FE 6D 00 16 04 3D 4E
With nul-termination, U+0000 = 0000hex:
big-endian = FE FF 00 6D 04 16 4E 3D 00 00
little-endian = FF FE 6D 00 16 04 3D 4E 00 00
Since your instructor didn't give a codepoint that required 4-byte UTF-16, here's one example:
U+1F031 = 1F031hex - 10000hex = F031hex = 0000111100 0000110001bin =
1101100000111100 1101110000110001bin = D83C DC31hex
The following program will do the necessary work. It may not be "manual" enough for your purposes, but at a minimum you can check your work.
#!/usr/bin/perl
use 5.012;
use strict;
use utf8;
use autodie;
use warnings;
use warnings qw< FATAL utf8 >;
no warnings qw< uninitialized >;
use open qw< :std :utf8 >;
use charnames qw< :full >;
use feature qw< unicode_strings >;
use Encode qw< encode decode >;
use Unicode::Normalize qw< NFD NFC >;
my ($x) = "mЖ丽";
open(U8,">:encoding(utf8)","/tmp/utf8-out");
print U8 $x;
close(U8);
open(U16,">:encoding(utf16)","/tmp/utf16-out");
print U16 $x;
close(U16);
system("od -t x1 /tmp/utf8-out");
my $u8 = encode("utf-8",$x);
print "utf-8: 0x".unpack("H*",$u8)."\n";
system("od -t x1 /tmp/utf16-out");
my $u16 = encode("utf-16",$x);
print "utf-16: 0x".unpack("H*",$u16)."\n";

Create Unicode from a hex number in C++

My objective is to take a character which represents to UK pound symbol and convert it to it's unicode equivalent in a string.
Here's my code and output so far from my test program:
#include <iostream>
#include <stdio.h>
int main()
{
char x = 163;
unsigned char ux = x;
const char *str = "\u00A3";
printf("x: %d\n", x);
printf("ux: %d %x\n", ux, ux);
printf("str: %s\n", str);
return 0;
}
Output
$ ./pound
x: -93
ux: 163 a3
str: £
My goal is to take the unsigned char 0xA3 and put it into a string representing the unicode UK pound representation: "\u00A3"
What exactly is your question? Anyway, you say you're writing C++, but you're using char* and printf and stdlib.h so you're really writing C, and base C does not support unicode. Remember that a char in C is not a "character" it's just a byte, and a char* is not an array of characters, it's an array of bytes. When you printf the "\u00A3" string in your sample program, you are not printing a unicode character, you are actually printing those literal bytes, and your terminal is helping you out and interpreting them as a unicode character. The fact that it correctly prints the £ character is just coincidence. You can see this for yourself. If you printf str[0] in your sample program you should just see the "\" character.
If you want to use unicode correctly in C you'll need to use a library. There are many to choose from and I haven't used any of them enough to recommend one. Or you'll need to use C++11 or newer and use std::wstring and friends. But what you are doing is not real unicode and will not work as you expect in the long run.

Scala Convert Between Unsigned and Signed Bytes

I have a unsigned data representation like this:
val str = "145 38 0 255 0 1"
I want to now get this unsigned representation as a byteString after which I can manipulate on the individual bits to extract information. So basically what I want is to get the unsigned version.
For example., the unsigned binary representation of 145 is 10010001, but the signed version is -00000111.
scala> 145.byteValue
res128: Byte = -111 // I would need an unsigned value instead!
So is there a function or an approach to convert the 145 to an unsigned representation?
Signed and unsigned bytes (or Ints or Longs) are just two different ways to interpret the same binary bits.
In signed bytes the first bit from the left (most significant bit) is interpreted as the sign.
0 means positive, 1 means negative.
For negative numbers, what comes after the minus sign is 256 - unsigned-value.
So in your case what we get is 256 - 145 = 111.
Java / Scala bitwise operators work on the underlying bits, not on the signed/unsigned representation, but of course the results, when printed are still interpreted as signed.
Actually, to save on confusion, I would just work with Ints (or Shorts).
But it will work just as well with bytes.
For example to get the n-th bit you could do something like:
def bitValue(byte: Byte, n: Byte) = {
val pow2 = Math.pow(2, n - 1).toInt
if((pow2 & byte) == 0) 0
else 1
}
bitValue(145.byteValue, 8)
res27: Int = 1

Manually converting unicode codepoints into UTF-8 and UTF-16

I have a university programming exam coming up, and one section is on unicode.
I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help.
The question will be something like:
The string 'mЖ丽' has these unicode codepoints U+006D, U+0416 and
U+4E3D, with answers written in hexadecimal, manually encode the
string into UTF-8 and UTF-16.
Any help at all will be greatly appreciated as I am trying to get my head round this.
Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)
The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:
Encoding
The following byte sequences are used to represent a
character. The sequence to be used depends on the UCS code
number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
[... removed obsolete five and six byte forms ...]
The xxx bit positions are filled with the bits of the
character code number in binary representation. Only the
shortest possible multibyte sequence which can represent the
code number of the character can be used.
The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
as 0xfffe and 0xffff (UCS noncharacters) should not appear in
conforming UTF-8 streams.
It might be easier to remember a 'compressed' version of the chart:
Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.
0x80 5 bits, one byte
0x800 4 bits, two bytes
0x10000 3 bits, three bytes
You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:
2**(5+1*6) == 2048 == 0x800
2**(4+2*6) == 65536 == 0x10000
2**(3+3*6) == 2097152 == 0x200000
I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)
Update
Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:
U+4E3E
This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:
1110xxxx 10xxxxxx 10xxxxxx
0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):
1110x100 10111000 10111110
There is an x spot left over at the start, fill it in with 0:
11100100 10111000 10111110
Convert from bits to hex:
0xE4 0xB8 0xBE
The descriptions on Wikipedia for UTF-8 and UTF-16 are good:
Procedures for your example string:
UTF-8
UTF-8 uses up to 4 bytes to represent Unicode codepoints. For the 1-byte case, use the following pattern:
1-byte UTF-8 = 0xxxxxxxbin = 7 bits = 0-7Fhex
The initial byte of 2-, 3- and 4-byte UTF-8 start with 2, 3 or 4 one bits, followed by a zero bit. Follow on bytes always start with the two-bit pattern 10, leaving 6 bits for data:
2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 5+6(11) bits = 80-7FFhex
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 4+6+6(16) bits = 800-FFFFhex
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 3+6+6+6(21) bits = 10000-10FFFFhex†
†Unicode codepoints are undefined beyond 10FFFFhex.
Your codepoints are U+006D, U+0416 and U+4E3D requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign the bits:
U+006D = 1101101bin = 01101101bin = 6Dhex
U+0416 = 10000 010110bin = 11010000 10010110bin = D0 96hex
U+4E3D = 0100 111000 111101bin = 11100100 10111000 10111101bin = E4 B8 BDhex
Final byte sequence:
6D D0 96 E4 B8 BD
or if nul-terminated strings are desired:
6D D0 96 E4 B8 BD 00
UTF-16
UTF-16 uses 2 or 4 bytes to represent Unicode codepoints. Algorithm:
U+0000 to U+D7FF uses 2-byte 0000hex to D7FFhex
U+D800 to U+DFFF are invalid codepoints reserved for 4-byte UTF-16
U+E000 to U+FFFF uses 2-byte E000hex to FFFFhex
U+10000 to U+10FFFF uses 4-byte UTF-16 encoded as follows:
Subtract 10000hex from the codepoint.
Express result as 20-bit binary.
Use the pattern 110110xxxxxxxxxx 110111xxxxxxxxxxbin to encode the upper- and lower- 10 bits into two 16-bit words.
Using your codepoints:
U+006D = 006Dhex
U+0416 = 0416hex
U+4E3D = 4E3Dhex
Now, we have one more issue. Some machines store the two bytes of a 16-bit word least significant byte first (so-called little-endian machines) and some store most significant byte first (big-endian machines). UTF-16 uses the codepoint U+FEFF (called the byte order mark or BOM) to help a machine determine if a byte stream contains big- or little-endian UTF-16:
big-endian = FE FF 00 6D 04 16 4E 3D
little-endian = FF FE 6D 00 16 04 3D 4E
With nul-termination, U+0000 = 0000hex:
big-endian = FE FF 00 6D 04 16 4E 3D 00 00
little-endian = FF FE 6D 00 16 04 3D 4E 00 00
Since your instructor didn't give a codepoint that required 4-byte UTF-16, here's one example:
U+1F031 = 1F031hex - 10000hex = F031hex = 0000111100 0000110001bin =
1101100000111100 1101110000110001bin = D83C DC31hex
The following program will do the necessary work. It may not be "manual" enough for your purposes, but at a minimum you can check your work.
#!/usr/bin/perl
use 5.012;
use strict;
use utf8;
use autodie;
use warnings;
use warnings qw< FATAL utf8 >;
no warnings qw< uninitialized >;
use open qw< :std :utf8 >;
use charnames qw< :full >;
use feature qw< unicode_strings >;
use Encode qw< encode decode >;
use Unicode::Normalize qw< NFD NFC >;
my ($x) = "mЖ丽";
open(U8,">:encoding(utf8)","/tmp/utf8-out");
print U8 $x;
close(U8);
open(U16,">:encoding(utf16)","/tmp/utf16-out");
print U16 $x;
close(U16);
system("od -t x1 /tmp/utf8-out");
my $u8 = encode("utf-8",$x);
print "utf-8: 0x".unpack("H*",$u8)."\n";
system("od -t x1 /tmp/utf16-out");
my $u16 = encode("utf-16",$x);
print "utf-16: 0x".unpack("H*",$u16)."\n";