UTF-8 in decimal - encoding

Is representing UTF-8 encoding in decimals even possible? I think only values till 255 would be correct, am I right?
As far as I know, we can only represent UTF-8 in hex or binary form.

I think it is possible. Let's look at an example:
The Unicode code point for ∫ is U+222B.
Its UTF-8 encoding is E2 88 AB, in hexadecimal representation. In octal, this would be 342 210 253. In decimal, it would be 226 136 171. That is, if you represent each byte separately.
If you look at the same 3 bytes as a single number, you have E288AB in hexadecimal; 70504253 in octal; and 14846123 in decimal.

Related

Is this a bug in the passlib base64 encoding?

I am trying to decode an re-encode a bytesytring using passlibs base64 encoding:
from passlib.utils import binary
engine = binary.Base64Engine(binary.HASH64_CHARS)
s2 = engine.encode_bytes(engine.decode_bytes(b"1111111111111111111111w"))
print(s2)
This prints b'1111111111111111111111A' which is of course not what I expected. The last character is different.
Where is my mistake? Is this a bug?
No, it's not a bug.
In all variants of Base64, every encoded character represents just 6 bits and depending on the number of bytes encoded you can end up with 0, 2 or 4 insignificant bits on the end.
In this case the encoded string 1111111111111111111111w is 23 characters long, that means 23*6 = 138 bits which can be decoded to 17 bytes (136 bits) + 2 insignifcant bits.
The encoding you use here is not Base64 but Hash64
Base64 character map used by a number of hash formats; the ordering is wildly different from the standard base64 character map.
In the character map
HASH64_CHARS = u("./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz")
we find A on index 12 (001100) and w on index 60 (111100)
Now the 'trick' here is, that
binary.Base64Engine(binary.HASH64_CHARS) has a default parameter big=False, which means that the encoding is done in little endian format by default.
In your example it means that w is 001111 and A is 001100. During decoding the last two bits are cut off as they are not needed as explained above. When you encode it again, A is taken as the first character in the character map that can be used two encode 0011 plus two insignifiant bits.

How can U+203C be represented in (226, 128, 188) in swift chapter Strings and Characters?

When I read The Swift Programming Language Strings and Characters. I don't know how U+203C (means !!) can represented by (226, 128, 188) in utf-8.
How did it happen ?
I hope you already know how UTF-8 reserves certain bits to indicate that the Unicode character occupies several bytes. (This website can help).
First, write 0x203C in binary:
0x230C = 10000000111100
So this character takes 16 bits to represent. Due to the "header bits" in the UTF-8 encoding scheme, it would take 3 bytes to encode it:
0x230C = 10 000000 111100
1st byte 2nd byte 3rd byte
-------- -------- --------
header 1110 10 10
actual data 10 000000 111100
-------------------------------------------
full byte 11100010 10000000 10111100
decimal 226 128 188

Representing HEX: 1D524 with utf-8 representation?

I have to represent a character given by the hexadecimal 1D524 in its utf-8 form (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx).
I've already converted the hexadecimal chain to binary, which gives me: 1 1101 0101 0010 0100. But when I try to represent those in utf-8, I get 11110111 10010101 10001001 1000xxxx. Assuming all bytes should have 8 bits, I'm missing 4 bits, so obviously I'm doing something wrong.
Help?

How to denote 160 bits SHA-1 string as 20 characters in ANSI?

For an input "hello", SHA-1 returns "aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d", which are 40 hex outputs. I know 1 byte can denote as 1 character, so the 160 bits output should be able to converted to 20 characters. But when I look up "aa" in an ASCII table, there are no such hex value, and I'm confused about that. How to map 160 bits SHA-1 string as 20 characters in ANSI?
ASCII only has 128 characters (7 bits), while ANSI has 256 (8 bits). As to the ANSI value of hex value AA (decimal 170), the corresponding ANSI character would be ª (see for example here).
Now, you have to keep in mind that a number of both ASCII and ANSI characters (0-31) are non-printable control characters (system bell, null character, etc.), so turning your hash into a readable 20 character string will be not possible in most cases. For instance, your example contains the hex value 0F, which would translate to a shift-in character.

0x00000000 hexadecimal?

I had always been taught 0–9 to represent values zero to nine, and A, B, C, D, E, F for 10-15.
I see this format 0x00000000 and it doesn't fit into the pattern of hexadecimal. Is there a guide or a tutor somewhere that can explain it?
I googled for hexadecimal but I can't find any explanation of it.
So my 2nd question is, is there a name for the 0x00000000 format?
0x simply tells you the number after it will be in hex
so 0x00 is 0, 0x10 is 16, 0x11 is 17 etc
The 0x is just a prefix (used in C and many other programming languages) to mean that the following number is in base 16.
Other notations that have been used for hex include:
$ABCD
ABCDh
X'ABCD'
"ABCD"X
Yes, it is hexadecimal.
Otherwise, you can't represent A, for example. The compiler for C and Java will treat it as variable identifier. The added prefix 0x tells the compiler it's hexadecimal number, so:
int ten_i = 10;
int ten_h = 0xA;
ten_i == ten_h; // this boolean expression is true
The leading zeroes indicate the size: 0x0080 hints the number will be stored in two bytes; and 0x00000080 represents four bytes. Such notation is often used for flags: if a certain bit is set, that feature is enabled.
P.S. As an off-topic note: if the number starts with 0, then it's interpreted as octal number, for example 010 == 8. Here 0 is also a prefix.
Everything after the x are hex digits (the 0x is just a prefix to designate hex), representing 32 bits (if you were to put 0xFFFFFFFF in binary, it would be 1111 1111 1111 1111 1111 1111 1111 1111).
hexadecimal digits are often prefaced with 0x to indicate they are hexadecimal digits.
In this case, there are 8 digits, each representing 4 bits, so that is 32 bits or a word. I"m guessing you saw this in an error, and it is a memory address. this value means null, as the hex value is 0.