Signed and unsigned integers -- why are bytes treated differently?

Signed and unsigned integers -- why are bytes treated differently? - word

I am learning High Level Assembly Language at the moment, and was going over the concept of signed and unsigned integers. It seems simple enough, however getting to sign extension has confused me.
Take the byte 10011010 which I would take to be 154 in decimal. Indeed, using a binary calculator with anything more than word selected shows this as 154 in decimal.
However, if I select the unit to be a byte, and type in 10011010 then suddenly it is treated as -102 in decimal. Whenever I increase it starting from a byte then it is sign extended and will always be -102 in decimal.
If I use anything higher than a byte then it remains 154 in decimal.
Could somebody please explain this seeming disparity?

When you select the unit as a byte the MSB of 10011010 is treated as the signed bit, which makes this one byte signed integer equivalent interpretation to -102 (2's complement).
For integers sized large than 8 bits, say 16 bits the number will be: 0000000010011010 which do not have 1 in MSB therefore it is treated as a positive integer whose integer representation is 154 in decimal. When you convert the 8 bit byte to a higher type the sign extension will preserve the -ve interpretation in the larger length storage too.

Related

Last character change in Access Token [duplicate]

I was just trying out JWT tokens on a rails app, using this jwt library:
https://github.com/jwt/ruby-jwt
JWT.encode({sss: "333"}, 'SECRET_KEY')
returns below token:
eyJhbGciOiJIUzI1NiJ9.eyJzc3MiOiIzMzMifQ.CwX_1FztYHVpyx_G27u938SceilsVc5AB5Akwqlo2HA
Then I decoded using the above token
JWT.decode("eyJhbGciOiJIUzI1NiJ9.eyJzc3MiOiIzMzMifQ.CwX_1FztYHVpyx_G27u938SceilsVc5AB5Akwqlo2HA", 'SECRET_KEY')
returns below response correctly:
[{"sss"=>"333"}, {"alg"=>"HS256"}]
But if I try to change the last letter of the token to B instead of current A it is still returning the same response which is weird.
JWT.decode("eyJhbGciOiJIUzI1NiJ9.eyJzc3MiOiIzMzMifQ.CwX_1FztYHVpyx_G27u938SceilsVc5AB5Akwqlo2HB", 'SECRET_KEY')
Getting this response even though the token I provided is wrong:
[{"sss"=>"333"}, {"alg"=>"HS256"}]
Actually I am getting the same response for all characters up to 'D'
If I use F and others above then its showing error as expected:
JWT.decode("eyJhbGciOiJIUzI1NiJ9.eyJzc3MiOiIzMzMifQ.CwX_1FztYHVpyx_G27u938SceilsVc5AB5Akwqlo2HF", 'SECRET_KEY')
JWT::VerificationError (Signature verification raised)
from (irb):34
What could be the reason for this? Is it the expected behavior or am I doing something wrong here?

The reason is the base64url encoding. The three parts of a JWT are all base64url encoded. Base64 encoding transforms the input data to a 6-Bit representation, mapped to a set of 64 ASCII characters. If you have 3 bytes source data (24 bits), the base64 encoded result is 4 characters long, each character representing a 6 bit value, so 4 * 6 bits = 24 bits. If the number of bits that need to be encoded can't be divided by 6 without remainder, there'll be one character more with 2 or 4 insignificant bits.
In your case, the encoded signature has 43 characters, which means 43 * 6 = 258 bits.
So you could theoretically encode 258 bits, but the signature is only 256 bits (32 bytes) long, which means there are 2 insignificant bits on the end.
A look on the base64 encoding table shows that 'A' to 'D' represent the 6 bit values 0 (000000) to 4 (000011), so the first four bits, which are still significant, are all identical, and only the last two, insignificant bits are changing. But the character 'E' stands for 5 (000100) and would change the last bit of the 256 bit value.
The following table illustrates that. It shows the last 4 base64 characters of the signature, including the possible changes of the last character (A-D) and the bit and byte number of the original data:
The change of the last character in that range only causes a change of the last two bits (light grey) but does not change the original data, because the changed bits are beyond the last bit of the original data.
If you're really concerned about the 2 bits on the end, you can consider to change the signature algorithm to HS384.
Then you have a 384 bit (= 48 byte) hash, which is represented in 64 Base64 characters. 384 can be divided by 8 and by 6 without remainder, so there are no insignificant bits on the end and any change on the last character will lead to a failed verification.
HS512 would have the same "problem" as HS256, and then even 4 insignificant bits on the end, but nonetheless a longer hash (512 bits vs. 384 bits vs. 256 bits) is considered more secure.
Conclusion: it's all fine, nothing wrong here. The verification of a signature is based on its binary value, which is not affected by the peculiarities of the encoding. You can change the algorithm, if you're worried, but I think it's not really necessary and the choice of an algorithm should not be based on that.

A protocol is telling me to encode the numeric value 150 to 0x01 0x50, and the value 35 to 0x00 0x35?

So I'm trying to implement the 'ECR' protocol that talks to a credit card terminal (Ingenico/Telium device in Costa Rica).
The documentation for the 'length' bytes states:
Length of field DATA (it does not include ETX nor LRC)
Example: if length of field Message Data
is 150 bytes; then, 0x01 0x50 is sent.
I would think that the value '150' should be send as 0x00 0x96.
I've verified that that is not a typo. In a working example message which has 35 bytes of data, they really do send 0x00 0x35.
Am I missing something? Is this form of encoding the decimal representation of a value to its literal representation in hex a thing? Does it have a name? Why would anyone do this?

It has a name, and it was frequent in the past: it is Binary coded Decimal or in short BCD, see https://en.wikipedia.org/wiki/Binary-coded_decimal.
In fact Intel CPU but the 64-bit versions had special instructions to deal with them.
How it works: every decimal digit is encoded in 4 bits (a nibble), so a byte can host two decimal digits. And you get string of them to describe integer numbers. Note: to convert to string (or back from strings): you divide the nibbles and then it is just an addition ('0' + nibble): the C language requires that character encoding of digits must be consecutive (and ordered).
If you works a lot with decimals, it is convenient and fast: no need to transform to binary (which requires shift and addition, or just multiplications) and back (again shift or divisions). So in past when most CPU didn't have floating point co-processors, this was very convenient (especially if you need just to add or subtract numbers). So no need to handle precision errors (which banks doesn't like; was the first movie of Super Man about the villain getting rich by inserting a round error on a bank system? This show the worries of the time).
It has also less problem on number of bits: banks needs accounts with potential billions with a precision of cents. A BCD makes easier to port program on different platforms, with different endianess and different number of bits. Again: it was for the past, where 8-bit, 16-bit, 32-bit, 36-bit, etc. were common, and no real standard architecture.
It is obsolete system: newer CPUs doesn't have problem converting decimal to binary and back, and we have enough bits to handle cents. Note: still in financial sector the floating point is avoided. Just integers with a fixed point (usually 2 digits). But protocols and some sectors tend not to change protocols very often (for interoperability).

in swift, how utf16 surrogate pair is represented in bit

I'm currently learning swift using the book swift programming language 3.1.
In the book, it states that swift's String and Character type is fully unicode compliant, with each character represented by a 21 bits unicode scalar value. Each character can be view via utf8, 16, 32.
I understand how utf8 and utf32 works in the byte and bit level, but I'm having trouble understanding how utf16 works in the bit level.
I know that for characters whose code point can be fit into 16 bits, utf16 simply represent the character as a 16 bit number. But for characters whose representation require more than 16 bits, two 16 bits block is used (called surrogate pair, I believe).
But how is the two 16 bits block is presented in bit level?

A "Unicode Scalar Value" is
Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive.
Every Unicode scalar value can be represented as a sequence of one or two UTF-16 code units, as described in the
Unicode Standard:
D91 UTF-16 encoding form
The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5.
Table 3-5. UTF-16 Bit Distribution
Scalar Value UTF-16
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
000uuuuuxxxxxxxxxxxxxxxx 110110wwwwxxxxxx 110111xxxxxxxxxx
Note: wwww = uuuuu - 1
There are 220 Unicode scalar values in the "Supplementary Planes" (U+10000..U+10FFFF), which means that 20 bits are sufficient to encode
all of them in a surrogate pair. Technically this is done by subtracting
0x010000 from the value before splitting it into two blocks of 10 bits.

The utf16 range D800...DFFF is reserved. Values below and above that are simple 16 bit code points. Values D800..DBFF, minus D800, are the high 10 bits of 20-bit codes beyond FFFC. The next two bytes contain the low 10 bits. Of course, endianness comes into the picture making us all wish we could just use utf8. Sigh.

Go: Rune, Int32, and binary encoding

I get that a rune is an alias for int32 because it's supposed to hold all valid Unicode code points. There are apparently about 1,114,112 Unicode code points currently, so it would make sense that they would have to be stored in four bits, or an int32-sized register, which can store an integer up to 2,147,483,647.
I have a few questions about binary encoding of UTF-8 characters and integers, however.
It appears that both rune and int32 both occupy four bytes. If 2147483647 is the highest integer able to be represented in four bytes (four bit octets), why is its binary representation 1111111111111111111111111111111, i.e., 31 1's instead of 32? Is there a bit reserved for its sign? It's possible that there's a bug in the binary converter I used, because -2147483648 should still be able to be represented in 4 bytes (as it's still able to be represented in the int32 type), but it is output there as 1111111111111111111111111111111110000000000000000000000000000000, i.e., 33 1's and 31 0's which clearly overruns a four byte allowance. What's the story there?
In the binary conversion, how would the compiler differentiate between a rune like 'C' (01000011, according to the unicode to binary table and the integer 67 (also 01000011, according to the binary to decimal converter I used). Intuition tells me that some of the bits must be reserved for that information. Which ones?
I've done a fair amount of Googling, but I'm obviously missing the resources that explain this well, so feel free to explain like I'm 5. Please also feel free to correct terminology misuses.

Clarification on Joel Spolsky's Unicode article

I'm reading the popular Unicode article from Joel Spolsky and there's one illustration that I don't understand.
What does "Hex Min, Hex Max" mean? What do those values represent? Min and max of what?
Binary can only have 1 or 0. Why do I see tons of letter "v" here?
http://www.joelonsoftware.com/articles/Unicode.html

The Hex Min/Max define the range of unicode characters (typically represented by their unicode number in HEX).
The v is referring to the bits of the original number
So the first line is saying:
The unicode characters in the range 0 (hex 00) to 127 (hex 7F) (a 7
bit number) are represented by a 1 byte bit string starting with '0'
followed by all 7 bits of the unicode number.
The second line is saying:
The unicode numbers in the range 128 (hex 0800) to 2047 (07FF) (an 11
bit number) are represented by a 2 byte bit string where the first
byte starts with '110' followed by the first 5 of the 11 bits, and the
second byte starts with '10' followed by the remaining 6 of the 11 bits
etc
Hope that makes sense

Note that the table in Joel's article covers code points that do not, and never will, exist in Unicode. In fact, UTF-8 never needs more than 4 bytes, though the scheme underlying UTF-8 could be extended much further, as shown.
A more nuanced version of the table is available in How does a file with Chinese characters know how many bytes to use per character? It points out some of the gaps. For example, the bytes 0xC0, 0xC1, and 0xF5..0xFF can never appear in valid UTF-8. You can also see information about invalid UTF-8 at Really good bad UTF-8 example test data.
In the table you showed, the Hex Min and Hex Max values are the minimum and maximum U+wxyz values that can be represented using the number of bytes in the 'byte sequence in binary' column. Note that the maximum code point in Unicode is U+10FFFF (and that is defined/reserved as a non-character). This is the maximum value that can be represented using the surrogate encoding scheme in UTF-16 using just 4 bytes (two UTF-16 code points).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse