Rationale for CBOR negative integers - encoding

I am confused as to why CBOR chooses to encode negative integers as unsigned binary numbers with the value defined as -1 minus the unsigned value, instead of e.g. regular two's complement representation. Is there an obvious advantage that I'm missing, apart from increased negative range (which, IMO, is of questionable value weighed against increased complexity)?

Advantages:
There's only one allowed encoding type for each integer value, so all encoders will emit consistent output. If the encoders use the shortest encoding for each value as recommended by the spec, they'll emit identical output.
Picking the shortest numeric field is easier for non-negative numbers than for signed negative numbers, and CBOR aims for tiny IOT devices to readily transmit data.
It fits twice as many values into each integer encoding field width, thus making the data more compact. (It'd be yet more compact if the integer encodings didn't overlap, but that'd be notably more complicated.)
It can handle twice as large a negative value before needing the bignum extension.

Related

Decoding Arbitrary-Length Values Using a Fixed Block Size?

Background
In the past I've written an encoder/decoder for converting an integer to/from a string using an arbitrary alphabet; namely this one:
abcdefghjkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ23456789
Lookalike characters are excluded, so 1, I, l, O, and 0 are not present in this alphabet. This was done for user convenience and to make it easier to read and to type out a value.
As mentioned above, my previous project, python-ipminify converts a 32-bit IPv4 address to a string using an alphabet similar to the above, but excluding upper-case characters. In my current undertaking, I don't have the constraint of excluding upper-case characters.
I wrote my own Python for this project using the excellent question and answer here on how to build a URL-shortener.
I have published a stand-alone example of the logic here as a Gist.
Problem
I'm now writing a performance-critical implementation of this in a compiled language, most likely Rust, but I'd need to port it to other languages as well.. I'm also having to accept an arbitrary-length array of bytes, rather than an arbitrary-width integer, as is the case in Python.
I suppose that as long as I use an unsigned integer and use consistent endianness, I could treat the byte array as one long arbitrary-precision unsigned integer and do division over it, though I'm not sure how performance will scale with that. I'd hope that arbitrary-precision unsigned integer libraries would try to use vector instructions where possible, but I'm not sure how this would work when the input length does not match a specific instruction length, i.e. when the input size in bits is not evenly divisible by supported instructions, e.g. 8, 16, 32, 64, 128, 256, 512 bits.
I have also considered breaking up the byte array into 256-bit (32 byte) blocks and using SIMD instructions (I only need to support x86_64 on recent CPUs) directly to operate on larger unsigned integers, but I'm not exactly sure how to deal with size % 32 != 0 blocks; I'd probably need to zero-pad, but I'm not clear on how I would know when to do this during decoding, i.e. when I don't know the underlying length of the source value, only that of the decoded value.
Question
If I'm going the arbitrary unsigned integer width route, I'd essentially be at the mercy of the library author, which is probably fine; I'd imagine that these libraries would be fairly optimized to vectorize as much as possible.
If I try to go the block route, I'd probably zero-pad any remaining bits in the block if the input length was not divisible by the block size during encoding. However, would it even be possible to decode such a value without knowing the decoded value size?

Why NumberLong(9007199254740993) matches NumberLong(9007199254740992) in MongoDB from mongo shell?

This situation happens when the given number is big enough (greater than 9007199254740992), along with more tests, I even found many adjacent numbers could match a single number.
Not only NumberLong(9007199254740996) would match NumberLong("9007199254740996"), but also NumberLong(9007199254740995) and NumberLong(9007199254740997).
When I want to act upon a record using its number, I could actually use three different adjacent numbers to get back the same record.
The accepted answer from here makes sense, I quote the most relevant part below:
Caveat: Don't try to invoke the constructor with a too large number, i.e. don't try db.foo.insert({"t" : NumberLong(1234657890132456789)}); Since that number is way too large for a double, it will cause roundoff errors. Above number would be converted to NumberLong("1234657890132456704"), which is wrong, obviously.
Here are some supplements to make things more clear:
Firstly, Mongo shell is a JavaScript shell. And JS does not distinguish between integer and floating-point values. All numbers in JS are represented as floating point values. This means mongo shell uses 64 bit floating point number by default. If shell sees "9007199254740995", it will treat this as a string and convert it to long long. But when we omit the double quotes, mongo shell will see unquoted 9007199254740995 and treat it as a floating-point number.
Secondly, JS uses the 64 bit floating-point format defined in IEEE 754 standard to represent numbers, the maximum it can represent is:
, and the minimum is:
There are an infinite number of real numbers, but only a limited number of real numbers can be accurately represented in the JS floating point format. This means that when you deal with real numbers in JS, the representation of the numbers will usually be an approximation of the actual numbers.
This brings the so-called rounding error issue. Because integers are also represented in binary floating-point format, the reason for the loss of trailing digits precision is actually the same as that of decimals.
The JS number format allows you to accurately represent all integers between
and
Here, since the numbers are bigger than 9007199254740992, the rounding error certainly occurs. The binary representation of NumberLong(9007199254740995), NumberLong(9007199254740996) and NumberLong(9007199254740997) are the same. So when we query with these three numbers in this way, we are practically asking for the same thing. As a result, we will get back the same record.
I think understanding that this problem is not specific to JS is important: it affects any programming language that uses binary floating point numbers.
You are misusing the NumberLong constructor.
The correct usage is to give it a string argument, as stated in the relevant documentation.
NumberLong("2090845886852")

Portability of auto kind/type conversions in numerical operations in Fortran

According to the Fortran standard, if the operands of a numeric operation have different data kind/types, then the resulting value has a kind/type determined by the operand with greater decimal precision. Before the operation is evaluated, the operand with the lower decimal precision is first converted to the higher-precision kind/type.
Now, the use of a high-precision data kind/type implies there is accuracy to a certain level of significant digits, but kind/type conversion does not seem to guarantee such things1. For this reason, I avoid mixing single- and double-precision reals.
But does this mean that automatic kind/type conversions should be avoided at all costs? For example, I would not hesitate to write x = y**2 where both x and y are reals (of the same kind), but the exponent is an integer.
Let us limit the scope of this question to the result of a single operation between two operands. We are not considering the outcome of equations with operations between multiple values where other issues might creep in.
Let us also assume we are using a portable type/kind system. For example, in the code below selected_real_kind is used to define the kind assigned to double-precision real values.
Then, I have two questions regarding numerical expressions with type/kind conversions between two operands:
Is it "portable", in practice? Can we expect the same result for an operation that uses automatic type/kind conversion from different compilers?
Is it "accurate" (and "portable") if the lower-precision operands are limited to integers or whole-number reals? To make this clear, can we always assume that 0==0.0d0, 1==1.0d0, 2==2.0d0, ... , for all compilers? And if so, then can we always assume that simple expressions such as (1 - 0.1230d0) == (1.0d0 - 0.1230d0) are true, and therefore the conversion is both accurate and portable?
To provide a simple example, would automatic conversion from an integer to a double-precision real like shown in the code below be accurate and/or portable?
program main
implicit none
integer, parameter :: dp = selected_real_kind(p=15)
print *, ((42 - 0.10_dp) == (42.0_dp - 0.10_dp))
end program
I have tested with gfortran and ifort, using different operands and operations, but have yet to see anything to cause concern as long as I limit the conversions to integers or whole-number reals. Am I missing anything here, or just revealing my non-CS background?
1According to these Intel Fortran docs (for example), integers converted to a real type have decimals filled with zeros. For the conversion of a single-precision real to higher-precision real, the additional decimal places are filled by first setting the low-order bits of the converted higher-precision operand to zero. So, for example, when a single-precision real operand with a non-zero fractional part (such as 1.2) is converted to a double, the conversion does not automatically increase the accuracy of the value - for example, 1.2 does not become 1.2000000000000000d0 but instead becomes something like 1.200000047683758d0. How much this actually matters probably depends on the application.

Kademlia XOR Distance as an Integer

In the Kademlia paper it mentions using the XOR of the NodeID interpreted as an integer. Let's pretend my NodeID1 is aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d and my NodeID2 is ab4d8d2a5f480a137067da17100271cd176607a1. What's the appropriate way to interpret this as an integer for comparison of NodeID1 and NodeID2? Would I convert these into BigInt and XOR those two BigInts? I saw that in one implementation. Could I also just convert each NodeID into decimal and XOR those values?
I found this question but I'm trying to better understand exactly how this works.
Note: This isn't for implementation, I'm just trying to understand how the integer interpretation works.
For a basic kademlia implementation you only need 2 bit arithmetic operations on the IDs: xor and comparison. For both cases the ID conceptually is a 160bit unsigned integer with overflow, i.e. modulo 2^160 arithmetic. It can be decomposed into a 20bytes or 5×u32 array, assuming correct endianness conversion in the latter case. The most common endianness for network protocols is big-endian, so byte 0 will contain the most significant 8 bits out of 160.
Then the xor or comparisons can be applied on a subunit by subunit basis. I.e. xor is just an xor for all the bytes, the comparison is a binary array comparison.
Using bigint library functions are probably sufficient for implementation but not optimal because they have size and signedness overhead compared to implementing the necessary bit-twiddling on fixed-sized arrays.
A more complete implementation may also need some additional arithmetic and utility functions.
Could I also just convert each NodeID into decimal and XOR those values?
Considering the size of the numbers decimal representation is not particularly useful. For the human reader heaxadecimal or the individual bits are more useful and computers operates on binary and practically never on decimal.

Compressing Sets of Integers Into Smaller Integers

Along the lines of How to encode integers into other integers, I am wondering if it is possible to encode one integer or a set of integers into one smaller integer or a smaller set of integers, and if so, how it is done. For example, encoding an 8 bit integer into a 4 bit integer, a 256 integer into a 16 bit integer. It doesn't seem possible but perhaps there is something along these lines. Basically, how to get a set of integers to take up less space. Not necessarily encoding into another sequence of bytes, but maybe even into a data structure that is more compact.
Sure, you can always encode them into fewer bits. However you won't be able to decode them back to the original bits. Though you neglected to mention that step, I'm guessing that's what you're looking for.