Decoding Arbitrary-Length Values Using a Fixed Block Size? - encoding

Background
In the past I've written an encoder/decoder for converting an integer to/from a string using an arbitrary alphabet; namely this one:
abcdefghjkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ23456789
Lookalike characters are excluded, so 1, I, l, O, and 0 are not present in this alphabet. This was done for user convenience and to make it easier to read and to type out a value.
As mentioned above, my previous project, python-ipminify converts a 32-bit IPv4 address to a string using an alphabet similar to the above, but excluding upper-case characters. In my current undertaking, I don't have the constraint of excluding upper-case characters.
I wrote my own Python for this project using the excellent question and answer here on how to build a URL-shortener.
I have published a stand-alone example of the logic here as a Gist.
Problem
I'm now writing a performance-critical implementation of this in a compiled language, most likely Rust, but I'd need to port it to other languages as well.. I'm also having to accept an arbitrary-length array of bytes, rather than an arbitrary-width integer, as is the case in Python.
I suppose that as long as I use an unsigned integer and use consistent endianness, I could treat the byte array as one long arbitrary-precision unsigned integer and do division over it, though I'm not sure how performance will scale with that. I'd hope that arbitrary-precision unsigned integer libraries would try to use vector instructions where possible, but I'm not sure how this would work when the input length does not match a specific instruction length, i.e. when the input size in bits is not evenly divisible by supported instructions, e.g. 8, 16, 32, 64, 128, 256, 512 bits.
I have also considered breaking up the byte array into 256-bit (32 byte) blocks and using SIMD instructions (I only need to support x86_64 on recent CPUs) directly to operate on larger unsigned integers, but I'm not exactly sure how to deal with size % 32 != 0 blocks; I'd probably need to zero-pad, but I'm not clear on how I would know when to do this during decoding, i.e. when I don't know the underlying length of the source value, only that of the decoded value.
Question
If I'm going the arbitrary unsigned integer width route, I'd essentially be at the mercy of the library author, which is probably fine; I'd imagine that these libraries would be fairly optimized to vectorize as much as possible.
If I try to go the block route, I'd probably zero-pad any remaining bits in the block if the input length was not divisible by the block size during encoding. However, would it even be possible to decode such a value without knowing the decoded value size?

Related

Avoiding long integers (32 bit) in Simulink-generated code

I'm working with some signal processing code (C language) generated by Matlab Simulink, targetting a DSP with 24-bit integers. The code that Simulink generated relies upon the existence of 32-bit integers and uses them in calculations, only at the end truncating the higher-order bits into the 24-bit result. Unfortunately the compiler for this architecture targets a limited subset of C and doesn't currently support 32-bit longs, instead having short/int/long all as the same 24-bit integer.
We've tried specifying the bit-widths of the integer types for the custom processor target as 24 bits, however this gave errors and the documentation for hardware targets appears to confirm that this is not permitted (3rd bullet):
The Number of bits parameters describe the native word size of the microprocessor and the bit lengths of char, short, int, and long data. For code generation to succeed:
The bit lengths must be such that char <= short <= int <= long.
Bit lengths must be multiples of 8, with a maximum of 32.
The bit length for long data must not be less than 32.
However as a Simulink neophyte it's quite possible I'm looking in the wrong places - is it in fact possible to have Simulink target a device with only 24-bit integers?

Compressing Sets of Integers Into Smaller Integers

Along the lines of How to encode integers into other integers, I am wondering if it is possible to encode one integer or a set of integers into one smaller integer or a smaller set of integers, and if so, how it is done. For example, encoding an 8 bit integer into a 4 bit integer, a 256 integer into a 16 bit integer. It doesn't seem possible but perhaps there is something along these lines. Basically, how to get a set of integers to take up less space. Not necessarily encoding into another sequence of bytes, but maybe even into a data structure that is more compact.
Sure, you can always encode them into fewer bits. However you won't be able to decode them back to the original bits. Though you neglected to mention that step, I'm guessing that's what you're looking for.

Rationale for CBOR negative integers

I am confused as to why CBOR chooses to encode negative integers as unsigned binary numbers with the value defined as -1 minus the unsigned value, instead of e.g. regular two's complement representation. Is there an obvious advantage that I'm missing, apart from increased negative range (which, IMO, is of questionable value weighed against increased complexity)?
Advantages:
There's only one allowed encoding type for each integer value, so all encoders will emit consistent output. If the encoders use the shortest encoding for each value as recommended by the spec, they'll emit identical output.
Picking the shortest numeric field is easier for non-negative numbers than for signed negative numbers, and CBOR aims for tiny IOT devices to readily transmit data.
It fits twice as many values into each integer encoding field width, thus making the data more compact. (It'd be yet more compact if the integer encodings didn't overlap, but that'd be notably more complicated.)
It can handle twice as large a negative value before needing the bignum extension.

How are datatypes that need more than 32 bits stored in a 32 bit OS

How are datatypes that would need more than 32 bits stored in the system?
For example consider an unsigned int or a long which can have a value greater than 2 to the power of 32, how is it stored in the memory?
Any OS, or compiler, will use the number of bits that it needs. So if an OS or language has a need for 64-bit integers, it will just store such integers into an 8-byte representation.
There are standards for this, for integers as well as floating point numbers. See this article on Wikipedia for more: http://en.wikipedia.org/wiki/Computer_numbering_formats
The 32-bits in a 32-bit architecture to the number of bits the CPU registers are wide (There are some exceptions, such as floating point registers). This does not mean that the system can't handle datatypes larger than this, only that it must deal with these datatypes 32-bits at a time.
For example, machines have an "Add With Carry" instruction, which allows the machine to chain link multiple adds together so that arbitrarily sized numbers, say two 512-bit numbers, can be added in 16 steps (512/32).

Are there any real-world uses for converting numbers between different bases?

I know that we need to convert decimal, octal, and hexadecimal into binary, but I am confused about conversion of decimal to octal or octal to hexadecimal or decimal to hexadecimal.
Why and where we need these types of conversion?
Different bases are good for different purposes.
Decimal is obviously what most people know how to deal with, so is good for output of real quantities to end users.
Hex is short and has an even ratio of exactly 2 characters per byte, so it's good for expressing large numbers like SHA1 hashes or private keys and the like in a type-able format, particularly since those numbers don't really represent a quantity, so users don't need to be able to understand them as numbers.
Octal is mostly for legacy reasons -- UNIX file permission codes are traditionally expressed as octal numbers, for example, because three bits per digit corresponds nicely to the three bits per user-category of the UNIX permission encoding scheme.
One sometimes will want to use numbers in one base for a purpose where another base is desired. Thus, the various conversion functions available. In truth, however, my experience is that in practice you almost never convert from one base to another much, except to convert numbers from some non-binary base into binary (in the form of your language of choice's native integral type) and back out into whatever base you need to output. Most of the time one goes from one non-binary base to another is when learning about bases and getting a feel for what numbers in different bases look like, or when debugging using hexadecimal output. Even then, if a computer does it the main method is to convert to binary and then back out, because current computers are just inherently good at dealing with base-2 numbers and not-so-good at anything else.
One important place you see numbers actually stored and operated on in decimal is in some financial applications or others where it's important that "number-of-decimal-place" level precision be preserved. Sometimes fixed-point arithmetic can work for currency, but not always, and if it doesn't using binary-floating-point is a bad idea. Older systems actually had built in support for this in the form of binary-coded-decimal arithmetic. In BCD, each 4 bits acts as a decimal digit, so you give up a chunk of every 4 bits of storage in exchange for maintaining your level of precision in the base-of-choice of the non-computing world.
Oddly enough, there is one common use case for other bases that's a bit hidden. Modern languages with large number support (e.g. Python 2.x's long type or Java's BigInteger and BigDecimal type) will usually store the numbers internally in an array with each element being a digit in some base. Then they implement the math they support on strings of digits of that base. Really efficient bigint implementations may actually use use a base approaching 2^(bits in machine native word size); a base 2^64 number is obviously impossible to usefully output in that form, but doing the calculations in chunks of that size ends up making the best use of space and the CPU. (I don't know if that's the best base; it may be best to use a base of half that number of bits to simplify overflow handling from one digit to the next. It's been awhile since I wrote my own bigint and I never implemented the faster/more-complicated versions of multiplication and division.)
MIME uses hexadecimal system for Quoted Printable encoding (e.g. mail subject in Unicode) abd 64-based system for Base64 encoding.
If your workplace is stuck in IPv4 CIDR - you'll be doing quite a lot of bin -> hex -> decimal conversions managing most of the networking equipment until you get them memorized (or just use some random, simple tool).
Even that usage is a bit few-and-far-between - most businesses just adopt the lazy "/24 everything" approach.
If you do a lot of graphics work - there's the chance you'll want to convert colors between systems and need to convert from hex -> dec... most tools have this built in to the color picker, though.
I suppose there's no practical reason to be able to do other than it's really simple and there's no point not learning how to do it. :)
... unless, for some reason, you're trying to do mantissa binary math in your head.
All of these bases have their uses. Hexadecimal in particular is useful as a shorthand for binary. Every hexadecimal digit is equivalent to 4 bits, so you can write a full 32-bit value as a string of 8 hex digits. Likewise, octal digits are equivalent to 3 bits, and are used frequently as a shorthand for things like Unix file permissions (777 = set read, write, execute bits for user/group/other).
No one base is special--they all have their (obscure) uses. Decimal is special to us because it reflects human experience (10 fingers) but that's really the only reason.
A real world use case: a program prints error code in decimal, to get info from a database or the internet you need the hexadecimal format, because the bits of the error 'number' convey extra info you need to look at it in binary.
I'm there are occasional uses for this. One use case would be a little app that allows user who wants to convert decimal to octal ... like you can with lots of calculators.
But I'm not sure I understand the point of the question. Standard libraries typically don't provide methods like String toOctal(String decimal). Instead, you would normally convert from a decimal String to a primitive integer and then from the primitive integer to (say) an octal String.