Does a pronounceable encoding exist? - encoding

I am using UUIDs, but they are not particularly nice to read, write and communicate. So I would like to encode them. I could use base64, or base32, but they would not be easy anyway: base64 has capitalized letters and symbols. Base32 is a bit better, but you can still obtain clumsy stuff.
I was wondering if there's a nice and clean way to encode a number into palatable phonemes, so to achieve better readability and hopefully a bit of compression.

I hope you don't use this idea: The Automated Curse Generator :)

Bubble Babble is a good one to try. It generates nonsensical but readable output like:
xesef-disof-gytuf-katof-movif-baxux

This question is very old; interestingly, as old as the solution I'm about to present, but it hasn't been mentioned here yet.
It's Proquint. Similar to Bubble Babble, but the differences make the results easier to read, in my opinion.
Here's how it works, from their documentation:
In sum, we propose encoding a 16-bit string as a proquint [PRO-nouncable QUINT-uplet] of alternating consonants and vowels as follows.
Four-bits as a consonant:
0 1 2 3 4 5 6 7 8 9 A B C D E F
b d f g h j k l m n p r s t v z
Two-bits as a vowel:
0 1 2 3
a i o u
Whole 16-bit word, where "con" = consonant, "vo" = vowel:
0 1 2 3 4 5 6 7 8 9 A B C D E F
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|con |vo |con |vo |con |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Separate proquints using dashes, which can go un-pronounced or be pronounced "eh". The suggested
optional magic number prefix to a sequence of proquints is "0q-".
Here are some IP dotted-quads and their corresponding proquints.
127.0.0.1 lusab-babad
63.84.220.193 gutih-tugad
63.118.7.35 gutuk-bisog
140.98.193.141 mudof-sakat
64.255.6.200 haguz-biram
128.30.52.45 mabiv-gibot
147.67.119.2 natag-lisaf
212.58.253.68 tibup-zujah
216.35.68.215 tobog-higil
216.68.232.21 todah-vobij
198.81.129.136 sinid-makam
12.110.110.204 budov-kuras

Why not use something similar to what PGP does to create readable keys, simply find a nice list of words that are distinctive, lets say you're using 128 bit UUID's, a list of 256 words (2^8) means 16 words.
Stupid question but why are people reading/writing UUID's/etc. with respect to your application?

If all you want is a way to communicate hex values readably (ie, over the phone, or when instructing someone verbally what to type), then I suggest you use one of the various phonetic alphabets, such as the NATO Phonetic Alphabet or the US Army/Navy Phonetic Alphabet.
In the latter, the letters A-F are spoken as "able", "baker", "charlie", "dog", "easy", and "fox", respectively, so you would read the hex sequence "3fd2cc0e" as "three fox dog two charlie charlie zero easy". A uuid would be read out in exactly the same fashion.

Bubble babble and base32 are inefficient, especially in your case. I suggest that you make your own algorithm. Since there are 20 consonants and 6 vowels (including 'y') you can have approx. 20*6*2+6*6=276 consonant/vowel-vowel/consonant pairs. So every byte of your number can be represented by a pair. With a bit of tweaking your algorithm could produce pronounceable words much shorter than bubble babble. You could even play dice and replace all odd digits with a consonant/vowel. For example, 0123456789ABCDEF (hex) encodes to ABECIDOFUGYHKRM. 3141592654 (dec) encodes to HHIA-ROIR. You are left with ten spare consonants which can be paired with vowels to replace some double consonants etc.

S/KEY uses a dictionary of 2048 words to map 64 bit numbers to a sequence of 6 predefined words/syllables. (People will always find swear words if they are looking for them ;) )

Urbit's phonetic naming system wasn't mentioned yet. It uses 3 characters for 8 bits, 6 for 16, so it's less efficient than Proquint or Bubble Babble, but more divisible.

and hopefully a bit of compression
Not sure exactly what you mean there; making something "readable" or "pronouncable" will inevitably expand the space required for it. Maybe you meant "hopefully a bit of redundancy"? It would be good if, even if the user makes a small mistake, the system can detect and perhaps even correct it.
Really it depends very much on how big your UUIDs are and how they are most often communicated. If they need to be communicated over phone or VoIP, you want more audible redundancy. If they need to be entered into mobile devices with numeric keypads, it tends to be difficult to enter alphabetic characters, moreso if they are case-sensitive. If they are written down a lot, you need to worry about characters that look similar (O and 0 and o, for instance). If they need to be memorised, then probably strings of real words are the best (have a look at the PGP Word List).
However I think a great all-round solution is just using numberic digits. They're a lot harder to confuse with each other (both when spoken and written) than some alphabetic characters. Easy to enter on mobile devices, and people aren't too bad at memorising numbers.
And the length of the string is not too bad either. Let's compare base32 with base 10 (decimal). The length of a decimal string is log_10(32) times the length of the corresponding base32 string, or about 1.5 times as long. Ten characters of base32 correspond to 15 decimal digits.
Not much of a penalty, IMO, seeing as in base 32 it's easy to confuse C and T, or S, F and X (when spoken), and someone speaking with a foreign accent is more likely to cause trouble.

If they were easy to read they probably wouldn't be particularly unique.

Related

How can I calculate the impact on collision probability when truncating a hash?

I'd like to reduce an MD5 digest from 32 characters down to, ideally closer to 16. I'll be using this as a database key to retrieve a set of (public) user-defined parameters. I'm expecting the number of unique "IDs" to eventually exceed 10,000. Collisions are undesirable but not the end of the world.
I'd like to understand the viability of a naive truncation of the MD5 digest to achieve a shorter key. But I'm having trouble digging up a formula that I can understand (given I have a limited Math background), let alone use to determine the impact on collision probability that truncating the hash would have.
The shorter the better, within reason. I feel there must be a simple formula, but I'd rather have a definitive answer than do my own guesswork cobbled together from bits and pieces I have read around the web.
You can calculate the chance of collisions with this formula:
chance of collision = 1 - e^(-n^2 / (2 * d))
Where n is the number of messages, d is the number of possibilities, and e is the constant e (2.718281828...).
#mypetition's answer is great.
I found a few other equations that are more-or-less accurate and/or simplified here, along with a great explanation and a handy comparison of real-world probabilities:
1−e^((−k(k−1))/2N) - sample plot here
(k(k-1))/2N - sample plot here
k^2/2N - sample plot here
...where k is the number of ID's you'll be generating (the "messages") and N is the largest number that can be produced by the hash digest or the largest number that your truncated hexadecimal number could represent (technically + 1, to account for 0).
A bit more about "N"
If your original hash is, for example, "38BF05A71DDFB28A504AFB083C29D037" (32 hex chars), and you truncate it down to, say, 12 hex chars (e.g.: "38BF05A71DDF"), the largest number you could produce in hexadecimal is "0xFFFFFFFFFFFF" (281474976710655 - which is 16^12-1 (or 256^6 if you prefer to think in terms of bytes). But since "0" itself counts as one of the numbers you could theoretically produce, you add back that 1, which leaves you simply with 16^12.
So you can think of N as 16 ^ (numberOfHexDigits).

What is a realistic maximum number of unicode combining characters?

I'm looking for a maximum number of unicode combining characters that appear after a non-combining one in a realistic natural text.
I know that in unicode text there can be an arbitrary number of combinings placed anywhere in the text. However, I am writing a specialized application that has to operate under constrained resources and because of that and other technical reasons displaying an arbitrary number of combining chars after a non-combining one is not an option. However I would still like to display natural languages properly if possible and support for a small number of combinings should not be a problem.
My intuition that natural languages don't need more than some two or three combinings after a proper char, but I'm not sure and can't find any source on that number.
Ok, for a lack of a better answer, here's what I did (for future reference if needed):
I ended up using a SmallVec -like thing with a threshold of 8 bytes before allocation and some 50 bytes upper limit (text stored in UTF-8). That should make everyone happy I think and performance doesn't suffer.
Take those numbers with a pinch of salt, they are arbitrary and I might tune them anyway.

Is there any classic 3 byte fingerprint function?

I need a checksum/fingerprint function for short strings (say, 16 to 256 bytes) which fits in a 24 bits word. Is there any well known algorithm for that?
I propose to use a 24-bit CRC as an easy solution. CRCs are available in all lengths and always simple to compute. Wikipedia has a matching entry. The quality is far better than a modulo-reduced sum, because swapping characters will most likely produce a different CRC.
The next step (if it is a real threat to have a wrong string with the same checksum) would be a cryptographic MAC like CMAC. While this is too long out of the book, it can be reduced by taking the first 24 bits.
Simplest thing to do is a basic checksum - add up the bytes in the string, mod (2^24).
You have to watch out for character set issues when converting to bytes though, so everyone agrees on the same encoding of characters to bytes.

Are there any real-world uses for converting numbers between different bases?

I know that we need to convert decimal, octal, and hexadecimal into binary, but I am confused about conversion of decimal to octal or octal to hexadecimal or decimal to hexadecimal.
Why and where we need these types of conversion?
Different bases are good for different purposes.
Decimal is obviously what most people know how to deal with, so is good for output of real quantities to end users.
Hex is short and has an even ratio of exactly 2 characters per byte, so it's good for expressing large numbers like SHA1 hashes or private keys and the like in a type-able format, particularly since those numbers don't really represent a quantity, so users don't need to be able to understand them as numbers.
Octal is mostly for legacy reasons -- UNIX file permission codes are traditionally expressed as octal numbers, for example, because three bits per digit corresponds nicely to the three bits per user-category of the UNIX permission encoding scheme.
One sometimes will want to use numbers in one base for a purpose where another base is desired. Thus, the various conversion functions available. In truth, however, my experience is that in practice you almost never convert from one base to another much, except to convert numbers from some non-binary base into binary (in the form of your language of choice's native integral type) and back out into whatever base you need to output. Most of the time one goes from one non-binary base to another is when learning about bases and getting a feel for what numbers in different bases look like, or when debugging using hexadecimal output. Even then, if a computer does it the main method is to convert to binary and then back out, because current computers are just inherently good at dealing with base-2 numbers and not-so-good at anything else.
One important place you see numbers actually stored and operated on in decimal is in some financial applications or others where it's important that "number-of-decimal-place" level precision be preserved. Sometimes fixed-point arithmetic can work for currency, but not always, and if it doesn't using binary-floating-point is a bad idea. Older systems actually had built in support for this in the form of binary-coded-decimal arithmetic. In BCD, each 4 bits acts as a decimal digit, so you give up a chunk of every 4 bits of storage in exchange for maintaining your level of precision in the base-of-choice of the non-computing world.
Oddly enough, there is one common use case for other bases that's a bit hidden. Modern languages with large number support (e.g. Python 2.x's long type or Java's BigInteger and BigDecimal type) will usually store the numbers internally in an array with each element being a digit in some base. Then they implement the math they support on strings of digits of that base. Really efficient bigint implementations may actually use use a base approaching 2^(bits in machine native word size); a base 2^64 number is obviously impossible to usefully output in that form, but doing the calculations in chunks of that size ends up making the best use of space and the CPU. (I don't know if that's the best base; it may be best to use a base of half that number of bits to simplify overflow handling from one digit to the next. It's been awhile since I wrote my own bigint and I never implemented the faster/more-complicated versions of multiplication and division.)
MIME uses hexadecimal system for Quoted Printable encoding (e.g. mail subject in Unicode) abd 64-based system for Base64 encoding.
If your workplace is stuck in IPv4 CIDR - you'll be doing quite a lot of bin -> hex -> decimal conversions managing most of the networking equipment until you get them memorized (or just use some random, simple tool).
Even that usage is a bit few-and-far-between - most businesses just adopt the lazy "/24 everything" approach.
If you do a lot of graphics work - there's the chance you'll want to convert colors between systems and need to convert from hex -> dec... most tools have this built in to the color picker, though.
I suppose there's no practical reason to be able to do other than it's really simple and there's no point not learning how to do it. :)
... unless, for some reason, you're trying to do mantissa binary math in your head.
All of these bases have their uses. Hexadecimal in particular is useful as a shorthand for binary. Every hexadecimal digit is equivalent to 4 bits, so you can write a full 32-bit value as a string of 8 hex digits. Likewise, octal digits are equivalent to 3 bits, and are used frequently as a shorthand for things like Unix file permissions (777 = set read, write, execute bits for user/group/other).
No one base is special--they all have their (obscure) uses. Decimal is special to us because it reflects human experience (10 fingers) but that's really the only reason.
A real world use case: a program prints error code in decimal, to get info from a database or the internet you need the hexadecimal format, because the bits of the error 'number' convey extra info you need to look at it in binary.
I'm there are occasional uses for this. One use case would be a little app that allows user who wants to convert decimal to octal ... like you can with lots of calculators.
But I'm not sure I understand the point of the question. Standard libraries typically don't provide methods like String toOctal(String decimal). Instead, you would normally convert from a decimal String to a primitive integer and then from the primitive integer to (say) an octal String.

Variable-byte encoding clarification

I am very new to the world of byte encoding so please excuse me (and by all means, correct me) if I am using/expressing simple concepts in the wrong way.
I am trying to understand variable-byte encoding. I have read the Wikipedia article (http://en.wikipedia.org/wiki/Variable-width_encoding) as well as a book chapter from an Information Retrieval textbook. I think I understand how to encode a decimal integer. For example, if I wanted to provide variable-byte encoding for the integer 60, I would have the following result:
1 0 1 1 1 1 0 0
(please let me know if the above is incorrect). If I understand the scheme, then I'm not completely sure how the information is compressed. Is it because usually we would use 32 bits to represent an integer, so that representing 60 would result in 1 1 1 1 0 0 preceded by 26 zeros, thus wasting that space as opposed to representing it with just 8 bits instead?
Thank you in advance for the clarifications.
The way you do it is by reserving one of the bits to mean "I'm not done with the value." Usually, that's the most significant bit.
When you read a byte, you process the lower 7 bits. If the most significant bit is 1, then you know there's one more byte to read, and you repeat the process, adding the next 7 bits to the current 7 bits.
The MIDI format uses that exact encoding to represent lengths of MIDI events, in the following manner:
ExpectedValue = 0
byte=ReadFromFile
ExpectedValue = ExpectedValue + (byte AND 0x7f)
if byte > 127 then
ExpectedValue = ExpectedValue SHL 7
Goto 2
Done
For example, the value 0x80 would be represented using the bytes 0x81 0x00. You can try running the algorithm on those two bytes, and you see you'll get the right value.
UTF-8 works similarly, but it uses a slightly more complex scheme to tell you how many bytes you should be expecting. This allows for some error correction, since you can easily tell if the bytes you're getting match the length claimed. Wikipedia describes their structure quite well.
You hit the nail on the head.
There are many encoding schemes, such as gamma and delta, which are special cases of elias coding. These are bit-level codes, as opposed to the byte-level code you used, and are useful when you have a strong skew towards small numbers (which can often be achieved by encoding deltas instead of absolute values).
Bit-level encoding schemes are much more difficult to implement than byte-level schemes and the additional CPU burden may outweigh the time saved by having less data to read, though most modern CPUs have "highest-bit" and "lowest-bit" instructions that dramatically improve the performance of bit-level codecs. As CPU speeds continue to outpace RAM speeds, bit-level schemes will become more attractive, though the simplicity of byte-level codecs is a big factor too.
Yes, you are right, you save space by encoding using one byte instead of 4.
Generally, you will save memory if the values you are encoding are much smaller than the maximum value that would have fit in your original fixed-width encoding.