Most compact way to encode a sequence of random variable length binary codes? - encoding

Let's say you have a List<List<Boolean>> and you want to encode that into binary form in the most compact way possible.
I don't care about read or write performance. I just want to use the minimal amount of space. Also, the example is in Java, but we are not limited to the Java system. The length of each "List" is unbounded. Therefore any solution that encodes the length of each list must in itself encode a variable length data type.
Related to this problem is encoding of variable length integers. You can think of each List<Boolean> as a variable length unsigned integer.
Please read the question carefully. We are not limited to the Java system.
EDIT
I don't understand why a lot of the answers talk about compression. I am not trying to do compression per se, but just encoding random sequence of bits down. Except each sequence of bits are of different lengths and order needs to be preserved.
You can think of this question in a different way. Lets say you have a list of arbitrary list of random unsigned integers (unbounded). How do you encode this list in a binary file?
Research
I did some reading and found what I really am looking for is Universal code
Result
I am going to use a variant of Elias Omega Coding described in the paper A new recursive universal code of the positive integers
I now understand how the smaller the representation of the smaller integers is a trade off with the larger integers. By simply choosing an Universal code with a "large" representation of the very first integer you save a lot of space in the long run when you need to encode the arbitrary large integers.

I am thinking of encoding a bit sequence like this:
head | value
------+------------------
00001 | 0110100111000011
Head has variable length. Its end is marked by the first occurrence of a 1. Count the number of zeroes in head. The length of the value field will be 2 ^ zeroes. Since the length of value is known, this encoding can be repeated. Since the size of head is log value, as the size of the encoded value increases, the overhead converges to 0%.
Addendum
If you want to fine tune the length of value more, you can add another field that stores the exact length of value. The length of the length field could be determined by the length of head. Here is an example with 9 bits.
head | length | value
------+--------+-----------
00001 | 1001 | 011011001

I don't know much about Java, so I guess my solution will HAVE to be general :)
1. Compact the lists
Since Booleans are inefficient, each List<Boolean> should be compacted into a List<Byte>, it's easy, just grab them 8 at a time.
The last "byte" may be incomplete, so you need to store how many bits have been encoded of course.
2. Serializing a list of elements
You have 2 ways to proceed: either you encode the number of items of the list, either you use a pattern to mark an end. I would recommend encoding the number of items, the pattern approach requires escaping and it's creepy, plus it's more difficult with packed bits.
To encode the length you can use a variable scheme: ie the number of bytes necessary to encode a length should be proportional to the length, one I already used. You can indicate how many bytes are used to encode the length itself by using a prefix on the first byte:
0... .... > this byte encodes the number of items (7 bits of effective)
10.. .... / .... .... > 2 bytes
110. .... / .... .... / .... .... > 3 bytes
It's quite space efficient, and decoding occurs on whole bytes, so not too difficult. One could remark it's very similar to the UTF8 scheme :)
3. Apply recursively
List< List< Boolean > > becomes [Length Item ... Item] where each Item is itself the representation of a List<Boolean>
4. Zip
I suppose there is a zlib library available for Java, or anything else like deflate or lcw. Pass it your buffer and make sure to precise you wish as much compression as possible, whatever the time it takes.
If there is any repetitive pattern (even ones you did not see) in your representation it should be able to compress it. Don't trust it dumbly though and DO check that the "compressed" form is lighter than the "uncompressed" one, it's not always the case.
5. Examples
Where one notices that keeping track of the edge of the lists is space consuming :)
// Tricky here, we indicate how many bits are used, but they are packed into bytes ;)
List<Boolean> list = [false,false,true,true,false,false,true,true]
encode(list) == [0x08, 0x33] // [00001000, 00110011] (2 bytes)
// Easier: the length actually indicates the number of elements
List<List<Boolean>> super = [list,list]
encode(super) == [0x02, 0x08, 0x33, 0x08, 0x33] // [00000010, ...] (5 bytes)
6. Space consumption
Suppose we have a List<Boolean> of n booleans, the space consumed to encode it is:
booleans = ceil( n / 8 )
To encode the number of bits (n), we need:
length = 1 for 0 <= n < 2^7 ~ 128
length = 2 for 2^7 <= n < 2^14 ~ 16384
length = 3 for 2^14 <= n < 2^21 ~ 2097152
...
length = ceil( log(n) / 7 ) # for n != 0 ;)
Thus to fully encode a list:
bytes =
if n == 0: 1
else : ceil( log(n) / 7 ) + ceil( n / 8 )
7. Small Lists
There is one corner case though: the low end of the spectrum (ie almost empty list).
For n == 1, bytes is evaluated to 2, which may indeed seem wasteful. I would not however try to guess what will happen once the compression kicks in.
You may wish though to pack even more. It's possible if we abandon the idea of preserving whole bytes...
Keep the length encoding as is (on whole bytes), but do not "pad" the List<Boolean>. A one element list becomes 0000 0001 x (9 bits)
Try to 'pack' the length encoding as well
The second point is more difficult, we are effectively down to a double length encoding:
Indicates how many bits encode the length
Actually encode the length on these bits
For example:
0 -> 0 0
1 -> 0 1
2 -> 10 10
3 -> 10 11
4 -> 110 100
5 -> 110 101
8 -> 1110 1000
16 -> 11110 10000 (=> 1 byte and 2 bits)
It works pretty well for very small lists, but quickly degenerate:
# Original scheme
length = ceil( ( log(n) / 7)
# New scheme
length = 2 * ceil( log(n) )
The breaking point ? 8
Yep, you read it right, it's only better for list with less than 8 elements... and only better by "bits".
n -> bits spared
[0,1] -> 6
[2,3] -> 4
[4,7] -> 2
[8,15] -> 0 # Turn point
[16,31] -> -2
[32,63] -> -4
[64,127] -> -6
[128,255] -> 0 # Interesting eh ? That's the whole byte effect!
And of course, once the compression kicks in, chances are it won't really matter.
I understand you may appreciate recursive's algorithm, but I would still advise to compute the figures of the actual space consumption or even better to actually test it with archiving applied on real test sets.
8. Recursive / Variable coding
I have read with interest TheDon's answer, and the link he submitted to Elias Omega Coding.
They are sound answers, in the theoretical domain. Unfortunately they are quite unpractical. The main issue is that they have extremely interesting asymptotic behaviors, but when do we actually need to encode a Gigabyte worth of data ? Rarely if ever.
A recent study of memory usage at work suggested that most containers were used for a dozen items (or a few dozens). Only in some very rare case do we reach the thousand. Of course for your particular problem the best way would be to actually examine your own data and see the distribution of values, but from experience I would say you cannot just concentrate on the high end of the spectrum, because your data lay in the low end.
An example of TheDon's algorithm. Say I have a list [0,1,0,1,0,1,0,1]
len('01010101') = 8 -> 1000
len('1000') = 4 -> 100
len('100') = 3 -> 11
len('11') = 2 -> 10
encode('01010101') = '10' '0' '11' '0' '100' '0' '1000' '1' '01010101'
len(encode('01010101')) = 2 + 1 + 2 + 1 + 3 + 1 + 4 + 1 + 8 = 23
Let's make a small table, with various 'tresholds' to stop the recursion. It represents the number of bits of overhead for various ranges of n.
threshold 2 3 4 5 My proposal
-----------------------------------------------
[0,3] -> 3 4 5 6 8
[4,7] -> 10 4 5 6 8
[8,15] -> 15 9 5 6 8
[16,31] -> 16 10 5 6 8
[32,63] -> 17 11 12 6 8
[64,127] -> 18 12 13 14 8
[128,255]-> 19 13 14 15 16
To be fair, I concentrated on the low end, and my proposal is suited for this task. I wanted to underline that it's not so clear cut though. Especially because near 1, the log function is almost linear, and thus the recursion loses its charm. The treshold helps tremendously and 3 seems to be a good candidate...
As for Elias omega coding, it's even worse. From the wikipedia article:
17 -> '10 100 10001 0'
That's it, a whooping 11 bits.
Moral: You cannot chose an encoding scheme without considering the data at hand.
So, unless your List<Boolean> have a length in the hundreds, don't bother and stick to my little proposal.

I'd use variable-length integers to encode how many bits there are to read. The MSB would indicate if the next byte is also part of the integer. For instance:
11000101 10010110 00100000
Would actually mean:
10001 01001011 00100000
Since the integer is continued 2 times.
These variable-length integers would tell how many bits there are to read. And there'd be another variable-length int at the beginning of all to tell how many bit sets there are to read.
From there on, supposing you don't want to use compression, the only way I can see to optimize it size-wise is to adapt it to your situation. If you often have larger bit sets, you might want for instance to use short integers instead of bytes for the variable-length integer encoding, making you potentially waste less bits in the encoding itself.
EDIT I don't think there exists a perfect way to achieve all you want, all at once. You can't create information out of nothing, and if you need variable-length integers, you obviously have to encode the integer length too. There is necessarily a tradeoff between space and information, but there is also minimal information that you can't cut out to use less space. No system where factors grow at different rates will ever scale perfectly. It's like trying to fit a straight line over a logarithmic curve. You can't do that. (And besides, that's pretty much exactly what you're trying to do here.)
You cannot encode the length of the variable-length integer outside of the integer and get unlimited-size variable integers at the same time, because that would require the length itself to be variable-length, and whatever algorithm you choose, it seems common sense to me that you'll be better off with just one variable-length integer instead of two or more of them.
So here is my other idea: in the integer "header", write one 1 for each byte the variable-length integer requires from there. The first 0 denotes the end of the "header" and the beginning of the integer itself.
I'm trying to grasp the exact equation to determine how many bits are required to store a given integer for the two ways I gave, but my logarithms are rusty, so I'll plot it down and edit this message later to include the results.
EDIT 2
Here are the equations:
Solution one, 7 bits per encoding bit (one full byte at a time):
y = 8 * ceil(log(x) / (7 * log(2)))
Solution one, 3 bits per encoding bit (one nibble at a time):
y = 4 * ceil(log(x) / (3 * log(2)))
Solution two, 1 byte per encoding bit plus separator:
y = 9 * ceil(log(x) / (8 * log(2))) + 1
Solution two, 1 nibble per encoding bit plus separator:
y = 5 * ceil(log(x) / (4 * log(2))) + 1
I suggest you take the time to plot them (best viewed with a logarithmic-linear coordinates system) to get the ideal solution for your case, because there is no perfect solution. In my opinion, the first solution has the most stable results.

I guess for "the most compact way possible" you'll want some compression, but Huffman Coding may not be the way to go as I think it works best with alphabets that have static per-symbol frequencies.
Check out Arithmetic Coding - it operates on bits and can adapt to a dynamic input probabilities. I also see that there is a BSD-licensed Java library that'll do it for you which seems to expect single bits as input.
I suppose for maximum compression you could concatenate each inner list (prefixed with its length) and run the coding algorithm again over the whole lot.

I don't see how encoding an arbitrary set of bits differ from compressing/encoding any other form of data. Note that you only impose a loose restriction on the bits you're encoding: namely, they are lists of lists of bits. With this small restriction, this list of bits becomes just data, arbitrary data, and that's what "normal" compression algorithms compress.
Of course, most compression algorithms work on the assumption that the input is repeated in some way in the future (or in the past), as in the LZxx family of compressor, or have a given frequency distribution for symbols.
Given your prerequisites and how compression algorithms work, I would advice doing the following:
Pack the bits of each list using the less possible number of bytes, using bytes as bitfields, encoding the length, etc.
Try huffman, arithmetic, LZxx, etc on the resulting stream of bytes.
One can argue that this is the pretty obvious and easiest way of doing this, and that this won't work as your sequence of bits have no known pattern. But the fact is that this is the best you can do in any scenario.
UNLESS, you know something from your data, or some transformation on those lists that make them raise a pattern of some kind. Take for example the coding of the DCT coefficients in JPEG encoding. The way of listing those coefficients (diagonal and in zig-zag) is made to favor a pattern in the output of the different coefficients for the transformation. This way, traditional compressions can be applied to the resulting data. If you know something of those lists of bits that allow you to re-arrange them in a more-compressible way (a way that shows some more structure), then you'll get compression.

I have a sneaking suspicion that you simply can't encode a truly random set of bits into a more compact form in the worst case. Any kind of RLE is going to inflate the set on just the wrong input even though it'll do well in the average and best cases. Any kind of periodic or content specific approximation is going to lose data.
As one of the other posters stated, you've got to know SOMETHING about the dataset to represent it in a more compact form and / or you've got to accept some loss to get it into a predictable form that can be more compactly expressed.
In my mind, this is an information-theoretic problem with the constraint of infinite information and zero loss. You can't represent the information in a different way and you can't approximate it as something more easily represented. Ergo, you need at least as much space as you have information and no less.
http://en.wikipedia.org/wiki/Information_theory
You could always cheat, I suppose, and manipulate the hardware to encode a discrete range of values on the media to tease out a few more "bits per bit" (think multiplexing). You'd spend more time encoding it and reading it though.
Practically, you could always try the "jiggle" effect where you encode the data multiple times in multiple ways (try interpreting as audio, video, 3d, periodic, sequential, key based, diffs, etc...) and in multiple page sizes and pick the best. You'd be pretty much guaranteed to have the best REASONABLE compression and your worst case would be no worse then your original data set.
Dunno if that would get you the theoretical best though.

Theoretical Limits
This is a difficult question to answer without knowing more about the data you intend to compress; the answer to your question could be different with different domains.
For example, from the Limitations section of the Wikipedia article on Lossless Compression:
Lossless data compression algorithms cannot guarantee compression for all input data sets. In other words, for any (lossless) data compression algorithm, there will be an input data set that does not get smaller when processed by the algorithm. This is easily proven with elementary mathematics using a counting argument. ...
Basically, since it's theoretically impossible to compress all possible input data losslessly, it's not even possible to answer your question effectively.
Practical compromise
Just use Huffman, DEFLATE, 7Z, or some ZIP-like off-the-shelf compression algorithm and enocde the bits as variable length byte arrays (or lists, or vectors, or whatever they are called in Java or whatever language you like). Of course, to read the bits back out may require a bit of decompression but that could be done behind the scenes. You can make a class which hides the internal implementation methods to return a list or array of booleans in some range of indices despite the fact that the data is stored internally in pack byte arrays. Updating the boolean at a give index or indices may be a problem but is by no means impossible.

List-of-Lists-of-Ints-Encoding:
When you come to the beginning of a list, write down the bits for ASCII '['. Then proceed into the list.
When you come to any arbitrary binary number, write down bits corresponding to the decimal representation of the number in ASCII. For example the number 100, write 0x31 0x30 0x30. Then write the bits corresponding to ASCII ','.
When you come to the end of a list, write down the bits for ']'. Then write ASCII ','.
This encoding will encode any arbitrarily-deep nesting of arbitrary-length lists of unbounded integers. If this encoding is not compact enough, follow it up with gzip to eliminate the redundancies in ASCII bit coding.

You could convert each List into a BitSet and then serialize the BitSet-s.

Well, first off you will want to pack those booleans together so that you are getting eight of them to a byte. C++'s standard bitset was designed for this purpose. You should probably be using it natively instead of vector, if you can.
After that, you could in theory compress it when you save to get the size even smaller. I'd advise against this unless your back is really up against the wall.
I say in theory because it depends a lot on your data. Without knowing anything about your data, I really can't say any more on this, as some algorithms work better than others on certian kinds of data. In fact, simple information theory tells us that in some cases any compression algorithm will produce output that takes up more space than you started with.
If your bitset is rather sparse (not a lot of 0's, or not a lot of 1's), or is streaky (long runs of the same value), then it is possible you could get big gains with compression. In almost every other circumstance it won't be worth the trouble. Even in that circumstance it may not be. Remember that any code you add will need to be debugged and maintained.

As you point out, there is no reason to store your boolean values using any more space than a single bit. If you combine that with some basic construct, such as each row begins with an integer coding the number of bits in that row, you'll be able to store a 2D table of any size where each entry in the row is a single bit.
However, this is not enough. A string of arbitrary 1's and 0's will look rather random, and any compression algorithm breaks down as the randomness of your data increases - so I would recommend a process like Burrows-Wheeler Block sorting to greatly increase the amount of repeated "words" or "blocks" in your data. Once that's complete a simple Huffman code or Lempel-Ziv algorithm should be able to compress your file quite nicely.
To allow the above method to work for unsigned integers, you would compress the integers using Delta Codes, then perform the block sorting and compression (a standard practice in Information Retrieval postings lists).

If I understood the question correctly, the bits are random, and we have a random-length list of independently random-length lists. Since there is nothing to deal with bytes, I will discuss this as a bit stream. Since files actually contain bytes, you will need to put pack eight bits for each byte and leave the 0..7 bits of the last byte unused.
The most efficient way of storing the boolean values is as-is. Just dump them into the bitstream as a simple array.
In the beginning of the bitstream you need to encode the array lengths. There are many ways to do it and you can save a few bits by choosing the most optimal for your arrays. For this you will probably want to use huffman coding with a fixed codebook so that commonly used and small values get the shortest sequences. If the list is very long, you probably won't care so much about the size of it getting encoded in a longer form that is.
A precise answer as to what the codebook (and thus the huffman code) is going to be cannot be given without more information about the expected list lengths.
If all the inner lists are of the same size (i.e. you have a 2D array), you only need the two dimensions, of course.
Deserializing: decode the lengths and allocate the structures, then read the bits one by one, assigning them to the structure in order.

#zneak's answer (beat me to it), but use huffman encoded integers, especially if some lengths are more likely.
Just to be self-contained: Encode the number of lists as a huffman encoded integer, then for each list, encode its bit length as a huffman encoded integer. The bits for each list follow with no intervening wasted bits.
If the order of the lists doesn't matter, sorting them by length would reduce the space needed, only the incremental length increase of each subsequent list need be encoded.

List-of-List-of-Ints-binary:
Start traversing the input list
For each sublist:
Output 0xFF 0xFE
For each item in the sublist:
Output the item as a stream of bits, LSB first.
If the pattern 0xFF appears anywhere in the stream,
replace it with 0xFF 0xFD in the output.
Output 0xFF 0xFC
Decoding:
If the stream has ended then end any previous list and end reading.
Read bits from input stream. If pattern 0xFF is encountered, read the next 8 bits.
If they are 0xFE, end any previous list and begin a new one.
If they are 0xFD, assume that the value 0xFF has been read (discard the 0xFD)
If they are 0xFC, end any current integer at the bit before the pattern, and begin reading a new one at the bit after the 0xFC.
Otherwise indicate error.

If I understand correctly our data structure is ( 1 2 ( 33483 7 ) 373404 9 ( 337652222 37333788 ) )
Format like so:
byte 255 - escape code
byte 254 - begin block
byte 253 - list separator
byte 252 - end block
So we have:
struct {
int nmem; /* Won't overflow -- out of memory first */
int kind; /* 0 = number, 1 = recurse */
void *data; /* points to array of bytes for kind 0, array of bigdat for kind 1 */
} bigdat;
int serialize(FILE *f, struct bigdat *op) {
int i;
if (op->kind) {
unsigned char *num = (char *)op->data;
for (i = 0; i < op->nmem; i++) {
if (num[i] >= 252)
fputs(255, f);
fputs(num[i], f);
}
} else {
struct bigdat *blocks = (struct bigdat *)op->data
fputs(254, f);
for (i = 0; i < op->nmem; i++) {
if (i) fputs(253, f);
serialize(f, blocks[i]);
}
fputs(252, f);
}
There is a law about numeric digit distribution that says for sets of sets of arbitrary unsigned integers, the higher the byte value the less it happens so put special codes at the end.
Not encoding length in front of each takes up far less room, but makes deserialize a difficult exercise.

This question has a certain induction feel to it. You want a function: (bool list list) -> (bool list) such that an inverse function (bool list) -> (bool list list) generates the same original structure, and the length of the encoded bool list is minimal, without imposing restrictions on the input structure. Since this question is so abstract, I'm thinking these lists could be mind bogglingly large - 10^50 maybe, or 10^2000, or they can be very small, like 10^0. Also, there can be a large number of lists, again 10^50 or just 1. So the algorithm needs to adapt to these widely different inputs.
I'm thinking that we can encode the length of each list as a (bool list), and add one extra bool to indicate whether the next sequence is another (now larger) length or the real bitstream.
let encode2d(list1d::Bs) = encode1d(length(list1d), true) # list1d # encode2d(Bs)
encode2d(nil) = nil
let encode1d(1, nextIsValue) = true :: nextIsValue :: []
encode1d(len, nextIsValue) =
let bitList = toBoolList(len) # [nextIsValue] in
encode1d(length(bitList), false) # bitList
let decode2d(bits) =
let (list1d, rest) = decode1d(bits, 1) in
list1d :: decode2d(rest)
let decode1d(bits, n) =
let length = fromBoolList(take(n, bits)) in
let nextIsValue :: bits' = skip(n, bits) in
if nextIsValue then bits' else decode1d(bits', length)
assumed library functions
-------------------------
toBoolList : int -> bool list
this function takes an integer and produces the boolean list representation
of the bits. All leading zeroes are removed, except for input '0'
fromBoolList : bool list -> int
the inverse of toBoolList
take : int * a' list -> a' list
returns the first count elements of the list
skip : int * a' list -> a' list
returns the remainder of the list after removing the first count elements
The overhead is per individual bool list. For an empty list, the overhead is 2 extra list elements. For 10^2000 bools, the overhead would be 6645 + 14 + 5 + 4 + 3 + 2 = 6673 extra list elements.

Related

How can 3-state bit packed together?

I am looking for a clever solution that would allow to pack into a 16 bits integer, at least nine 3-state 'bits'. It should also still be possible to easily set the value of one these 3-state 'bit'.
As an example, it could be used to encode a tic-tac-toe position, the tree state being, _ (empty), X (me), O (opponent) for the nine square of the board.
Naturally using 2 bits per square would do the job, but it would require 18bits overall. Is there an encoding that would use only 1.7 bits at most per square, and still stay simple for working with it ?
You can store ten 3-state values in a 16-bit integer, since 310 = 59049 < 65536. Simply encode a 10-digit base-3 number into a 16-bit integer, and pull the digits out going the other way.
To encode each digit d, the repeated operation is n = 3*n + d. To decode the digits in the opposite order, the repeated operations are d = n % 3 and n /= 3.

how to get reverse(not complement or inverse) of a binary number

I am implementing cooley-tuckey fft(raddix - 2 DIF / DIT) algorithm in matlab.In that for the bit reversing i want to have reverse of an binary number. so can anyone suggest how can I get the reverse of a binary number(like 100111 -> 111001). One who have worked on fft implementation can help me with the algorithm also.
Topic: How to do bit reversal in Matlab? .
If you're using double precision floating point ('double') numbers
which are integers, you can do this:
dr = bin2dec(fliplr(dec2bin(d,n))); % Bits in dr are in reverse order
where n is the number of bits to be reversed and where 0 <= d < 2^n.
You will experience no precision problems at all as long as the
integers are no more than 52 bits long.
And
Re: How to do bit reversal in Matlab?
How large will the numbers be that you need to reverse? May I ask what
is the purpose of it? Maybe there is a more efficient way to solve the
whole problem. If the numbers are large you can just store the bits as
a string. To reverse it just read the string backwards! Or use
fliplr().
(There may be better places to ask).
If it were VHDL I'd suggest an alias with 'REVERSE'RANGE.
Taken from the help section;
Y = swapbytes(X) reverses the byte ordering of each element in array X, converting little-endian values to big-endian (and vice versa). The input array must contain all full, noncomplex, numeric elements.

Simple compression algorithm in C++ interpretable by matlab

I'm generating ~1million text files containing arrays of doubles, tab delimited (these are simulations for research). Example output below. Each million text files I expect to be ~5 TB, which is unacceptable. So I need to compress.
However, all my data analysis will be done in matlab. And every matlab script will need to access all million of these text files. I can't decompress the whole million using C++, then run the matlab scripts, because I lack the HD space. So my question is, are there some very simple, easy to implement algorithms or other ways of reducing my text file sizes so that I can write the compression in C++ and read it in matlab?
example text file
0.0220874 0.00297818 0.000285954 1.70E-05 1.52E-07
0.0542912 0.00880725 0.000892849 6.94E-05 4.51E-06
0.0848582 0.0159799 0.00185915 0.000136578 7.16E-06
0.100415 0.0220033 0.00288016 0.000250445 1.38E-05
0.101889 0.0250725 0.00353148 0.000297856 2.34E-05
0.0942061 0.0256 0.00393893 0.000387219 3.01E-05
0.0812377 0.0238492 0.00392418 0.000418365 4.09E-05
0.0645259 0.0206528 0.00372185 0.000419891 3.23E-05
0.0487525 0.017065 0.00313825 0.00037539 3.68E-05
If it matters.. the complete text files represent joint probability mass functions, so they sum to 1. And I need lossless compression.
UPDATE Here is an IDIOTS guide to writing binary in C++ and reading it Matlab, with some very basic explanation along the way.
C++ code to write a small array to a binary file.
#include <iostream>
using namespace std;
int main()
{
float writefloat;
const int rows=2;
const int cols=3;
float JPDF[rows][cols];
JPDF[0][0]=.19493;
JPDF[0][1]=.111593;
JPDF[0][2]=.78135;
JPDF[1][0]=.33333;
JPDF[1][1]=.151535;
JPDF[1][2]=.591355;
JPDF is an array of type float that I save 6 values to. It's a 2x3 array.
FILE * out_file;
out_file = fopen ( "test.bin" , "wb" );
To be honest, I don't quite get what the first line is doing. It seems to be making a pointer of type FILE named out_file. The second line fopen says make a new file for writing (the 'w' of the second parameter), and make it a binary file (the 'b' of the wb).
fwrite(&rows,sizeof(int),1,out_file);
fwrite(&cols,sizeof(int),1,out_file);
Here I encode the size of my array (# rows, # cols). Note that we fwrite the reference to the variables rows and cols, not the variables themselves (& is by ref). The second parameter tells it how many bytes to write. Since rows and cols are both ints, I use sizeof(int). The '1' says do this 1 time. I think. And out_file is our pointer to the file we're writing to.
for (int i=0; i<3; i++)
{
for (int j=0; j<2; j++)
{
writefloat=JPDF[j][i];
fwrite (&writefloat , sizeof(float), 1, out_file);
}
}
fclose (out_file);
return 0;
}
Now I'll iterate through my array and write each value in bytes to my file. The indexing is a little backwards looking in that I'm iterating down each column rather than across a column in the inner loop. We'll see why in a sec. Again, I'm writing the reference to writefloat, which takes on the value of the current array element in each iteration. Since each array element is a float, I'm using sizeof(float) here instead of sizeof(int).
Just to be incredibly, stupidly clear, here's a diagram of how I think of the file we've just created.
[4 bytes: rows][4 bytes: cols][4 bytes: JPDF[0][0]][4 bytes: JPDF[1][0]] ...
[4 bytes: JPDF[1][2]]
..where each chunk of bytes is written in binary (0s and 1s).
To interpret in MATLAB:
FID=fopen('test.bin');
sizes=fread(FID,2,'int')
FID sort of works like a pointer here. Secretly, it probably is a pointer. Then we use fread which operates very similarly to C++ fread. FID is our 'pointer' to our file. The 'int' tells the function how many bytes each chunk contains. So sizes=fread(FID,2,'int') says 'open FID in binary, and read 2 chunks of size INT bytes, and return the 2 elements in vector form. Now, sizes(1)=rows, and sizes(2)=cols.
s=fread(FID,[sizes(1) sizes(2)],'float')
The next part wasn't completely clear to me originally, I thought I'd have to tell fread to skip the 'header' of my binary that contains row/col info. However, it secretly maintains a pointer to where you left off. So now I empty the rest of the binary file, using the fact that I know the dimensions of the array. Note, while the second parameter [M,N] is [rows,cols], fread reads in "column order", which is why we wrote the array data in column order.
The one * is that I think I can only use matlab code 'int' and 'float' if the architecture of the C++ program is concordant with matlab (e.g., both are 64-bit, or both are 32-bit). But I'm not sure about this.
The output is:
sizes =
2
3
s =
0.194930002093315 0.111593000590801 0.781350016593933
0.333330005407333 0.151535004377365 0.59135502576828
To do better than four bytes per number, you need to determine to what precision you need these numbers. Since they are probabilities, they are all in [0,1]. You should be able to specify a precision as a power of two, e.g. that you need to know each probability to within 2-n of the actual. Then you can simply multiply each probability by 2n, round to the nearest integer, and store just the n bits in that integer.
In the worst case, I can see that you are never showing more than six digits for each probability. You can therefore code them in 20 bits, assuming a constant fixed precision past the decimal point. Multiply each probability by 220 (1048576), round, and write out 20 bits to the file. Each probability will take 2.5 bytes. That is smaller than the four bytes for a float value.
And either way is way smaller than the average of 11.3 bytes per value in your example file.
You can get better compression even than that if you can exploit known patterns in your data. Assuming that there are any. I see that in your example, on each line the values go down by some factor at each step. If that is real and not just an artifact of the generation of the example, then you can successively use fewer bits for each sample. Also if the first sample is really always less than 1/8, then you can drop the top three bits off that one, since those bits would always be zero. If the second column is always less than 1/32, you can drop the first five bits off all of those. And so on. Assuming that the magnitudes in the example are maximums across all of the data sets (obviously not true, but just using that as an illustrative case), and assuming you need six decimal digits after the decimal point, I could code each row of six values in 50 bits, for an average of a little over one byte per probability.
And for one last smidgen of compression, since the values add to one, you don't have store the last value.
Matlab can read binary files. Why not save your files as binary instead of text?
Saving each number as a float would only require 4 bytes (if you're running 32 bit linux), you could use doubles but it appears that you aren't using the full double resolution. Under your current scheme each digit every number consumes a byte of space. All of your numbers are easily 4+ char longs, some as long as 10 chars. Implementing this change should cut down your file sizes by more than 50%.
Additionally you might consider using a more elegant data format like HDF5 (more here) that both supports compression and is supported by matlab
Update:
There are lots of examples of how to write binary files in C++, just google it.
Additionally to read in a binary file in Matlab simply use fread
The difference between representing a number as ascii vs binary is really simple. All files are written using binary, the difference is in how that information gets interpreted. Text files are generally read using ASCII, which provides a nice mapping between an 8bit word and characters. When you see a string like "255" what you have is a array of bytes where each byte encodes on character in the array. However when you are storing numbers its really wasteful to store each digit of using a different byte. A single byte can store values between 0-255. So why use three bytes to store the string "255" when I can use a single byte to store the value 255.
You can always go ahead and zip everything using a standard library like zlib. Afterwards you could use a custom dll written in C++ that unzips your data in chunks you can manage. So basically:
Data --> Zip --> Dll (Loaded by Matlab via LoadLibrary) --> Matlab

Set of unambiguous looking letters & numbers for user input

Is there an existing subset of the alphanumerics that is easier to read? In particular, is there a subset that has fewer characters that are visually ambiguous, and by removing (or equating) certain characters we reduce human error?
I know "visually ambiguous" is somewhat waffly of an expression, but it is fairly evident that D, O and 0 are all similar, and 1 and I are also similar. I would like to maximize the size of the set of alpha-numerics, but minimize the number of characters that are likely to be misinterpreted.
The only precedent I am aware of for such a set is the Canada Postal code system that removes the letters D, F, I, O, Q, and U, and that subset was created to aid the postal system's OCR process.
My initial thought is to use only capital letters and numbers as follows:
A
B = 8
C = G
D = 0 = O = Q
E = F
H
I = J = L = T = 1 = 7
K = X
M
N
P
R
S = 5
U = V = Y
W
Z = 2
3
4
6
9
This problem may be difficult to separate from the given type face. The distinctiveness of the characters in the chosen typeface could significantly affect the potential visual ambiguity of any two characters, but I expect that in most modern typefaces the above characters that are equated will have a similar enough appearance to warrant equating them.
I would be grateful for thoughts on the above – are the above equations suitable, or perhaps are there more characters that should be equated? Would lowercase characters be more suitable?
I needed a replacement for hexadecimal (base 16) for similar reasons (e.g. for encoding a key, etc.), the best I could come up with is the following set of 16 characters, which can be used as a replacement for hexadecimal:
0 1 2 3 4 5 6 7 8 9 A B C D E F Hexadecimal
H M N 3 4 P 6 7 R 9 T W C X Y F Replacement
In the replacement set, we consider the following:
All characters used have major distinguishing features that would only be omitted in a truly awful font.
Vowels A E I O U omitted to avoid accidentally spelling words.
Sets of characters that could potentially be very similar or identical in some fonts are avoided completely (none of the characters in any set are used at all):
0 O D Q
1 I L J
8 B
5 S
2 Z
By avoiding these characters completely, the hope is that the user will enter the correct characters, rather than trying to correct mis-entered characters.
For sets of less similar but potentially confusing characters, we only use one character in each set, hopefully the most distinctive:
Y U V
Here Y is used, since it always has the lower vertical section, and a serif in serif fonts
C G
Here C is used, since it seems less likely that a C would be entered as G, than vice versa
X K
Here X is used, since it is more consistent in most fonts
F E
Here F is used, since it is not a vowel
In the case of these similar sets, entry of any character in the set could be automatically converted to the one that is actually used (the first one listed in each set). Note that E must not be automatically converted to F if hexadecimal input might be used (see below).
Note that there are still similar-sounding letters in the replacement set, this is pretty much unavoidable. When reading aloud, a phonetic alphabet should be used.
Where characters that are also present in standard hexadecimal are used in the replacement set, they are used for the same base-16 value. In theory mixed input of hexadecimal and replacement characters could be supported, provided E is not automatically converted to F.
Since this is just a character replacement, it should be easy to convert to/from hexadecimal.
Upper case seems best for the "canonical" form for output, although lower case also looks reasonable, except for "h" and "n", which should still be relatively clear in most fonts:
h m n 3 4 p 6 7 r 9 t w c x y f
Input can of course be case-insensitive.
There are several similar systems for base 32, see http://en.wikipedia.org/wiki/Base32 However these obviously need to introduce more similar-looking characters, in return for an additional 25% more information per character.
Apparently the following set was also used for Windows product keys in base 24, but again has more similar-looking characters:
B C D F G H J K M P Q R T V W X Y 2 3 4 6 7 8 9
My set of 23 unambiguous characters is:
c,d,e,f,h,j,k,m,n,p,r,t,v,w,x,y,2,3,4,5,6,8,9
I needed a set of unambiguous characters for user input, and I couldn't find anywhere that others have already produced a character set and set of rules that fit my criteria.
My requirements:
No capitals: this supposed to be used in URIs, and typed by people who might not have a lot of typing experience, for whom even the shift key can slow them down and cause uncertainty. I also want someone to be able to say "all lowercase" so as to reduce uncertainty, so I want to avoid capital letters.
Few or no vowels: an easy way to avoid creating foul language or surprising words is to simply omit most vowels. I think keeping "e" and "y" is ok.
Resolve ambiguity consistently: I'm open to using some ambiguous characters, so long as I only use one character from each group (e.g., out of lowercase s, uppercase S, and five, I might only use five); that way, on the backend, I can just replace any of these ambiguous characters with the one correct character from their group. So, the input string "3Sh" would be replaced with "35h" before I look up its match in my database.
Only needed to create tokens: I don't need to encode information like base64 or base32 do, so the exact number of characters in my set doesn't really matter, besides my wanting to to be as large as possible. It only needs to be useful for producing random UUID-type id tokens.
Strongly prefer non-ambiguity: I think it's much more costly for someone to enter a token and have something go wrong than it is for someone to have to type out a longer token. There's a tradeoff, of course, but I want to strongly prefer non-ambiguity over brevity.
The confusable groups of characters I identified:
A/4
b/6/G
8/B
c/C
f/F
9/g/q
i/I/1/l/7 - just too ambiguous to use; note that european "1" can look a lot like many people's "7"
k/K
o/O/0 - just too ambiguous to use
p/P
s/S/5
v/V
w/W
x/X
y/Y
z/Z/2
Unambiguous characters:
I think this leaves only 9 totally unambiguous lowercase/numeric chars, with no vowels:
d,e,h,j,m,n,r,t,3
Adding back in one character from each of those ambiguous groups (and trying to prefer the character that looks most distinct, while avoiding uppercase), there are 23 characters:
c,d,e,f,h,j,k,m,n,p,r,t,v,w,x,y,2,3,4,5,6,8,9
Analysis:
Using the rule of thumb that a UUID with a numerical equivalent range of N possibilities is sufficient to avoid collisions for sqrt(N) instances:
an 8-digit UUID using this character set should be sufficient to avoid collisions for about 300,000 instances
a 16-digit UUID using this character set should be sufficient to avoid collisions for about 80 billion instances.
Mainly drawing inspiration from this ux thread, mentioned by #rwb,
Several programs use similar things. The list in your post seems to be very similar to those used in these programs, and I think it should be enough for most purposes. You can add always add redundancy (error-correction) to "forgive" minor mistakes; this will require you to space-out your codes (see Hamming distance), though.
No references as to particular method used in deriving the lists, except trial and error
with humans (which is great for non-ocr: your users are humans)
It may make sense to use character grouping (say, groups of 5) to increase context ("first character in the second of 5 groups")
Ambiguity can be eliminated by using complete nouns (from a dictionary with few look-alikes; word-edit-distance may be useful here) instead of characters. People may confuse "1" with "i", but few will confuse "one" with "ice".
Another option is to make your code into a (fake) word that can be read out loud. A markov model may help you there.
If you have the option to use only capitals, I created this set based on characters which users commonly mistyped, however this wholly depends on the font they read the text in.
Characters to use: A C D E F G H J K L M N P Q R T U V W X Y 3 4 6 7 9
Characters to avoid:
B similar to 8
I similar to 1
O similar to 0
S similar to 5
Z similar to 2
What you seek is an unambiguous, efficient Human-Computer code. What I recommend is to encode the entire data with literal(meaningful) words, nouns in particular.
I have been developing a software to do just that - and most efficiently. I call it WCode. Technically its just Base-1024 Encoding - wherein you use words instead of symbols.
Here are the links:
Presentation: https://docs.google.com/presentation/d/1sYiXCWIYAWpKAahrGFZ2p5zJX8uMxPccu-oaGOajrGA/edit
Documentation: https://docs.google.com/folder/d/0B0pxLafSqCjKOWhYSFFGOHd1a2c/edit
Project: https://github.com/San13/WCode (Please wait while I get around uploading...)
This would be a general problem in OCR. Thus for end to end solution where in OCR encoding is controlled - specialised fonts have been developed to solve the "visual ambiguity" issue you mention of.
See: http://en.wikipedia.org/wiki/OCR-A_font
as additional information : you may want to know about Base32 Encoding - wherein symbol for digit '1' is not used as it may 'confuse' the users with the symbol for alphabet 'l'.
Unambiguous looking letters for humans are also unambiguous for optical character recognition (OCR). By removing all pairs of letters that are confusing for OCR, one obtains:
!+2345679:BCDEGHKLQSUZadehiopqstu
See https://www.monperrus.net/martin/store-data-paper
It depends how large you want your set to be. For example, just the set {0, 1} will probably work well. Similarly the set of digits only. But probably you want a set that's roughly half the size of the original set of characters.
I have not done this, but here's a suggestion. Pick a font, pick an initial set of characters, and write some code to do the following. Draw each character to fit into an n-by-n square of black and white pixels, for n = 1 through (say) 10. Cut away any all-white rows and columns from the edge, since we're only interested in the black area. That gives you a list of 10 codes for each character. Measure the distance between any two characters by how many of these codes differ. Estimate what distance is acceptable for your application. Then do a brute-force search for a set of characters which are that far apart.
Basically, use a script to simulate squinting at the characters and see which ones you can still tell apart.
Here's some python I wrote to encode and decode integers using the system of characters described above.
def base20encode(i):
"""Convert integer into base20 string of unambiguous characters."""
if not isinstance(i, int):
raise TypeError('This function must be called on an integer.')
chars, s = '012345689ACEHKMNPRUW', ''
while i > 0:
i, remainder = divmod(i, 20)
s = chars[remainder] + s
return s
def base20decode(s):
"""Convert string to unambiguous chars and then return integer from resultant base20"""
if not isinstance(s, str):
raise TypeError('This function must be called on a string.')
s = s.translate(bytes.maketrans(b'BGDOQFIJLT7KSVYZ', b'8C000E11111X5UU2'))
chars, i, exponent = '012345689ACEHKMNPRUW', 0, 1
for number in s[::-1]:
i += chars.index(number) * exponent
exponent *= 20
return i
base20decode(base20encode(10))
base58:123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz

Compression in Scala

I'm working on Scala with VERY larg lists of Int (maybe large) and I need to compress them and to hold it in memory.
The only requirement is that I can pull (and decompress) the first number on the list to work with, whithout touching the rest of the list.
I have many good ideas but most of them translate the numbers to bits.
Example:
you can write any number x as the tuple |log(x)|,x-|log(x)| the first element we right it as a string of 1's and a 0 at the end (Unary Code) and the second in binary. e.g:
1 -> 0,1 -> 0 1
...
5 -> 2,1 -> 110 01
...
8 -> 3,0 -> 1110 000
9 -> 3,1 -> 1110 001
...
While a Int takes a fixed 32 bits of memory and a long 64, with this compression x requires 2log(x) bits for storage and can grow indefinetly. This Compression does reducememory in most cases.
How would you handle such type of data? Is there something such as bitarray or something?
Any other way to compress such data in Scala?
Thanks
Depending on the sparseness and range of your data set, you may keep your data as a list of deltas instead of numbers. That's used for sound compression, for instance, and can be both lossy or lossless, depending on your needs.
For instance, if you have Int numbers but know they will hardly ever be more than a (signed) Byte apart, you could do something like this list of bytes:
-1 // Use -1 to imply the next number cannot be computed as a byte delta
0, 0, 4, 0 // 1024 encoded as bytes
1 // 1025 as a delta
-5 // 1020 as a delta
-1 // Next number can't be computed as a byte delta
0, 0, -1, -1 // 65535 encoded as bytes -- -1 doesn't have special meaning here
10 // 65545 as a delta
So you don't have to handle bits using this particular encoding. But, really, you won't get good answers without a very clear indication of the particular problem, the characteristics of the data, etc.
Rereading your question, it seems you are not discarding compression techniques that turn data into bits. If not, then I suggest Huffman -- predictive if needed -- or something from the Lempel-Ziv family.
And, no, Scala has no library to handle binary data, unfortunately. Though paulp probably has something like that in the compiler itself.