linux tools for highest compression of sorted integers - encoding

I am looking for a tried-and-true Linux tools specifically designed for:
simple run length encoding
compression of sorted integers.
1. I have a sorted list of repeated words, one per line, 1.5 Gb text file (617 million lines). There are ~200 unique words, ranging from 1 to < 20 characters in length.
I can "manually compress" using uniq -c to get a 3.4 Kb file that is 1.2 Kb when gzipped. I could then do a "manually decompression" with a simple awk function. A tested, optimized, maintained, dedicated tool is much preferable and less error-prone and more time effective to writing my own code, of course.
gzip --best gives a 1.5Mb file, which is a woefully poor compression ratio for this particular problem.
bzip2 --best gives a 62 Kb compressed file, which is good, but obviously suboptimal compression ratio. And it takes far longer than simply uniq -c.
A simple tool with a straightforward implementation of Run Length Encoding seems optimal, but I cannot find anything standard and reliable.
2. I have a sorted list of positive integers, one per line. Each integer is approximately in the range of 1 Million to 300 million. There is no algorithmic pattern or formula, they are random. But the difference between consecutive integers is tightly distributed around 0 to 30, though with a tail.
Huffman Coding of the difference of consecutive integers (or the difference of the difference) should give very high compression ratio. but I cannot find a simple, dedicated tool for sorted integers.
Another SO answer gives links to C libraries for these problems, but I am looking for a maintained, standalone Linux binary.
These are simple problems, but I don't have time to write my own code, debug it, test it, optimize it, etc. This is a tiny piece of a larger project. I am surprised there are not dedicated Linux utility tools for these problems.

For #1, compress the compressed output with gzip. On the other end, decompress twice. (The maximum compression of the gzip format is 1032:1. For highly redundant data, the result of the compression is itself compressible.)
For #2, if you don't have time to write a super simple piece of code to write five bits for each difference (0..30, with 31 being an escape code followed by the next integer with no differencing), then you can't possibly finish your project.

Related

Loading a CSV File taking forever

I am trying to load a 4.0 GB large CSV file into Matlab. I have 40GB of RAM. However, the table does not seem to finish loading. (Activity Monitor showed fast increase of RAM use up to 38.64GB and stopped after that. CPU still in heavily in use.)
According to the force quit menu of apple, matlab is has not gotten stuck. (I'd guess a missing "Matlab is not responding"-message signals that.)
1st Question: Why does it even take up that much RAM? I've read RAM duplicates. Can I do something in this regard?
2nd Question: Can I speed this project up. Split the CSV somehow?
3rd Question: Can I speed up my computer? It is taking forever, while using only 30% of CPU capacity... Why does it not use more? The vents are not crazy loud, so I guess "it's chilling".
Edit: It went up to 72.80 and is now decreasing...
Edit: Now back down at 55.something
There are a few concepts you should be aware of with Matlab.
Strings are stored as UINT16 (sort of, I can never get this right). Importantly what this means is that every character requires 2 bytes. If you stored the entire file as a long string it would take up 8 GB.
Values, whether they are arrays or scalars, are stored with headers. This means that storing a string (technically a character array, strings - the ones with double quotes instead of single quotes - may be different) requires a header that is roughly 104 bytes. This means something like 'test' requires roughly 108 bytes! If you can store an array of numbers then the 104 byte overhead is minimal. If you have a cell array of scalars, then each scalar is taking up 112 byes (assuming the scalar is an 8 byte double). This might be a bit confusing but in the end it means if you're not careful reading a CSV file your memory requirements can explode.
So what can you do. Tables store columns as arrays where possible. You can try readtable although I think the underlying implementation might not be memory efficient.
For large files Matlab suggests using the datastore function. It will fix your memory problem although it may be a bit slow.
The other option is to read the entire file into memory and to do your own custom processing. For example, assuming you don't have anything escaped (i.e. commas that are not actually delimiters), you can find all relevant delimiters by using:
%Find comma or newline
I = regexp(temp,',|\n')
Here's an example of extracting various columns. As indicated above, this has a large overhead for strings (character arrays) but is efficient for numbers.
%Fake data as an example, 3 columns with middle one numeric
temp = sprintf('asdf,1234,temp\nfred,324,chip\ncheese,12,you are always right');
I = regexp(temp,',|\n');
starts = [0 I];
ends = [I length(temp)+1];
n_columns = 3;
%extract column 2
c2 = arrayfun(#(x,y) str2double(temp(x+1:y-1)),starts(2:n_columns:end),ends(2:n_columns:end));
%extract column 1
c1 = arrayfun(#(x,y) temp(x+1:y-1),...
starts(1:n_columns:end),ends(1:n_columns:end),'un',0);
Depending on your use case this may work or it may not. To read the file into memory you can use fileread
In answer to question(2): It is quite straightforward to split the csv up, assuming there are more rows than columns...
bigfile= csvread(filename);
bigLen=length(bigfile);
size=unint64(bliglen/2)
csvwrite('first.csv', bigfile(1:size,:));
csvwrite('second.csv', bigfile(size:beglen,:));
Or even doing this with SEVERAL files; it may not make it faster overall, but it would allow you to observe the process as each file is read.
I think that MatLab itself has a limit of how much input it is allowed to take in. I'm sure you can set that in the preferences if you have the high enough version.
Check this out: http://www.mathworks.com/help/matlab/matlab_env/set-workspace-and-variable-preferences.html

How well do Non-cryptographic hashes detect errors in data vs. CRC-32 etc.?

Non-cryptographic hashes such as MurmurHash3 and xxHash are almost exclusively designed for hash tables, but they appear to function comparably (and even favorably) to CRC-32, Adler-32 and Fletcher-32. Non-crypto hashes are often faster than CRC-32 and produce more "random" output similar to slow cryptographic hashes (MD5, SHA). Despite this, I only ever see CRC-32 or MD5 recommended for data integrity/checksum purposes.
In the table below, I tested 32-bit checksum/CRC/hash functions to determine how well they detect small differences in data:
The results in each cell means: A) number of collisions found, and B) minimum and maximum probability that any of the 32 output bits are set to 1. To pass test B, the max and min should be as close as possible to 50. Anything under 45 or over 55 indicates bias.
Looking at the table, MurmurHash3 and Jenkins lookup2 compare favorably to CRC-32 (which actually fails one test). They are also well-distributed. DJB2 and FNV1a pass collision tests but aren't well distributed. Fletcher32 and Adler32 struggle with the NullBytes and 8RandBytes tests.
So then my question is, compared to other checksums, how suitable are 'non-cryptographic hashes' for detecting errors or differences in files? Is there any reason a CRC-32/Adler-32/CRC-64 might outperform any decent 32-bit/64-bit hash?
Is there any reason this function would be inferior to CRC-32 or
Adler-32 for detecting errors in data?
Yes, for certain kinds of error characteristics. A CRC can be designed to very effectively detect small numbers of bit errors in a packet, as you might expect on an actual communications or storage channel. That's what it's designed for.
For large numbers of errors, any 32-bit check that fills the 32 bits and does a reasonably good job of being sensitive to all of the bits of the packet will work about as well as any other. So your's would be as good as a CRC-32, and a smidge better than an Adler-32. (The Adler-32 deliberately does not use all possible 32-bit values, so has a slightly higher false positive rate than 32-bit checks that use all possible values.)
By the way, looking a little more at your algorithm, it does not distribute over all 32-bit values until you have many bytes of input. So your check would not be as good as any other 32-bit check on a large number of errors until you have covered the possible 32-bit values of the check.

Elias Gamma Coding and upper bound

While reading about Elias Gamma coding on wikipedia, I see it mentions that:
"Gamma coding is used in applications where the largest encoded value is not known ahead of time."
and that:
"It is used most commonly when coding integers whose upper-bound cannot be determined beforehand."
I don't really understand what is meant by these sentences, because whenever this algorithm is coded, the largest value of the test data or range of the test data would be known before hand. Any help is appreciated!
As far as I'm acquainted with Elias-gamma/delta encoding, the first sentence simply states that these compression methods are global, which means that it does not rely on the input data to generate the code. In other words, these methods do not need to process the input before performing the compression (as local methods do); it compresses the data with a function that does not depend on information from the database.
As for the second sentence, it may be taken as a guarantee that, although there may be some very large integers, the encoding will still perform well (and will represent such values with feasible amount of bytes, i.e., it is a universal method). Notice that, if you knew the biggest integer, some approaches (like minimal hashes) could perform better.
As a last consideration, the same page you referred to also states that:
Gamma coding is used in applications where the largest encoded value is not known ahead of time, or to compress data in which small values are much more frequent than large values.
This may be obtained by generating lists of differences from the original lists of integers, and passing such differences to be compressed instead. For example, in a list of increasing numbers, you could generate:
list: 1 5 29 32 35 36 37
diff: 1 4 24 3 3 1 1
Which will give you many more small numbers, and therefore a greater level of compression, than the first list.

How many string characters should I read to get a good hash?

Here is a little conundrum for you: If you use a hash algorithm like CRC-64 then how many bytes in a string would be necessary to read to calculate a good hash? Lets say all your strings are at least 2 KB long then it seems a waste or resources using the whole string to calculate the cache, but just how many characters do you think is enough? Would just 8 ASCII-characters be enough since it equals 64-bits? Wont using more than 8 ASCII characters just be pointless? I want to know your though on this.
Update:
With a 'good hash' I mean the point where the likelihood of hash collisions can not get any less by using even more bytes to calculate it.
If you use CRC-64 over 8 bytes or less then there is no point in using CRC-64: just use the 8 bytes "as is". A CRC does not have any added value unless the input is longer than the intended output.
As a general rule, if your hash function has an output of n bits then collisions begin to appear once you have accumulated about 2n/2 strings. In shorter words, if you use 64 bits, then it is very unlikely that you encounter a collision in the first 2 billions of strings. If you get a 160-bit or more output, then collisions are virtually unfeasible (you will encounter much less collisions than hardware failures such as the CPU catching fire). This assumes that the hash function is "perfect". If your hash function begins by selecting a few data bytes, then, necessarily, the bytes that you do not select cannot have any influence on the hash output, so you'd better use the "good" bytes -- which utterly depends on the kind of strings that you are hashing. There is no general rule here.
My advice would be to first try using a generic hash function over the whole string; I usually recommend MD4. MD4 is a cryptographic hash function, which has been utterly broken, but for a problem with no security involved, it is still very good at mixing data elements (cryptographically speaking, a CRC is so much more broken than MD4). MD4 has been reported to actually be faster than CRC-32 on some platforms, so you could give it a shot. On a basic PC (my 2.4 GHz Core2), a MD4 implementation works at about 700 MBytes/s, so we are talking about 35000 hashed 2 kB strings per second, which is not bad.
What are the chances that the first 8 letters of two different strings are the same? Depending on what these strings are, it could be very high, in which case you'll definitely get hash collisions.
Hash the whole thing. A few kilobytes is nothing. Unless you actually have a need to save nanoseconds in your program, not hashing the full strings would be premature optimization.

array data compression that is holding 13268 bits(1.66kBytes)

i.e array is having 100*125 bits of data for each aircraft+8 ascii messages each of 12 characters
what compression technique should i apply to such data
Depends mostly on what those 12500 bits look like, since that's the biggest part of your data. If there aren't any real patterns in it, or if they aren't byte-sized or word-sized patterns, "compressing" it may actually make it bigger, since almost every compression algorithm will add a small amount of extra data just to make decompression possible.