Why doesn't lz4 keep small/uncompressible values uncompressed? - lossless-compression

When compressing small values (< 500 bytes or so), and for uncompressible random values,
lz4 returns data that is much larger than the original value (e.g. 27 bytes from 4).
When a large amount of such values is compressed separately (e.g. in a key-value storage), this adds up.
Question is: why doesn't lz4 use e.g. a separate magic number for values that didn't become smaller after compression, leaving the original data as-is, and only adding 4 bytes of the overhead?
The same applies to many other compression formats.
Code with demonstration: https://jsfiddle.net/gczy7f3k/2/

Related

MATLAB: Large text file to Matrix conversion

I need to read a text file in MATLAB containing 436 float values in each line (text file size is 25GB, so you can estimate the number of lines) and then convert it to a matrix so that I can take the transpose. How to do it? Won't the format specifier be too long?
Let's assume that the floats in your file are written in a format such that you have 15 digits after the floating point. So we have 17 characters for each float. Then, let's also assume that they are separated by a character, and that we have a \n at the end of each line, so that we will have 436*18=7848 characters in total, which assuming ascii characters will use one byte each. Then, your file uses about 25G of memory, so you can say that you have (25*2^30)/7848=3.6041e+06 lines (using 2^30 bytes in a gygabyte, the scale is roughly the same if you prefer the definition as 10^9 bytes)`.
So, a matrix of size 4e6, 436 (I'm assuming a much larger higher bound on your matrix size) will, assuming each float takes 4 bytes, roughly take about 6.48G. This is nothing crazy, and you can find this amount of contiguous memory to allocate when reading the matrix using the load function, given you have a decent amount of RAM on your machine. I have 8G at the moment, and rand(4*1e6,436) allocates the desired amount of memory, although it ends up using the swap space and slowing down. I assume that load itself will have some amount of overhead, but if you have 16G of RAM (which is not crazy nowadays) you can safely go with load.
Now if you think that you wont find that much contiguous memory, I suggest that you just separate your file into chunks, like 10 matrices, and load and transpose them separately. How you do that is up to you and depends on the application, and whether there are any sparsity patterns in the data or not. Also, make sure (if you don't need the extra precision) that you are using single float precision.

PNG: What is the benefit of using multiple IDAT-Chunks?

I would like to know what the benefit of using multiple IDAT-Chunks inside a PNG Image is.
The PNG documentation says
There may be multiple IDAT chunks; if so, they shall appear consecutively with no other intervening chunks. The compressed datastream is then the concatenation of the contents of the data fields of all the IDAT chunks.
I can't imagine it's because of the maximum size (2^32 bytes) of the data-block inside the chunk.
Recall that all PNG chunks (including IDAT chunks) have a prefix with the chunk length. To put all the compressed stream in a single huge IDAT chunk would cause these two inconveniences:
On the encoder side: the compressor doesn't know the total compressed data size until it has finished the compression. Then, it would need to buffer the full compressed data in memory before writing the chunk prefix.
On the decoder side: it depends on how chunk decoding is implemented; if it buffers each chunk in memory (allocating the space given by the chunk length prefix) and, after filling it and checking the CRC, it passes the content to the uncompressor, then, again, having a single huge IDAT chunk would be a memory hog.
Considering this, I believe that use rather small IDAT chunks (say, 16KB or 64KB) should be recommended practice. The overhead (12 bytes per chunk, less than 1/5000 if len=64KB) is negligible.
It appears that when reading a PNG file, libpng limits the chunks of data it buffers to 8192 bytes even if the IDAT chunk size in the file is larger. This puts an upper limit on the allocation size needed for libpng to read and decompress IDAT chunks. However, a checksum error still cannot be detected until the entire IDAT chunk has been read and this could take much longer with large IDAT chunks.
Assuming you're not concerned with early detection of CRC errors (if they do occur they'll still be detected but later on) then small IDAT chunks don't offer any benefit to the reader. Indeed, small IDAT chunks imply more separate calls to zlib and more preamble/postamble costs within zlib, so it's generally less efficient in processing time as well as space on disk.
For the writer, it's convenient to write finite-length IDAT chunks because you can determine before the write how long the chunk will be. If you want to write a single IDAT chunk then you must either complete the compression before beginning to write anything (requiring a lot of temporary storage), or you must seek within your output to update the IDAT chunk length once you know how long it is.
If you're compressing the image and streaming the result concurrently this might be impossible. If you're writing the image to disk then this is probably not a big deal.
In short, small chunks are for the compressing-on-the-fly, streaming-output use case. In most other situations you're better off with just a single chunk.

After encoding data size is increasing

I am having a text data in XML format and it's length is around 816814 bytes. It contains some image data as well as some text data.
We are using ZLIB algorithm for compressing and after compressing, the compressed data length is 487239 bytes.
After compressing we are encoding data using BASE64Encoder. But after encoding the compressed data, size is increasing and length of encoded data is 666748 bytes.
Why, after encoding data size is increasing? Is there any other best encoding techniques?
Regards,
Siddesh
As noted, when you are encoding binary 8-bit bytes with 256 possible values into a smaller set of characters, in this case 64 values, you will necessarily increase the size. For a set of n allowed characters, the expansion factor for random binary input will be log(256)/log(n), at a minimum.
If you would like to reduce this impact, then use more characters. Chances are that whatever medium you are using, it can handle more than 64 characters transparently. Find out how many by simply sending all 256 possible bytes, and see which ones make it through. Test the candidate set thoroughly, and then ideally find documentation of the medium that backs up that set of n < 256.
Once you have the set, then you can use a simple hard-wired arithmetic code to convert from the set of 256 to the set of n and back.
That is perfectly normal.
Base64 is required to be done, if your transmitting medium is not designed to transmit binary data but only textual data (eg XML)
So your zip file gets base64 encoded.
Plainly speaking, it requires the transcoder to change "non-ASCII" letters into a ASCII form but still remember the way to go back
As a rule of thumb, it's around a 33% size increase ( http://en.wikipedia.org/wiki/Base64#Examples )
This is the downside of base64. You are better of using a protocol supporting file-transfer... but for files encoded within XML, you are pretty much out of options.

how to apply RLE in binary image?

Here I have binary image,and I need to compress it using Run-length encoding RLE.I used the regular RLE algorithm and using maximum count is 16.
Instead of reducing the file size, it is increasing it. For example 5*5 matrix, 10 values repeating count is one,that is making the file bigger.
How to avoid this glitch? Is there any better way I can apply RLE partially to the matrix?
If it's for your own usage only you can create your custom image file format, and in the header you can mark if RLE is used or not, and the range of coordinates of X and Y and possible the bit planes for which it is used. But if you want to produce an image file that follows some defined image file format that uses RLE (.pcx comes into my mind) you must follow the file format specifications. If I remember correctly, in .pcx there wasn't any option to disable RLE partially.
If you are not required to use RLE and you are only looking for an easy to implement compression method, before using any compression, I suggest that you first check how many bytes your 5x5 binary matrix file takes. If the file size is 25 bytes or more, then you are saving it using at least one byte (8 bits) for each element (or alternatively you have a lot of data which is not matrix content). If you don't need to store the size, 5x5 binary matrix takes 25 bits, which is 4 bytes and 1 bit, so practically 5 bytes. I'm quite sure that there's no compression method that is generally useful for files that have size of 5 bytes. If you have matrices of different sizes, you can use eg. unsigned integer 16-bit fields (2 bytes each) for maximum matrix horizontal/vertical size of 65535 or unsigned integer 32-bit fields (4 bytes each) for maximum matrix horizontal/vertical size of 4294967295.
For example 100x100 binary matrix takes 10000 bits, which is 1250 bytes. Add 2 x 2 = 4 bytes for 16-bit size fields or 2 x 4 = 8 bytes for 32-bit size fields. After this, you can plan what would be the best compression method.

array data compression that is holding 13268 bits(1.66kBytes)

i.e array is having 100*125 bits of data for each aircraft+8 ascii messages each of 12 characters
what compression technique should i apply to such data
Depends mostly on what those 12500 bits look like, since that's the biggest part of your data. If there aren't any real patterns in it, or if they aren't byte-sized or word-sized patterns, "compressing" it may actually make it bigger, since almost every compression algorithm will add a small amount of extra data just to make decompression possible.