zlib, how to decompress stream of compressed data chunk? - streaming

I have a problem about transfer compressed data across network.
The data size is around a few hundred MBs. My plan is to divide the data into 1MB chunk and compress each chunk with zlib, then stream the compressed data over network. On the other end of the network, data will be decompressed by zlib.
My question is that since I stream the compressed data, there won't be information about where each compressed chunk starts and ends in the stream. I am not sure if zlib can decompress such compressed data stream.
If zlib can, please kindly let me know what flush mode I should use in deflate/inflate methods?
Thanks!

It is not clear why you are dividing the data into chunks, or why you would need to do any special flushing. If you just mean feeding the data to zlib in chunks, that is how zlib is normally used. zlib doesn't care how you feed it it the data -- big chunks, little chunks, one byte at a time, one giant chunk, etc., will not change the compressed result.
Flushing does change the compressed result, degrading it slightly or significantly depending on how often you flush and how you flush.
Flushing is used when you want to assure that some portion of the data is fully received at a known boundary in the compressed data, or if you want to be able to recover part of the data if not all of it is received.

If the strategy you used is a must, you could make a protocol between your host and remote, such as:
02 123456 the-compressed-data 654321 the-compressed-data
The 3 numbers are:
1. the number of chunks of data, here is 2 chunks
2. the bytes of first chunk
3. the bytes of second chunk
respectively.

Related

Kafka data compression technique

I loaded data(Selective data) from Oracle to Kafka with the replication factor of 1( So, only one copy ) and the data size in Kafka is 1TB. Kafka stores the data in a compressed format. But, I want to know the actual data size in Oracle. Since, we did selective tables and data load, I am not able to check the actual data size in Oracle. Is there any formula which I can apply to estimate the data size in Oracle for this 1TB data loaded in Kafka?
Kafka version - 2.1
Also, It took 4 hours to move data from oracle to kafka. The data size over the wire could be different. How to estimate the data over the wire and the bandwidth consumed?
There is as yet insufficient data for a meaningful answer.
Kafka supports GZip, LZ4 and "Snappy" compressions, with different compression factors and different saturation thresholds. All three methods are "learning based", i.e. they consume bytes from a stream, build a dictionary and output bytes that are symbols from the dictionary. As a result, short data streams will not be compressed very much because the dictionary hasn't learned yet very much. And if the characteristics of the dictionary become unsuitable for the new incoming bytes, the compression ratio again goes down.
This means that the structure of the data can completely change the compression performances.
On the whole, in real world applications with reasonable data (i.e. not a DTM sparse matrix or a PDF or Office document storage system) you can expect on average between 1.2x and 2.0x. The larger the data chunks, the higher the compression. The actual content of the "message" also has great weight, as you can imagine.
Oracle then allocates data in data blocks, which means you get some slack space overhead, but then again it can compress those blocks. Oracle also performs deduplication in some instances.
Therefore, a meaningful and reasonably precise answer would have to depend on several factors that we don't know here.
As a ballpark figure, I'd say that the actual "logical" data from the 1TB Kafka ought to range between 0.7 and 2 TB, and I'd expect the Oracle occupation to be anywhere from 0.9 to 1.2 TB, if compression is available Oracle side, 1.2 TB to 2.4 TB if it is not.
But this is totally a shot in the dark. You could have compressed binary information (say, XLSX or JPEG-2000 files or MP3 songs) stored, and those would actually grow in size when compression was used. Or you might have swaths of sparse matricial data, that can easily compress 20:1 or more even with the most cursory gzipping. In the first case, the 1TB might remain more or less 1TB when compression was removed; in the second case, the same 1TB could just as easily grow to 20TB or more.
I am afraid the simplest way to know would be to instrument both storages and the network, and directly monitor traffic and data usage.
Once you knew the parameters of your DBs, you could extrapolate them to different storage amounts (so, say, if you know that 1TB Kafka requires 2.5TB network traffic to become 2.1 TB of Oracle tablespace, then it stands to reason that 2TB Kafka would require 5TB of traffic and occupy 4.2TB Oracle side)... but, even here, only provided the nature of the data did not change in the interim.

How to compute the Adler32 in zlib [duplicate]

I am currently writing a C program that builds a PNG image from a data file generated by another. The image is a palette type.
Is the Adler-32 checksum calculated on the uncompressed data for...
a) each compressed block within an IDAT data chunk?
b) all compressed blocks within an IDAT data chunk?
c) all compressed blocks spanning all IDAT data chunks?
From the documents at http://www.w3.org/TR/PNG/, https://www.rfc-editor.org/rfc/rfc1950 and rfc1951 (at the same address as previuos) I am of the opinion that it is case 'c' above, allowing one's deflate implementation to chop and change how the data are compressed for each block and disregard how the compressed blocks are split between consecutive IDAT chunks.
Is this correct?
There can be only one compressed image data stream in a PNG file, and that is a single zlib stream with a single Adler-32 check at the end that is the Adler-32 of all of the uncompressed data (as pre-processed by the filters and interlacing). That zlib stream may or may not be broken up into multiple IDAT chunks. Each IDAT chunk has its own CRC-32, which is the CRC-32 of the chunk type code "IDAT" and the compressed data within.
I'm not sure what you mean by "allowing one's deflate implementation to chop and change how the data are compressed for each block". The deflate implementation for a valid PNG file must compress all of the filtered image data as a single zlib stream.
After you've compressed it as a single zlib stream, you can break up that stream however you like into a series of IDAT chunks, or as a single IDAT chunk.
PNG IDAT chunks are independent from the compressed blocks. The Adler-32 checksum is part of the zlib compression only and has nothing to do with the PNG's overall meta-structure.
From the PNG Specification:
There can be multiple IDAT chunks; if so, they must appear consecutively with no other intervening chunks. The compressed datastream is then the concatenation of the contents of all the IDAT chunks. The encoder can divide the compressed datastream into IDAT chunks however it wishes. (Multiple IDAT chunks are allowed so that encoders can work in a fixed amount of memory; typically the chunk size will correspond to the encoder's buffer size.) It is important to emphasize that IDAT chunk boundaries have no semantic significance and can occur at any point in the compressed datastream.
(emphasis mine)

Put png scanlines image data to zlib stream with no compressing?

I am making a simple png image from scratch. I have had the scanlines data for it. Now I want to make it into zlib stream without being compressed. How can I do that? I have read the "ZLIB Compressed Data Format Specification version 3.3" at "https://www.ietf.org/rfc/rfc1950.txt" but still not understanding. Could someone give me a hint about setting the bytes in zlib stream?
Thanks in advance!
As mentioned in RFC1950, the details of the compression algorithm are described in another castle RFC: DEFLATE Compressed Data Format Specification version 1.3 (RFC1951).
There we find
3.2.3. Details of block format
Each block of compressed data begins with 3 header bits
containing the following data:
first bit BFINAL
next 2 bits BTYPE
Note that the header bits do not necessarily begin on a byte
boundary, since a block does not necessarily occupy an integral
number of bytes.
BFINAL is set if and only if this is the last block of the data
set.
BTYPE specifies how the data are compressed, as follows:
00 - no compression
[... a few other types]
which is the one you wanted. These 2 bits BTYPE, in combination with the last-block marker BFINAL, is all you need to write "uncompressed" zlib-compatible data:
3.2.4. Non-compressed blocks (BTYPE=00)
Any bits of input up to the next byte boundary are ignored.
The rest of the block consists of the following information:
0 1 2 3 4...
+---+---+---+---+================================+
| LEN | NLEN |... LEN bytes of literal data...|
+---+---+---+---+================================+
LEN is the number of data bytes in the block. NLEN is the
one's complement of LEN.
So, the pseudo-algorithm is:
set the initial 2 bytes to 78 9c ("default compression").
for every block of 32768 or less bytesᵃ
if it's the last block, write 01, else write 00
... write [block length] [COMP(block length)]ᵇ
... write the immediate data
repeat until all data is written.
Don't forget to add the Adler-32 checksum at the end of the compressed data, in big-endian order, after 'compressing' it this way. The Adler-32 checksum is to verify the uncompressed, original data. In the case of PNG images, that data has already been processed by its PNG filters and has row filter bytes appended – and that is "the" data that gets compressed by this FLATE-compatible algorithm.
ᵃ This is a value that happened to be convenient for me at the time; it ought to be safe to write blocks as large as 65535 bytes (just don't try to cross that line).
ᵇ Both as words with the low byte first, then high byte. It is briefly mentioned in the introduction.

Adler-32 checksum(s) in a PNG file

I am currently writing a C program that builds a PNG image from a data file generated by another. The image is a palette type.
Is the Adler-32 checksum calculated on the uncompressed data for...
a) each compressed block within an IDAT data chunk?
b) all compressed blocks within an IDAT data chunk?
c) all compressed blocks spanning all IDAT data chunks?
From the documents at http://www.w3.org/TR/PNG/, https://www.rfc-editor.org/rfc/rfc1950 and rfc1951 (at the same address as previuos) I am of the opinion that it is case 'c' above, allowing one's deflate implementation to chop and change how the data are compressed for each block and disregard how the compressed blocks are split between consecutive IDAT chunks.
Is this correct?
There can be only one compressed image data stream in a PNG file, and that is a single zlib stream with a single Adler-32 check at the end that is the Adler-32 of all of the uncompressed data (as pre-processed by the filters and interlacing). That zlib stream may or may not be broken up into multiple IDAT chunks. Each IDAT chunk has its own CRC-32, which is the CRC-32 of the chunk type code "IDAT" and the compressed data within.
I'm not sure what you mean by "allowing one's deflate implementation to chop and change how the data are compressed for each block". The deflate implementation for a valid PNG file must compress all of the filtered image data as a single zlib stream.
After you've compressed it as a single zlib stream, you can break up that stream however you like into a series of IDAT chunks, or as a single IDAT chunk.
PNG IDAT chunks are independent from the compressed blocks. The Adler-32 checksum is part of the zlib compression only and has nothing to do with the PNG's overall meta-structure.
From the PNG Specification:
There can be multiple IDAT chunks; if so, they must appear consecutively with no other intervening chunks. The compressed datastream is then the concatenation of the contents of all the IDAT chunks. The encoder can divide the compressed datastream into IDAT chunks however it wishes. (Multiple IDAT chunks are allowed so that encoders can work in a fixed amount of memory; typically the chunk size will correspond to the encoder's buffer size.) It is important to emphasize that IDAT chunk boundaries have no semantic significance and can occur at any point in the compressed datastream.
(emphasis mine)

PNG: What is the benefit of using multiple IDAT-Chunks?

I would like to know what the benefit of using multiple IDAT-Chunks inside a PNG Image is.
The PNG documentation says
There may be multiple IDAT chunks; if so, they shall appear consecutively with no other intervening chunks. The compressed datastream is then the concatenation of the contents of the data fields of all the IDAT chunks.
I can't imagine it's because of the maximum size (2^32 bytes) of the data-block inside the chunk.
Recall that all PNG chunks (including IDAT chunks) have a prefix with the chunk length. To put all the compressed stream in a single huge IDAT chunk would cause these two inconveniences:
On the encoder side: the compressor doesn't know the total compressed data size until it has finished the compression. Then, it would need to buffer the full compressed data in memory before writing the chunk prefix.
On the decoder side: it depends on how chunk decoding is implemented; if it buffers each chunk in memory (allocating the space given by the chunk length prefix) and, after filling it and checking the CRC, it passes the content to the uncompressor, then, again, having a single huge IDAT chunk would be a memory hog.
Considering this, I believe that use rather small IDAT chunks (say, 16KB or 64KB) should be recommended practice. The overhead (12 bytes per chunk, less than 1/5000 if len=64KB) is negligible.
It appears that when reading a PNG file, libpng limits the chunks of data it buffers to 8192 bytes even if the IDAT chunk size in the file is larger. This puts an upper limit on the allocation size needed for libpng to read and decompress IDAT chunks. However, a checksum error still cannot be detected until the entire IDAT chunk has been read and this could take much longer with large IDAT chunks.
Assuming you're not concerned with early detection of CRC errors (if they do occur they'll still be detected but later on) then small IDAT chunks don't offer any benefit to the reader. Indeed, small IDAT chunks imply more separate calls to zlib and more preamble/postamble costs within zlib, so it's generally less efficient in processing time as well as space on disk.
For the writer, it's convenient to write finite-length IDAT chunks because you can determine before the write how long the chunk will be. If you want to write a single IDAT chunk then you must either complete the compression before beginning to write anything (requiring a lot of temporary storage), or you must seek within your output to update the IDAT chunk length once you know how long it is.
If you're compressing the image and streaming the result concurrently this might be impossible. If you're writing the image to disk then this is probably not a big deal.
In short, small chunks are for the compressing-on-the-fly, streaming-output use case. In most other situations you're better off with just a single chunk.