How to read and decode zlib block and deflate block in a png image file - png

Recently I started writing a png decoder for fun.And for understanding the format and compression I have read the following
PNG (Portable Network Graphics) Specification, Version 1.2)
RFC 1950("ZLIB Compressed Data Format Specification")
RFC 1951 ("DEFLATE Compressed Data Format Specification")
And hence I have an simple understanding of the format and compression.
So at first I decided to implement a simple decoder which will decode a simple png image file,and for that i have choosen a simple png image containing a single IDAT chunk which contains a single zlib block and deflate block.
And image file used is this
Image file used for decoding
And i have extracted the zlib block from the the image file and in the hex editor it looks like this
Hex view of IDAT CHUNK
And the binary representation of the part marked in red is this
Binary representation of zlib block
Now from what i have understood from reading the specs i have decoded it as follows
Decoded binary representation of zlib block
BFINAL=FINAL BLOCK
BTYPE=DYNAMIC HUFFMAN
HLIT=29
HDIST=29
HCLEN=11
The parts marked in green is the (HCLEN + 4) code lengths for the code length alphabet.
The read lengths are as follows 6,6,0,2,3,3,5,4,4,4,3,6,4,6,5
The generated huffman codes for above code bit lengths are as follows
Generated huffman code
And after assigning them to corresponding length alphabet is as follows(Note:The length alphabet 18 is not used as its code length was zero)
Assigned huffman codes
Now when I started to decode the huffman code for the (HLIT+257) of code lengths of the huffman codes for the literal/length alphabets using the assigned huffman codes, I got the first decoded alphabet as 16,but this is not possible as alphabet 16 is copy the previous code length alphabet,but this is not possible as it is the first decoded alphabet.
Hence there is some error in my understanding of the format and i cannot seem to figure it out and this is where I need help with.

The way you're representing the codes doesn't make sense, since you are showing a bunch of zeros that aren't part of the codes, so you can't tell what their lengths are. Also you are showing the codes in reverse, as compared to how they show up in the bytes.
More importantly, somehow you got the assignments wrong. Here are the correct codes, showing only the bits in the codes, in the correct bit order, with the correct assignments:
00 - 0
010 - 7
110 - 8
001 - 11
0101 - 5
1101 - 6
0011 - 10
1011 - 12
00111 - 9
10111 - 13
001111 - 3
101111 - 4
011111 - 16 (followed by two bits, +3 is the repeat count)
111111 - 17 (followed by three bits, +3 is the zeros count)
The first code length code is 001111, which says the literal 0 has code length 3.

Related

when is fixed length encoding better than huffman?

For the word "sleeplessness" Huffman encoding is 27 bits while Fixed length encoding is 39
Is there a word or a general condition in which Huffman will need more bits than Fixed length encoding?
A Huffman coding using the probability of the symbols in the message will never need more bits than a fixed-length coding, though only if we ignore the bits required to transmit a description of the code itself. The Huffman code description plus the Huffman-coded message for short messages will often be larger than a fixed-length code that requires no description.

Put png scanlines image data to zlib stream with no compressing?

I am making a simple png image from scratch. I have had the scanlines data for it. Now I want to make it into zlib stream without being compressed. How can I do that? I have read the "ZLIB Compressed Data Format Specification version 3.3" at "https://www.ietf.org/rfc/rfc1950.txt" but still not understanding. Could someone give me a hint about setting the bytes in zlib stream?
Thanks in advance!
As mentioned in RFC1950, the details of the compression algorithm are described in another castle RFC: DEFLATE Compressed Data Format Specification version 1.3 (RFC1951).
There we find
3.2.3. Details of block format
Each block of compressed data begins with 3 header bits
containing the following data:
first bit BFINAL
next 2 bits BTYPE
Note that the header bits do not necessarily begin on a byte
boundary, since a block does not necessarily occupy an integral
number of bytes.
BFINAL is set if and only if this is the last block of the data
set.
BTYPE specifies how the data are compressed, as follows:
00 - no compression
[... a few other types]
which is the one you wanted. These 2 bits BTYPE, in combination with the last-block marker BFINAL, is all you need to write "uncompressed" zlib-compatible data:
3.2.4. Non-compressed blocks (BTYPE=00)
Any bits of input up to the next byte boundary are ignored.
The rest of the block consists of the following information:
0 1 2 3 4...
+---+---+---+---+================================+
| LEN | NLEN |... LEN bytes of literal data...|
+---+---+---+---+================================+
LEN is the number of data bytes in the block. NLEN is the
one's complement of LEN.
So, the pseudo-algorithm is:
set the initial 2 bytes to 78 9c ("default compression").
for every block of 32768 or less bytesᵃ
if it's the last block, write 01, else write 00
... write [block length] [COMP(block length)]ᵇ
... write the immediate data
repeat until all data is written.
Don't forget to add the Adler-32 checksum at the end of the compressed data, in big-endian order, after 'compressing' it this way. The Adler-32 checksum is to verify the uncompressed, original data. In the case of PNG images, that data has already been processed by its PNG filters and has row filter bytes appended – and that is "the" data that gets compressed by this FLATE-compatible algorithm.
ᵃ This is a value that happened to be convenient for me at the time; it ought to be safe to write blocks as large as 65535 bytes (just don't try to cross that line).
ᵇ Both as words with the low byte first, then high byte. It is briefly mentioned in the introduction.

Size of binary file after base64 encoding? Need explanation on the solution

So I'm studying for the upcoming exam, and there's this question: given a binary file with the size of 31 bytes what will its size be, after encoding it to base64?
The solution teacher gave us was (40 + 4) bytes as it needs to be a multiple of 4.
I'm not being able to come across this solution, and I have no idea how to solve this, so I was hoping somebody could help me figure this out.
Because base 64 encoding divide the input data in six bit block and one block use an ascii code.
If you have 31 byte in input you have 31*8/6 bit block to encode. As a rule of thumb every three byte in input you have 4 byte in output
If input data is not a multiple of six bit the base64 encoding fills the last block with 0 bit
In your example you have 42 block of six bit, with last filled with missing 0 bit.
Base 64 algorithm implementation filled the encoded data with '=' symbol in order to have of multiple of 4 as final result.

Are there any UTF-8 code units that have byte 60 or 62 (`<` and `>`) as not the first byte of their binary representation?

I need to debug a XML parser and I am wondering if I can construct "malicious" input that will cause it to not recognize opening and closing tags correctly.
Additionally, where can I find this sort of information in general? After this I will also want to be sure that the parser I am working with won't have trouble with other special characters such as &, = , ", etc.
UTF-8 makes it very easy to figure out what the role of a code unit (i.e. a byte) is:
If the highest bit is not set, i.e. the code unit is 0xxxxxxx, then this is byte expresses an entire code point, whose value is xxxxxxx (i.e. 7 bits of information).
If the highest bit is set and the code unit is 10xxxxxx, then it is a continuation part of a multibyte sequence, carrying six bits of information.
Otherwise, the code unit is the initial byte of a multibyte sequence, as follows:
110xxxxx: Two bytes (one continuation byte), for 5 + 6 = 11 bits.
1110xxxx: Three bytes (two continuation bytes), for 4 + 6 + 6 = 16 bits.
11110xxx: Four bytes (three continuation bytes), for 3 + 6 + 6 + 6 = 21 bits.
As you can see, a value 60, which is 00111100, is a single-byte codepoint of value 60, and the same byte cannot occur as part of any multibyte sequence.
The scheme can actually be extended up to seven bytes, encoding up to 36 bits, but since Unicode only requires 21 bits, four bytes suffice. The standard mandates that a code point must be represented with the minimal number of code units.
Update: As #Mark Tolonen rightly points out, you should check carefully whether each encoded code point is actually encoded with the minimal number of code units. If a browser would inadvertently accept such input, a user could sneak something past you that you would not spot in a byte-for-byte analysis. As a starting point you could look for bytes like 10111100, but you'd have to check the entire multibyte sequence of which it is a part (since it can of course occur legitimately as a part of different code points). Ultimately, if you can't trust the browser, you don't really get around decoding everything and just check­ing the resulting code point sequence for occurrences of U+3C etc., and don't even bother looking at the byte stream.
In UTF-8, no. In other encodings, yes.
In UTF-8, by design, all bytes of a multibyte character will always have the highest bit set. Vice versa, a byte that doesn't have the highest bit set is always an ASCII character.
However, this is not true for other encodings, which are also valid for XML.
For more information about UTF-8, check e.g wikipedia
A poorly-designed UTF-8 decoder could interpret the bytes C0 BC and C0 BE as U+003C and U+003E. As #KerrekSB stated in his answer:
The standard mandates that a code point must be represented with the minimal number of code units.
But a poor algorithm might still decode a malformed two-byte UTF-8 sequence that is not the minimal number of code units:
C0 BC = 11000000 10111100 = 00000111100 = 3Chex = 60dec = '<'
So in your testing be sure to include malformed UTF-8 sequences and verify that they are rejected.

UTF-8 Encoding size

what unicode characters fit in 1, 2, 4 bytes? Can someone point me to complete character chart?
Characters are encoded according to their position in the range. You can actually find the algorithm on the Wikipedia page for UTF8 - you can implement it very quickly
Wikipedia UTF8 Encoding
U+0000 to U+007F are (correctly) encoded with one byte
U+0080 to U+07FF are encoded with 2 bytes
U+0800 to U+FFFF are encoded with 3 bytes
U+010000 to U+10FFFF are encoded with 4 bytes
The wikipedia article on UTF-8 has a good enough description of the encoding:
1 byte = code points 0x000000 to 0x00007F (inclusive)
2 bytes = code points 0x000080 to 0x0007FF
3 bytes = code points 0x000800 to 0x00FFFF
4 bytes = code points 0x010000 to 0x10FFFF
The charts can be downloaded directly from unicode.org. It's a set of about 150 PDF files, because a single chart would be huge (maybe 30 MiB).
Also be aware that Unicode (compared to something like ASCII) is much more complex to process - there's things like right-to-left text, byte order marks, code points that can be combined ("composed") to create a single character and different ways of representing the exact same string (and a process to convert strings into a canonical form suitable for comparison), a lot more white-space characters, etc. I'd recommend downloading the entire Unicode specification and reading most of it if you're planning to do more than "not much".
UTF-8 compromises of 1 to a limit of 6 bytes, although the current amount of code points is covered with just 4 bytes. UTF-8 uses the first byte to determine how long (in bytes) the character is - see the various links to the Wiki page:
UTF-8 Wikipedia
Single byte UTF-8 is effectively ASCII - UTF-8 was designed to be compatible with it, which is why it's more prevalent than UTF-16, for example.
Edit: Apparently, it was agreed the UTF-8's code points would not exceed 21 bits (4 byte sequences) - but it has the technical capability to handle up to 31 bits (6 byte UTF-8).