What's bloating my png? - png

Background:
I am working on making a bunch of PNGs as small as possible. I'm playing with tools like PngOut, PngCrush, and OptiPng.
Problem:
I ran across a file that is 1434 KB in size but it's only 230 x 230 pixels. When I open the file in Paint.Net and save it as a new file, the new file is only 77 KB. That's a whopping 1.325 MB of extra junk in there!
Objectives:
I would like to understand what exactly could be bloating the file, and also how I can automatically get rid of such bloat when it is encountered, and am having trouble accomplishing either of these objectives. OptiPng is not removing the metadata.
Progress:
I found exiftool which seems all-around awesome, but it's not showing any crazy tags.
RIOT can produce a new version of the image without the extra data, but it doesn't give me any solid clues as to what the bloat is--it's definitely not XMP info or Comments (the only metadata I can choose to include). But RIOT automatically and forcibly removes IPTC info and EXIF profiles--could it be one of those?
Desired Feedback
Your thoughts on how to programatically or automatically losslessly crush and remove metadata from PNGs (and for that matter, other image types) are appreciated. However, I'd like to not just throw away information in a file without first understanding what it is.
Update
I found Steel Bytes Jpeg & PNG Stripper, and it does strip the metadata from the file (and has a command-line mode), yielding an 84 KB file which I can then PNG optimize, but this still doesn't help me understand what I'm removing, and I feel like I need to understand before proceeding. I don't need to get permission to optimize these images used in a production public-facing web site, but I do need to be confident of what I'm doing before making such a change.
Update 2
I didn't notice that OptiPng has an option, -strip all, which strips metadata. This is better than the Steel Bytes Stripper by a long shot, as it has a lot more options for backup handling of the original file -- keep a backup, or output the optimized file to a new location. It also can optimize the image data of the PNG at the same time, requiring only one tool instead of two.
Addendum
Here's what exiftool -a -G [file.png] shows:
[ExifTool] ExifTool Version Number : 9.60
[File] File Name : file.png
[File] Directory : .
[File] File Size : 1446 kB
[File] File Modification Date/Time : 2014:03:31 16:37:20-07:00
[File] File Access Date/Time : 2014:05:15 15:47:53-07:00
[File] File Creation Date/Time : 2014:05:15 15:47:53-07:00
[File] File Permissions : rw-rw-rw-
[File] File Type : PNG
[File] MIME Type : image/png
[PNG] Image Width : 230
[PNG] Image Height : 230
[PNG] Bit Depth : 8
[PNG] Color Type : RGB with Alpha
[PNG] Compression : Deflate/Inflate
[PNG] Filter : Adaptive
[PNG] Interlace : Noninterlaced
[PNG] Significant Bits : 8 8 8 8
[PNG] Pixels Per Unit X : 2834
[PNG] Pixels Per Unit Y : 2834
[PNG] Pixel Units : Meters
[PNG] Creation Time : 3/31/14
[PNG] Software : Adobe Fireworks CS6
[XMP] XMP Toolkit : Adobe XMP Core 5.3-c011 66.145661, 2012/02/06-14:56:27
[XMP] Creator Tool : Adobe Fireworks CS6 (Macintosh)
[XMP] Create Date : 2012:10:24 19:01:30Z
[XMP] Modify Date : 2014:03:31 23:34:45Z
[XMP] Format : image/png
[Composite] Image Size : 230x230

The raw pixel data of your 230x230 pixel image should not be that big, even when badly or not at all compressed. Therefore, all of this data must reside in a non-standard PNG data chunk.
Use pngcheck to find out what chunks are in the file, and how big they are. Then use the W3C PNG Specifications to find out the use of particular chunks, and, if necessary, look elsewhere for "private data" chunks.
Unless you are dealing with a seriously non-standard chunk, all whose name indicate are not "required", are possible candidates for removal.

Related

Does exiftool require the complete file for extracting metadata

This question is about extraction of metadata only.
Is it required for exiftool to get a complete file for propperly working?
Scenario:
I want to extract the metadata of a 20 GB video file. Do I need to provide exiftool with the complete file (via stdin), or is it enough to provide it with a certain amount of bytes.
Motivation:
I am programatically (golang) calling exiftool in a streaming context and want to have the extraction as fast as possible. Magic numbers for filetypes work with the first 33 bytes and I am wondering if that is possible with the exiftool metadata as well.
The answer depends upon the file and the location of the metadata within that file.
There are a couple of threads on the subject on the ExifTool forums (link 1, link 2) and Phil Harvey, the author, says that about half the time the in the case of MP4/MOV videos, the metadata is at the end of the file.
Using the -fast option might help. I've done some quick tests using cURL and a large image file (see the second to the last example under Piping Examples) and in that case cURL didn't download the whole image file, just enough to extra the metadata. It might be different with a video file though, as I haven't tested that situation.

Get Maximum Compression from 7zip compression algorithm

I am trying to compress some of my large document files. But most of files are getting compresses by only 10% maximum. I am using 7zip Terminal Commands.
7z a filename.7z -m0=LZMA -mx=9 -mmt=on -aoa -mfb=64 filename.pptx
Any suggestion on changing parameters. I need at least 30% compression ratio.
.pptx files or .docx files are internally .zip archives. You can not expect a lot of compression on an already compressed file.
Documentation states lzma2 handles better data that can not be compressed, so you can try with
7z a -m0=lzma2 -mx filename.7z filename.pptx
But the required 30% is almost unreachable.
If you really need that compression, you could use the fact that a pptx is just a fancy zip file:
Unzip the pptx, then compress it with 7zip. To recover an equivalent (but not identical) pptx decompress with 7zip and recompress with zip.
There are probably some complications, for example with epub there is a certain file that must be stored uncompressed as first file in the archive at a certain offset from the start. I'm not familiar with pptx, but it might have similar requirements.
I think it's unlikely that the small reduction in file size is worth the trouble, but it's the only approach I can think of.
Depending on what's responsible for the size of the pptx you could also try to compress the contained files. For example by recompressing png files with a better compressor, stripping unnecessary data (e.g. meta-data or change histories) or applying lossy compression with lower quality settings for jpeg files.
Well just an idea to max compressing is
'recompress' these .zip archives(the .docx, .pptx, jar...) using -m0 (storing = NoCompression) and then
apply lzma2 on them
lzma2 is petty good - however if the file contains many jpg's consider to give the opensource packer peazip or more specify paq8o a try. Paq8 has a build in Jpeg compressor and supports range compression. So it will also come along with jpg's the are inside some other file. Winzip's zipx in contrast to this will require pure jpg files and is useless in this case.
But again to make PAQ effectively working/compressing your target file you'll need to 'null' the zip/deflate compression, turn it into an uncompressed zip.
Well PAQ is probably a little exotic, however it's in my eye's more honest and clear than zipx. PAQ is unsupport so it's as always a good idea to just google for what don't have/know and you will find something.
Zipx in contrast may appears a little intrigious since it looks like a normal zip and files are listed properly in Winrar or 7zip but when you like to extract jpg's it will fail so if the user is not experienced it may seem like the zip corrupted. It'll be much harder to find out that is a zipx that so far only winzip or The Unarchiver(unar.exe) can handle properly.
PPTX, XLSX, and DOCX files can indeed be compressed effectively if there are many of them. By unzipping each of them into their directories, an archiver can find commonalities between them, deduplicating the boilerplate XML as well as any common text between them.
If you must use the ZIP format, first create a zero-compression "store" archive containing all of them, then ZIP that. This is necessary because each file in a ZIP archive is compressed from scratch without taking advantage of redundancies across different files.
By taking advantage of boilerplate deduplication, 30% should be a piece of cake.

Method to decompress a PDF (non-Adobe) while retaining form fields?

I found a similar question that involves Acrobat, but in this case the PDF was made with a combination of MS Word and CenoPDF v3, with which I'm unfamiliar. Additionally the PDF is version 1.3. I'd like to decompress it, to see its low-level workings and make some changes. It's easy with GhostScript's -dCompressPages=false parameter, but that simultaneously strips all the fill-in form functionality. Is there a method for decompressing the file while leaving everything else intact? A quick search of the docs for tcpdf and fpdi (cited in the link) didn't reveal a compression option.
Ghostscript and pdfwrite isn't a good combination. The PDF file you get out is NOT the same as the one you put in. This is because of the way that Ghostscript and pdfwrite work; the input is fully interpreted to a sequence of graphics primitives, which is sent to the Ghostscript graphics library. These are then sent to the requested device, most devices then render the result to a bitmap, but the pdfwrite family reassemble those graphics primitives int a new PDF file.
Note that the contents of the new PDF file have no relationship to the original, other than the appearance when rendered. Ghostscript and pdfwrite do maintain much of the non-marking content of PDF files such as hyperlinks and so on (which obviously don't get turned into graphics primitives), by interpreting them into pdfmark operations (an extension to the PostScript language defined by Adobe). However, even if Ghostscript and pdfwrite maintained all this content, the resulting PDF file wouldn't be the same as the original one decompressed....
There are tools which will decompress PDF files, and I would recommend one of our other products, MuPDF. A part of this is mutool, and "mutool clean -d in.pdf out.pdf" will decompress pretty much everything in a PDF file
QPDF can decompress PDF documents (among other things). I used this tool in the past and it preserved forms and data.
The tool has some issues with large PDFs (can take too much time and memory for decompression). The tool can produce incomplete output (with warnings in console) for some partially broken / nonstandard PDFs.

is partial gz decompression possible?

For working with images that are stored as .gz files (my image processing software can read .gz files for shorter/smaller disk time/space) I need to check the header of each file.
The header is just a small struct of a fixed size at the start of each image, and for images that are not compressed, checking it is very fast. For reading the compressed images, I have no choice but to decompress the whole file and then check this header, which of course slows down my program.
Would it be possible to read the first segment of a .gz file (say a couple of K), decompress this segment and read the original contents? My understanding of gz is that after some bookkeeping at the start, the compressed data is stored sequentially -- is that correct?
so instead of
1. open big file F
2. decompress big file F
3. read 500-byte header
4. re-compress big file F
do
1. open big file F
2. read first 5 K from F as stream A
3. decompress A as stream B
4. read 500-byte header from B
I am using libz.so but solutions in other languages are appreciated!
You can use gzip -cd file.gz | dd ibs=1024 count=10 to uncompress just the first 10 KiB, for example.
gzip -cd decompresses to the standard output.
Pipe | this into the dd utility.
The dd utility copies the standard input to the standard output.
Sodd ibs=1024 sets the input block size to 1024 bytes instead of the default 512.
And count=10 Copies only 10 input blocks, thus halting the gzip decompression.
You'll want to do gzip -cd file.gz | dd count=1 using the standard 512 block size and just ignore the extra 12 bytes.
A comment highlights that you can use gzip -cd file.gz | head -c $((1024*10)) or in this specific case gzip -cd file.gz | head -c $(512). The comment that the original dd relies on gzip decompressing in 1024 doesn't seem to true. For example dd ibs=2 count=10 decompresses the first 20 bytes.
Yes, it is possible.
But don't reinvent the wheel, the HDF5 database supports different compression algorithms (gz among them) and you can address different pieces. It is compatible with Linux and Windows and there are wrappers to many languages. It also supports reading and decompressing in parallel, that is very useful if you use high compression rates.
Here is a comparison of read speed using different compression algorithms from Python through PyTables:
A Deflate stream can have multiple blocks back to back. But you can always decompress just the number of bytes you want, even if it's part of a larger block. The zlib function gzread takes a length arg, and there are various other ways to decompress a specific amount of plaintext bytes, regardless of how long the full stream is. See the zlib manual for a list of functions and how to use them.
It's not clear if you want to modify just the headers. (You mention recompressing the whole file, but option B doesn't recompress anything). If so, write headers in a separate Deflate block so you can replace that block without recompressing the rest of the image. Use Z_FULL_FLUSH when you call the zlib deflate function to write the headers. You probably don't need to record the compressed length of the headers anywhere; I think it can be computed when reading them to figure out which bytes to replace.
If you aren't modifying anything, recompressing the whole file makes no sense. You can seek and restart decompression from the start after finding headers you like...

Multifile favicon.ico much bigger than sum of sizes of component files

I want to create multifile favicon.ico according to great favicon cheat sheet.
I created 3 .png files, optimized them with OptiPNG and received files with 1, 2 and 3kb size.
Now I'm trying to create favicon.ico from them using Imagemagick but final file is around 15kb big (sum of component files is around 6kB).
What causes this effect and how i can avoid it ?
A workaround is to rely on the HTTP gzip compression.
For example, I packed 3 PNG pictures (3580 bytes in total) in a 15086 bytes favicon.ico file. When I download it with gzip compression enabled on the server side, I get 2551 bytes of data. This is even more efficient than downloading the PNG pictures one by one, as gzip is usually not enabled for this kind of file.
However, gzip is often not configured for .ico files (this is more for text files). On Apache, this can be fixed by adding:
<IfModule mod_deflate.c>
(... other rules here...)
AddOutputFilterByType DEFLATE image/x-icon
</IfModule>
in an Apache configuration file, such as /etc/apache2/mods-available/deflate.conf.
This is not the answer you expected and I hope someone will come with a magic command line to produce the lightweight favicon.ico we deserve!
Another workaround is using other another tool to create favicon file.
I found (on stack ofc ;)) png2ico tool. It gave me 8kb file (sum of components is around 6) which is size I can live with.
Please read comments to this answer - philippe_b observed that sometimes png2ico gives him poor results.
Only the 256x256 32bit icon is PNG compressed, others are stored as classic icons in .ICO file. So, there's a logical increase in filesize since it will decompress the other smaller size icons.