is partial gz decompression possible? - partial

For working with images that are stored as .gz files (my image processing software can read .gz files for shorter/smaller disk time/space) I need to check the header of each file.
The header is just a small struct of a fixed size at the start of each image, and for images that are not compressed, checking it is very fast. For reading the compressed images, I have no choice but to decompress the whole file and then check this header, which of course slows down my program.
Would it be possible to read the first segment of a .gz file (say a couple of K), decompress this segment and read the original contents? My understanding of gz is that after some bookkeeping at the start, the compressed data is stored sequentially -- is that correct?
so instead of
1. open big file F
2. decompress big file F
3. read 500-byte header
4. re-compress big file F
do
1. open big file F
2. read first 5 K from F as stream A
3. decompress A as stream B
4. read 500-byte header from B
I am using libz.so but solutions in other languages are appreciated!

You can use gzip -cd file.gz | dd ibs=1024 count=10 to uncompress just the first 10 KiB, for example.
gzip -cd decompresses to the standard output.
Pipe | this into the dd utility.
The dd utility copies the standard input to the standard output.
Sodd ibs=1024 sets the input block size to 1024 bytes instead of the default 512.
And count=10 Copies only 10 input blocks, thus halting the gzip decompression.
You'll want to do gzip -cd file.gz | dd count=1 using the standard 512 block size and just ignore the extra 12 bytes.
A comment highlights that you can use gzip -cd file.gz | head -c $((1024*10)) or in this specific case gzip -cd file.gz | head -c $(512). The comment that the original dd relies on gzip decompressing in 1024 doesn't seem to true. For example dd ibs=2 count=10 decompresses the first 20 bytes.

Yes, it is possible.
But don't reinvent the wheel, the HDF5 database supports different compression algorithms (gz among them) and you can address different pieces. It is compatible with Linux and Windows and there are wrappers to many languages. It also supports reading and decompressing in parallel, that is very useful if you use high compression rates.
Here is a comparison of read speed using different compression algorithms from Python through PyTables:

A Deflate stream can have multiple blocks back to back. But you can always decompress just the number of bytes you want, even if it's part of a larger block. The zlib function gzread takes a length arg, and there are various other ways to decompress a specific amount of plaintext bytes, regardless of how long the full stream is. See the zlib manual for a list of functions and how to use them.
It's not clear if you want to modify just the headers. (You mention recompressing the whole file, but option B doesn't recompress anything). If so, write headers in a separate Deflate block so you can replace that block without recompressing the rest of the image. Use Z_FULL_FLUSH when you call the zlib deflate function to write the headers. You probably don't need to record the compressed length of the headers anywhere; I think it can be computed when reading them to figure out which bytes to replace.
If you aren't modifying anything, recompressing the whole file makes no sense. You can seek and restart decompression from the start after finding headers you like...

Related

dumpcap, save to text file and line separated

I'm trying to build a solution where dumpcap saves to text file in the format:
timestamp_as_detailed_as_possible, HEX-raw-packet
My goal is to have this continuously streaming each single data packet to the file, separated by newline.
2 questions?:
Is it possible for dumpcap to take care of fragmented packets, so I'm guaranteed each line contains 1 single full packet?
Is it OK to have another thread afterwards running and reading lines from the same file, do something with the data and then delete the line when processed - without this interfering with dumpcap?
Is it OK to have another thread afterwards running and reading lines from the same file, do something with the data and then delete the line when processed - without this interfering with dumpcap?
No. But this is the wrong approach. A pipe is actually what you should use here, i.e. dumpcap writing to a pipe and the analyzing process reading from it, i.e.
dumpcap -w - | analyzer
Is it possible for dumpcap to take care of fragmented packets, so I'm guaranteed each line contains 1 single full packet?
No, and it is also unclear here what exactly you expect. Usually there is no fragmentation done at the IP level and all since TCP tries to adjust the packet size to not be larger than the MTU anyway. And TCP should be treated as a byte stream only, i.e. don't expect anything you send to end up in a single packet or that multiple send will actually result in multiple packets.
I'm trying to build a solution where dumpcap saves to text file
Dumpcap doesn't save to text files, it saves to binary pcap or pcapng files.
You might want to consider using tcpdump instead, although you'd have to pipe it to a separate program/script to massage its output into the format you want.

Is there a way to replay a gzip-compressed log file in kdb+ without uncompressing it first?

Streaming execute, -11!, does not work on named pipes, so an obvious solution of redirecting the gzip -cd output to a named pipe and passing it to -11! does not work.
-11! accepts a compressed file and streams it so long as it was compressed with -19! (using 2 as the compression algorithm parameter, which is gzip).
The only difference between a normal gzipped file and a kdb compressed one is a few bytes at the beginning of the file.
EDIT (see comment) Thanks, this isn't true - the bytes are different at the end of the file
So a possible solution is to prepend your gzipped files (if they weren't produced by -19!) with the appropriate byte array first.
For anyone using kdb v3.4+ streaming execution for named pipes was introduced as the function .Q.fps.
Here is a simple example of .Q.fps in action, first create the pipe via command line:
echo "aa" > test.pipe
Next in a q session:
q).Q.fps[0N!]`:test.pipe
,"aa"
Where 0N! is a function used to display the contents of the file.

Multifile favicon.ico much bigger than sum of sizes of component files

I want to create multifile favicon.ico according to great favicon cheat sheet.
I created 3 .png files, optimized them with OptiPNG and received files with 1, 2 and 3kb size.
Now I'm trying to create favicon.ico from them using Imagemagick but final file is around 15kb big (sum of component files is around 6kB).
What causes this effect and how i can avoid it ?
A workaround is to rely on the HTTP gzip compression.
For example, I packed 3 PNG pictures (3580 bytes in total) in a 15086 bytes favicon.ico file. When I download it with gzip compression enabled on the server side, I get 2551 bytes of data. This is even more efficient than downloading the PNG pictures one by one, as gzip is usually not enabled for this kind of file.
However, gzip is often not configured for .ico files (this is more for text files). On Apache, this can be fixed by adding:
<IfModule mod_deflate.c>
(... other rules here...)
AddOutputFilterByType DEFLATE image/x-icon
</IfModule>
in an Apache configuration file, such as /etc/apache2/mods-available/deflate.conf.
This is not the answer you expected and I hope someone will come with a magic command line to produce the lightweight favicon.ico we deserve!
Another workaround is using other another tool to create favicon file.
I found (on stack ofc ;)) png2ico tool. It gave me 8kb file (sum of components is around 6) which is size I can live with.
Please read comments to this answer - philippe_b observed that sometimes png2ico gives him poor results.
Only the 256x256 32bit icon is PNG compressed, others are stored as classic icons in .ICO file. So, there's a logical increase in filesize since it will decompress the other smaller size icons.

Does pcap_t *pcap_open_offline(const char *fname, char *errbuf) from libpcap read the whole pcap file into memory?

Does
pcap_t *pcap_open_offline(const char *fname, char *errbuf)
from libpcap read the whole pcap file into memory? If not so, I have to use tcpslice or similar tools to split pcap file up?
Thanks.
A strange way of wording your question, but I'll try and answer what I can.
pcap_open_offline() takes a .dump file (or similarly named output from tcpdump, tcpslice, or libpcap's pcap_dump_open() + pcap_dump() functions) as an input.
This file is exactly the same in format and function as a live trace of a network device, IE, you can use this pcap_t object in pcap_next, pcap_loop, etc.
Altering a dump file in any way (IE, stripping information or parsing out only what you want with tcpslice or wireshark) will render it unreadable by pcap_open_offline(), as it will not be formatted in the manner of a live packet trace.
However, it does not load the entire file at any one time into memory. It streams the file, as you would stream packets from a live trace.
To summarize: pcap_open_live() opens an unaltered tcpdump/tcpslice dump and reads it like a live stream. It does not load the entire file into its memory, as dumps can get quite large! Instead it just goes through the file only loading one packet's worth of the file at a time.

what is the difference between tar and gZ?

when i compress the file "file.tar.gZ" in iphone SDK, it gives file.tar , but both tar and tar.gZ gives same size?any help please?
*.tar means that multiple files are combined to one. (Tape Archive)
*.gz means that the files are compressed as well. (GZip compression)
Edit: that the size is the same doesn't say a lot. Sometimes files can't be compressed.
As Rhapsody said, tar is an archive containing multiple files, and gz is a file that is compressed using gzip. The reason why two formats are used is because gzip only supports compressing one file - perhaps due to the UNIX philosophy that a program should do one thing, and do it well.
In any case, if you have the option you may want to use bzip2 compression, which is more efficient (IE, compresses files to a smaller size) than gzip.