How can I write compressed files on the fly using Perl? - perl

I am generating relatively large files using Perl. The files I am generating are of two kinds:
Table files, i.e. textual files I print line by line (row by row), which contain mainly numbers. A typical line looks like:
126891 126991 14545 12
Serialized objects I create then store into a file using Storable::nstore. These objects usually contain some large hash with numeric values. The values in the object might have been packed to save on space (and the object unpacks each value before using it).
Currently I'm usually doing the following:
use IO::Compress::Gzip qw(gzip $GzipError);
# create normal, uncompressed file ($out_file)
# ...
# compress file using gzip
my $gz_out_file = "$out_file.gz";
gzip $out_file => $gz_out_file or die "gzip failed: $GzipError";
# delete uncompressed file
unlink($out_file) or die "can't unlink file $out_file: $!";
This is quite inefficient since I first write the large file to disk, then gzip read it again and compresses it. So my questions are as following:
Can I create a compressed file without first writing a file to disk? Is it possible to create a compressed file sequentially, i.e. printing line-by-line like in scenario (1) described earlier?
Does Gzip sounds like an appropriate choice? aRe there any other recommended compressors for the kind of data I have described?
Does it make sense to pack values in an object that will later be stored and compressed anyway?
My considerations are mainly saving on disk space and allowing fast decompression later on.

You can use IO::Zlib or PerlIO::gzip to tie a file handle to compress on the fly.
As for what compressors are appropriate, just try several and see how they do on your data. Also keep an eye on how much CPU/memory they use for compression and decompression.
Again, test to see how much pack helps with your data, and how much it affects your performance. In some cases, it may be helpful. In others, it may not. It really depends on your data.

You can also open() a filehandle to a scalar instead of a real file, and use this filehandle with IO::Compress::Gzip. Haven't actually tried it, but it should work. I use something similar with Net::FTP to avoid creating files on disk.
Since v5.8.0, Perl has built using PerlIO by default. Unless you've changed this (i.e., Configure -Uuseperlio), you can open filehandles directly to Perl scalars via:
open($fh, '>', \$variable) || ..
from open()

IO::Compress::Zlib has an OO interface that can be used for this.
use strict;
use warnings;
use IO::Compress::Gzip;
my $z = IO::Compress::Gzip->new('out.gz');
$z->print($_, "\n") for 0 .. 10;

Related

Perl - Piping gunzip Output to File::ReadBackwards

I have a Perl project (CGI script, running on Apache) that has previously always used gunzip and tac (piping gunzip to tac, and piping that to filehandle) in order to accomplish its workload, which is to process large, flat text files, sometimes on the order of 10GB or more each in size. More specifically, in my use case these files are needing to be decompressed on-the-fly AND read backwards at times as well (there are times when both are a requirement - mainly for speed).
When I started this project, I looked at using File::ReadBackwards but decided on tac instead for performance reasons. After some discussion on a slightly-related topic last night and several suggestions to try and keep the processing entirely within Perl, I decided to give File::ReadBackwards another shot to see how it stands up under this workload.
Some preliminary tests indicate that it may in fact be comparable, and possibly even better, than tac. However, so far I've only been able to test this on uncompressed files. But it now has grabbed my interest so I'd like to see if I could make it work with compressed files as well.
Now I'm pretty sure I could probably unzip a file to another file, then read that backwards, but I think that would have terrible performance. Especially because the user has the option to limit results to X number for the exact reason of helping performance so I do not want to have to process/decompress the entirety of a file every single time I pull any lines out of it. Ideally I would like to be able to do what I do now, which is to decompress and read it backwards on-the-fly, with the ability to bail out as soon as I hit my quota if needed.
So, my dilemma is that I need to find a way to pipe output from gunzip, into File::ReadBackwards, if possible.
On a side note, I would be willing to give IO::Uncompress::Gunzip a chance as well (compare the decompression performance against a plain, piped gunzip process), either for performance gain (which would surprise me) or for convenience/the ability to pipe output to File::ReadBackwards (which I find slightly more likely).
Does anyone have any ideas here? Any advice is much appreciated.
You can't. File::ReadBackwards requires a seekable handle (i.e. a plain file and not a pipe or socket).
To use File::ReadBackwards, you'd have to first send the output to a named temporary file (which you could create using File::Temp).
While File::ReadBackwards won't work as desired, here is another take.
In the original approach you first gunzip before tac-ing, and the whole file is read so to get to its end; thus tac is there only for convenience. (For a plain uncompressed file one can get file size from file metadata and then seek toward the end of a file so to not have to read the whole thing.)
Then try the same, or similar, in Perl. The IO::Uncompress::Gunzip module also has seek method. It does have to uncompress data up to that point
Note that the implementation of seek in this module does not provide true random access to a compressed file/buffer
but with it we still avoid copying uncompressed data (into variables) and so pay the minimal price here, to uncompress data in order to seek. In my timings this saves upward from an order of magnitude, making it far closer to system's gunzip (competitive on the order of 10Mb file sizes).
For that we also need the uncompressed size, which module's seek uses, which I get with system's gzip -l. Thus I still need to parse output of an external tool; so there's that issue.†
use warnings;
use strict;
use feature 'say';
use IO::Uncompress::Gunzip qw($GunzipError);
my $file = shift;
die "Usage: $0 file\n" if not $file or not -f $file;
my $z = IO::Uncompress::Gunzip->new($file) or die "Error: $GunzipError";
my $us = (split ' ', (`gunzip -l $file`)[1])[1]; # CHECK gunzip's output
say "Uncompressed size: $us";
# Go to 1024 bytes before uncompressed end (should really be more careful
# since we aren't guaranteed that size estimate)
$z->seek($us-1024, 0);
while (my $line = $z->getline) {
print $line if $z->eof;
}
(Note: docs advertise SEEK_END but it didn't work for me, neither as a constant nor as 2. Also note that the constructor does not fail for non-existent files so the program doesn't die there.)
This only prints the last line. Collect those lines into an array instead, for more involved work.
For compressed text files on order of 10Mb in size this runs as fast as gunzip | tac. For files around 100Mb in size it is slower by a factor of two. This is a rather rudimentary estimate, and it depends on all manner of detail. But I am comfortable to say that it will be noticeably slower for larger files.
However, the code above has a particular problem with file sizes possible in this case, in tens of Gb. The good old gzip format has the limitation, nicely stated in gzip manual
The gzip format represents the input size modulo 2^32 [...]
Then sizes obtained by --list for files larger than 4Gb undermine the above optimization: We'll seek to a place early in the file instead of to near its end (for a 17Gb file the size is reported by -l as 1Gb and so we seek there), and then in fact read the bulk of the file by getline.
The best solution would be to use the known value for the uncompressed data size -- if that is known. Otherwise, if the compressed file size exceeds 4Gb then seek to its compressed size (as far as we can safely), and after that use read with very large chunks
my $len = 10*1_024_000; # only hundreds of reads but large buffer
$z->read($buf, $len) while not $z->eof;
my #last_lines = split /\n/, $buf;
The last step depends on what actually need be done. If it is indeed to read lines backwards then you can do while (my $el = pop #last_lines) { ... } for example, or reverse the array and work away. Note that it is likely that the last read will be far lesser than $len.
On the other hand, it could so happen that the last read buffer is too small for what's needed; so one may want to always copy the needed number of lines and keep that across reads.
The buffer size to read ($len) clearly depends on specifics of the problem.
Finally, if this is too much bother you can pipe gunzip and keep a buffer of lines.
use String::ShellQuote qw(shell_quote);
my $num_lines = ...; # user supplied
my #last_lines;
my $cmd = shell_quote('gunzip', '-c', '--', $file);
my $pid = open my $fh, '-|', $cmd // die "Can't open $cmd: $!";
push #last_lines, scalar <$fh> for 0..$num_lines; # to not have to check
while (<$fh>) {
push #last_lines, $_;
shift #last_lines;
}
close $fh;
while (my $line = pop #last_lines) {
print; # process backwards
}
I put $num_lines on the array right away so to not have to test the size of #last_lines against $num_lines for every shift, so on every read. (This improves runtime by nearly 30%.)
Any hint of the number of lines (of uncompressed data) is helpful, so that we skip ahead and avoid copying data into variables, as much as possible.
# Stash $num_lines on array
<$fh> for 0..$num_to_skip; # skip over an estimated number of lines
# Now push+shift while reading
This can help quite a bit, but depending on how well we can estimate the number of lines. Altogether, in my tests this is still slower than gunzip | tac | head, by around 50% in the very favorable case when I skip 90% of the file.
† The uncompressed size can be found without going to external tools as
my $us = do {
my $d;
open my $fh, '<', $file or die "Can't open $file: $!";
seek($fh, -4, 2) and read($fh, $d, 4) >= 4 and unpack('V', $d)
or die "Can't get uncompressed size: $!";
};
Thanks to mosvy for a comment with this.
If we still stick with using system's gunzip then the safety of running an external command with user input (filename), practically bypassed here by checking for that file, need be taken into account by using String::ShellQuote to compose the command
use String::ShellQuote qw(shell_quote);
my $cmd = shell_quote('gunzip', '-l', '--', $file);
# my $us = ... qx($cmd) ...;
Thanks to ikegami for comment.

Share hash across Perl scripts

Is it possible to share a hash created by a Perl script by another Perl script on a Linux machine ?
./hash_script.pl # Creates a hash after parsing a file
# Takes several minutes and hash consumes 4Gb of memory
./script1.pl # Reads hash
./script2.pl # Reads hash
I want to create the hash once and use it many times,whenever script1.pl and script2.pl are run.
If your hash_script script dumps its hash into a file somewhere (using Data::Dumper or some other means), you can load that hash in a subsequent script with do.
In script1/script2:
our %sharedhash; #whatever name the hash has in the dumped file
do 'hash_dump_file.txt' or die "Couldn't read hash: $#";
print $sharedhash{stuff};
I would recommend using the Perl module Storable. Storable can take any data structure and store it onto a disk.
use Storable; # It automatically imports all functions. Grrr...
...
store \%hash, $file_name;
However, if this is a 4Gb file, it probably is way too big to be used effectively for a Perl hash. This is why other posts are recommending you to use SQL or a NoSQL database. A hash would have to keep the entire file in memory and attempt to manipulate it. A SQL or NoSQL database could pull up the file that's required.
However, try Storable, and see how long it takes.
You don't say why you need such a big hash in memory, but probably use of some NoSQL database would be more suitable.
Take a look at Redis or MongoDB.
How about creating package and loading it in other scripts? But if it's size is about 4Gb, it's too much for this approach. Next solution could be memcached or something like that.
Could you write other information about your hash? How are you using it in the other two scripts?
dbmopen / dbmclose allows you to have a normal hash implemented by means of a file on your drive. I've never tried reading a DBM from a different script than the one that created it, but I see no reason why it shouldn't work.

How can i make writes to a gzip file from my perl script non-blocking?

I'm currently writing a script that takes a database as input and generates all valid combinations from the 10+ tables, following certain rules. Since the output is pretty darn huge, i'm dumping this through gzip into the file, like this:
open( my $OUT, '|-', "gzip > file" );
for ( #data ) {
my $line = calculate($_);
print $OUT $line;
}
Due to the nature of the beast though i end up having to make hundreds of thousands of small writes, one for each line. This means that between each calculation it waits for gzip to receive the data and get done compressing it. At least i think so, i might be wrong.
In case I'm right though, I'm wondering how i can make this print asynchronous, i.e. it fires the data at gzip and then goes on processing the data.
Give IO::Compress::Gzip a try. It accepts a filehandle to write to. You can set O_NONBLOCK on that filehandle.
Pipes already use a buffer so that the writing program doesn't have to wait for the reading program. However, that buffer is usually fairly small (it's normally only 64KB on Linux) and not easily changed (it requires recompiling the kernel). If the standard buffer is not enough, the easiest thing to do is include a buffering program in the pipeline:
open( my $OUT, '|-', "bfr | gzip > file" );
bfr simply reads STDIN into an in-memory buffer, and writes to STDOUT as fast as the next program allows. The default is a 5MB buffer, but you can change that with the -b option (e.g. bfr -b10m for a 10MB buffer).
naturally i'll will do it in a thread or with a fork as you wish.
http://hell.jedicoder.net/?p=82

How do I properly format plain text data for a simple Perl dictionary app?

I have a very simple dictionary application that does search and display. It's built with the Win32::GUI module. I put all the plain text data needed for the dictionary under the __DATA__ section. The script itself is very small but with everything under the __DATA__ section, its size reaches 30 MB. In order to share the work with my friends, I've then packed the script into a stand-alone executable using the PP utility of the PAR::Packer module with the highest compression level 9 and now I have a single-file dictionary app of about the size of 17MB.
But although I'm very comfortable with the idea of a single-file script, placing such huge amount of text data under the script's DATA section does not feel right. For one thing, when I try opening the script in Padre (Notepad ++ is okay), I'm receiving the error that is like:
Can't open my script as the script is over the arbitrary file size limit which is currently 500000.
My questions:
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
How do people normally format the text data needed for a dictionary application?
Any comments, ideas or suggestions? Thanks like always :)
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
Well, it depends on WHY you want to reduce the size. If it is to minimize disk space usage (rather weird goal most of the time these days), then the zip/unzip is the way to go.
However if the goal is to minimize memory usage, then a better approach is to split up the dictionary data into smaller chunks (for example indexed by a first letter), and only load needed chunks.
How do people normally format the text data needed for a dictionary application?
IMHO the usual approach is what you get as the logical end of an approach mentioned above (partitioned and indexed data): using a back-end database, which allows you to only retrieve the data which is actually needed.
In your case probably something simple like SQLite or Berkley DB/DBM files should be OK.
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
This depends somewhat on your usage... if it's a never-changing script used by 3 people, may be no tangible benefits.
In general, it will make maintenance much easier (you can change the dictionary and the code logic independently - think virus definitions file vs. antivirus executable for real world example).
It will also decrease the process memory consumption if you go with the approaches I mentioned above.
Since you are using PAR::Packer already, why not move it to a separate file or module and include it in the PAR file?
The easy way (no extra commandline options to pp, it will see the use statement and do the right thing):
words.pl
#!/usr/bin/perl
use strict;
use warnings;
use Words;
for my $i (1 .. 2) {
print "Run $i\n";
while (defined(my $word = Words->next_word)) {
print "\t$word\n";
}
}
Words.pm
package Words;
use strict;
use warnings;
my $start = tell DATA
or die "could not find current position: $!";
sub next_word {
if (eof DATA) {
seek DATA, $start, 0
or die "could not seek: $!";
return undef;
}
chomp(my $word = scalar <DATA>);
return $word;
}
1;
__DATA__
a
b
c

Can I use a filehandle instead of a filename for creating DBM files?

I'm using MLDBM to persist some Perl data structures and I'm wondering if there's an alternative to the following:
tie %hash, "MLDBM", $dbm_file, O_CREAT | O_RDWR, 0644;
Primarily, I'd like to be able to use STDOUT, rather than a known file name. This could then be redirected to a file on the shell-side.
I've been searching with keywords like "tie", "DBM" and "filehandle", but the hits tend to talk about tying filehandles to things, as opposed to things to filehandles.
Any suggestions?
Well, MLDBM wouldn't care, as it just passes the parameters to the underlying dbm library (e.g., DB_File or GDBM_File). But I'm not aware of any dbm library that accepts a filehandle instead of a filename. Also, a dbm file will need to be seekable, so the shell would have to be redirecting to an actual file, not a pipe. And STDOUT would probably be opened write-only, which wouldn't work for a dbm file.
If you're just using MLDBM for persistence, and not because the database is too big for memory, then you could try a different approach. Use Storable to persist your data structures. It can read & write to an already-open filehandle.
Remember that STDOUT is a stream, a sequence of bytes that must be read sequentially like a tape. The DBM modules provide record-oriented persistence where you can read from and write to arbitrary records.
To fake DBM over STDOUT, you would need to output some sort of journal format. Writing to STDOUT seems to have higher priority than using DBM, so maybe a different format would be more appropriate.
With more information about your application, we could offer suggestions that will be more useful to you.