Perl file size limit - perl

I know how to read|write|open files in perl. What am trying to achieve is this; how do I create a new file when the existing file exceeds 'x' size. For instance, I have a 3MB file size, before writing to the same file, check the size, if size exceed 3MB, create a new one, chmod it if needed, then write.
I don't know if my question is clear -

$size = -s '/path/to/file.txt';
if(($size / 1048576) > 3) {
print "too big";
}
else {
do_something();
}

You can use stat for this:
http://perldoc.perl.org/functions/stat.html
stat gives you a lot of information about a given file, including the size.
Example:
use File::stat;
my $filesize = stat("test.txt")->size;

Once you've determined that the file is big enough to rotate using -s, Logfile::Rotate can be used to do the rotation.

Related

Perl - Piping gunzip Output to File::ReadBackwards

I have a Perl project (CGI script, running on Apache) that has previously always used gunzip and tac (piping gunzip to tac, and piping that to filehandle) in order to accomplish its workload, which is to process large, flat text files, sometimes on the order of 10GB or more each in size. More specifically, in my use case these files are needing to be decompressed on-the-fly AND read backwards at times as well (there are times when both are a requirement - mainly for speed).
When I started this project, I looked at using File::ReadBackwards but decided on tac instead for performance reasons. After some discussion on a slightly-related topic last night and several suggestions to try and keep the processing entirely within Perl, I decided to give File::ReadBackwards another shot to see how it stands up under this workload.
Some preliminary tests indicate that it may in fact be comparable, and possibly even better, than tac. However, so far I've only been able to test this on uncompressed files. But it now has grabbed my interest so I'd like to see if I could make it work with compressed files as well.
Now I'm pretty sure I could probably unzip a file to another file, then read that backwards, but I think that would have terrible performance. Especially because the user has the option to limit results to X number for the exact reason of helping performance so I do not want to have to process/decompress the entirety of a file every single time I pull any lines out of it. Ideally I would like to be able to do what I do now, which is to decompress and read it backwards on-the-fly, with the ability to bail out as soon as I hit my quota if needed.
So, my dilemma is that I need to find a way to pipe output from gunzip, into File::ReadBackwards, if possible.
On a side note, I would be willing to give IO::Uncompress::Gunzip a chance as well (compare the decompression performance against a plain, piped gunzip process), either for performance gain (which would surprise me) or for convenience/the ability to pipe output to File::ReadBackwards (which I find slightly more likely).
Does anyone have any ideas here? Any advice is much appreciated.
You can't. File::ReadBackwards requires a seekable handle (i.e. a plain file and not a pipe or socket).
To use File::ReadBackwards, you'd have to first send the output to a named temporary file (which you could create using File::Temp).
While File::ReadBackwards won't work as desired, here is another take.
In the original approach you first gunzip before tac-ing, and the whole file is read so to get to its end; thus tac is there only for convenience. (For a plain uncompressed file one can get file size from file metadata and then seek toward the end of a file so to not have to read the whole thing.)
Then try the same, or similar, in Perl. The IO::Uncompress::Gunzip module also has seek method. It does have to uncompress data up to that point
Note that the implementation of seek in this module does not provide true random access to a compressed file/buffer
but with it we still avoid copying uncompressed data (into variables) and so pay the minimal price here, to uncompress data in order to seek. In my timings this saves upward from an order of magnitude, making it far closer to system's gunzip (competitive on the order of 10Mb file sizes).
For that we also need the uncompressed size, which module's seek uses, which I get with system's gzip -l. Thus I still need to parse output of an external tool; so there's that issue.†
use warnings;
use strict;
use feature 'say';
use IO::Uncompress::Gunzip qw($GunzipError);
my $file = shift;
die "Usage: $0 file\n" if not $file or not -f $file;
my $z = IO::Uncompress::Gunzip->new($file) or die "Error: $GunzipError";
my $us = (split ' ', (`gunzip -l $file`)[1])[1]; # CHECK gunzip's output
say "Uncompressed size: $us";
# Go to 1024 bytes before uncompressed end (should really be more careful
# since we aren't guaranteed that size estimate)
$z->seek($us-1024, 0);
while (my $line = $z->getline) {
print $line if $z->eof;
}
(Note: docs advertise SEEK_END but it didn't work for me, neither as a constant nor as 2. Also note that the constructor does not fail for non-existent files so the program doesn't die there.)
This only prints the last line. Collect those lines into an array instead, for more involved work.
For compressed text files on order of 10Mb in size this runs as fast as gunzip | tac. For files around 100Mb in size it is slower by a factor of two. This is a rather rudimentary estimate, and it depends on all manner of detail. But I am comfortable to say that it will be noticeably slower for larger files.
However, the code above has a particular problem with file sizes possible in this case, in tens of Gb. The good old gzip format has the limitation, nicely stated in gzip manual
The gzip format represents the input size modulo 2^32 [...]
Then sizes obtained by --list for files larger than 4Gb undermine the above optimization: We'll seek to a place early in the file instead of to near its end (for a 17Gb file the size is reported by -l as 1Gb and so we seek there), and then in fact read the bulk of the file by getline.
The best solution would be to use the known value for the uncompressed data size -- if that is known. Otherwise, if the compressed file size exceeds 4Gb then seek to its compressed size (as far as we can safely), and after that use read with very large chunks
my $len = 10*1_024_000; # only hundreds of reads but large buffer
$z->read($buf, $len) while not $z->eof;
my #last_lines = split /\n/, $buf;
The last step depends on what actually need be done. If it is indeed to read lines backwards then you can do while (my $el = pop #last_lines) { ... } for example, or reverse the array and work away. Note that it is likely that the last read will be far lesser than $len.
On the other hand, it could so happen that the last read buffer is too small for what's needed; so one may want to always copy the needed number of lines and keep that across reads.
The buffer size to read ($len) clearly depends on specifics of the problem.
Finally, if this is too much bother you can pipe gunzip and keep a buffer of lines.
use String::ShellQuote qw(shell_quote);
my $num_lines = ...; # user supplied
my #last_lines;
my $cmd = shell_quote('gunzip', '-c', '--', $file);
my $pid = open my $fh, '-|', $cmd // die "Can't open $cmd: $!";
push #last_lines, scalar <$fh> for 0..$num_lines; # to not have to check
while (<$fh>) {
push #last_lines, $_;
shift #last_lines;
}
close $fh;
while (my $line = pop #last_lines) {
print; # process backwards
}
I put $num_lines on the array right away so to not have to test the size of #last_lines against $num_lines for every shift, so on every read. (This improves runtime by nearly 30%.)
Any hint of the number of lines (of uncompressed data) is helpful, so that we skip ahead and avoid copying data into variables, as much as possible.
# Stash $num_lines on array
<$fh> for 0..$num_to_skip; # skip over an estimated number of lines
# Now push+shift while reading
This can help quite a bit, but depending on how well we can estimate the number of lines. Altogether, in my tests this is still slower than gunzip | tac | head, by around 50% in the very favorable case when I skip 90% of the file.
† The uncompressed size can be found without going to external tools as
my $us = do {
my $d;
open my $fh, '<', $file or die "Can't open $file: $!";
seek($fh, -4, 2) and read($fh, $d, 4) >= 4 and unpack('V', $d)
or die "Can't get uncompressed size: $!";
};
Thanks to mosvy for a comment with this.
If we still stick with using system's gunzip then the safety of running an external command with user input (filename), practically bypassed here by checking for that file, need be taken into account by using String::ShellQuote to compose the command
use String::ShellQuote qw(shell_quote);
my $cmd = shell_quote('gunzip', '-l', '--', $file);
# my $us = ... qx($cmd) ...;
Thanks to ikegami for comment.

perl file size calculation not working

I am trying to write a simple perl script that will iterate through the regular files in a directory and calculate the total size of all the files put together. However, I am not able to get the actual size of the file, and I can't figure out why. Here is the relevant portion of the code. I put in print statements for debugging:
$totalsize = 0;
while ($_ = readdir (DH)) {
print "current file is: $_\t";
$cursize = -s $_;
print "size is: $cursize\n";
$totalsize += $cursize;
}
This is the output I get:
current file is: test.pl size is:
current file is: prob12.pl size is:
current file is: prob13.pl size is:
current file is: prob14.pl size is:
current file is: prob15.pl size is:
So the file size remains blank. I tried using $cursize = $_ instead but the only effect of that was to retrieve the file sizes for the current and parent directories as 4096 bytes each; it still didn't get any of the actual file sizes for the regular files.
I have looked online and through a couple of books I have on perl, and it seems that perl isn't able to get the file sizes because the script can't read the files. I tested this by putting in an if statement:
print "Cannot read file $_\n" if (! -r _);
Sure enough for each file I got the error saying that the file could not be read. I do not understand why this is happening. The directory that has the files in question is a subdirectory of my home directory, and I am running the script as myself from another subdirectory in my home directory. I have read permissions to all the relevant files. I tried changing the mode on the files to 755 (from the previous 711), but I still got the Cannot read file output for each file.
I do not understand what's going on. Either I am mixed up about how permissions work when running a perl script, or I am mixed up about the proper way to use -s _. I appreciate your guidance. Thanks!
If it isn't just your typo -s _ instead of the correct -s $_ then please remember that readdir returns file names relative to the directory you've opened with opendir. The proper way would be something like
my $base_dir = '/path/to/somewhere';
opendir DH, $base_dir or die;
while ($_ = readdir DH) {
print "size of $_: " . (-s "$base_dir/$_") . "\n";
}
closedir DH;
You could also take a look at the core module IO::Dir which offers a tie way of accessing both the file names and the attributes in a simpler manner.
You have a typo:
$cursize = -s _;
Should be:
$cursize = -s $_;

Easiest way to parse a single, large text file across multiple client machines?

I've been given the task of writing a webapp that analyzes text files given a single regular expression. The text files I am given range anywhere from 500MB to 3GB. I am currently using Perl as my parsing engine. I've been reading about mapReduce and Hadoop but it seems like the set up is only worth it given very,very large amounts of data, much larger than the amounts I am parsing.
What would be a good way to go about this? Right now a 500MB file takes anywhere from 4 to 6 minutes to parse, which isn't too bad, but the 3GB files take forever, and the webserver usually times out before it can get output from the Perl script and generate a report.
Let's partition your file into 100 chunks, and use seek to let an arbitrary process work on an arbitrary part of the file.
my $chunk = $ARGV[0]; # a user input, from 0 to 99
my $size = -s $THE_FILE;
my $startByte = int($chunk * $size / 100);
my $endByte = int(($chunk + 1) * $size) / 100);
open my $fh, '<', $THE_FILE;
seek $fh, 0, $startByte;
scalar <$fh>; # skip current line in case we have seek'd to the middle of a line
while (<$fh>) {
# ... process this section of the file ...
last if tell($fh) >= $endByte;
}
Now run this program 100 times on whatever machines you have available, passing the arguments 0 to 99 once to each program.
Actually hadoop is surprisingly easy to install and use (especially if you don't have huge data and don't need to optimize it). I had a similar task a while (processing logs in the range of about 5GB) and it took me no more than a couple of hours to install it on 5 machines, just using the tutorial and doc on their site. Then the programming is really easy, just read from STDIN and write to STDOUT!
Probably making your own split and distribute script (even if you make it on top of something like Gearman) will take more than installing hadoop.

How do I properly format plain text data for a simple Perl dictionary app?

I have a very simple dictionary application that does search and display. It's built with the Win32::GUI module. I put all the plain text data needed for the dictionary under the __DATA__ section. The script itself is very small but with everything under the __DATA__ section, its size reaches 30 MB. In order to share the work with my friends, I've then packed the script into a stand-alone executable using the PP utility of the PAR::Packer module with the highest compression level 9 and now I have a single-file dictionary app of about the size of 17MB.
But although I'm very comfortable with the idea of a single-file script, placing such huge amount of text data under the script's DATA section does not feel right. For one thing, when I try opening the script in Padre (Notepad ++ is okay), I'm receiving the error that is like:
Can't open my script as the script is over the arbitrary file size limit which is currently 500000.
My questions:
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
How do people normally format the text data needed for a dictionary application?
Any comments, ideas or suggestions? Thanks like always :)
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
Well, it depends on WHY you want to reduce the size. If it is to minimize disk space usage (rather weird goal most of the time these days), then the zip/unzip is the way to go.
However if the goal is to minimize memory usage, then a better approach is to split up the dictionary data into smaller chunks (for example indexed by a first letter), and only load needed chunks.
How do people normally format the text data needed for a dictionary application?
IMHO the usual approach is what you get as the logical end of an approach mentioned above (partitioned and indexed data): using a back-end database, which allows you to only retrieve the data which is actually needed.
In your case probably something simple like SQLite or Berkley DB/DBM files should be OK.
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
This depends somewhat on your usage... if it's a never-changing script used by 3 people, may be no tangible benefits.
In general, it will make maintenance much easier (you can change the dictionary and the code logic independently - think virus definitions file vs. antivirus executable for real world example).
It will also decrease the process memory consumption if you go with the approaches I mentioned above.
Since you are using PAR::Packer already, why not move it to a separate file or module and include it in the PAR file?
The easy way (no extra commandline options to pp, it will see the use statement and do the right thing):
words.pl
#!/usr/bin/perl
use strict;
use warnings;
use Words;
for my $i (1 .. 2) {
print "Run $i\n";
while (defined(my $word = Words->next_word)) {
print "\t$word\n";
}
}
Words.pm
package Words;
use strict;
use warnings;
my $start = tell DATA
or die "could not find current position: $!";
sub next_word {
if (eof DATA) {
seek DATA, $start, 0
or die "could not seek: $!";
return undef;
}
chomp(my $word = scalar <DATA>);
return $word;
}
1;
__DATA__
a
b
c

Check length per file instead of entire request in CGI Upload

I am attempting to modify the Uber-Uploader perl script so that when an upload is checked if it meets the minimum requirements, it checks per file instead of just the entire request.
I'm not too experienced with perl, and don't know how to do this. Currently the script simply does this:
elsif($ENV{'CONTENT_LENGTH'} > $config{'max_upload_size'}){
my $max_size = &format_bytes($config{'max_upload_size'}, 99);
die("Maximum upload size of $max_size exceeded");
}
All that does is check the content length of the request (which contains multiple files), and fail when the total is greater than the max allowed size.
Is there a way to check it per file?
Note: Changing upload scripts is not an option. Don't try
I'm not sure what you mean by "Changing upload scripts is not an option.", but have you tried something like this?
my $q = CGI->new();
my #files = $q->upload();
foreach my $file (#files){
if ((-s $file) > $config{'max_upload_size'}){
die("Maximum upload size of $file exceeded");
}
}
(NOTE: this is untested code!!!!!)