Limitations of the Perl Tie::File module - perl

I tried using the Tie:File module to write a text file which should be containing 1 billion lines, but it throws an error after writing 16 million
"Out of memory!"
"Callback called exit at C:/perl/lib/Tie/File.pm line 979 during global destruction."
this is the code I tried with.
use Tie::File;
tie #array, 'Tie::File', "Out.txt";
$start = time();
for ($i = 0; $i <= 15000000; $i++) {
$array[$i].= "$i,";
}
$end = time();
print("time taken: ", $end - $start, "seconds");
untie #array;
I don't know why it throws an error. Any solutions to overcome this? It also took about 55 minutes to write 16 million records and it throws error! Is this usual time it takes to write?

The Tie:File module is known to be quite slow, and it is best used where the advantage of having random access to the lines of the file outweighs the poor performance.
But this isn't a problem with the module, it is a limitation of Perl. Or, more accurately, a limitation of your computer system. If you take the module out of the situation and just try to create an ordinary array with 1,000,000,000 elements then Perl will die with an Out of memory! error. The limit for my 32-bit build of Perl 5 version 20 is around 30 million. For 64-bit builds it will be substantially more.
Tie:File doesn't keep the whole file in memory but pages it in and out to save space, so it can handle very large files. Just not that large!
In this case you don't have any need of the advantages of Tie:File, and you should just write the data sequentially to the file. Something like this
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '>', 'Out.txt';
my $time = time;
for my $i (0 .. 15_000_000) {
print $fh "$i,\n";
}
$time = time - $time;
printf "Time taken: %d seconds\n", $time;
This program ran in seven seconds on my system.
Please also note use strict and use warnings at the start of the program. This is essential for any Perl program, and will quickly reveal many simple problems that you would otherwise overlook. With use strict in place, each variable must be declared with my as close as possible to the first point of use.

Related

Perl to read each line

I'm beginner for Perl script. Below script is to check if file modified time is greater than 600 seconds. I read filename from filelist.txt.
When I tried to print file modified time, it shows modified time as blank.
Could you help me where I'm wrong?
filelist.txt
a.sh
file.txt
Perl script
#!/usr/bin/perl
my $filename = '/root/filelist.txt';
open(INFO, $filename) or die("Could not open file.");
foreach $eachfile (<INFO>) {
my $file="/root/$eachfile";
my $file_timestamp = (stat $file)[9];
my $timestamp = localtime($epoch_timestamp);
my $startTime = time();
my $fm = $startTime - $file_timestamp;
print "The file modified time is = $file_timestamp\n";
if ($rm > 600) {
print "$file modified time is greater than 600 seconds";
}
else {
print "$file modified time is less than 600 seconds\n";
}
}
You didn't include use strict; or use warnings; which is your downfall.
You set $fm; you test $rm. These are not the same variable. Using strictures and warnings would have pointed out the error of your ways. Expert use them routinely to make sure they aren't making silly mistakes. Beginners should use them too to make sure they aren't making silly mistakes either.
This revised script:
Uses use strict; and use warnings;
Makes sure each variable is defined with my
Doesn't contain $epoch_timestamp or $timestamp
Uses lexical file handles ($info) and the three argument form of open
Closes the file
Includes newlines at the ends of messages
Chomps the file name read from the file
Prints the file name so it can be seen that the chomp is doing its stuff
Locates the files in the current directory instead of /root
Avoids parentheses around the argument to die
Includes the file name in the argument to die
Could be optimized by moving my $startTime = time; outside the loop
Uses $fm in the test
Could be improved if the greater than/less than comments included the equals case correctly
Code:
#!/usr/bin/perl
use strict;
use warnings;
my $filename = './filelist.txt';
open my $info, '<', $filename or die "Could not open file $filename";
foreach my $eachfile (<$info>)
{
chomp $eachfile;
my $file="./$eachfile";
print "[$file]\n";
my $file_timestamp = (stat $file)[9];
my $startTime = time();
my $fm = $startTime - $file_timestamp;
print "The file modified time is = $file_timestamp\n";
if ($fm > 600) {
print "$file modified time is greater than 600 seconds\n";
}
else {
print "$file modified time is less than 600 seconds\n";
}
}
close $info;
Tangentially: if you're working in /root, the chances are you are running as user root. I trust you have good backups. Experimenting in programming as root is a dangerous game. A simple mistake can wipe out the entire computer, doing far more damage than if you were running as a mere mortal user (rather than the god-like super user).
Strong recommendation: Don't learn to program as root!
If you ignore this advice, make sure you have good backups, and know how to recover from them.
(FWIW: I run my Mac as a non-root user; I even run system upgrades as a non-root user. I do occasionally use root privileges via sudo or equivalents, but I never login as root. I have no need to do so. And I minimize the amount of time I spend as root to minimize the chance of doing damage. I've been working on Unix systems for 30 years; I haven't had a root-privileged accident in over 25 years, because I seldom work as root.)
What others have run into before, is that reading the filename from the file INFO you end up with a newline character at the end of the string and then trying to open /root/file1<cr> doesn't work because that file doesn't exist.
Try calling:
chomp $eachfile
before constructing $file

Parsing huge text file in Perl

I have genome file something about 30 gb similar to under below ,
>2RHet assembled 2006-03-27 md5sum:88c0ac39ebe4d9ef5a8f58cd746c9810
GAGAGGTGTGGAGAGGAGAGGAGAGGAGTGGTGAGGAGAGGAGAGGTGAG
GAGAGGAGAGGAGAGGAGAGGAATGGAGAGGAGAGGAGTCGAGAGGAGAG
GAGAGGAGTGGTGAGGAGAGGAGAGGAGTGGAGAGGAGACGTGAGGAGTG
GAGAGGAGAGTAGTGGAGAGGAGTGGAGAGGAGAGGAGAGGAGAGGACGG
ATTGTGTTGAGGACGGATTGTGTTACACTGATCGATGGCCGAGAACGAAC
I am trying to parse the file and achieve my task fast ,
using the below code character by character
but the character is not getting printed
open (FH,"<:raw",'genome.txt') or die "cant open the file $!\n";
until ( eof(FH) ) {
$ch = getc(FH);
print "$ch\n";# not printing ch
}
close FH;
Your mistake is forgetting an eof:
until (eof FH) { ... }
But that is very unlikely to be the most efficient solution: Perl is slower than, say … C, so we want as few loop iterations as possible, and as much work done inside perl internals as we can get. This means that reading a file character by character is slow.
Also, use lexical variables (declared with my) instead of globals; this can lead to a performance increase.
Either pick a natural record delimiter (like \n), or read a certain number of bytes:
local $/ = \256; # read 256 bytes at a time.
while (<FH>) {
# do something with the bytes
}
(see perlvar)
You could also shed all the luxuries that open, readline and even getc do for you, and use sysopen and sysread for total control. However, that way lies madness.
# not tested; I will *not* use sysread.
use Fcntl;
use constant NUM_OF_CHARS => 1; # equivalent to getc; set higher maybe.
sysopen FH, "genome.txt", O_RDONLY or die;
my $char;
while (sysread FH, $char, NUM_OF_CHARS, 0) {
print($char .= "\n"); # appending should be better than concatenation.
}
If we are gone that far, using Inline::C is just a small and possibly preferable step.

Storing time series data, without a database

I would like to store time series data, such as CPU usage over 6 Months (Will poll the CPU usage every 2 minutes, so later I can get several resolutions, such as - 1 Week, 1 Month, or even higher resolutions, 5 Minutes,etc).
I'm using Perl, and I dont want to use RRDtool or relational database, I was thinking of implementing my own using some sort of a circular buffer (ring buffer) with the following properties:
6 Months = 186 Days = 4,464 Hours = 267,840 Minutes.
Dividing it into 2 minutes sections: 267,840 / 2 = 133,920.
133,920 is the ring-buffer size.
Each element in the ring-buffer will be a hashref with the key as the epoch (converted easily into date time using localtime) and the value is the CPU usage for that given time.
I will serialize this ring-buffer (using Storable I guess)
Any other suggestions?
Thanks,
I suspect you're overthinking this. Why not just use a flat (e.g.) TAB-delimited file with one line per time interval, with each line containing a timestamp and the CPU usage? That way, you can just append new entries to the file as they are collected.
If you want to automatically discard data older than 6 months, you can do this by using a separate file for each day (or week or month or whatever) and deleting old files. This is more efficient than reading and rewriting the entire file every time.
Writing and parsing such files is trivial in Perl. Here's some example code, off the top of my head:
Writing:
use strict;
use warnings;
use POSIX qw'strftime';
my $dir = '/path/to/log/directory';
my $now = time;
my $date = strftime '%Y-%m-%d', gmtime $now; # ISO 8601 datetime format
my $time = strftime '%H:%M:%S', gmtime $now;
my $data = get_cpu_usage_somehow();
my $filename = "$dir/cpu_usage_$date.log";
open FH, '>>', $filename
or die "Failed to open $filename for append: $!\n";
print FH "${date}T${time}\t$data\n";
close FH or die "Error writing to $filename: $!\n";
Reading:
use strict;
use warnings;
use POSIX qw'strftime';
my $dir = '/path/to/log/directory';
foreach my $filename (sort glob "$dir/cpu_usage_*.log") {
open FH, '<', $filename
or die "Failed to open $filename for reading: $!\n";
while (my $line = <FH>) {
chomp $line;
my ($timestamp, $data) = split /\t/, $line, 2;
# do something with timestamp and data (or save for later processing)
}
}
(Note: I can't test either of these example programs right now, so they might contain bugs or typos. Use at your own risk!)
As #Borodin suggests, use SQLite or DBM::Deep as recommended here.
If you want to stick to Perl itself, go with DBM::Deep:
A unique flat-file database module, written in pure perl. ... Can handle millions of keys and unlimited levels without significant slow-down. Written from the ground-up in pure perl -- this is NOT a wrapper around a C-based DBM. Out-of-the-box compatibility with Unix, Mac OS X and Windows.
You mention your need for storage, which could be satisfied by a simple text file as advocated by #llmari. (And, of course, using a CSV format would allow the file to be manipulated easily in a spreadsheet.)
But, if you plan on collecting a lot of data, and you wish to eventually be able to query it with good performance, then go with a tool designed for that purpose.

Parsing multiple files at a time in Perl

I have a large data set (around 90GB) to work with. There are data files (tab delimited) for each hour of each day and I need to perform operations in the entire data set. For example, get the share of OSes which are given in one of the columns. I tried merging all the files into one huge file and performing the simple count operation but it was simply too huge for the server memory.
So, I guess I need to perform the operation each file at a time and then add up in the end. I am new to perl and am especially naive about the performance issues. How do I do such operations in a case like this.
As an example two columns of the file are.
ID OS
1 Windows
2 Linux
3 Windows
4 Windows
Lets do something simple, counting the share of the OSes in the data set. So, each .txt file has millions of these lines and there are many such files. What would be the most efficient way to operate on the entire files.
Unless you're reading the entire file into memory, I don't see why the size of the file should be an issue.
my %osHash;
while (<>)
{
my ($id, $os) = split("\t", $_);
if (!exists($osHash{$os}))
{
$osHash{$os} = 0;
}
$osHash{$os}++;
}
foreach my $key (sort(keys(%osHash)))
{
print "$key : ", $osHash{$key}, "\n";
}
While Paul Tomblin's answer dealt with filling the hash, here's the same plus opening the files:
use strict;
use warnings;
use 5.010;
use autodie;
my #files = map { "file$_.txt" } 1..10;
my %os_count;
for my $file (#files) {
open my $fh, '<', $file;
while (<$file>) {
my ($id, $os) = split /\t/;
... #Do something with %os_count and $id/$os here.
}
}
We just open each file serially -- Since you need to read all lines from all files, there isn't much more you can do about it. Once you have the hash, you could store it somewhere and load it when the program starts, then skip all lines until the last you read, or simply seek there, if your records premit, which doesn't look like it.

Subroutines vs scripts in Perl

I'm fairly new to Perl and was wondering what the best practices regarding subroutines are with Perl. Can a subroutine be too big?
I'm working on a script right now, and it might need to call another script. Should I just integrate the old script into the new one in the form of a subroutine? I need to pass one argument to the script and need one return value.
I'm guessing I'd have to do some sort of black magic to get the output from the original script, so subroutine-ing it makes sense right?
Avoiding "black magic" is always a good idea when writing code. You never want to jump through hoops and come up with an unintuitive hack to solve a problem, especially if that code needs to be supported later. It happens, admittedly, and we're all guilty of it. Circumstances can weigh heavily on "just getting the darn thing to work."
The point is, the best practice is always to make the code clean and understandable. Remember, and this is especially true with Perl code in my experience, any code you wrote yourself more than a few months ago may as well have been written by someone else. So even if you're the only one who needs to support it, do yourself a favor and make it easy to read.
Don't cling to broad sweeping ideas like "favor more files over larger files" or "favor smaller methods/subroutines over larger ones" etc. Those are good guidelines to be sure, but apply the spirit of the guideline rather than the letter of it. Keep the code clean, understandable, and maintainable. If that means the occasional large file or large method/subroutine, so be it. As long as it makes sense.
A key design goal is separation of concerns. Ideally, each subroutine performs a single well-defined task. In this light, the main question revolves not around a subroutine's size but its focus. If your program requires multiple tasks, that implies multiple subroutines.
In more complex scenarios, you may end up with groups of subroutines that logically belong together. They can be organized into libraries or, even better, modules. If possible, you want to avoid a scenario where you end up with multiple scripts that need to communicate with each other, because the usual mechanism for one script to return data to another script is tedious: the first script writes to standard output and the second script must parse that output.
Several years ago I started work at a job requiring that I build a large number of command-line scripts (at least, that's how it turned out; in the beginning, it wasn't clear what we were building). I was quite inexperienced at the time and did not organize the code very well. In hindsight, I should have worked from the premise that I was writing modules rather than scripts. In other words, the real work would have been done by modules, and the scripts (the code executed by a user on the command line) would have remained very small front-ends to invoke the modules in various ways. This would have facilitated code reuse and all of that good stuff. Live and learn, right?
Another option that hasn't been mentioned yet for reusing the code in your scripts is to put common code in a module. If you put shared subroutines into a module or modules, you can keep your scripts short and focussed on what they do that is special, while isolating the common code in a easy to access and reuse form.
For example, here is a module with a few subroutines. Put this in a file called MyModule.pm:
package MyModule;
# Always do this:
use strict;
use warnings;
use IO::Handle; # For OOP filehandle stuff.
use Exporter qw(import); # This lets us export subroutines to other scripts.
# These may be exported.
our #EXPORT_OK = qw( gather_data_from_fh open_data_file );
# Automatically export everything allowed.
# Generally best to leave empty, but in some cases it makes
# sense to export a small number of subroutines automatically.
our #EXPORT = #EXPORT_OK;
# Array of directories to search for files.
our #SEARCH_PATH;
# Parse the contents of a IO::Handle object and return structured data
sub gather_data_from_fh {
my $fh = shift;
my %data;
while( my $line = $fh->readline );
# Parse the line
chomp $line;
my ($key, #values) = split $line;
$data{$key} = \#values;
}
return \%data;
}
# Search a list of directories for a file with a matching name.
# Open it and return a handle if found.
# Die otherwise
sub open_data_file {
my $file_name = shift;
for my $path ( #SEARCH_PATH, '.' ) {
my $file_path = "$path/$file_name";
next unless -e $file_path;
open my $fh, '<', $file_path
or die "Error opening '$file_path' - $!\n"
return $fh;
}
die "No matching file found in path\n";
}
1; # Need to have trailing TRUE value at end of module.
Now in script A, we take a filename to search for and process and then print formatted output:
use strict;
use warnings;
use MyModule;
# Configure which directories to search
#MyModule::SEARCH_PATH = qw( /foo/foo/rah /bar/bar/bar /eeenie/meenie/mynie/moe );
#get file name from args.
my $name = shift;
my $fh = open_data_file($name);
my $data = gather_data_from_fh($fh);
for my $key ( sort keys %$data ) {
print "$key -> ", join ', ', #{$data->{$key}};
print "\n";
}
Script B, searches for a file, parses it and then writes the parsed data structure into a YAML file.
use strict;
use warnings;
use MyModule;
use YAML qw( DumpFile );
# Configure which directories to search
#MyModule::SEARCH_PATH = qw( /da/da/da/dum /tutti/frutti/unruly /cheese/burger );
#get file names from args.
my $infile = shift;
my $outfile = shift;
my $fh = open_data_file($infile);
my $data = gather_data_from_fh($fh);
DumpFile( $outfile, $data );
Some related documentation:
perlmod - About Perl modules in general
perlmodstyle - Perl module style guide; this has very useful info.
perlnewmod - Starting a new module
Exporter - The module used to export functions in the sample code
use - the perlfunc article on use.
Some of these docs assume you will be sharing your code on CPAN. If you won't be publishing to CPAN, simply ignore the parts about signing up and uploading code.
Even if you aren't writing for CPAN, it is beneficial to use the standard tools and CPAN file structure for your module development. Following the standard allows you to use all of the tools CPAN authors use to simplify the development, testing and installation process.
I know that all this seems really complicated, but the standard tools make each step easy. Even adding unit tests to your module distribution is easy thanks to the great tools available. The payoff is huge, and well worth the time you will invest.
Sometimes it makes sense to have a separate script, sometimes it doesn't. The "black magic" isn't that complicated.
#!/usr/bin/perl
# square.pl
use strict;
use warnings;
my $input = shift;
print $input ** 2;
#!/usr/bin/perl
# sum_of_squares.pl
use strict;
use warnings;
my ($from, $to) = #ARGV;
my $sum;
for my $num ( $from .. $to ) {
$sum += `square.pl $num` // die "square.pl failed: $? $!";
}
print $sum, "\n";
Easier and better error reporting on failure is automatic with IPC::System::Simple:
#!/usr/bin/perl
# sum_of_squares.pl
use strict;
use warnings;
use IPC::System::Simple 'capture';
my ($from, $to) = #ARGV;
my $sum;
for my $num ( $from .. $to ) {
$sum += capture( "square.pl $num" );
}
print $sum, "\n";