Parsing multiple files at a time in Perl - perl

I have a large data set (around 90GB) to work with. There are data files (tab delimited) for each hour of each day and I need to perform operations in the entire data set. For example, get the share of OSes which are given in one of the columns. I tried merging all the files into one huge file and performing the simple count operation but it was simply too huge for the server memory.
So, I guess I need to perform the operation each file at a time and then add up in the end. I am new to perl and am especially naive about the performance issues. How do I do such operations in a case like this.
As an example two columns of the file are.
ID OS
1 Windows
2 Linux
3 Windows
4 Windows
Lets do something simple, counting the share of the OSes in the data set. So, each .txt file has millions of these lines and there are many such files. What would be the most efficient way to operate on the entire files.

Unless you're reading the entire file into memory, I don't see why the size of the file should be an issue.
my %osHash;
while (<>)
{
my ($id, $os) = split("\t", $_);
if (!exists($osHash{$os}))
{
$osHash{$os} = 0;
}
$osHash{$os}++;
}
foreach my $key (sort(keys(%osHash)))
{
print "$key : ", $osHash{$key}, "\n";
}

While Paul Tomblin's answer dealt with filling the hash, here's the same plus opening the files:
use strict;
use warnings;
use 5.010;
use autodie;
my #files = map { "file$_.txt" } 1..10;
my %os_count;
for my $file (#files) {
open my $fh, '<', $file;
while (<$file>) {
my ($id, $os) = split /\t/;
... #Do something with %os_count and $id/$os here.
}
}
We just open each file serially -- Since you need to read all lines from all files, there isn't much more you can do about it. Once you have the hash, you could store it somewhere and load it when the program starts, then skip all lines until the last you read, or simply seek there, if your records premit, which doesn't look like it.

Related

Reading lines of a file into a hash parallel in Perl

I have thousands of files. My goal is to insert the lines of those files into a hash (Big amount of those lines repeats).
For now, I iterate through an array on files and for each file, I open it and split the row (Because each row is in the following format: <path>,<number>).
Then I insert into the %paths hash. Also each line I write into one main file (trying to save time by combining).
Piece of my code:
open(my $fh_main, '>', "$main_file") or die;
foreach my $dir (#dirs)
{
my $test = $dir."/"."test.csv";
open(my $fh, '<', "$test") or die;
while (my $row = <$fh>)
{
print $fh_main $row;
chomp($row);
my ($path,$counter) = split(",",$row);
my $abs_path = abs_path($path);
$paths{$abs_path} += $counter;
}
close ($fh);
}
close ($fh_main);
Due to a lot of files, I would like to split the iteration at least half. I thought of using the Parallel::ForkManager module (link),
in order to parallel insert the files into a hash A and into a hash B (if possible, then more than two hashes).
Then I can combine those two (or more) hashes into one main hash. There should not be a memory issue (because I'm running on a machine that does not have memory issues).
I read the decontamination but every single try failed and each iteration was running alone. I would like to see an initial example of the should I solve this issue.
Also, I would like to hear another opinion on how to implement this in a more clean and wise way.
Edit: maybe I didn't understand what exactly the module do. I would like to create a fork in the script so one half will of the files will be collected by process 1 and the other half will be collected by process 2. The first one to finish will write to a file and the other one will read from it. Is it possible to implement? Will it reduce the run time?
Try MCE::Map. It will automatically gather the output of the sub-processes into a list, which in your case can be a hash. Here's some untested pseudocode:
use MCE::Map qw[ mce_map ];
# note that MCE passes the argument via $_, not #_
sub process_file {
my $file = $_;
my %result_hash;
... fill hash ...
return %result_hash
}
my %result_hash = mce_map \&process_file \#list_of_files

quickest way to count the number of files in a directory containing hundreds of thousands of files

In a solaris system that processes large numbers of files and stores their information in a database (yes i know that using the database is the quickest way to get information about the number of files we have). I need fast way to monitor the files as they progress through the system on their way to being stored in a database.
Currently I use a perl script that reads in the directory to an array and then grabs the size of the array and sends it to a monitoring script. Unfortunately as our system grows this monitor is getting more and more slow.
I am looking for a method that will operate much more quickly instead of pausing and updating every 15-20 seconds after performing the count operation on all the directories involved.
I am relatively certain that my bottleneck is the read directory into array operation.
I don't need any information about the files, I don't need sizes or file names, just the number of files in the directory.
In my code I do not count hidden files or the text files I use to hold configuration information. It would be great if this functionality was preserved but is certainly not mandatory.
I have found some references to counting inodes with C code or something along those lines but I am not very experienced in that area.
I would like to make this monitor as real-time as possible.
The perl code I use looks like this:
opendir (DIR, $currentDir) or die "Cannot open directory: $!";
#files = grep ! m/^\./ && ! /config_file/, readdir DIR; # skip hidden files and config files
closedir(DIR);
$count = #files;
What you do right now reads the whole directory (more or less) into memory only to discard that content for its count. Avoid that by streaming the directory instead:
my $count;
opendir(my $dh, $curDir) or die "opendir($curdir): $!";
while (my $de = readdir($dh)) {
next if $de =~ /^\./ or $de =~ /config_file/;
$count++;
}
closedir($dh);
Importantly, don't use glob() in any of its forms. glob() will expensively stat() every entry, which is not overhead you want.
Now, you might have much more sophisticated and lighter weight ways of doing this depending on OS capabilities or filesystem capabilities (Linux, by way of comparison, offers inotify), but streaming the dir as above is about as good as you'll portably get.
Keep it short.
#files = readdir(DIR) - 2;
The -2 is because readdir counts "." and ".." as directory entries.
print #files . " files found\n";
exit;
1 files found

Storing time series data, without a database

I would like to store time series data, such as CPU usage over 6 Months (Will poll the CPU usage every 2 minutes, so later I can get several resolutions, such as - 1 Week, 1 Month, or even higher resolutions, 5 Minutes,etc).
I'm using Perl, and I dont want to use RRDtool or relational database, I was thinking of implementing my own using some sort of a circular buffer (ring buffer) with the following properties:
6 Months = 186 Days = 4,464 Hours = 267,840 Minutes.
Dividing it into 2 minutes sections: 267,840 / 2 = 133,920.
133,920 is the ring-buffer size.
Each element in the ring-buffer will be a hashref with the key as the epoch (converted easily into date time using localtime) and the value is the CPU usage for that given time.
I will serialize this ring-buffer (using Storable I guess)
Any other suggestions?
Thanks,
I suspect you're overthinking this. Why not just use a flat (e.g.) TAB-delimited file with one line per time interval, with each line containing a timestamp and the CPU usage? That way, you can just append new entries to the file as they are collected.
If you want to automatically discard data older than 6 months, you can do this by using a separate file for each day (or week or month or whatever) and deleting old files. This is more efficient than reading and rewriting the entire file every time.
Writing and parsing such files is trivial in Perl. Here's some example code, off the top of my head:
Writing:
use strict;
use warnings;
use POSIX qw'strftime';
my $dir = '/path/to/log/directory';
my $now = time;
my $date = strftime '%Y-%m-%d', gmtime $now; # ISO 8601 datetime format
my $time = strftime '%H:%M:%S', gmtime $now;
my $data = get_cpu_usage_somehow();
my $filename = "$dir/cpu_usage_$date.log";
open FH, '>>', $filename
or die "Failed to open $filename for append: $!\n";
print FH "${date}T${time}\t$data\n";
close FH or die "Error writing to $filename: $!\n";
Reading:
use strict;
use warnings;
use POSIX qw'strftime';
my $dir = '/path/to/log/directory';
foreach my $filename (sort glob "$dir/cpu_usage_*.log") {
open FH, '<', $filename
or die "Failed to open $filename for reading: $!\n";
while (my $line = <FH>) {
chomp $line;
my ($timestamp, $data) = split /\t/, $line, 2;
# do something with timestamp and data (or save for later processing)
}
}
(Note: I can't test either of these example programs right now, so they might contain bugs or typos. Use at your own risk!)
As #Borodin suggests, use SQLite or DBM::Deep as recommended here.
If you want to stick to Perl itself, go with DBM::Deep:
A unique flat-file database module, written in pure perl. ... Can handle millions of keys and unlimited levels without significant slow-down. Written from the ground-up in pure perl -- this is NOT a wrapper around a C-based DBM. Out-of-the-box compatibility with Unix, Mac OS X and Windows.
You mention your need for storage, which could be satisfied by a simple text file as advocated by #llmari. (And, of course, using a CSV format would allow the file to be manipulated easily in a spreadsheet.)
But, if you plan on collecting a lot of data, and you wish to eventually be able to query it with good performance, then go with a tool designed for that purpose.

How to sort rows in a text file in Perl?

I have a couple of text files (A.txt and B.txt) which look like this (might have ~10000 rows each)
processa,id1=123,id2=5321
processa,id1=432,id2=3721
processa,id1=3,id2=521
processb,id1=9822,id2=521
processa,id1=213,id2=1
processc,id1=822,id2=521
I need to check if every row in file A.txt is present in B.txt as well (B.txt might have more too, that is okay).
The thing is that rows can be in any order in the two files, so I am thinking I will sort them in some particular order in both the files in O(nlogn) and then match each line in A.txt to the next lines in B.txt in O(n). I could implement a hash, but the files are big and this comparison happens only once after which these files are regenerated, so I don't think that is a good idea.
What is the best way to sort the files in Perl? Any ordering would do, it just needs to be some ordering.
For example, in dictionary ordering, this would be
processa,id1=123,id2=5321
processa,id1=213,id2=1
processa,id1=3,id2=521
processa,id1=432,id2=3721
processb,id1=9822,id2=521
processc,id1=822,id2=521
As I mentioned before, any ordering would be just as fine, as long as Perl is fast in doing it.
I want to do it from within Perl code, after opening the file like so
open (FH, "<A.txt");
Any comments, ideas etc would be helpful.
To sort the file in your script, you will still have to load the entire thing into memory. If you're doing that, I'm not sure what's the advantage of sorting it vs just loading it into a hash?
Something like this would work:
my %seen;
open(A, "<A.txt") or die "Can't read A: $!";
while (<A>) {
$seen{$_}=1;
}
close A;
open(B, "<B.txt") or die "Can't read B: $!";
while(<B>) {
delete $seen{$_};
}
close B;
print "Lines found in A, missing in B:\n";
join "\n", keys %seen;
Here's another way to do it. The idea is to create a flexible data structure that allows you to answer many kinds of questions easily with grep.
use strict;
use warnings;
my ($fileA, $fileB) = #ARGV;
# Load all lines: $h{LINE}{FILE_NAME} = TALLY
my %h;
$h{$_}{$ARGV} ++ while <>;
# Do whatever you need.
my #all_lines = keys %h;
my #in_both = grep { keys %{$h{$_}} == 2 } keys %h;
my #in_A = grep { exists $h{$_}{$fileA} } keys %h;
my #only_in_A = grep { not exists $h{$_}{$fileB} } #in_A;
my #in_A_mult = grep { $h{$_}{$fileA} > 1 } #in_A;
well, i routinely parse very large (600MB) daily Apache log files with Perl, and to store the information i use a hash. I also go through about 30 of these files, in one script instance, using the same hash. its not a big deal assuming you have enough RAM.
May I ask why you must do this in native Perl? If the cost of calling a system call or 3 is not an issue (e.g. you do this infrequently and not in a tight loop), why not simply do:
my $cmd = "sort $file1 > $file1.sorted";
$cmd .= "; sort $file2 > $file2.sorted";
$cmd .= "; comm -23 $file1.sorted $file2.sorted |wc -l";
my $count = `$cmd`;
$count =~ s/\s+//g;
if ($count != 0) {
print "Stuff in A exists that aren't in B\n";
}
Please note that comm parameter might be different, depending on what exactly you want.
As usual, CPAN has an answer for this. Either Sort::External or File::Sort looks like it would work. I've never had occasion to try either, so I don't know which would be better for you.
Another possibility would be to use AnyDBM_File to create a disk-based hash that can exceed available memory. Without trying it, I couldn't say whether using a DBM file would be faster or slower than the sort, but the code would probably be simpler.
Test if A.txt is a subset of B.txt
open FILE.B, "B.txt";
open FILE.A, "A.txt";
my %bFile;
while(<FILE.B>) {
($process, $id1, $id2) = split /,/;
$bFile{$process}{$id1}{$id2}++;
}
$missingRows = 0;
while(<FILE.A>) {
$missingRows++ unless $bFile{$process}{$id1}{$id2};
# If we've seen a given entry already don't add it
next if $missingRows; # One miss means they aren't all verified
}
$is_Atxt_Subset_Btxt = $missingRows?FALSE:TRUE;
That will give you a test for all rows in A being in B with only reading in all of B and then testing each member of the array while reading A.

How do I efficiently parse a CSV file in Perl?

I'm working on a project that involves parsing a large csv formatted file in Perl and am looking to make things more efficient.
My approach has been to split() the file by lines first, and then split() each line again by commas to get the fields. But this suboptimal since at least two passes on the data are required. (once to split by lines, then once again for each line). This is a very large file, so cutting processing in half would be a significant improvement to the entire application.
My question is, what is the most time efficient means of parsing a large CSV file using only built in tools?
note: Each line has a varying number of tokens, so we can't just ignore lines and split by commas only. Also we can assume fields will contain only alphanumeric ascii data (no special characters or other tricks). Also, i don't want to get into parallel processing, although it might work effectively.
edit
It can only involve built-in tools that ship with Perl 5.8. For bureaucratic reasons, I cannot use any third party modules (even if hosted on cpan)
another edit
Let's assume that our solution is only allowed to deal with the file data once it is entirely loaded into memory.
yet another edit
I just grasped how stupid this question is. Sorry for wasting your time. Voting to close.
The right way to do it -- by an order of magnitude -- is to use Text::CSV_XS. It will be much faster and much more robust than anything you're likely to do on your own. If you're determined to use only core functionality, you have a couple of options depending on speed vs robustness.
About the fastest you'll get for pure-Perl is to read the file line by line and then naively split the data:
my $file = 'somefile.csv';
my #data;
open(my $fh, '<', $file) or die "Can't read file '$file' [$!]\n";
while (my $line = <$fh>) {
chomp $line;
my #fields = split(/,/, $line);
push #data, \#fields;
}
This will fail if any fields contain embedded commas. A more robust (but slower) approach would be to use Text::ParseWords. To do that, replace the split with this:
my #fields = Text::ParseWords::parse_line(',', 0, $line);
Here is a version that also respects quotes (e.g. foo,bar,"baz,quux",123 -> "foo", "bar", "baz,quux", "123").
sub csvsplit {
my $line = shift;
my $sep = (shift or ',');
return () unless $line;
my #cells;
$line =~ s/\r?\n$//;
my $re = qr/(?:^|$sep)(?:"([^"]*)"|([^$sep]*))/;
while($line =~ /$re/g) {
my $value = defined $1 ? $1 : $2;
push #cells, (defined $value ? $value : '');
}
return #cells;
}
Use it like this:
while(my $line = <FILE>) {
my #cells = csvsplit($line); # or csvsplit($line, $my_custom_seperator)
}
As other people mentioned, the correct way to do this is with Text::CSV, and either the Text::CSV_XS back end (for FASTEST reading) or Text::CSV_PP back end (if you can't compile the XS module).
If you're allowed to get extra code locally (eg, your own personal modules) you could take Text::CSV_PP and put it somewhere locally, then access it via the use lib workaround:
use lib '/path/to/my/perllib';
use Text::CSV_PP;
Additionally, if there's no alternative to having the entire file read into memory and (I assume) stored in a scalar, you can still read it like a file handle, by opening a handle to the scalar:
my $data = stupid_required_interface_that_reads_the_entire_giant_file();
open my $text_handle, '<', \$data
or die "Failed to open the handle: $!";
And then read via the Text::CSV interface:
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
while (my $row = $csv->getline($text_handle)) {
...
}
or the sub-optimal split on commas:
while (my $line = <$text_handle>) {
my #csv = split /,/, $line;
... # regular work as before.
}
With this method, the data is only copied a bit at a time out of the scalar.
You can do it in one pass if you read the file line by line. There is no need to read the whole thing into memory at once.
#(no error handling here!)
open FILE, $filename
while (<FILE>) {
#csv = split /,/
# now parse the csv however you want.
}
Not really sure if this is significantly more efficient though, Perl is pretty fast at string processing.
YOU NEED TO BENCHMARK YOUR IMPORT to see what is causing the slowdown. If for example, you are doing a db insertion that takes 85% of the time, this optimization won't work.
Edit
Although this feels like code golf, the general algorithm is to read the whole file or part of the fie into a buffer.
Iterate byte by byte through the buffer until you find a csv delimeter, or a new line.
When you find a delimiter, increment your column count.
When you find a newline increment your row count.
If you hit the end of your buffer, read more data from the file and repeat.
That's it. But reading a large file into memory is really not the best way, see my original answer for the normal way this is done.
Assuming that you have your CSV file loaded into $csv variable and that you do not need text in this variable after you successfully parsed it:
my $result=[[]];
while($csv=~s/(.*?)([,\n]|$)//s) {
push #{$result->[-1]}, $1;
push #$result, [] if $2 eq "\n";
last unless $2;
}
If you need to have $csv untouched:
local $_;
my $result=[[]];
foreach($csv=~/(?:(?<=[,\n])|^)(.*?)(?:,|(\n)|$)/gs) {
next unless defined $_;
if($_ eq "\n") {
push #$result, []; }
else {
push #{$result->[-1]}, $_; }
}
Answering within the constraints imposed by the question, you can still cut out the first split by slurping your input file into an array rather than a scalar:
open(my $fh, '<', $input_file_path) or die;
my #all_lines = <$fh>;
for my $line (#all_lines) {
chomp $line;
my #fields = split ',', $line;
process_fields(#fields);
}
And even if you can't install (the pure-Perl version of) Text::CSV, you may be able to get away with pulling up its source code on CPAN and copy/pasting the code into your project...