To increase the performance of a script in perl - perl

I have 2 files here which is newFile and LookupFile (which are huge files).
The contents in newFile will be searched in LookupFile and further processing happens. This script is working fine, however, it is taking more time to execute. Could you please let me know what can be done here to increase the performance? Could you please let me know if we can convert files into hash to increase performance?
My file looks like below
NewFile and LookupFile:
acl sourceipaddress subnet destinationipaddress subnet portnumber
.
.
Script:
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp::Tiny 'read_file';
use File::Copy;
use Data::Dumper;
use File::Copy qw(copy);
my %options = (
LookupFile => {
type => "=s",
help => "File name",
variable => 'gitFile',
required => 1,
}, newFile => {
type => "=s",
help => "file containing the acl lines to checked for",
variable => ‘newFile’,
required => 1,
} );
$opts->addOptions(%options);
$opts->parse();
$opts->validate();
my $newFile = $opts->getOption('newFile');
my $LookupFile = $opts->getOption('LookupFile');
my #LookupFile = read_file ("$LookupFile");
my #newFile = read_file ("$newFile");
#LookupFile = split (/\n/,$LookupFile[0]);
#newLines = split (/\n/,$newFile[0]);
open FILE1, "$newFile" or die "Could not open file: $! \n";
while(my $line = <FILE1>)
{
chomp($line);
my #columns = split(' ',$line);
$var = #columns;
my $fld1;
my $cnt;
my $fld2;
my $fld3;
my $fld4;
my $fld5;
my $dIP;
my $sIP;
my $sHOST;
my $dHOST;
if(....)
if (....) further checks and processing
)

First thing to do before any optimization is to profile your code. Rather than guessing, this will tell you what lines are taking up the most time, and how often they're called. Devel::NYTProf is a good tool for the job.
This is a problem.
my #LookupFile = read_file ("$LookupFile");
my #newFile = read_file ("$newFile");
#LookupFile = split (/\n/,$LookupFile[0]);
#newLines = split (/\n/,$newFile[0]);
read_file reads the whole file in as one big string (it should be my $contents = read_file(...), using an array is awkward). Then it splits the whole thing into newlines, copying everything in the file. This is very slow and hard on memory and unnecessary.
Instead, use read_lines. This will split the file into lines as it reads avoiding a costly copy.
my #lookups = read_lines($LookupFile);
my #new = read_lines($newFile);
Next problem is $newFile is opened again and iterated through line by line.
open FILE1, "$newFile" or die "Could not open file: $! \n";
while(my $line = <FILE1>) {
This is a waste as you've already read that file into memory. Use one or the other. However, in general, it's better to work with files line-by-line than to slurp them all into memory.
The above will speed things up, but they don't get at the crux of the problem. This is likely the real problem...
The contents in newFile will be searched in LookupFile and further processing happens.
You didn't show what you're doing, but I'm going to imagine it looks something like this...
for my $line (#lines) {
for my $thing (#lookups) {
...
}
}
That is, for each line in one file, you're looking at every line in the other. This is what is known as an O(n^2) algorithm meaning that as you double the size of the files you quadruple the time.
If each file has 10 lines, it will take 100 (10^2) turns through the inner loop. If they have 100 lines, it will take 10,000 (100^2). With 1,000 lines it will take 1,000,000 times.
With O(n^2) as the sizes get bigger things get very slow very quickly.
Could you please let me know if we can convert files into hash to increase performance?
You've got the right idea. You could convert the lookup file to a hash to speed things up. Let's say they're both lists of words.
# input
foo
bar
biff
up
down
# lookup
foo
bar
baz
And you want to check if any lines in input match any lines in lookup.
First you'd read lookup in and turn it into a hash. Then you'd read input and check if each line is in the hash.
use strict;
use warnings;
use autodie;
use v5.10;
...
# Populate `%lookup`
my %lookup;
{
open my $fh, $lookupFile;
while(my $line = <$fh>) {
chomp $line;
$lookup{$line} = 1;
}
}
# Check if any lines are in %lookup
open my $fh, $inputFile;
while(my $line = <$fh>) {
chomp $line;
print $line if $lookup{$line};
}
This way you only iterate through each file once. This is an O(n) algorithm meaning is scales linearly, because hash lookups are basically instantaneous. If each file has 10 lines, it will only take 10 iterations of each loop. If they have 100 lines it will only take 100 iterations of each loop. 1000 lines, 1000 iterations.
Finally, what you really want to do is skip all this and create a database for your data and search that. SQLite is a SQL database that requires no server, just a file. Put your data in there and perform SQL queries on it using DBD::SQLite.
While this means you have to learn SQL, and there is a cost to building and maintaining the database, this is fast and most importantly very flexible. SQLite can do all sorts of searches quickly without you having to write a bunch of extra code. SQL databases are a very common, so it's a very good investment to learn SQL.
Since you're splitting the file up with my #columns = split(' ',$line); it's probably a file with many fields in it. That will likely map to a SQL table very well.
SQLite can even import files like that for you. See this answer for details on how to do that.

Related

PERL: Jumping to lines in a huge text file

I have a very large text file (~4 GB).
It has the following structure:
S=1
3 lines of metadata of block where S=1
a number of lines of data of this block
S=2
3 lines of metadata of block where S=2
a number of lines of data of this block
S=4
3 lines of metadata of block where S=4
a number of lines of data of this block
etc.
I am writing a PERL program that read in another file,
foreach line of that file (where it must contain a number),
search the huge file for a S-value of that number minus 1,
and then analyze the lines of data of the block belongs to that S-value.
The problem is, the text file is HUGE, so processing each line with a
foreach $line {...} loop
is very slow. As the S=value is strictly increasing, are there any methods to jump to a particular line of the required S-value?
are there any methods to jump to a particular line of the required S-value?
Yes, if the file does not change then create an index. This requires reading the file in its entirety once and noting the positions of all the S=# lines using tell. Store it in a DBM file with the key being the number and the value being the byte position in the file. Then you can use seek to jump to that point in the file and read from there.
But if you're going to do that, you're better off exporting the data into a proper database such as SQLite. Write a program to insert the data into the database and add normal SQL indexes. This will probably be simpler than writing the index. Then you can query the data efficiently using normal SQL, and make complex queries. If the file change you can either redo the export, or use the normal insert and update SQL to update the database. And it will be easy for anyone who knows SQL to work with, as opposed to a bunch of custom indexing and search code.
I know the op has already accepted an answer, but a method that's served me well is to slurp the file into an array, based on changing the "record separator" ($/).
If you do something like this (not tested, but this should be close):
$/ = "S=";
my #records=<fh>;
print $records[4];
The output should be the entire fifth record (the array starts at 0, but your data starts at 1), starting with the record number (5) on a line by itself (you might need to strip that out later), following by all the remaining lines in that record.
It's very simple and fast, although it is a memory pig...
If the blocks of text are of the same length (in bytes or characters) you can calculate the position of the needed S-value in the file and seek there, then read. Otherwise, in principle you need to read lines to find the S value.
However, if there are only a few S-values to find you can estimate the needed position and seek there, then read enough to capture an S-value. Then analyze what you read to see how far off you are, and either seek again or read lines with <> to get to the S-value.
use warnings;
use strict;
use feature 'say';
use Fcntl qw(:seek);
my ($file, $s_target) = #ARGV;
die "Usage: $0 filename\n" if not $file or not -f $file;
$s_target //= 5; #/ default, S=5
open my $fh, '<', $file or die $!;
my $est_text_len = 1024;
my $jump_by = $est_text_len * $s_target; # to seek forward in file
my ($buff, $found);
seek $fh, $jump_by, SEEK_CUR; # get in the vicinity
while (1) {
my $rd = read $fh, $buff, $est_text_len;
warn "error reading: $!" if not defined $rd;
last if $rd == 0;
while ($buff =~ /S=([0-9]+)/g) {
my $s_val = $1;
# Analyze $s_val and $buff:
# (1) if overshot $s_target adjust $jump_by and seek back
# (2) if in front of $s_target read with <> to get to it
# (3) if $s_target is in $buff extract needed text
if ($s_val == $s_target) {
say "--> Found S=$s_val at pos ", pos $buff, " in buffer";
seek $fh, - $est_text_len + pos($buff) + 1, SEEK_CUR;
while (<$fh>) {
last if /S=[0-9]+/; # next block
print $_;
}
$found = 1;
last;
}
}
last if $found;
}
Tested with your sample, enlarged and cleaned up (change S=n in text as it is the same as the condition!), with $est_text_len and $jump_by set at 100 and 20.
This is a sketch. A full implementation needs to negotiate over and under seeking as outlined in comments in code. If text-block sizes don't vary much it can get in front of the needed S-value in two seek-and-reads, and then read with <> or use regex as in the example.
Some comments
The "analysis" sketched above need be done carefully. For one, a buffer may contain multiple S-value lines. Also, note that the code keeps reading if an S-value isn't in buffer.
Once you are close enough and in front of $s_target read lines by <> to get to it.
The read may not get as much as requested so you should really put that in a loop. There are recent posts with that.
Change to sysread from read for efficiency. In that case use sysseek, and don't mix with <> (which is buffered).
The code above presumes one S-value to find; adjust for more. It absolutely assumes that S-values are sorted.
This is clearly far more complex than reading lines but it does run much faster, with a very large file and only a few S-values to find. If there are many values then this may not help.
The foreach (<$fh>), indicated in the question, would cause the whole file to be read first (to build the list for foreach to go through); use while (<$fh>) instead.
If the file doesn't change (or the same file need be searched many times) you can first process it once to build an index of exact positions of S-values. Thanks to Danny_ds for a comment.
Binary search of a sorted list is an O(log N) operation. Something like this using seek:
open my $fh, '>>+', $big_file;
$target = 123_456_789;
$low = 0;
$high = -s $big_file;
while ($high - $low > 0.01 * -s $big_file) {
$mid = ($low + $high) / 2;
seek $fh, $mid, 0;
while (<$fh>) {
if (/^S=(\d+)/) {
if ($1 < $target) { $low = $mid; }
else { $high = $mid }
last;
}
}
}
seek $fh, $low, 0;
while (<$fh>) {
# now you are searching through the 1% of the file that contains
# your target S
}
Sort the numbers in the second file. Now you can proceed thru the huge file in order, processing each S-value as needed.

Data::Dumper wraps second word's output

I'm experiencing a rather odd problem while using Data::Dumper to try and check on my importing of a large list of data into a hash.
My Data looks like this in another file.
##Product ID => Market for product
ABC => Euro
XYZ => USA
PQR => India
Then in my script, I'm trying to read in my list of data into a hash like so:
open(CONFIG_DAT_H, "<", $config_data);
while(my $line = <CONFIG_DAT_H>) {
if($line !~ /^\#/) {
chomp($line);
my #words = split(/\s*\=\>\s/, $line);
%product_names->{$words[0]} = $words[1];
}
}
close(CONFIG_DAT_H);
print Dumper (%product_names);
My parsing is working for the most part that I can find all of my data in the hash, but when I print it using the Data::Dumper it doesn't print it properly. This is my output.
$VAR1 = 'ABC';
';AR2 = 'Euro
$VAR3 = 'XYZ';
';AR4 = 'USA
$VAR5 = 'PQR';
';AR6 = 'India
Does anybody know why the Dumper is printing the '; characters over the first two letters on my second column of data?
There is one unclear thing in the code: is *product_names a hash or a hashref?
If it is a hash, you should use %product_names{key} syntax, not %product_names->{key}, and need to pass a reference to Data::Dumper, so Dumper(\%product_names).
If it is a hashref then it should be labelled with a correct sigil, so $product_names->{key} and Dumper($product_names}.
As noted by mob if your input has anything other than \n it need be cleaned up more explicitly, say with s/\s*$// per comment. See the answer by ikegami.
I'd also like to add, the loop can be simplified by loosing the if branch
open my $config_dat_h, "<", $config_data or die "Can't open $config_data: $!";
while (my $line = <$config_dat_h>)
{
next if $line =~ /^\#/; # or /^\s*\#/ to account for possible spaces
# ...
}
I have changed to the lexical filehandle, the recommended practice with many advantages. I have also added a check for open, which should always be in place.
Humm... this appears wrong to me, even you're using Perl6:
%product_names->{$words[0]} = $words[1];
I don't know Perl6 very well, but in Perl5 the reference should be like bellow considering that %product_names exists and is declared:
$product_names{...} = ... ;
If you could expose the full code, I can help to solve this problem.
The file uses CR LF as line endings. This would become evident by adding the following to your code:
local $Data::Dumper::Useqq = 1;
You could convert the file to use unix line endings (seeing as you are on a unix system). This can be achieved using the dos2unix utility.
dos2unix config.dat
Alternatively, replace
chomp($line);
with the more flexible
$line =~ s/\s+\z//;
Note: %product_names->{$words[0]} makes no sense. It happens to do what you want in old versions of Perl, but it rightfully throws an error in newer versions. $product_names{$words[0]} is the proper syntax for accessing the value of an element of a hash.
Tip: You should be using print Dumper(\%product_names); instead of print Dumper(%product_names);.
Tip: You might also find local $Data::Dumper::Sortkeys = 1; useful. Data::Dumper has such bad defaults :(
Tip: Using split(/\s*=>\s*/, $line, 2) instead of split(/\s*=>\s*/, $line) would permit the value to contain =>.
Tip: You shouldn't use global variable without reason. Use open(my $CONFIG_DAT_H, ...) instead of open(CONFIG_DAT_H, ...), and replace other instances of CONFIG_DAT_H with $CONFIG_DAT_H.
Tip: Using next if $line =~ /^#/; would avoid a lot of indenting.

Remove duplicate lines on file by substring - preserve order (PERL)

i m trying to write a perl script to deal with some 3+ gb text files, that are structured like :
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
I want to perform two operations :
Count the number of delimiters per line and compare it to a static number (ie 5), those lines that exceed said number should be output to a file.control.
Remove duplicates on the file by substring($line, 0, 7) - first 7 numbers, but i want to preserve order. I want the output of that in a file.output.
I have coded this in simple shell script (just bash), but it took too long to process, the same script calling on perl one liners was quicker, but i m interested in a way to do this purely in perl.
The code i have so far is :
open $file_hndl_ot_control, '>', $FILE_OT_CONTROL;
open $file_hndl_ot_out, '>', $FILE_OT_OUTPUT;
# INPUT.
open $file_hndl_in, '<', $FILE_IN;
while ($line_in = <$file_hndl_in>)
{
# Calculate n. of delimiters
my $delim_cur_line = $line_in =~ y/"$delimiter"//;
# print "$commas \n"
if ( $delim_cur_line != $delim_amnt_per_line )
{
print {$file_hndl_ot_control} "$line_in";
}
# Remove duplicates by substr(0,7) maintain order
my substr_in = substr $line_in, 0, 11;
print if not $lines{$substr_in}++;
}
And i want the file.output file to look like
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
and the file.control file to look like :
(assuming delimiter control number is 6)
4352342xx23232xxx345545x45454x23232xxx
Could someone assist me? Thank you.
Posting edits : Tried code
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
open(my $fh1, ">>", "outputcontrol.txt");
open(my $fh2, ">>", "outputoutput.txt");
while ( <> ) {
my $count = ($_ =~ y/x//);
print "$count \n";
# print $_;
if ( $count != $delim_amnt_per_line )
{
print fh1 $_;
}
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print fh2;
}
I dont know if i m supposed to post new code in here. But i tried the above, based on your example. What baffles me (i m still very new in perl) is that it doesnt output to either filehandle, but if i redirected from the command line just as you said, it worked perfect. The problem is that i need to output into 2 different files.
It looks like entries with the same seven-character prefix may appear anywhere in the file, so it's necessary to use a hash to keep track of which ones have already been encountered. With a 3GB text file this may result in your perl process running out of memory, in which case a different approach is necessary. Please give this a try and see if it comes in under the bar
The tr/// operator (the same as y///) doesn't accept variables for its character list, so I've used eval to create a subroutine delimiters() that will count the number of occurrences of $delimiter in $_
It's usually easiest to pass the input file as a parameter on the command line, and redirect the output as necessary. That way you can run your program on different files without editing the source, and that's how I've written this program. You should run it as
$ perl filter.pl my_input.file > my_output.file
use strict;
use warnings 'all';
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
eval "sub delimiters { tr/$delimiter// }";
while ( <> ) {
next if delimiters() == $delim_amnt_per_line;
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print;
}
output
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

looping over the lines of a file spliiting each line in columns and creating an array of each column

Sorry if my question is too obvious, I´m new in perl.
My code is the following:
open (FILE1, "$ARG[0]") or die
#lines1;
$i=1;
while (<FILE>) {
chomp;
push (#lines1, $_);
my #{columns$1}= split (/\s+/, $lines1[$i]);
$i++;
}
It gives an error saying
Can´t declare array dereference at the line my #{columns$1}= split (/\s+/, $lines1[$i]);
I wanted to create columns1, columns2, columns3... and each one of them would have the columns of the corresponding line (columns1 of the line 1, columns2 of line 2 and so on...)
Because before I tried to do it this way (below) and every time it was splitting the lines but it was overwriting the #columns1 array so only the last line was saved, at the end I had the values of the 10th line (because it starting counting at 0)
for my $i (0..9) {
#columns1 = split (/\s+/, $lines1[$i]);
}
To split a table file in its columns, you could do the following:
#!/usr/bin/perl
#ALWAYS put 'use warnings' and 'use strict' on the beginning of your code. It makes
#your life easier when debugging your code, and save you from having empty variables
#making weird things all over your code, and many other things.
#It is a good practice for "safe Perl coding".
use warnings;
use strict;
my ($file) = #ARGV;
open(my $in, "<$ARGV[0]"); #In your code you used an old filehandle format, FILE1.
#You should use the new format - $file1 as it allows you
#to use any scalar variable as a filehandle.
my #column1;
while(<$in>) {
chomp;
#Here comes the splitting:
my #table = split(/\s+/);
#if you want to print the first column:
print "$table[0]\n"; #remember that Perl starts to count from 0;
#if you know which columns you want to work with:
push(#column1, $table[0]);
}
Even though I am an adept of the do-first-and-learn-to-code-by-fixing-your-mistakes approach to learn to code, you should really take some time to work through the basics of Perl, as #mpapec said. Learn the basics will save a lot of time and effort when dealing with problems like yours.

Nested while loop which does not seem to keep variables appropriately

I'm an amature Perl coder, and I'm having a lot of trouble figuring what is causing this particular issue. It seems as though it's a variable issue.
sub patch_check {
my $pline;
my $sline;
while (<SYSTEMINFO>) {
chomp($_);
$sline = $_;
while (<PATCHLIST>) {
chomp($_);
$pline = $_;
print "sline $sline pline $pline underscoreline $_ "; #troubleshooting
print "$sline - $pline\n";
if ($pline =~ /($sline)/) {
#print " - match $pline -\n";
}
} #end while
}
}
There is more code, but I don't think it is relevant. When I print $sline in the first loop it works fine, but not in the second loop. I tried making the variables global, but that did not work either.
The point of the subform is I want to open a file (patches) and see if it is in (systeminfo). I also tried reading the files into arrays and doing foreach loops.
Does anyone have another solution?
It looks like your actual goal here is to find lines which are in both files, correct? The normal (and much more efficient! - it only requires you to read in each file once, rather than reading all of one file for each line in the other) way to do this in Perl would be to read the lines from one file into a hash, then use hash lookups on each line in the other file to check for matches.
Untested (but so simple it should work) code:
sub patch_check {
my %slines;
while (<SYSTEMINFO>) {
# Since we'll just be comparing one file's lines
# against the other file's lines, there's no real
# reason to chomp() them
$slines{$_}++;
}
# %slines now has all lines from SYSTEMINFO as its
# keys and the values are the number of times the
# line appears, in case that's interesting to you
while (<PATCHLIST>) {
print "match: $_" if exists $slines{$_};
}
}
Incidentally, if you're reading your data from SYSTEMINFO and PATCHLIST, then you're doing it the old-fashioned way. When you get a chance, read up on lexical filehandles and the three-argument form of open if you're not already familiar with them.
Your code is not entering the PATCHLIST while loop the 2nd time through the SYSTEMINFO while loop because you already read all the contents of PATCHLIST the first time through. You'd have to re-open the PATCHLIST filehandle to accomplish what you're trying to do.
That's a pretty inefficient way to see if the lines of one file match the lines of another file. Take a look at grep with the -f flag for another way.
grep -f PATCHFILE SYSTEMINFO
What I like to do in such cases is: read one file and create keys for a hash from the values you are looking for. And then read the second file and look if the keys are already existing. In this way you have to read each file only once.
Here is example code, untested:
sub patch_check {
my %patches = ();
open(my $PatchList, '<', "patch.txt") or die $!;
open(my $SystemInfo, '<', "SystemInfo.txt") or die $!;
while ( my $PatchRow = <$PatchList> ) {
$patches($PatchRow) = 0;
}
while ( my $SystemRow = <$SystemInfo> ) {
if exists $patches{$SystemRow} {
#The Patch is in System Info
#Do whateever you want
}
}
}
You can not read one file inside the read loop of another. Slurp one file in, then have one loop as a foreach line of the slurped file, the outer loop, the read loop.