Optimize Perl script to deal with large amount of data

Optimize Perl script to deal with large amount of data - perl

Here is my script:
#!/usr/bin/perl -w
use warnings;
use strict;
no warnings 'uninitialized';
`rm /slot/ems12093/oracle/working/marchfound.txt`;
`touch /slot/ems12093/oracle/working/marchfound.txt`;
`rm /slot/ems12093/oracle/working/newcontact.txt`;
`touch /slot/ems12093/oracle/working/newcontact.txt`;
my ( $filename, $handle, #contact_list, $file_list, $k, #file_list2, $i, $e, $m, $fh, $f, $g,
$file1, $data, $file_location, $arrSize, $namefile );
$file_location = '/slot/ems12093/oracle/working/marchfound.txt';
$filename = '/slot/ems12093/oracle/working/contact.txt';
open( $handle, '<', $filename ) or die $!;
#contact_list = <$handle>;
close $handle;
chomp #contact_list;
chdir( '/scratch/mount_point/dnbfiles/oracle_cr/' );
$file_list = qx(ls|grep -i \"2016_03_Mar_EA\");
chomp( $file_list );
$k = "/scratch/mount_point/dnbfiles/oracle_cr/2016_03_Mar_EA";
chdir( $k );
#file_list2 = qx(ls|grep -i contact|grep -i full|grep -Ev "Glb");
chomp #file_list2;
foreach $file1 ( #file_list2 ) {
foreach $i ( #contact_list ) {
$e = "zgrep $i $file1";
$f = qx($e);
if ( $f ) {
print "working\n";
$g = "$f, $file1";
open $data, '>>', $file_location or die $!;
print $data "$g\n";
close $data;
#contact_list = grep { !/$i/ } #contact_list;
$arrSize = #contact_list;
print "$arrSize\n";
}
}
}
$m = "/slot/ems12093/oracle/working/";
chdir( $m );
chomp #contact_list;
$namefile = '/slot/ems12093/oracle/working/newcontact.txt';
open( $fh, '<', $namefile ) or die $!;
#contact_list = <$fh>;
close $fh;
print "done\n";
Here I am taking an input file contact.txt which has 370k records, for example mail address, and checking if those records are present in March month's zipped database 2016_03_Mar_EA.
The database again contains approx 1.6 million records e.g. name, designation, mail, etc. So it's going to take a LOT of time to check and print all 355k * 1.6m records.
Please suggest if there is any way that I can improve my script to get a faster result.

Not purely speed specific but you should do below modifications.
1) contact.txt has 370k records therefore you should not slurp whole data at once. So instead of doing
#contact_list = <$handle>;
You should read data line by line using
while(<$handle>){
#process one contact at a time
}
2) You are changing directories and executing shell commands to get desired files. It'd be better to use File::Find::Rule. It's easier to use, see below:
my #files = File::Find::Rule->file()->name( '*.pm' )->in( #INC );

The way you are doing this, I'd bet most of the time is spent in umcompressing the database dump (which will happen 370k times). Uncompress it once - before doing the matches. (That assumes you do have enough disk).
If you are not checking for actual regexps, fgrep will save some (marginal) time (though I suspect that this optimizatin is done internally by grep)
The advice on not slurping files is a good for memory saving, and should not affect speed much, for a single scan through the data. However, you are actually unnecessarily scanning the arry multiple times, in order to get rid of duplicate contacts
#contact_list = grep { !/$i/ } #contact_list;
and that not always slows the whole shebang down, it also wastes memory as #contact_list is being copied in memory.
You can read by line, keep track in a hash, and skip the loop body on duplicates:
next if exists $seen{$i};
$seen{$i}++

Related

Open two text files, process them and write to separate files

I'm using with Perl to open two text files, process them and then write the output to another file.
I have a file INPUT were every line is a customer. I will process each line into variables that will be used to substitute text in another file, TEMP. The result should be written into individual files for each customer, OUTPUT.
My program seems to be working on only the first file. The rest of the files remain empty with no output.
#!/usr/bin/perl -w
if ( $#ARGV < 0) {
print "Usage: proj5.pl <mm/dd/yyyy>\n";
exit;
}
my $date = $ARGV[0];
open(INFO, "p5Customer.txt") or die("Could not open p5Customer.txt file\n");
open(TEMP, "template.txt") or die("Could not open template.txt file\n");
my $directory = "Emails";
mkdir $directory unless(-e $directory);
foreach $info (<INFO>){
($email, $fullname, $title, $payed, $owed) = split /,/, $info;
next if($owed < $payed);
chomp($owed);
$filepath = "$directory/$email";
unless(open OUTPUT, '>>'.$filepath){
die "Unable to create '$filepath'\n";
}
foreach $detail (<TEMP>){
$detail =~ s/EMAIL/$email/g;
$detail =~ s/(NAME|FULLNAME)/$fullname/g;
$detail =~ s/TITLE/$title/g;
$detail =~ s/AMOUNT/$owed/g;
$detail =~ s{DATE}{$date}g;
print OUTPUT $detail;
}
close(OUTPUT);
}
close(INFO);
close(TEMP);

As has been said, you need to open your template file again each time you read from it. There's a bunch of other issues with your code too
Always use strict and use warnings 'all' and declare every variable with my as close as possible to where it is first used
$#ARGV is the index of the last element of #ARGV, so $#ARGV < 0 is much better written as #ARGV < 1
You should use lexical file handles, and the three-parameter form of open, so open(INFO, "p5Customer.txt") should be open my $info_fh, '<', "p5Customer.txt"
You should use while instead of for to read from a file
It is easier to use the default variable $_ for short loops
It is pointless to capture a substring in a regular expression if you're not going to use it, so (NAME|FULLNAME) should be NAME|FULLNAME
There is no point in closing input files before the end of your program
It is also much better to use an existing template system, such as
Template::Toolkit
This should work for you
#!/usr/bin/perl
use strict;
use warnings 'all';
if ( #ARGV < 1 ) {
print "Usage: proj5.pl <mm/dd/yyyy>\n";
exit;
}
my $date = $ARGV[0];
open my $info_fh, '<', 'p5Customer.txt' or die qq{Could not open "p5Customer.txt" file: $!};
my $directory = "Emails";
mkdir $directory unless -e $directory;
while ( <$info_fh> ) {
chomp;
my ($email, $fullname, $title, $payed, $owed) = split /,/;
next if $owed < $payed;
open my $template_fh, '<', 'template.txt' or die qq{Could not open "template.txt" file: $!};
my $filepath = "$directory/$email";
open my $out_fh, '>', $filepath or die qq{Unable to create "$filepath": $!};
while ( <$template_fh> ) {
s/EMAIL/$email/g;
s/FULLNAME|NAME/$fullname/g;
s/TITLE/$title/g;
s/AMOUNT/$owed/g;
s/DATE/$date/g;
print $out_fh $_;
}
close($out_fh);
}

Your problem is that the TEMP loop is inside the INPUT loop and so the TEMP loop will end while the INPUT loop is still on the first line of the INPUT file.
Best to store TEMP file data into a hash table and work on the TEMP hash table inside the INPUT loop.
Good luck.

perl: make script fast to use big file

My problem is how to make my script fast (I use big files)
I have the script above it add "bbb" between words if the words exist in an other file that contain sequences of words
for exemple file2.txt : i eat big pizza .my big pizza ...
file1.txt (sequences):
eat big pizza
big pizza
the result Newfile
i eatbbbbigbbbpizza.my bigbbbpizza ...
my script:
use strict;
use warnings;
use autodie;
open Newfile ,">./newfile.txt" or die "Cannot create Newfile.txt";
my %replacement;
my ($f1, $f2) = ('file1.txt', 'file2.txt');
open(my $fh, $f1);
my #seq;
foreach (<$fh> )
{
chomp;
s/^\s+|\s+$//g;
push #seq, $_;
}
close $fh;
#seq = sort bylen #seq;
open($fh, $f2);
foreach (<$fh> ) {
foreach my $r (#seq) {
my $t = $r;
$t =~ s/\h+/bbb/g;
s/$r/$t/g;
}
print Newfile ;
}
close $fh;
close Newfile ;
exit 0;
sub bylen {
length($b) <=> length($a);
}

Instead of an array
my #seq;
define your words as a hash.
my %seq;
Instead of pushing the words
push #seq, $_;
store the words in the hash. Precalculate the replacement and move it out of the loop.
my $t = $_;
$t =~ s/\h+/bbb/g;
$seq{$_} = $t;
Precalculate the words in front of the outer loop:
my #seq = keys %seq;
And use hash look-ups to find the replacement in the inner loop:
my $t = $seq{$r};
This might be a bit faster, but do not expect too much.
In most cases it is better to reduce the problem by preparing the input in a way, which makes the solution easier. For example grep -f is much faster than your Perl loops. Use grep to find the lines, which need a replacement, and do the replacement with Perl or Sed.
Another way is to parallel the job. You can divide your input in n parts and run n processes on n CPUs in parallel. See the GNU parallel tutorial.

What about a regexp like this (beware that this approach can cause security concerns) ?
use strict;
use warnings;
open (my $Newfile, '>', 'newfile.txt') or die "Cannot create Newfile.txt: $!";
my ($f1, $f2) = qw(file1.txt file2.txt);
open (my $fh, $f1) or die "Can't open $f1 for reading: $!";
my #seq = map {split ' ', $_ } <$fh>;
close $fh;
# an improvement would be to use an hash to avoid dupplicates
my $regexp = '(' . join('|', #seq) . ')';
open($fh, $f2) or die "Can't open $f2 for reading: $!";
foreach my $line (<$fh> ) {
$line =~ s/$regexp/$1bbb/g;
print $Newfile $line;
}
close $fh;
close $Newfile ;
exit 0;

In Perl, how can filter all log files in a directory, and extract interesting lines?

I'm trying to select only the .log files in my directory and then search in those files for the word "unbound" and print the entire line into a new output file with the same name as the log file (number###.log) but with a .txt extension. This is what I have so far:
#!/usr/bin/perl
use strict;
use warnings;
my $path = $ARGV[0];
my $outpath = $ARGV[1];
my #files;
my $files;
opendir(DIR,$path) or die "$!";
#files = grep { /\.log$/} readdir(DIR);
my #out;
my $out;
opendir(OUT,$outpath) or die "$!";
my $line;
foreach $files (#files) {
open (FILE, "$files");
my #line = <FILE>;
my $regex = Unbound;
open (OUT, ">>$out");
print grep {$line =~ /$regex/ } <>;
}
close OUT;
close FILE;
closedir(DIR);
closedir (OUT);
I'm a beginner, and I don't really know how to create a new text file with the acquired output.

Few things I'd suggest to improve this code:
declare your loop iterators within the loop. foreach my $file ( #files ) {
use 3 arg open: open ( my $input_fh, "<", $filename );
use glob rather than opendir then grep. foreach my $file ( <$path/*.txt> ) {
grep is good for extracting things into arrays. Your grep reads the whole file to print it, which isn't necessary. Doesn't matter much if the file is short though.
perltidy is great for reformatting code.
you're opening 'OUT' to a directory path (I think?) which isn't going to work.
$outpath isn't, it's a file. You need to do something different to output to different files. opendir isn't really valid to an output.
because you're using opendir that's actually giving you filenames - not full paths. So you might be in the wrong place to actually open the files. Prepending the path name, doing a chdir are possible solutions. But that's one of the reasons I like glob because it returns a path as well.
So with that in mind - how about:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
#Extract paths
my $input_path = $ARGV[0];
my $output_path = $ARGV[1];
#Error if paths are invalid.
unless (defined $input_path
and -d $input_path
and defined $output_path
and -d $output_path )
{
die "Usage: $0 <input_path> <output_path>\n";
}
foreach my $filename (<$input_path/*.log>) {
# extract the 'name' bit of the filename.
# be slightly careful with this - it's based
# on an assumption which isn't always true.
# File::Spec is a more powerful way of accomplishing this.
# but should grab 'number####' from /path/to/file/number####.log
my $output_file = basename ( $filename, '.log' );
#open input and output filehandles.
open( my $input_fh, "<", $filename ) or die $!;
open( my $output_fh, ">", "$output_path/$output_file.txt" ) or die $!;
print "Processing $filename -> $output_path/$output_file.txt\n";
#iterate input, extracting into $line
while ( my $line = <$input_fh> ) {
#check if $line matches your RE.
if ( $line =~ m/Unbound/ ) {
#write it to output.
print {$output_fh} $line;
}
}
#tidy up our filehandles. Although technically, they'll
#close automatically because they leave scope
close($output_fh);
close($input_fh);
}

Here is a script that takes advantage of Path::Tiny. Now, at this stage of your learning process, you are probably better off understanding #Sobrique's solution, but using modules such as Path::Tiny or Path::Class will make it easier to write these one off scripts more quickly, and correctly.
Also, I didn't really test this script, so watch out for bugs.
#!/usr/bin/env perl
use strict;
use warnings;
use Path::Tiny;
run(\#ARGV);
sub run {
my $argv = shift;
unless (#$argv == 2) {
die "Need source and destination paths\n";
}
my $it = path($argv->[0])->realpath->iterator({
recurse => 0,
follow_symlinks => 0,
});
my $outdir = path($argv->[1])->realpath;
while (my $path = $it->()) {
next unless -f $path;
next unless $path =~ /[.]log\z/;
my $logfh = $path->openr;
my $outfile = $outdir->child($path->basename('.log') . '.txt');
my $outfh;
while (my $line = <$logfh>) {
next unless $line =~ /Unbound/;
unless ($outfh) {
$outfh = $outfile->openw;
}
print $outfh $line;
}
close $outfh
or die "Cannot close output '$outfile': $!";
}
}
Notes
realpath will croak if the path provided does not exist.
Similarly for openr and openw.
I am reading input files line-by-line to keep the memory footprint of the program independent of the sizes of input files.
I do not open the output file until I know I have a match to print to.
When matching a file extension using a regular expression pattern, keep in mind that \n is a valid character in Unix file names, and the $ anchor will match it.

splitting a large file into small files based on column value in perl

I am trying to split up a large file (having around 17.6 million data) into 6-7 small files based on the column value.Currently, I am using sql bcp utility to dump in all data into one table and creating seperate files using bcp out utility.
But someone suggested me to use Perl as it would be more faster and you don't need to create a table for that.As I am not a perl guy. I am not sure how to do it in perl.
Any help..
INPUT file :
inputfile.txt
0010|name|address|city|.........
0020|name|number|address|......
0030|phone no|state|street|...
output files:
0010.txt
0010|name|address|city|.........
0020.txt
0020|name|number|address|......
0030.txt
0030|phone no|state|street|...

It is simplest to keep a hash of output file handles, keyed by the file name. This program shows the idea. The number at the start of each record is used to create the name of the file where it belongs, and file of that name is opened unless we already have a file handle for it.
All of the handles are closed once all of the data has been processed. Any errors are caught by use autodie, so explicit checking of the open, print and close calls is unnecessary.
use strict;
use warnings;
use autodie;
open my $in_fh, '<', 'inputfile.txt';
my %out_fh;
while (<$in_fh>) {
next unless /^(\d+)/;
my $filename = "$1.txt";
open $out_fh{$filename}, '>', $filename unless $out_fh{$filename};
print { $out_fh{$filename} } $_;
}
close $_ for values %out_fh;
Note close caught me out here because, unlike most operators that work on $_ if you pass no parameters, a bare close will close the currently selected file handle. That is a bad choice IMO, but it's way to late to change it now

17.6 million rows is going to be a pretty large file, I'd imagine. It'll still be slow with perl to process.
That said, you're going to want something like the below:
use strict;
use warnings;
my $input = 'FILENAMEHERE.txt';
my %results;
open(my $fh, '<', $input) or die "cannot open input file: $!";
while (<$fh>) {
my ($key) = split '|', $_;
my $array = $results{$key} || [];
push $array, $_;
$results{$key} = $array;
}
for my $filename (keys %results) {
open(my $out, '>', "$filename.txt") or die "Cannot open output file $out: $!";
print $out, join "\n", $results{$filename};
close($out);
}
I haven't explicitly tested this, but it should get you going in the right direction.

$ perl -F'|' -lane '
$key = $F[0];
$fh{$key} or open $fh{$key}, ">", "$key.txt" or die $!;
print { $fh{$key} } $_
' inputfile.txt

perl -Mautodie -ne'
sub out { $h{$_[0]} ||= open(my $f, ">", "$_[0].txt") && $f }
print { out($1) } $_ if /^(\d+)/;
' file

help merging perl code routines together for file processing

I need some perl help in putting these (2) processes/code to work together. I was able to get them working individually to test, but I need help bringing them together especially with using the loop constructs. I'm not sure if I should go with foreach..anyways the code is below.
Also, any best practices would be great too as I'm learning this language. Thanks for your help.
Here's the process flow I am looking for:
read a directory
look for a particular file
use the file name to strip out some key information to create a newly processed file
process the input file
create the newly processed file for each input file read (if i read in 10, I create 10 new files)
Part 1:
my $target_dir = "/backups/test/";
opendir my $dh, $target_dir or die "can't opendir $target_dir: $!";
while (defined(my $file = readdir($dh))) {
next if ($file =~ /^\.+$/);
#Get filename attributes
if ($file =~ /^foo(\d{3})\.name\.(\w{3})-foo_p(\d{1,4})\.\d+.csv$/) {
print "$1\n";
print "$2\n";
print "$3\n";
}
print "$file\n";
}
Part 2:
use strict;
use Digest::MD5 qw(md5_hex);
#Create new file
open (NEWFILE, ">/backups/processed/foo$1.name.$2-foo_p$3.out") || die "cannot create file";
my $data = '';
my $line1 = <>;
chomp $line1;
my #heading = split /,/, $line1;
my ($sep1, $sep2, $eorec) = ( "^A", "^E", "^D");
while (<>)
{
my $digest = md5_hex($data);
chomp;
my (#values) = split /,/;
my $extra = "__mykey__$sep1$digest$sep2" ;
$extra .= "$heading[$_]$sep1$values[$_]$sep2" for (0..scalar(#values));
$data .= "$extra$eorec";
print NEWFILE "$data";
}
#print $data;
close (NEWFILE);

You are using an old-style of Perl programming. I recommend you to use functions and CPAN modules (http://search.cpan.org). Perl pseudocode:
use Modern::Perl;
# use...
sub get_input_files {
# return an array of files (#)
}
sub extract_file_info {
# takes the file name and returs an array of values (filename attrs)
}
sub process_file {
# reads the input file, takes the previous attribs and build the output file
}
my #ifiles = get_input_files;
foreach my $ifile(#ifiles) {
my #attrs = extract_file_info($ifile);
process_file($ifile, #attrs);
}
Hope it helps

I've bashed your two code fragments together (making the second a sub that the first calls for each matching file) and, if I understood your description of the objective correctly, this should do what you want. Comments on style and syntax are inline:
#!/usr/bin/env perl
# - Never forget these!
use strict;
use warnings;
use Digest::MD5 qw(md5_hex);
my $target_dir = "/backups/test/";
opendir my $dh, $target_dir or die "can't opendir $target_dir: $!";
while (defined(my $file = readdir($dh))) {
# Parens on postfix "if" are optional; I prefer to omit them
next if $file =~ /^\.+$/;
if ($file =~ /^foo(\d{3})\.name\.(\w{3})-foo_p(\d{1,4})\.\d+.csv$/) {
process_file($file, $1, $2, $3);
}
print "$file\n";
}
sub process_file {
my ($orig_name, $foo_x, $name_x, $p_x) = #_;
my $new_name = "/backups/processed/foo$foo_x.name.$name_x-foo_p$p_x.out";
# - From your description of the task, it sounds like we actually want to
# read from the found file, not from <>, so opening it here to read
# - Better to use lexical ("my") filehandle and three-arg form of open
# - "or" has lower operator precedence than "||", so less chance of
# things being grouped in the wrong order (though either works here)
# - Including $! in the error will tell why the file open failed
open my $in_fh, '<', $orig_name or die "cannot read $orig_name: $!";
open(my $out_fh, '>', $new_name) or die "cannot create $new_name: $!";
my $data = '';
my $line1 = <$in_fh>;
chomp $line1;
my #heading = split /,/, $line1;
my ($sep1, $sep2, $eorec) = ("^A", "^E", "^D");
while (<$in_fh>) {
chomp;
my $digest = md5_hex($data);
my (#values) = split /,/;
my $extra = "__mykey__$sep1$digest$sep2";
$extra .= "$heading[$_]$sep1$values[$_]$sep2"
for (0 .. scalar(#values));
# - Useless use of double quotes removed on next two lines
$data .= $extra . $eorec;
#print $out_fh $data;
}
# - Moved print to output file to here (where it will print the complete
# output all at once) rather than within the loop (where it will print
# all previous lines each time a new line is read in) to prevent
# duplicate output records. This could also be achieved by printing
# $extra inside the loop. Printing $data at the end will be slightly
# faster, but requires more memory; printing $extra within the loop and
# getting rid of $data entirely would require less memory, so that may
# be the better option if you find yourself needing to read huge input
# files.
print $out_fh $data;
# - $in_fh and $out_fh will be closed automatically when it goes out of
# scope at the end of the block/sub, so there's no real point to
# explicitly closing it unless you're going to check whether the close
# succeeded or failed (which can happen in odd cases usually involving
# full or failing disks when writing; I'm not aware of any way that
# closing a file open for reading can fail, so that's just being left
# implicit)
close $out_fh or die "Failed to close file: $!";
}
Disclaimer: perl -c reports that this code is syntactically valid, but it is otherwise untested.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Optimize Perl script to deal with large amount of data - perl

Related

Open two text files, process them and write to separate files

perl: make script fast to use big file

In Perl, how can filter all log files in a directory, and extract interesting lines?

splitting a large file into small files based on column value in perl

help merging perl code routines together for file processing

Categories

Resources