Parallel reading of input file with Parallel::Loops module - perl

I often come across a scenario where I need to parse a very large input file and then process the lines for final output. With many of these files it can take a while to process.
Since it's usually the same process, and usually I want to stored the processed data to a hash for the final manipulation, it seems that maybe something like Parallel::Loops would be helpful and speed the process up.
If I'm not thinking this through correctly, please let me know.
I've used Parallel::Loops before to process many files at a time with great results, but I can't figure out how to process many lines from one file as I don't know how to pass each line of the file in as a reference.
If I try to do this:
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
use Parallel::Loops;
my $procs = 12;
my $pl = Parallel::Loops->new($procs);
my %data;
$pl->share(\%data);
my $input_file = shift;
open( my $in_fh, "<", $input_file ) || die "Can't open the file for reading: $!";
$pl->while( <$in_fh>, sub {
<some kind of munging and processing here>
});
I get the error:
Can't use string ("6334") as a subroutine ref while "strict refs" in use at /usr/local/share/perl/5.14.2/Parallel/Loops.pm line 518, <$in_fh> line 501.
I know that I need to pass a reference to the parallel object but I can't figure out how to make a reference to a readline element.
I also know that I can slurp the whole file in first and then pass an array reference of all of the lines, but for very large files that takes a lot of memory, and intuitively a lot more time as it technically needs to then read the file twice.
Is there a way to pass each line of a file into the Parallel::Loops object so that I can process many of the lines of a file at once?

I'm not in a position to test this as my laptop doesn't have Parallel::Loops installed and I have no consistent internet access.
However, from the documentation, the while method clearly takes two subroutine reference for parameters and you are passing <$in_fh> as the first. The method probably coerces its parameters to scalars using a prototype, so that means you are passing a simple string where a subroutine reference is expected.
Because of my situation I am far from certain, but you may get a result from
$pl->while(
sub {
scalar <$in_fh>;
},
sub {
# Process a line of data
}
);
I hope this helps. I will investigate further when I get home on Friday.

Related

How to pipe to and read from the same tempfile handle without race conditions?

Was debugging a perl script for the first time in my life and came over this:
$my_temp_file = File::Temp->tmpnam();
system("cmd $blah | cmd2 > $my_temp_file");
open(FIL, "$my_temp_file");
...
unlink $my_temp_file;
This works pretty much like I want, except the obvious race conditions in lines 1-3. Even if using proper tempfile() there is no way (I can think of) to ensure that the file streamed to at line 2 is the same opened at line 3. One solution might be pipes, but the errors during cmd might occur late because of limited pipe buffering, and that would complicate my error handling (I think).
How do I:
Write all output from cmd $blah | cmd2 into a tempfile opened file handle?
Read the output without re-opening the file (risking race condition)?
You can open a pipe to a command and read its contents directly with no intermediate file:
open my $fh, '-|', 'cmd', $blah;
while( <$fh> ) {
...
}
With short output, backticks might do the job, although in this case you have to be more careful to scrub the inputs so they aren't misinterpreted by the shell:
my $output = `cmd $blah`;
There are various modules on CPAN that handle this sort of thing, too.
Some comments on temporary files
The comments mentioned race conditions, so I thought I'd write a few things for those wondering what people are talking about.
In the original code, Andreas uses File::Temp, a module from the Perl Standard Library. However, they use the tmpnam POSIX-like call, which has this caveat in the docs:
Implementations of mktemp(), tmpnam(), and tempnam() are provided, but should be used with caution since they return only a filename that was valid when function was called, so cannot guarantee that the file will not exist by the time the caller opens the filename.
This is discouraged and was removed for Perl v5.22's POSIX.
That is, you get back the name of a file that does not exist yet. After you get the name, you don't know if that filename was made by another program. And, that unlink later can cause problems for one of the programs.
The "race condition" comes in when two programs that probably don't know about each other try to do the same thing as roughly the same time. Your program tries to make a temporary file named "foo", and so does some other program. They both might see at the same time that a file named "foo" does not exist, then try to create it. They both might succeed, and as they both write to it, they might interleave or overwrite the other's output. Then, one of those programs think it is done and calls unlink. Now the other program wonders what happened.
In the malicious exploit case, some bad actor knows a temporary file will show up, so it recognizes a new file and gets in there to read or write data.
But this can also happen within the same program. Two or more versions of the same program run at the same time and try to do the same thing. With randomized filenames, it is probably exceedingly rare that two running programs will choose the same name at the same time. However, we don't care how rare something is; we care how devastating the consequences are should it happen. And, rare is much more frequent than never.
File::Temp
Knowing all that, File::Temp handles the details of ensuring that you get a filehandle:
my( $fh, $name ) = File::Temp->tempfile;
This uses a default template to create the name. When the filehandle goes out of scope, File::Temp also cleans up the mess.
{
my( $fh, $name ) = File::Temp->tempfile;
print $fh ...;
...;
} # file cleaned up
Some systems might automatically clean up temp files, although I haven't care about that in years. Typically is was a batch thing (say once a week).
I often go one step further by giving my temporary filenames a template, where the Xs are literal characters the module recognizes and fills in with randomized characters:
my( $name, $fh ) = File::Temp->tempfile(
sprintf "$0-%d-XXXXXX", time );
I'm often doing this while I'm developing things so I can watch the program make the files (and in which order) and see what's in them. In production I probably want to obscure the source program name ($0) and the time; I don't want to make it easier to guess who's making which file.
A scratchpad
I can also open a temporary file with open by not giving it a filename. This is useful when you want to collect outside the program. Opening it read-write means you can output some stuff then move around that file (we show a fixed-length record example in Learning Perl):
open(my $tmp, "+>", undef) or die ...
print $tmp "Some stuff\n";
seek $tmp, 0, 0;
my $line = <$tmp>;
File::Temp opens the temp file in O_RDWR mode so all you have to do is use that one file handle for both reading and writing, even from external programs. The returned file handle is overloaded so that it stringifies to the temp file name so you can pass that to the external program. If that is dangerous for your purpose you can get the fileno() and redirect to /dev/fd/<fileno> instead.
All you have to do is mind your seeks and tells. :-) Just remember to always set autoflush!
use File::Temp;
use Data::Dump;
$fh = File::Temp->new;
$fh->autoflush;
system "ls /tmp/*.txt >> $fh" and die $!;
#lines = <$fh>;
printf "%s\n\n", Data::Dump::pp(\#lines);
print $fh "How now brown cow\n";
seek $fh, 0, 0 or die $!;
#lines2 = <$fh>;
printf "%s\n", Data::Dump::pp(\#lines2);
Which prints
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
]
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
"How now brown cow\n",
]
HTH

Perl, find a match and read next line in perl

I would like to use
myscript.pl targetfolder/*
to read some number from ASCII files.
myscript.pl
#list = <#ARGV>;
# Is the whole file or only 1st line is loaded?
foreach $file ( #list ) {
open (F, $file);
}
# is this correct to judge if there is still file to load?
while ( <F> ) {
match_replace()
}
sub match_replace {
# if I want to read the 5th line in downward, how to do that?
# if I would like to read multi lines in multi array[row],
# how to do that?
if ( /^\sName\s+/ ) {
$name = $1;
}
}
I would recommend a thorough read of perlintro - it will give you a lot of the information you need. Additional comments:
Always use strict and warnings. The first will enforce some good coding practices (like for example declaring variables), the second will inform you about potential mistakes. For example, one warning produced by the code you showed would be readline() on unopened filehandle F, giving you the hint that F is not open at that point (more on that below).
#list = <#ARGV>;: This is a bit tricky, I wouldn't recommend it - you're essentially using glob, and expanding targetfolder/* is something your shell should be doing, and if you're on Windows, I'd recommend Win32::Autoglob instead of doing it manually.
foreach ... { open ... }: You're not doing anything with the files once you've opened them - the loop to read from the files needs to be inside the foreach.
"Is the whole file or only 1st line is loaded?" open doesn't read anything from the file, it just opens it and provides a filehandle (which you've named F) that you then need to read from.
I'd strongly recommend you use the more modern three-argument form of open and check it for errors, as well as use lexical filehandles since their scope is not global, as in open my $fh, '<', $file or die "$file: $!";.
"is this correct to judge if there is still file to load?" Yes, while (<$filehandle>) is a good way to read a file line-by-line, and the loop will end when everything has been read from the file. You may want to use the more explicit form while (my $line = <$filehandle>), so that your variable has a name, instead of the default $_ variable - it does make the code a bit more verbose, but if you're just starting out that may be a good thing.
match_replace(): You're not passing any parameters to the sub. Even though this code might still "work", it's passing the current line to the sub through the global $_ variable, which is not a good practice because it will be confusing and error-prone once the script starts getting longer.
if (/^\sName\s+/){$name = $1;}: Since you've named the sub match_replace, I'm guessing you want to do a search-and-replace operation. In Perl, that's called s/search/replacement/, and you can read about it in perlrequick and perlretut. As for the code you've shown, you're using $1, but you don't have any "capture groups" ((...)) in your regular expression - you can read about that in those two links as well.
"if I want to read the 5th line in downward , how to do that ?" As always in Perl, There Is More Than One Way To Do It (TIMTOWTDI). One way is with the range operator .. - you can skip the first through fourth lines by saying next if 1..4; at the beginning of the while loop, this will test those line numbers against the special $. variable that keeps track of the most recently read line number.
"and if I would like to read multi lines in multi array[row], how to do that ?" One way is to use push to add the current line to the end of an array. Since keeping the lines of a file in an array can use up more memory, especially with large files, I'd strongly recommend making sure you think through the algorithm you want to use here. You haven't explained why you would want to keep things in an array, so I can't be more specific here.
So, having said all that, here's how I might have written that code. I've added some debugging code using Data::Dumper - it's always helpful to see the data that your script is working with.
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper; # for debugging
$Data::Dumper::Useqq=1;
for my $file (#ARGV) {
print Dumper($file); # debug
open my $fh, '<', $file or die "$file: $!";
while (my $line = <$fh>) {
next if 1..4;
chomp($line); # remove line ending
match_replace($line);
}
close $fh;
}
sub match_replace {
my ($line) = #_; # get argument(s) to sub
my $name;
if ( $line =~ /^\sName\s+(.*)$/ ) {
$name = $1;
}
print Data::Dumper->Dump([$line,$name],['line','name']); # debug
# ... do more here ...
}
The above code is explicitly looping over #ARGV and opening each file, and I did say above that more verbose code can be helpful in understanding what's going on. I just wanted to point out a nice feature of Perl, the "magic" <> operator (discussed in perlop under "I/O Operators"), which will automatically open the files in #ARGV and read lines from them. (There's just one small thing, if I want to use the $. variable and have it count the lines per file, I need to use the continue block I've shown below, this is explained in eof.) This would be a more "idiomatic" way of writing that first loop:
while (<>) { # reads line into $_
next if 1..4;
chomp; # automatically uses $_ variable
match_replace($_);
} continue { close ARGV if eof } # needed for $. (and range operator)

How to use variable instead of a file handle

I have a big data file dump.all.lammpstrj which I need to split/categorize into a series of files, such as Z_1_filename, Z_2_filename, Z_3_filename etc. based on the coordinates in each record.
The coordinates are saved in a disordered way, so my program reads each line and determines which file this record should be sent to.
I use a variable, $filehandle = "Z_$i_DUMP"
and I hope to open all of the possible files like this
for ( my $i = 1; $i <= 100; $i++ ) {
$filehandle = "Z_$i_DUMP";
open $filehandle,'>', "Z_$i_dump.all.lammpstrj.dat";
...
}
But when running my program, I get a message
Can't use string ("Z_90_DUMP") as a symbol ref while "strict refs" in use at ...
I don't want to scan all the data for each output file, because dump.all.lammpstrj is so big that a scan would take a long time.
Is there any way to use a defined variable as a file handle?
To give you an idea on how this might be done. Put file handles in a hash (or perhaps array if indexed by numbers).
use strict;
use warnings;
my %fh; #file handles
open $fh{$_}, '>', "Z_${_}_dump.all.lammpstrj.dat" for 1..100; #open 100 files
for(1..10000){ #write 10000 lines in 100 files
my $random=int(1+rand(100)); #pick random file handle
print {$fh{$random}} "something $_\n";
}
close $fh{$_} for 1..100;
Don't assign anything to $filehandle or set it to undef before you call open(). You get this error because you have assigned a string to $filehandle (which is of no use anyway).
Also see "open" in perldoc:
If FILEHANDLE is an undefined scalar variable (or array or hash element), a new filehandle is autovivified, meaning that the variable is assigned a reference to a newly allocated anonymous filehandle. Otherwise if FILEHANDLE is an expression, its value is the real filehandle. (This is considered a symbolic reference, so use strict "refs" should not be in effect.)
To have more file handles at a time and to conveniently map them to the file names consider using a hash with the file name (or whatever identifier suits you) as key to store them in. You can check if the key exists (see "exists") and the value is defined (see "defined") to avoid reopening the file unnecessarily.
I sincerely appreciate Kjetil S. and sticky bit. I tested it and their suggestion work well. And I noticed that there is another way to write data to different files WITHOUT CHANGING filehandler. Actually I changed file names using same file handler.
....
for my $i=0;$i<=$max_number;$i++;
{
$file="$i\_foo.dat";
open DAT,'>>',"$file";
......
}

Reading/dumping a perl hash from shell

I have a read-only perl file with a huge hash defined in it. Is there anyway for me to read this perl file and dump out the hash contents?
this is basic structure of the hash within the file.
%hash_name = {
-files => [
'<some_path>',
],
-dirs => [
'<some_path>',
'<some_path>',
'<some_path>',
'<some_path>',
'<some_path>',
],
};
Ideally you'd copy the file so that you can edit it, then turn it into a module so to use it nicely.
But if for some reason this isn't feasible here are your options.
If that hash is the only thing in the file, "load" it using do† and assign to a hash
use warnings;
use strict;
my $file = './read_this.pl'; # the file has *only* that one hash
my %hash = do $file;
This form of do executes the file (runs it as a script), returning the last expression that is evaluated. With only the hash in the file that last expression is the hash definition, precisely what you need.
If the hash is undeclared, so a global variable (or declared with our), then declare as our a hash with the same name in your program and again load the file with do
our %hash_name; # same name as in the file
do $file; # file has "%hash" or "our %hash" (not "my %hash")
Here we "pick up" the hash that is evaluated as do runs the file by virtues of our
If the hash is "lexical", declared as my %hash (as it should be!) ... well, this is bad. Then you need to parse the text of the file so to extract lines with the hash. This is in general very hard to do, as it amounts to parsing Perl. (A hash can be built using map, returned from a sub as a reference or a flat list ...) Once that is done you eval the variable which contains the text defining that hash.
However, if you know how the hash is built, as you imply, with no () anywhere inside
use warnings;
use strict;
my $file = './read_this.pl';
my $content = do { # "slurp" the file -- read it into a variable
local $/;
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
my ($hash_text) = $content =~ /\%hash_name\s*=\s*(\(.*?\)/s;
my %hash = eval $hash_text;
This simple shot leaves out a lot, assuming squarely that the hash is as shown. Also note that this form of eval carries real and serious security risks.
†
Files are also loaded using require. Apart from it doing a lot more than do, the important thing here is that even if it runs multiple times require still loads that file only once. This matters for modules in the first place, which shouldn't be loaded multiple times, and use indeed uses require.
On the other hand, do does it every time, what makes it suitable for loading files to be used as data, which presumably should be read every time. This is the recommended method. Note that require itself uses do to actually load the file.
Thanks to Schwern for a comment.

How to create a common array and be used between several perl scripts?

I have an application in which scripts will be run that need to be able access stored data. I want to run a script (main.pl) which will create an array. Later, if I run A.pl or B.pl, I want those scripts to be able access the previously created array and change values within it. What do I need to code in main.pl A.pl B.pl so I can achieve that?
Normally one perl instance can not access the variables of another instance. The question then becomes "what can one do that is almost like sharing variables"?
One approach is to store data somewhere it can persist, such as in a database or a CSV file on disk. This means reading the data at the beginning of the program, and writing it or updating it, and naturally leads to questions about race conditions, locking, etc... and greatly expands the scope that any possible answer would need to cover.
Another approach is to write your programs to use CSV or YAML or some other format easily read and written by libraries from CPAN, and use STDIN and STDOUT for input and output. This allows decoupling of storage, and also chaining several tools together with a pipe from the shell prompt.
For an in-memory solution for tieing hashes to shared memory, you can check out IPC::Shareable
http://metacpan.org/pod/IPC::Shareable
Perl memory structures can't be stored and then accessed later by other Perl scripts. However, you can write out those memory structures as a file. This can be done through hard raw coding, or by using a wide variety of Perl modules. The Storable is a standard Perl module and has been around for quite a while.
Since all you're installing is an array, you could have one program write the array to a file, and then have the other file read the array.
use strict;
use warnings;
use autodie;
use constant {
ARRAY_FILE => "$Env{HOME}/perl_arry.txt",
};
my #array;
[...] #Build the array
open my $output_fh, ">", ARRAY_FILE;
while my $item ( #array ) {
say {$output_fh} $item;
}
close $output_fh;
Now, have your second program read in this array:
use strict;
use warnings;
use autodie;
use constant {
ARRAY_FILE => "$Env{HOME}/perl_arry.txt",
};
my #new_array;
open my $input_fh, "<", ARRAY_FILE;
while ( my $item = <$input_fh> ) {
push #new_array, $item;
}
close $input_fh;
More complex data can be stored with Storable, but's it's pretty much the same thing: You need to write Storable to a physical file and then reopen that file to pull in your data once again.