What is best module for parallel processing in Perl? - perl

What is best module for parallel process in Perl? I have never done the parallel processing in Perl.
What is good Perl module for parallel process which is going to used in DB access and mailing ?
I have looked in module Parallel::ForkManager. Any idea appreciated.

Well, it all depends on particular case. Depending on what you exactly need to achieve, you might be better of with one of following:
Parallel::TaskManager
POE
AnyEvent
For some tasks there are also specialized modules - for example LWP::Parallel::UserAgent, which basically means - you have to give us much more details about what you want to achieve to get best possible answer.

Parallel::ForkManager, as the POD says, can limit the number of processes forked off. You could then use the children to do any work. I remember using the module for a downloader.

Many-core Engine for Perl (MCE) has been posted on CPAN.
http://code.google.com/p/many-core-engine-perl/
https://metacpan.org/module/MCE
MCE comes with various examples showing real-world use case scenarios on parallelizing something as small as cat (try with -n) to greping for patterns and word count aggregation.
barrier_sync.pl
A barrier sync demonstration.
cat.pl Concatenation script, similar to the cat binary.
egrep.pl Egrep script, similar to the egrep binary.
wc.pl Word count script, similar to the wc binary.
findnull.pl
A parallel driven script to report lines containing
null fields. It is many times faster than the binary
egrep command. Try against a large file containing
very long lines.
flow_model.pl
Demonstrates MCE::Flow, MCE::Queue, and MCE->gather.
foreach.pl, forseq.pl, forchunk.pl
These take the same sqrt example from Parallel::Loops
and measures the overhead of the engine. The number
indicates the size of #input which can be submitted
and results displayed in 1 second.
Parallel::Loops: 600 Forking each #input is expensive
MCE foreach....: 34,000 Sends result after each #input
MCE forseq.....: 70,000 Loops through sequence of numbers
MCE forchunk...: 480,000 Chunking reduces overhead
interval.pl
Demonstration of the interval option appearing in MCE 1.5.
matmult/matmult_base.pl, matmult_mce.pl, strassen_mce.pl
Various matrix multiplication demonstrations benchmarking
PDL, PDL + MCE, as well as parallelizing Strassen
divide-and-conquer algorithm. Also included are 2 plain
Perl examples.
scaling_pings.pl
Perform ping test and report back failing IPs to
standard output.
seq_demo.pl
A demonstration of the new sequence option appearing
in MCE 1.3. Run with seq_demo.pl | sort
tbray/wf_mce1.pl, wf_mce2.pl, wf_mce3.pl
An implementation of wide finder utilizing MCE.
As fast as MMAP IO when file resides in OS FS cache.
2x ~ 3x faster when reading directly from disk.

You could also look at threads, Coro, Parallel::Iterator.

Related

Programmable arguments in perl pipes

I'm gradually working my way up the perl learning curve (with thanks to contributors to this REALLY helpful site), but am struggling with how to approach this particular issue.
I'm building a perl utility which utilises three (c++) third party programmes. Normally these are run: A $file_list | B -args | C $file_out
where process A reads multiple files, process B modifies each individual file and process C collects all input files in the pipe and produces a single output file, with a null input file signifying the end of the input stream.
The input files are large(ish) at around 100Mb and around 10 in number. The processes are CPU intensive and the whole process need to be applied to thousands of groups of files each day, so the simple solution of reading and writing intermediate files to disk is simply too inefficient. In addition, the process above is only part of a processing sequence, where the input files are already in memory and the output file also needs to be in memory for further processing.
There are a number of solutions to this already well documented and I have a prototype version utilising IPC::Open3(). So far, so good. :)
However - when piping each file to process A through process B I need to modify the arguments in process B for each input file without interrupting the forward flow to process C. This is where I come unstuck and am looking for some suggestions.
As further background:
Running in Ubuntu 16.04 LTS (currently within Virtual box)and perl v5.22.1
The programme will run on (and within) a single machine by one user (me !), i.e. no external network communication or multi user or public requirement - so simplicity of programming is preferred over strong security.
Since the process must run repeatedly without interruption, robust/reliable I/O handling is required.
I have access to the source code of each process, so that could be modified (although I'd prefer not to).
My apologies for the lack of "code to date", but I thought the question is more one of "How do I approach this?" rather than "How do I get my code to work?".
Any pointers or help would be very much appreciated.
You need a fourth program (call it D) that determines what the arguments to B should be and executes B with those arguments and with D's stdin and stdout connected to B's stdin and stdout. You can then replace B with D in your pipeline.
What language you use for D is up to you.
If you're looking to feed output from different programs into the pipes, I'd suggest what you want to look at is ... well, pipe.
This lets you set up a pipe - that works much like the ones you get from IPC::Open3 but have a bit more control over what you read/write into it.

How to split long Perl code into several files without too much manual editing?

How do I split a long Perl script into two or more different files that can all access the same variables - without having to rename all shared variables from e.g. $count to $::count (or $main::count which is the same)?
In other words, what's the best and simplest way to split the Perl script into several files without having to import a lot of variables/functions and/or do a lot of manual editing.
I assume it has something to do with making the code part of the same package/scope/namespace, but my experiments so far have failed.
I am not sure it makes a difference, but the script is used for web/CGI purposes and will be running under mod_perl.
EDIT - Background:
I kind of knew I would get that response. The reason I want to split up the file is the following:
Currently I have a single very old and very long Perl file. I know it is not following Perl best practices but it works.
The problem is, I need to distribute the data files it uses between different web servers, first of all for performance reasons. There will be one "master" server and one or several "slaves".
About 20% of the mentioned Perl file contains shared functions, 40% has the code need to run on the master server and 40% on the slave servers. Therefore, I would like to split the code into three files: 1. shared, 2. master-only, 3. slave-only. On the master server, 1 and 2 will be loaded, on the slaves, 1 and 3 will be loaded.
I assume this approach would use less process RAM and, more importantly, I would minimize the risk of not splitting the code correctly (e.g. a slave process calling a master data file). I don't see a great need for modularization, as the system works and the code does not need a lot of changes or exchanges with other projects.
EDIT 2 - Solution:
Found the solution I was looking for here:
http://www.perlmonks.org/?node_id=95813
In cases where the main package is in ownership of the variable, the
actual word 'main' can be ommitted to yield something like: $::var
It is possible to get around having to fully qualify variable names
when strict is in use. Applying a simply use vars to your script, with
the variable names as it arguments will get around explicit package
names.
Actually, I ended up repeating the our ($count, etc...) statement for the needed variables instead of use vars ();
Do let me know if I am missing something vital - apart from not going with modules! :)
#Axeman, Thanks, I will accept your answer, both for your effort and for sending me in the right direction.
Unless you put different package statements in their files, they will all be treated as if they had package main; at the top. So assuming that the scripts use package variables, you shouldn't have to do anything. If you have declared them with my (that is, if they are lexically scoped variables) then you would have to make sure that all references to the variables are in the same file.
But splitting scripts up for length is a rotten substitute for modularization. Yes, modularization helps keep code length down, but modularization if the proper way to keep code length down--for all the reasons that you would want to keep code-length down, modularization does it best.
If chopping the files by length could really work for you, then you could create a script like this:
do '/path/to/bin/part1.pl';
do '/path/to/bin/part2.pl';
do '/path/to/bin/part3.pl';
...
But I kind of suspect that if the organization of this code is as bad as you're--sort of--indicating, it might suffer from some of the same re-inventing the wheel that I've seen in Perl-ignorant scripts. Just offhand (I might be wrong) but I'm thinking you would be surprised how much could be chopped from the length by simply substituting better-tested Perl library idioms than for-looping and while-ing everything.

How expensive is: require "foo.pl";

I'm about to rewrite a large portion of a project that I have developed over the last 10years while learning perl. There is alot of optimisation that can be gained.
A key part of the code is a large if/elsif block that require xxx.cgi files depending on a POST value. Eg:
if($FORM{'action'} eq "1"){require "1.cgi";}
elsif($FORM{'action'} eq "2"){require "2.cgi";}
elsif($FORM{'action'} eq "3"){require "3.cgi";}
elsif($FORM{'action'} eq "4"){require "4.cgi";}
It has many more irritations but just how expensive is using "require" in perl?
require itself has a relatively low cost in any case and, if you require the same file more than once within a single run of your program, it will detect that the file has already been loaded and not attempt to load it a second time. However, if you have a long and highly-populated search path (#INC) and you require (or use) a lot of files, it's possible that all of the directory searches could add up; this isn't common (and doesn't sound likely in your case), but it can be improved by reorganizing your module directories so that the things you're loading show up earlier in #INC.
The potentially-major performance hit referred to by earlier answers is the cost of compiling the code in the files you require. Getting rid of the require by moving the code into your main program will not help with this, as the code will still need to be compiled. In your case, it would probably make things worse, as it would cause the code for all options to be compiled on every one rather than only compiling the code used by the one action selected by the user.
As has been said, it really depends on the actual code in those files. Your best bet would be to do tests using Devel::NYTProf and/or Benchmark to see where the most time is being spent in your code if you are unhappy with its performance.
You can also read Profiling Perl on perl.com, but it is a bit outdated as it uses Devel::DProf.
Not answer to your primary question, but still a good idea for code refactor i read recently in Ovid blog.
The first time, possibly expensive; Perl has to search a path to find the file and load it up. Subsequent times, it's cheap -- a table is consulted and the file isn't actually loaded a second time. If this is in a CGI that is run once per request and then exited, then this is not too good.
It's really going to depend on the size of the files you're calling to. If you have massive CGI files, then it might detriment the performance of your software. If we're talking 6 or 7 lines of code each, then no issue. Try benchmarking your program's performance with and without, and make your own judgement.

How can I do perl CGI performance measurements, benchmarks, time measurements at various stages of execution?

I would like to know techniques (coding, libraries, configurations) for measuring the duration of execution of CGI Perl code at various stages:
starting up the Perl interpreter
beginning running the Perl code
loading in local Perl .pm modules for routines
completed running the code
I'm particularly interested in 3 and 4, I don't believe there is much I can do about 1) or 2) as I wouldn't want to try to optimise the Perl interpreter, the only thing I can do here is upgrade the hardware to a faster machine and/or use mod_perl instead of classic CGI.
With 3) loading the local Perl modules I would like to measure how long it takes but I'm not sure how to code this as I don't know (or am not sure) how to get code to execute before loading these modules. If I did know, then I would record the time before they load, then record the time after they have loaded and calculate the difference.
4) should be the easiest to obtain as I would record the time (in a variable) at start of execution and then at end.
I've done a search at stackoverflow.com and found:
How can I speed up my Perl program? - which is what I expect to be using at some point. BUT I need to prove the reduced time (i.e. speed improvement, so I need to be able to measure in the first place). The tool http://search.cpan.org/dist/Devel-NYTProf looks useful for profiling my source code but I'm not sure it covers 3) loading of modules
Does Perl language aim at producing fast programs at runtime? - more of a verbose discussion rather than succinct answers, but a good read later when time
Google search results included:
http://www.testingreflections.com/node/view/3622 - not enough information here
You can reduce 1) with FastCGI. It will reduce stages 2) and 3) too.
For measuring 3) you can use BEGIN blocks. Example:
use Benchmark ':hireswallclock';
my ($t0,$t1);
BEGIN {$t0 = Benchmark->new;}
use DBIx::Class;
BEGIN {$t1 = Benchmark->new;}
print "the loading took:",timestr(timediff($t1, $t0)),"\n";
Deve::NYTProf will help you with 4). Also there are specific modules like Template::Timer, CGI::Application::Plugin::DevPopup::Timing, DBIx::Class::Storage::Statistics.
In addition to FastCGI, there is also mod_perl and more importantly PSGI. With PSGI you can decouple your app from concrete webserver backend.

Is there any current review of statistical modules for Perl?

I would like to know which is the current status of the statistical modules in CPAN, does any one know any recent review or could comment about its likes/dislikes with those modules?
I have used the clasical: Statistics::Descriptive, Statistics::Distributions, and some others contained in Bundle::Math::Statistics
Some of the modules has not been updated for long time. I don't know if this is because they are rock solid or has been overtaken by better modules.
Does someone know any current review similar to this old one:
Using Perl for Statistics: Data Processing and Statistical Computing
NB (for the people that will suggest to use R ;-)):
All my code is mainly in perl, but I use R a lot for statistics and plotting. I usually prepare the dataframes with perl and write the R script in the perl modules as templates and save to a file and execute them from perl. But sometimes you have small data sets where efficiency is not an issue (well I am using perl insn't it ;-)) and you want to add some statistics and histograms to your report produced with perl.
PDL, the Perl Data Language is alive and thriving so its worth taking a look at that.
And I think the other stats modules you mention are OK. For eg. Statistics::Descriptive is up-to-date and has been used in answers to a few questions here on Stackoverflow.
NB. There is also a Perl to R bridge called Statistics::R which looks interesting.
/I3az/