Efficient way to make thousands of curl requests - perl

I am using CURL to make thousands of requests. In my code I set the cookie to a specific value and then read in the value on the page. Here is my Perl code:
#!/usr/bin/perl
my $site = "http://SITENAME/?id=";
my $cookie_name = "cookienum123";
print $fh "#\t\tValue\n";
for my $i ('1'..'10000') {
my $output = `curl -s -H "Cookie: $cookie_name=$i" -L $site$i | grep -Eo "[0-9]+"`;
print "$i\t\t$output\n";
}
So from 1 to 10000, I am setting cookienum123 to that value and reading in the whole response from the page. Then I use grep to just extract the #. The code I have now works fine but I am wondering if there is a faster or more efficient way I can do this.
Please note this does not have to be done as a Perl script (I can also use Windows batch file, Unix shell script, etc).
Edit Jan 18: Added bounty with the note "The desired answer should include a way in Perl to run through several thousand curl requests simultaneously but it needs to be run faster than the rate it is currently running at. It has to write the output to a single file in the end but the order does not matter." Some of the below comments mention fork but I am not sure how to apply it to my code. I am very new to Perl as this is my first program in it.

What you have here is an embarrassingly parallel problem. These are great for parallelising, because there's no inter-thread dependency or communication needed.
There's two key ways of doing this in perl - threading or forking. I would generally suggest thread based parallel processing for the kind of thing you're doing. This is a matter of choice, but I think it's better suited for collating information.
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $numthreads = 20;
my $site = "http://SITENAME/?id=";
my $cookie_name = "cookienum123";
my $fetch_q = Thread::Queue->new();
my $collate_q = Thread::Queue->new();
#fetch sub sits in a loop, takes items off 'fetch_q' and runs curl.
sub fetch {
while ( my $target = $fetch_q->dequeue() ) {
my $output =
`curl -s -H "Cookie: $cookie_name=$target" -L $site$target | grep -Eo "[0-9]+"`;
$collate_q->enqueue($output);
}
}
#one instance of collate, which exists to serialise the output from fetch.
#writing files concurrently can get very messy and build in race conditions.
sub collate {
open( my $output_fh, ">", "results.txt" ) or die $!;
print {$output_fh} "#\t\tValue\n";
while ( my $result = $collate_q->dequeue() ) {
print {$output_fh} $result;
}
close($output_fh);
}
## main bit:
#start worker threads
my #workers = map { threads->create( \&fetch ) } 1 .. $numthreads;
#collates results.
my $collater = threads->create( \&collate );
$fetch_q->enqueue( '1' .. '10000' );
$fetch_q->end();
foreach my $thr (#workers) {
$thr->join();
}
#end collate_q here, because we know all the fetchers are
#joined - so no more results will be generated.
#queue will then generate 'undef' when it's empty, and the thread will exit.
$collate_q->end;
#join will block until thread has exited, e.g. all results in the queue
#have been 'processed'.
$collater->join;
This will spawn 20 worker threads, that'll run in parallel, and collect results as they exit to a file. As an alternative, you could do something similar with Parallel::ForkManager, but for data-oriented tasks, I personally prefer threading.
You can use the 'collate' sub to postprocess any data, such as sorting it, counting it, whatever.
I would also point out - using curl and grep as system calls isn't ideal - I've left them as is, but would suggest looking at LWP and allowing perl to handle the text processing, because it's pretty good at it.

I'm pretty sure the following will do what you want however slamming a server with 10000 simultaneous requests is not very polite. In fact, harvesting a site's data by walking the id's of a given url doesn't sound very friendly either. I have NOT tested the following but it should get you 99% of the way there (might be a syntax/usage error somewhere).
See for more info:
https://metacpan.org/pod/distribution/Mojolicious/lib/Mojolicious/Guides/Cookbook.pod#Non-blocking
https://metacpan.org/pod/Mojo::UserAgent#build_tx
https://metacpan.org/pod/Mojo::DOM
Good luck!
#!/usr/bin/perl
use warnings;
use strict;
use Mojo::UserAgent;
use Mojo::IOLoop;
my $site = 'http://SITENAME/?id=';
my $cookie_name = 'cookienum123';
#open filehandle and write file header
open my $output_fh, q{>}, 'results.txt'
or die $!;
print {$output_fh} "#\t\tValue\n";
# Use Mojo::UserAgent for concurrent non-blocking requests
my $ua = Mojo::UserAgent->new;
#create your requests
for my $i (1..10000) {
#build transaction
my $tx = $ua->build_tx(GET => "$site$i");
#add cookie header
$tx->req->cookies({name => $cookie_name, value => $i});
#start "GET" with callback to write to file
$tx = $ua->start( $tx => sub {
my ($ua, $mojo) = #_;
print {$output_fh} $i . "\t\t" . $mojo->res->dom->to_string;
});
}
# Start event loop if necessary
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;
#close filehandle
close $output_fh;

Related

Perl CGI produces unexpected output

I have a Perl CGI script for online concordance application that searches for an instance of word in a text and prints the sorted output.
#!/usr/bin/perl -wT
# middle.pl - a simple concordance
# require
use strict;
use diagnostics;
use CGI;
# ensure all fatals go to browser during debugging and set-up
# comment this BEGIN block out on production code for security
BEGIN {
$|=1;
print "Content-type: text/html\n\n";
use CGI::Carp('fatalsToBrowser');
}
# sanity check
my $q = new CGI;
my $target = $q->param("keyword");
my $radius = $q->param("span");
my $ordinal = $q->param("ord");
my $width = 2*$radius;
my $file = 'concordanceText.txt';
if ( ! $file or ! $target ) {
print "Usage: $0 <file> <target>\n";
exit;
}
# initialize
my $count = 0;
my #lines = ();
$/ = ""; # Paragraph read mode
# open the file, and process each line in it
open(FILE, " < $file") or die("Can not open $file ($!).\n");
while(<FILE>){
# re-initialize
my $extract = '';
# normalize the data
chomp;
s/\n/ /g; # Replace new lines with spaces
s/\b--\b/ -- /g; # Add spaces around dashes
# process each item if the target is found
while ( $_ =~ /\b$target\b/gi ){
# find start position
my $match = $1;
my $pos = pos;
my $start = $pos - $radius - length($match);
# extract the snippets
if ($start < 0){
$extract = substr($_, 0, $width+$start+length($match));
$extract = (" " x -$start) . $extract;
}else{
$extract = substr($_, $start, $width+length($match));
my $deficit = $width+length($match) - length($extract);
if ($deficit > 0) {
$extract .= (" " x $deficit);
}
}
# add the extracted text to the list of lines, and increment
$lines[$count] = $extract;
++$count;
}
}
sub removePunctuation {
my $string = $_[0];
$string = lc($string); # Convert to lowercase
$string =~ s/[^-a-z ]//g; # Remove non-aplhabetic characters
$string =~ s/--+/ /g; #Remove 2+ hyphens with a space
$string =~s/-//g; # Remove hyphens
$string =~ s/\s=/ /g;
return($string);
}
sub onLeft {
#USAGE: $word = onLeft($string, $radius, $ordinal);
my $left = substr($_[0], 0, $_[1]);
$left = removePunctuation($left);
my #word = split(/\s+/, $left);
return($word[-$_[2]]);
}
sub byLeftWords {
my $left_a = onLeft($a, $radius, $ordinal);
my $left_b = onLeft($b, $radius, $ordinal);
lc($left_a) cmp lc($left_b);
}
# process each line in the list of lines
print "Content-type: text/plain\n\n";
my $line_number = 0;
foreach my $x (sort byLeftWords #lines){
++$line_number;
printf "%5d",$line_number;
print " $x\n\n";
}
# done
exit;
The perl script produces expected result in terminal (command line). But the CGI script for online application produces unexpected output. I cannot figure out what mistake I am making in the CGI script. The CGI script should ideally produce the same output as the command line script. Any suggestion would be very helpful.
Command Line Output
CGI Output
The BEGIN block executes before anything else and thus before
my $q = new CGI;
The output goes to the server process' stdout and not to the HTTP stream, so the default is text/plain as you can see in the CGI output.
After you solve that problem you'll find that the output still looks like a big ugly block because you need to format and send a valid HTML page, not just a big block of text. You cannot just dump a bunch of text to the browser and expect it to do anything intelligent with it. You must create a complete HTML page with tags to layout your content, probably with CSS as well.
In other words, the output required will be completely different from the output when writing only to the terminal. How to structure it is up to you, and explaining how to do that is out of scope for StackOverflow.
As the other answers state, the BEGIN block is executed at the very start of your program.
BEGIN {
$|=1;
print "Content-type: text/html\n\n";
use CGI::Carp('fatalsToBrowser');
}
There, you output an HTTP header Content-type: text/html\n\n. The browser sees that first, and treats all your output as HTML. But you only have text. Whitespace in an HTML page is collapsed into single spaces, so all your \n line breaks disappear.
Later, you print another header, the browser cannot see that as a header any more, because you already had one and finished it off with two newlines \n\n. It's now too late to switch back to text/plain.
It is perfectly fine to have a CGI program return text/plain and just have text without markup be displayed in a browser when all you want is text, and no colors or links or tables. For certain use cases this makes a lot of sense, even if it doesn't have the hyper in Hypertext any more. But you're not really doing that.
Your BEGIN block serves a purpose, but you are overdoing it. You're trying to make sure that when an error occurs, it gets nicely printed in the browser, so you don't need to deal with the server log while developing.
The CGI::Carp module and it's functionality fatalsToBrowser bring their own mechanism for that. You don't have to do it yourself.
You can safely remove the BEGIN block and just put your use CGI::CARP at the top of the script with all the other use statements. They all get run first anyway, because use gets run at compile time, while the rest of your code gets run at run time.
If you want, you can keep the $|++, which turns off the buffering for your STDOUT handle. It gets flushed immediately and every time you print something, that output goes directly to the browser instead of collecting until it's enough or there is a newline. If your process runs for a long time, this makes it easier for the user to see that stuff is happening, which is also useful in production.
The top of your program should look like this now.
#!/usr/bin/perl -T
# middle.pl - a simple concordance
use strict;
use warnigns;
use diagnostics;
use CGI;
use CGI::Carp('fatalsToBrowser');
$|=1;
my $q = CGI->new;
Finally, a a few quick words on the other parts I deleted from there.
Your comment requires over the use statements is misleading. Those are use, not require. As I said above, use gets run at compile time. require on the other hand gets run at run time and can be done conditionally. Misleading comments will make it harder for others (or you) to maintain your code later on.
I removed the -w flag from the shebang (#!/usr/bin/perl) and put the use warnings pragma in. That's a more modern way to turn on warnings, because sometimes the shebang can be ignored.
The use diagnostics pragma gives you extra long explanations when things go wrong. That's useful, but also extra slow. You can use it during development, but please remove it for production.
The comment sanity check should be moved down under the CGI instantiation.
Please use the invocation form of new to instantiate CGI, and any other classes. The -> syntax will take care of inheritance properly, while the old new CGI cannot do that.
I ran your cgi. The BEGIN block is run regardless and you print a content-type header here - you have explicitly asked for HTML here. Then later you attemp to print another header for PLAIN. This is why you can see the header text (that hasn't taken effect) at the beginning of the text in the browser window.

Reading files recursively in parallel in Perl

I have 500 files which are to be read, but reading recursively each file takes 2 minutes approximately. So I want to do this operation in parallel using Perl. How can I do that?
You're talking about a massive amount of reading if takes two minutes. You're basically spending your time waiting for the hard drive. Are the files on separate hard drives? If not, why do you think that trying to get a second file at the same time is going to be faster? In fact, it might make things slower by increasing the amount of seeking the hard drive has to make.
But if you want to try it anyway,
use threads;
use Thread::Queue qw( );
use constant NUM_WORKERS => 4; # Twiddle this
sub run {
my ($qfn) = #_;
...read file $qfn here...
}
my $q = Thread::Queue->new();
my #threads;
for (1..NUM_WORKERS) {
push #threads, async {
while (my $job = $q->dequeue()) {
run($job);
}
};
}
$q->enqueue($_) for #qfns;
$q->enqueue(undef) for #threads;
$_->join() for #threads;
Create a Perl script to process a single fine. Create a shell script, batch-run.sh, that contains 500 lines (lines like perl perl-script.pl file001). Then create another shell script that launches required number of background processes to execute lines from batch-run.sh. You may want to limit the number of background processes though. Something like this:
NCPUS=32 # number of parallel processes
ISCRIPT=batch-run.sh
NTASKS=$(wc -l $ISCRIPT | cut -d' ' -f1)
runbatch() {
OFFSET=$1
while [ $OFFSET -le $NTASKS ]; do
CMD=$(sed "${OFFSET}q;d" $ISCRIPT)
echo "$CMD ..."
eval $CMD
let OFFSET+=$NCPUS
done
}
for i in $(seq 1 $NCPUS); do
runbatch $i &
done
wait

Problems with joining threads

I've got some issue with a part of my perl script, bothering me for days now. To summarize the purpose is to read in a large file in chunks and do some operation on the input stream (not relevant for my question). When I first implemented it, I just looped over the file and then did some stuff on it, like this:
while (read FILE, $buffer, $chunksize){
callSomeOperation($buffer);
# Do some other stuff
}
Unfortunately the file is really big and the operation somehow complex with many function calls, therefore this led to steadily increasing Memory perl couldn't allocate memory anymore and the script failed. So I did some investigation and tried several things to minimize the memory overhead (defined variables outside the loop, set to undef and so on), which led the allocated memory size increasing slower, but at the end still failed. (And if I figured out right, perl giving back memory to the OS is sth. that won't happen in practice.)
So I decided to nest the function call and all its definition in a subthread, wait for its finish, join and then call the thread again with the next chunk:
while (read FILE, $buffer, $chunksize){
my $thr = threads->create(\&thrWorker,$buffer);
$thr->join();
}
sub thrWorker{
# Do the stuff here!
}
Which might have been a solution, if the thread would join! But it actually does not. If I run it with $thr->detach(); everything works fine, besides I get hundrets of threads at the same time, which is not a good idea, and in this case, I need to run them consecutively.
So I took some Investigation on this join issue and got some voices that ther might be an issue with perl 5.16.1 so I updated to 5.16.2 but it still never joins. Anywhere in a Mailing list I cant remember I read from somebody managed to get Threads to join with CPAN module Thread::Queue but this didn't worked for me either.
So I gave up with threads and tried to fork this thing. But with fork it seems like the total number of "forks" is limited? Anyway it went fine till the 13th to 20th iteration and then gave up with the message it couldn't fork anymore.
my $pid = fork();
if( $pid == 0 ){
thrWorker($buffer);
exit 0;
}
I also tried it with CPAN modules Parallel::ForkManager and Proc::Fork but that didn't help.
So now I'm somehow stuck and cant help myself out. Maybe somebody else can! Any suggestions greatly appreciated!
How can I get this thing to work with threads or child processes?
Or at least how can I force perl freeing memory so I can do this in the same process?
Some additional information on my system:
OS: Windows 7 64bit / Ubuntu Server 12.10
Perl on Windows: Strawberry Perl 5.16.2 64bit
One of my first posts on Stackoverflow. Hope I did it right :-)
I recommend reading: this
I usually use Thread::Queue to manage the input of thread.
Sample code:
my #threads = {};
my $Q = new Thread::Queue;
# Start the threads
for (my $i=0; $i<NUM_THREADS; $i++) {
$threads[$i] =
threads->new(\&insert_1_thread, $Q);
}
# Get the list of sites and put in the work queue
foreach $row ( #{$ref} ) {
$Q->enqueue( $row->[0] );
#sleep 1 while $Q->pending > 100;
} # foreach $row
# Signal we are done
for (my $i=0; $i<NUM_THREADS; $i++) {
$Q->enqueue( undef ); }
$count = 0;
# Now wait for the threads to complete before going on to the next step
for (my $i=0; $i<NUM_THREADS; $i++) {
$count += $threads[$i]->join(); }
And for the worker thread:
sub insert_1_thread {
my ( $Q ) = #_;
my $tid = threads->tid;
my $count = 0;
Log("Started thread #$tid");
while( my $row = $Q->dequeue ) {
PROCESS ME...
$count++;
} # while
Log("Thread#$tid, done");
return $count;
} # sub insert_1_thread
I don't know if it is a solution for you, but you could create an array of chunk objects and process them in parallel like this:
#!/usr/bin/perl
package Object; {
use threads;
use threads::shared;
sub new(){
my $class=shift;
share(my %this);
return(bless(\%this,$class));
}
sub set {
my ($this,$value)=#_;
lock($this);
# $this->{"data"}=shared_clone($value);
$this->{"data"}=$value;
}
sub get {
my $this=shift;
return $this->{"data"};
}
}
package main; {
use strict;
use warnings;
use threads;
use threads::shared;
my #objs;
foreach (0..2){
my $o = Object->new();
$o->set($_);
push #objs, $o;
}
threads->create(\&run,(\#objs))->join();
sub run {
my ($obj) = #_;
$$obj[$_]->get() foreach(0..2);
}
}

Capture the output of Perl's 'system()'

I need to run a shell command with system() in Perl. For example,
system('ls')
The system call will print to STDOUT, but I want to capture the output into a variable so that I can do future processing with my Perl code.
That's what backticks are for. From perldoc perlfaq8:
Why can't I get the output of a command with system()?
You're confusing the purpose of system() and backticks (``). system()
runs a command and returns exit status information (as a 16 bit value:
the low 7 bits are the signal the process died from, if any, and the
high 8 bits are the actual exit value). Backticks (``) run a command
and return what it sent to STDOUT.
my $exit_status = system("mail-users");
my $output_string = `ls`;
See perldoc perlop for more details.
IPC::Run is my favourite module for this kind of task. Very powerful and flexible, and also trivially simple for small cases.
use IPC::Run 'run';
run [ "command", "arguments", "here" ], ">", \my $stdout;
# Now $stdout contains output
Simply use similar to the Bash example:
$variable=`some_command some args`;
That's all. Notice, you will not see any printings to STDOUT on the output because this is redirected to a variable.
This example is unusable for a command that interact with the user, except when you have prepared answers. For that, you can use something like this using a stack of shell commands:
$variable=`cat answers.txt|some_command some args`;
Inside the answers.txt file you should prepare all answers for some_command to work properly.
I know this isn't the best way for programming :) But this is the simplest way how to achieve the goal, specially for Bash programmers.
Of course, if the output is bigger (ls with subdirectory), you shouldn't get all output at once. Read the command by the same way as you read a regular file:
open CMD,'-|','your_command some args' or die $#;
my $line;
while (defined($line=<CMD>)) {
print $line; # Or push #table,$line or do whatever what you want processing line by line
}
close CMD;
An additional extended solution for processing a long command output without extra Bash calling:
my #CommandCall=qw(find / -type d); # Some example single command
my $commandSTDOUT; # File handler
my $pid=open($commandSTDOUT),'-|'); # There will be an implicit fork!
if ($pid) {
#parent side
my $singleLine;
while(defined($singleline=<$commandSTDOUT>)) {
chomp $line; # Typically we don't need EOL
do_some_processing_with($line);
};
close $commandSTDOUT; # In this place $? will be set for capture
$exitcode=$? >> 8;
do_something_with_exit_code($exitcode);
} else {
# Child side, there you really calls a command
open STDERR, '>>&', 'STDOUT'; # Redirect stderr to stdout if needed. It works only for child - remember about fork
exec(#CommandCall); # At this point the child code is overloaded by an external command with parameters
die "Cannot call #CommandCall"; # Error procedure if the call will fail
}
If you use a procedure like that, you will capture all procedure output, and you can do everything processing line by line. Good luck :)
I wanted to run system() instead of backticks because I wanted to see the output of rsync --progress. However, I also wanted to capture the output in case something goes wrong depending on the return value. (This is for a backup script). This is what I am using now:
use File::Temp qw(tempfile);
use Term::ANSIColor qw(colored colorstrip);
sub mysystem {
my $cmd = shift; # "rsync -avz --progress -h $fullfile $copyfile";
my ($fh, $filename) = tempfile();
# http://stackoverflow.com/a/6872163/2923406
# I want to have rsync progress output on the terminal AND capture it in case of error.
# Need to use pipefail because 'tee' would be the last cmd otherwise and hence $? would be wrong.
my #cmd = ("bash", "-c", "set -o pipefail && $cmd 2>&1 | tee $filename");
my $ret = system(#cmd);
my $outerr = join('', <$fh>);
if ($ret != 0) {
logit(colored("ERROR: Could not execute command: $cmd", "red"));
logit(colored("ERROR: stdout+stderr = $outerr", "red"));
logit(colored("ERROR: \$? = $?, \$! = $!", "red"));
}
close $fh;
unlink($filename);
return $ret;
}
# And logit() is something like:
sub logit {
my $s = shift;
my ($logsec, $logmin, $loghour, $logmday, $logmon, $logyear, $logwday, $logyday, $logisdst) = localtime(time);
$logyear += 1900;
my $logtimestamp = sprintf("%4d-%02d-%02d %02d:%02d:%02d", $logyear, $logmon+1, $logmday, $loghour, $logmin, $logsec);
my $msg = "$logtimestamp $s\n";
print $msg;
open LOG, ">>$LOGFILE";
print LOG colorstrip($msg);
close LOG;
}

Making an IRC bot - how can I let people !eval perl/javascript code?

I'm working on a bot in Perl (based on POE) and so far so good, but I can't figure out how can I add a !js or !perl command to evaluate respective code and return one line of output to be printed into the channel. I found App::EvalServer but I don't get how to use it.
Thanks for any help!
The App::EvalServer module comes with a binary to run as a standalone application. You do not put it in your program but rather run it on it's own. It opens a port where you can hand it code as a json string. This does not sound like a good idea to me either.
There is another module you might want to look at called Safe. I suggest you read through the complete documentation as well as the one to Opcode (linked in the doc) before you do anything with this. YOU CAN DO SERIOUS DAMAGE IF YOU EVALUATE ARBITRARY CODE! Never forget that.
UPDATE:
Here's an example of how to capture the output of print or say from your evaled code. You can use open with a variable to make printed output always go to that variable. If you switch back afterwards you can work with the captured output in your var. This is called an in-memory file.
use strict; use warnings;
use feature 'say';
use Safe;
# Put our STDOUT into a variable
my $printBuffer;
open(my $buffer, '>', \$printBuffer);
# Everything we say and print will go into $printBuffer until we change it back
my $stdout = select($buffer);
# Create a new Safe
my $compartment = new Safe;
$compartment->permit(qw(print)); # for testing
# This is where the external code comes in:
my $external_code = qq~print "Hello World!\n"~;
# Execute the code
my $ret = $compartment->reval($external_code, 1);
# Go back to STDOUT
select($stdout);
printf "The return value of the reval is: %d\n", $ret;
say "The reval's output is:";
say $printBuffer;
# Now you can do whatever you want with your output
$printBuffer =~ s/World/Earth/;
say "After I change it:";
say $printBuffer;
Disclaimer: Use this code at your own risk!
Update 2: After a lengthy discussion in chat, here's what we came up with. It implements a kind of timeout to stop the execution if the reval is taking to long, e.g. because of an infinite loop.
#!/usr/bin/perl
use warnings;
use strict;
use Safe;
use Benchmark qw(:hireswallclock);
my ($t0, $t1); # Benchmark
my $timedOut = 0;
my $userError = 0;
my $printBuffer;
open (my $buffer, '>', \$printBuffer);
my $stdout = select($buffer);
my $cpmt = new Safe;
$cpmt->permit_only(qw(:default :base_io sleep));
eval
{
local $SIG{'ALRM'} = sub { $timedOut = 1; die "alarm\n"};
$t0 = Benchmark->new;
alarm 2;
$cpmt->reval('print "bla\n"; die "In the user-code!";');
# $cpmt->reval('print "bla\n"; sleep 50;');
alarm 0;
$t1 = Benchmark->new;
if ($#)
{
$userError = "The user-code died! $#\n";
}
};
select($stdout);
if ($timedOut)
{
print "Timeout!\n";
my $td = timediff($t1, $t0);
print timestr($td), "\n";
print $printBuffer;
}
else
{
print "There was no timeout...\n";
if ($userError)
{
print "There was an error with your code!\n";
print $userError;
print "But here's your output anyway:\n";
print $printBuffer;
}
else
{
print $printBuffer;
}
}
Take a look at perl eval(), you can pass it variables/strings and it will evaluate it as if it's perl code. Likewise in javascript, there's also an eval() function that performs similarly.
However, DO NOT EVALUATE ARBITRARY CODE in either perl or javascript unless you can run it in a completely closed environment (and even then, it's still a bad idea). Lot's of people spend lots of time preventing just this from happening. So that's how you'd do it, but you don't want to do it, really at all.