Perl: An asynchronous http proxy via mojolicious - perl

I made a simple http proxy, it's work fine, but not fast, because in the function handle_request, I use
my $tx = $ua->start( Mojo::Transaction::HTTP->new(req=>$request) );
to do the request, it's blocking.
I try to use a callback like:
$ua->start( Mojo::Transaction::HTTP->new(req=>$request) )=>sub{ ... }
to make it's non-blocking, and then, got a mistake:
'error' => { 'message' => 'Premature connection close'}
I guess that's because function handle_request return immediately, it does not wait the callback to be finished. If I use semaphore to wait the callback, that's mean it's blocking again.
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Mojo::IOLoop::Server;
use Mojo::UserAgent;
use Mojo::Message::Response;
use Mojo::Message::Request;
use Mojo::Transaction::HTTP;
use Data::Dumper;
binmode STDOUT, ":encoding(UTF-8)";
my %buffer;
Mojo::IOLoop->server( {port => 3128} => sub {
my ($loop, $stream, $client) = #_;
$stream->on(
read => sub {
my ($stream, $chunk) = #_;
my $buffer = $buffer{$client}{read_buffer} .= $chunk;
if ($buffer =~ /^GET\s+|POST\s+|HEAD\s+(.*)\r\n\r\n$/i) {
$buffer{$client}{read_buffer} = '';
&handle_request($client,$stream,$buffer);
}
elsif ($buffer =~ /^CONNECT\s+(.*)\r\n\r\n$/i) {
$buffer{$client}{read_buffer} = '';
&handle_connect($stream,$buffer);
}
elsif($buffer{$client}{connection})
{
$buffer{$client}{read_buffer} = '';
Mojo::IOLoop->stream($buffer{$client}{connection})->write($chunk);
}
if(length($buffer)>= 20 *1024 * 1024) {
delete $buffer{$client};
Mojo::IOLoop->remove($client);
return;
}
});
});
sub handle_request{
my($client,$stream,$chunk) = #_;
my $request = Mojo::Message::Request->new;
$request = $request->parse($chunk);
my $ua = Mojo::UserAgent->new;
my $tx = $ua->start( Mojo::Transaction::HTTP->new(req=>$request) );
$stream->write( $tx->res->to_string );
}
sub handle_connect{
my ($stream, $chunk) = #_;
my $request = Mojo::Message::Request->new;
my $ua = Mojo::UserAgent->new;
$request = $request->parse($chunk);
print Dumper($request);
}
Mojo::IOLoop->start;
Hope to get some suggestions .

You have 2 problem:
You try to call nonblocking variant of $ua->start when your code have blocking style. Function handle_request must have callback as parameter.
If you have chain of callback then the best way to implement it is to use Mojo::IOLoop::Delay.
When you create variable $ua in non-blocking style in sub handle_request then your variable is destoyed by garbage collector because first execute exit of sub handle_request and $ua destroyed, because it is local variable and then get answer from $ua. So you get Premature connection close. You need to save instance of $ua elsewhere to prevent such error.
Upd.
I write bad variant of http/https proxy which work only via CONNECT method and have bug with not full first http message.
Upd.
I add another example of http/https proxy which correctly read first http message and work not only via CONNECT method.
Upd.
Oh, author of the Mojo wrote example of https proxy

Related

Skip if a request takes too much time

I have the following code to request a header from an URL:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP;
use Data::Dumper;
my $request = HTTP::Request -> new ( HEAD => 'http://www.vliruos.be/media/6352100/nss2015_annex_3_budget.xlsx' );
my $agent = LWP::UserAgent -> new;
my $response = $agent -> request ( $request );
print $response -> header ( 'Content-Length');
...
I don't know the reason, but the request seems very slow, it takes more than 10 seconds for me. I just want to implement a rule: if it does not return anything in 10 seconds, it should give up and resume the commands after the print.
Does anyone know how to implement this?
You could use SIGALRM.
$SIG{ALRM} = sub { die "timeout" };
eval {
alarm(10);
# long-time operations here
alarm(0);
};
if ($#) {
if ($# =~ /timeout/) {
# timed out; do what you will here
} else {
alarm(0); # clear the still-pending alarm
die; # propagate unexpected exception
}
}

Perl print the redirected url

I want to print the redirected url in perl.
Input url : http://pricecheckindia.com/go/store/snapdeal/52517?ref=velusliv
output url : http://www.snapdeal.com/product/vox-2-in-1-camcorder/1154987704?utm_source=aff_prog&utm_campaign=afts&offer_id=17&aff_id=1298&source=pricecheckindia
use LWP::UserAgent qw();
use CGI qw(:all);
print header();
my ($url) = "http://pricecheckindia.com/go/store/snapdeal/52517?ref=velusliv";
my $ua = LWP::UserAgent->new;
my $req = new HTTP::Request(GET => $url);
my $res = $ua->request($req);
print $res->request;
How to get this done in perl?
You need to examine the HTTP response to find the URL. The documentation of HTTP::Response gives full details of how to do this, but to summarise, you should do the following:
use strict;
use warnings;
use feature ':5.10'; # enables "say"
use LWP::UserAgent;
my $url = "http://pricecheckindia.com/go/store/snapdeal/52517?ref=velusliv";
my $ua = LWP::UserAgent->new;
my $req = new HTTP::Request(GET => $url);
my $res = $ua->request($req);
# you should add a check to ensure the response was actually successful:
if (! $res->is_success) {
say "GET failed! " . $res->status_line;
}
# show the base URI for the response:
say "Base URI: " . $res->base;
You can view redirects using HTTP::Response's redirects method:
if ($res->redirects) { # are there any redirects?
my #redirects = $res->redirects;
say join(", ", #redirects);
}
else {
say "No redirects.";
}
In this case, the base URI is the same as $url, and if you examine the contents of the page, you can see why.
# print out the contents of the response:
say $res->decoded_contents;
Right near the bottom of the page, there is the following code:
$(window).load(function() {
window.setTimeout(function() {
window.location = "http://www.snapdeal.com/product/vox-2-in-1-camcorder/1154987704?utm_source=aff_prog&utm_campaign=afts&offer_id=17&aff_id=1298&source=pricecheckindia"
}, 300);
});
The redirect is handled by javascript, and so is not picked up by LWP::UserAgent. If you want to get this URL, you will need to extract it from the response contents (or use a different client that supports javascript).
On a different note, your script starts off like this:
use LWP::UserAgent qw();
The code following the module name, qw(), is used to import particular subroutines into your script so that you can use them by name (instead of having to refer to the module name and the subroutine name). If the qw() is empty, it's not doing anything, so you can just omit it.
To have LWP::UserAgent follow redirects, just set the max_redirects option:
use strict;
use warnings;
use LWP::UserAgent qw();
my $url = "http://pricecheckindia.com/go/store/snapdeal/52517?ref=velusliv";
my $ua = LWP::UserAgent->new( max_redirect => 5 );
my $res = $ua->get($url);
if ( $res->is_success ) {
print $res->decoded_content; # or whatever
} else {
die $res->status_line;
}
However, that website is using a JavaScript redirect.
$(window).load(function() {
window.setTimeout(function() {
window.location = "http://www.snapdeal.com/product/vox-2-in-1-camcorder/1154987704?utm_source=aff_prog&utm_campaign=afts&offer_id=17&aff_id=1298&source=pricecheckindia"
}, 300);
});
This will not work unless you use a framework that enables JavaScript, like WWW::Mechanize::Firefox.
It will throw you an error for the last line $res - > request since it is returning hash and content from the response. So below is the code:
use LWP::UserAgent qw();
use CGI qw(:all);
print header();
my ($url) = "http://pricecheckindia.com/go/store/snapdeal/52517?ref=velusliv";
my $ua = LWP::UserAgent->new;
my $req = new HTTP::Request(GET => $url);
my $res = $ua->request($req);
print $res->content;

HTTP request not going through proxy

I have written this code to fire a http request through a proxy.
But the request does not seem to use proxy. Even though I give a wrong proxy, it is returning OK.
Is there any way I can check, whether the HTTP request went via proxy?
What is the issue in this code which makes it not use proxy?
sub fire_http_request_through_proxy()
{
my $proxy = $_;
my $ua = LWP::UserAgent->new;
$ENV{HTTP_PROXY} = $proxy;
$ua->env_proxy; # initialize from environment variables
$ua->timeout(20);
my $response = $ua->get('http://www.google.com');
delete $ENV{HTTP_PROXY};
if ($response->is_success)
{
print $response->decoded_content . "\n";
}
else
{
die $response->status_line;
}
}
Sebastian and oalders have already solved your problem, but I'd just like to note that you don't need to mess around with $ENV{HTTP_PROXY} anyway — you can just use $ua->proxy(), like this:
$ua->proxy( http => 'http://1.1.1.1' );
or even:
$ua->proxy( ['http', 'https', 'ftp'] => 'http://1.1.1.1' );
Ps. If you really want to check which proxy was used by LWP for a particular request, you can peek at $response->request->{proxy}, which should be a URI object. However, as far as I know, this property is undocumented (I found out about it by reading the source) and thus subject to change in later versions of LWP. Use at your own risk!
Are you sure that $_ has a true value? This dies appropriately for me:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP::UserAgent;
fire_http_request_through_proxy();
sub fire_http_request_through_proxy {
my $ua = LWP::UserAgent->new;
local $ENV{HTTP_PROXY} = 'http://1.1.1.1';
$ua->env_proxy; # initialize from environment variables
$ua->timeout( 20 );
my $response = $ua->get( 'http://www.google.com' );
delete $ENV{HTTP_PROXY};
if ( $response->is_success ) {
print $response->decoded_content . "\n";
}
else {
die $response->status_line;
}
}
So, maybe $_ isn't what you think it is. If it's not defined, then no proxy will be used. Having said that, $_ is probably not the variable you want to use here. You could either declare a variable for use in this case, pass a variable right to the subroutine or actually set an ENV variable outside of the script.
One other point. Rather than setting and deleting the ENV var in your script, just declare the change with local and it will only take effect inside this block of code. That way you don't have to clean up after yourself and you don't risk overriding vars which may have been set elsewhere.
Take a look at your code sub fire_http_request_through_proxy(), especially the last two characters... This is a prototype. Basically you are saying "I don't take any arguments during compile-time".
I guess you are simply invoking the method before its declaration -> Always use warnings:
main::fire_http_request_through_proxy() called too early to check
prototype at test.pl line ...
So either change it to fire_http_request_through_proxy or change it to fire_http_request_through_proxy($) and invoke it after its declaration.
More about prototyping in perlsub.
Be sure
to read parameter as $_[0] or pop, not $_
to not include () in sub definition
Script:
sub fire_http_request_through_proxy {
my $proxy = $_[0];
my $timeout = 20;
my $url = 'http://www.google.com';
my $ua = LWP::UserAgent->new;
$ua->proxy(['http', 'https', 'ftp'] => $proxy);
$ua->timeout($timeout);
my $response = $ua->get($url);
if ($response->is_success) {
print $response->decoded_content . "\n";
}
else {
die $response->status_line;
}
}
Test:
To make it work, parameter of proxy has to be in correct format (http://host:port)
fire_http_request_through_proxy('http://176.34.248.142:9050');

How do I make parallel HTTP requests in Perl, and receive them back in order?

Using Perl, I'm looking for a simple way to perform a handful of HTTP requests in parallel, where I get responses back in the same order I sent them after they complete, e.g.:
my ($google, $perl) = foobar(GET => 'http://www.google.com/',
GET => 'http://www.perl.org/');
Is there a module I should be looking at?
I know I can do the bookkeeping by hand, but I feel spoiled after being able to do this using jQuery's when method, and I'd love to have as simple a solution using Perl.
Thanks for your help.
use threads;
use LWP::UserAgent qw( );
my $ua = LWP::UserAgent->new();
my #threads;
for my $url ('http://www.google.com/', 'http://www.perl.org/') {
push #threads, async { $ua->get($url) };
}
for my $thread (#threads) {
my $response = $thread->join;
...
}
The best part is that the parent doesn't wait for all requests to be completed. As soon as the right request is completed, the parent will unblock to process it.
If you used Parallel::ForkManager or something else where you can't wait for a specific child, you can use the following code to order the results:
for my $id (0..$#urls) {
create_task($id, $urls[$id]);
}
my %responses;
for my $id (0..$#urls) {
if (!exists($responses{$id})) {
my ($id, $response) = wait_for_a_child_to_complete();
$responses{$id} = $response;
redo;
}
my $response = delete($responses{$id});
...
}
I am a fan of Mojo!
From the Mojo::UserAgent documentation:
use Mojo;
use Mojo::UserAgent;
# Parallel requests
my $ua = Mojo::UserAgent->new;
$ua->max_redirects(5);
my $delay = Mojo::IOLoop->delay;
for my $url ('http://www.google.com/', 'http://www.perl.org/') {
$delay->begin;
$ua->get($url => sub {
my ($ua, $tx) = #_;
$delay->end($tx->res->dom);
});
}
my #responses = $delay->wait;
print join "\n", #responses
Enjoy!
EDIT
Btw. you do not have to process the responses at the end, you may do it in between:
# ...
$ua->get($url => sub {
my ($ua, $tx) = #_;
$delay->end(1);
# process $tx->res here
});
# ...
$delay->wait;

True timeout on LWP::UserAgent request method

I am trying to implement a request to an unreliable server. The request is a nice to have, but not 100% required for my perl script to successfully complete. The problem is that the server will occasionally deadlock (we're trying to figure out why) and the request will never succeed. Since the server thinks it is live, it keeps the socket connection open thus LWP::UserAgent's timeout value does us no good what-so-ever. What is the best way to enforce an absolute timeout on a request?
FYI, this is not an DNS problem. The deadlock has something to do with a massive number of updates hitting our Postgres database at the same time. For testing purposes, we've essentially put a while(1) {} line in the servers response handler.
Currently, the code looks like so:
my $ua = LWP::UserAgent->new;
ua->timeout(5); $ua->cookie_jar({});
my $req = HTTP::Request->new(POST => "http://$host:$port/auth/login");
$req->content_type('application/x-www-form-urlencoded');
$req->content("login[user]=$username&login[password]=$password");
# This line never returns
$res = $ua->request($req);
I've tried using signals to trigger a timeout, but that does not seem to work.
eval {
local $SIG{ALRM} = sub { die "alarm\n" };
alarm(1);
$res = $ua->request($req);
alarm(0);
};
# This never runs
print "here\n";
The final answer I'm going to use was proposed by someone offline, but I'll mention it here. For some reason, SigAction works while $SIG(ALRM) does not. Still not sure why, but this has been tested to work. Here are two working versions:
# Takes a LWP::UserAgent, and a HTTP::Request, returns a HTTP::Request
sub ua_request_with_timeout {
my $ua = $_[0];
my $req = $_[1];
# Get whatever timeout is set for LWP and use that to
# enforce a maximum timeout per request in case of server
# deadlock. (This has happened.)
use Sys::SigAction qw( timeout_call );
our $res = undef;
if( timeout_call( 5, sub {$res = $ua->request($req);}) ) {
return HTTP::Response->new( 408 ); #408 is the HTTP timeout
} else {
return $res;
}
}
sub ua_request_with_timeout2 {
print "ua_request_with_timeout\n";
my $ua = $_[0];
my $req = $_[1];
# Get whatever timeout is set for LWP and use that to
# enforce a maximum timeout per request in case of server
# deadlock. (This has happened.)
my $timeout_for_client = $ua->timeout() - 2;
our $socket_has_timedout = 0;
use POSIX;
sigaction SIGALRM, new POSIX::SigAction(
sub {
$socket_has_timedout = 1;
die "alarm timeout";
}
) or die "Error setting SIGALRM handler: $!\n";
my $res = undef;
eval {
alarm ($timeout_for_client);
$res = $ua->request($req);
alarm(0);
};
if ( $socket_has_timedout ) {
return HTTP::Response->new( 408 ); #408 is the HTTP timeout
} else {
return $res;
}
}
You might try LWPx::ParanoidAgent, a subclass of LWP::UserAgent which is more cautious about how it interacts with remote webservers.
Among other things, it allows you to specify a global timeout. It was developed by Brad Fitzpatrick as part of the LiveJournal project.
You can make your own timeout like this:
use LWP::UserAgent;
use IO::Pipe;
my $agent = new LWP::UserAgent;
my $finished = 0;
my $timeout = 5;
$SIG{CHLD} = sub { wait, $finished = 1 };
my $pipe = new IO::Pipe;
my $pid = fork;
if($pid == 0) {
$pipe->writer;
my $response = $agent->get("http://stackoverflow.com/");
$pipe->print($response->content);
exit;
}
$pipe->reader;
sleep($timeout);
if($finished) {
print "Finished!\n";
my $content = join('', $pipe->getlines);
}
else {
kill(9, $pid);
print "Timed out.\n";
}
From what I understand, the timeout property doesn't take into account DNS timeouts. It's possible that you could make a DNS lookup separately, then make the request to the server if that works, with the correct timeout value set for the useragent.
Is this a DNS problem with the server, or something else?
EDIT: It could also be a problem with IO::Socket. Try updating your IO::Socket module, and see if that helps. I'm pretty sure there was a bug in there that was preventing LWP::UserAgent timeouts from working.
Alex
The following generalization of one of the original answers also restores the alarm signal handler to the previous handler and adds a second call to alarm(0) in case the call in the eval clock throws a non alarm exception and we want to cancel the alarm. Further $# inspection and handling can be added:
sub ua_request_with_timeout {
my $ua = $_[0];
my $request = $_[1];
# Get whatever timeout is set for LWP and use that to
# enforce a maximum timeout per request in case of server
# deadlock. (This has happened.)`enter code here`
my $timeout_for_client_sec = $ua->timeout();
our $res_has_timedout = 0;
use POSIX ':signal_h';
my $newaction = POSIX::SigAction->new(
sub { $res_has_timedout = 1; die "web request timeout"; },# the handler code ref
POSIX::SigSet->new(SIGALRM),
# not using (perl 5.8.2 and later) 'safe' switch or sa_flags
);
my $oldaction = POSIX::SigAction->new();
if(!sigaction(SIGALRM, $newaction, $oldaction)) {
log('warn',"Error setting SIGALRM handler: $!");
return $ua->request($request);
}
my $response = undef;
eval {
alarm ($timeout_for_client_sec);
$response = $ua->request($request);
alarm(0);
};
alarm(0);# cancel alarm (if eval failed because of non alarm cause)
if(!sigaction(SIGALRM, $oldaction )) {
log('warn', "Error resetting SIGALRM handler: $!");
};
if ( $res_has_timedout ) {
log('warn', "Timeout($timeout_for_client_sec sec) while waiting for a response from cred central");
return HTTP::Response->new(408); #408 is the HTTP timeout
} else {
return $response;
}
}