Set chunk size for LWP get callback - perl

The LWP::UserAgent module allows using callback to read web content in chunks.
$r = $ua->get($url, ':content_cb' => sub
{
my ($chunk, $resp, $proto) = #_;
# ...
});
It seems that the chunk size is preset to 16384 bytes, which is very small for me.
How can I set the chunks to be of a different size, …let's say 4MB, i.e. 4*1024*1024 ?
I added :read_size_hint as follows, but seen no effect.
$r = $ua->get($url, ':read_size_hint' => 4*1024*1024, ':content_cb' => sub
{
my ($chunk, $resp, $proto) = #_;
# ...
});

:read_size_hint is only a hint given to sysread of the underlying socket module. This does not mean that there will be that much data read at once, only that they might be if they are already available.
How much is actually available and thus can be returned by sysread depends on how fast the sender will send data (sysread will not wait for more but just return what's available so far) and how large the sockets read buffer is.
In case of TLS it will also depend on the TLS record size used by the sender, i.e. a single sysread will not return results spanning multiple TLS records. Since the maximum record size in TLS is 16384 any larger settings for :read_size_hint have therefore no effect.

I'm afraid this is entirely at the sending server's discretion.
What I've done is accumulate their response in the response_data handler, until some criterion at which further action is taken.
Another option is to let the object accumulate it and use other handlers.

For fixed-size chunks:
my $msg_size = 4 * 1024 * 1024;
my $buf = '';
my $response = $ua->get( $url,
':content_cb' => sub {
$buf .= shift;
while ( length( $buf ) >= $msg_size ) {
process( substr( $buf, 0, $msg_size, "" ), #_ );
}
},
);
!length( $buf )
or die( "Premature EOF\n" );

Related

How to get a user-configurable buffer for printing?

I'd like to have a print function supporting a user-configurable buffer, so to print what I have in the buffer only when the buffer is > a threshold).
I need to write multiple files, so I have multiple filehandles to write to, and for this an object oriented module might be handier.
I imagine something like this:
my $printer1 = Print::Buffer->new({ size => 1000, filehandle => \$OUT1 });
for (my $i=1; $i<1000; $i++) {
$printer1->print("This string will be eventually printed ($i/1000)");
}
# and at the end print the remaining buffer
$printer1->flush();
Any recommendation? I probably don't use the right keywords as with print/buffer I didn't find clear matches in CPAN.
UPDATE:
Thanks everyone for the very useful comments. As some of you pointed out, the problem is more complex than I initially thought, and probably a bad idea. (This question arose as I was printing very large files [>100Gb] in with a print statement at each loop iteration, and noted that if I was printing every hunderth iteration I had a speedup, but it could be dependent on how the loop was changed...)
UPDATE 2:
I need/want to accept an answer. To me both have been instructive and they are both useful. I tested both and they both need further work before being able to benchmark the improvement (if any, see update above). The tie handle is a less known feature that I loved, that's why I accepted that. They were both equally close to the desired answer in my opinion. Thank you all very much for the discussion and the insights.
I'd like to have a print function supporting a user-configurable buffer, [...]
I imagine something like this: [...]
It's not hard to write something like it. Here's a basic sketch
File PrintBuffer.pm
package PrintBuffer;
use warnings;
use strict;
sub new {
my ($class, %args) = #_;
my $self = {
_size => $args{size} // 64*1024, #//
_fh => $args{filehandle} // *STDOUT,
_buf => ''
};
$self->{_fh}->autoflush; # want it out once it's printed
bless $self, $class;
}
sub print {
my ($self, $string) = #_;
$self->{_buf} .= $string;
if ( length($self->{_buf}) > $self->{_size} ) {
print { $self->{_fh} } $self->{_buf};
$self->{_buf} = '';
}
return $self;
}
sub DESTROY {
my $self = shift;
print { $self->{_fh} } $self->{_buf} if $self->{_buf} ne '';
$self->{_buf} = '';
}
1;
There's a bit more to do here, and a whole lot that can be added, and since it relies only on basic tools one can add/change as desired.† For one, I can imagine a size method to manipulate the buffer size of an existing object (print if there's already more data than the new size), and flush.
Note that DESTROY method provides for the buffer to be printed as the object drops out of any scope, and is getting destroyed, what seems reasonable to do.
A driver
use warnings;
use strict;
use feature 'say';
use PrintBuffer;
my $fout = shift // die "Usage: $0 out-file\n";
open my $fh, '>', $fout or die "Can't open $fout: $!";
my $obj_file = PrintBuffer->new(size => 100, filehandle => $fh);
my $obj_stdout = PrintBuffer->new(size => 100);
$obj_file->print('a little bit');
$obj_stdout->print('a little bit');
say "printed 'a little bit' ..."; sleep 10;
$obj_file->print('out'x30); # push it over a 100 chars
$obj_stdout->print('out'x30);
say "printed 'out'x30 ... "; sleep 10;
$obj_file->print('again...'); # check DESTROY
$obj_stdout->print('again');
say "printed 'again' (and we're done)";
Check the size of output file in another terminal after each informational print.
I tried PerlIO::buffersize brought up by Grinnz in a comment and it seems to work "as advertised" as they say. It doesn't allow you to do all you may wish but it may be a ready solution for basic needs. Note that this doesn't work with :encoding layer in use.
Thanks to ikegami for comments and tests (linked in comments).
† The print works with an autoflush-ed handle. Still, the first change could be to use syswrite instead, which is unbuffered and attempts to directly write all that's asked of it, via one write(2) call. But since there's no guarantee that all got written we also need to check
use Carp; # for croak
WRITE: {
my $bytes_written = 0;
while ( $bytes_written < length $self->{_buf} ) {
my $rv = syswrite(
$self->{_fh},
$self->{_buf},
length($self->{_buf}) - $bytes_written,
$bytes_written
);
croak "Error writing: $!" if not defined $rv;
$bytes_written += $rv;
}
$self->{_buf} = '';
};
I've put this in a block only to limit the scope of $bytes_written and any other variables that one may wish to introduce so to reduce the number of dereferences of $self (but note that $self->{_buf} may be quite large and copying it "to optimize" dereferencing may end up slower).
Naively we'd only need syswrite(FH, SCALAR) but if it happens that not all of SCALAR gets written then we need to continue writing from past what was written, thus the need to use the form with length-to-write and offset as well.
Since this is unbuffered it mustn't be mixed with buffered IO (or that need be done very carefully); see the docs. Also, :encoding layers can't be used with it. Consider these restrictions against other capabilities that may be wanted in this class.
I don't see a general solution on CPAN, either. But this is straightforward enough with tied filehandles. Something like
use Symbol;
sub Print::Buffer::new {
my ($class,$mode,$file,#opts) = #_;
my $x = Symbol::gensym;
open ($x, $mode, $file) or die "failed to open '$file': $!";
tie *$x, "Print::Buffer", fh => $fh, #opts;
$x;
}
sub Print::Buffer::TIEHANDLE {
my $pkg = shift;
my $self = { #_ };
$self->{bufsize} //= 16 * 1024 * 1024;
$self->{_buffer} = "";
bless $self, $pkg;
}
sub Print::Buffer::PRINT {
my ($self,#msg) = #_;
$self->{buffer} .= join($,,#msg);
$self->_FLUSH if length($self->{buffer}) > $self->{bufsize};
}
sub Print::Buffer::_FLUSH {
my $self = shift;
print {$self->{fh}} $self->{buffer};
$self->{buffer} = "";
}
sub Print::Buffer::CLOSE {
my $self = shift;
$self->_FLUSH;
close( $self->{fh} );
}
sub Print::Buffer::DESTROY {
my $self = shift;
$self->_FLUSH;
}
# ----------------------------------------
my $fh1 = Print::Buffer->new(">", "/tmp/file1",
bufsize => 16*1024*1024);
for (my $i=1; $i<1000; $i++) {
print $fh1 "This string will be eventually printed ($i/1000)\n";
}

Flush INET Socket response data with BLOCKING enabled

I am making a program that interfaces with Teamspeak, and I have an issue where the responses received will not match the commands sent. I run the program multiple times and each time, I will get different results when they should be the same, due to responses being out of sync.
my $buf = '';
use IO::Socket;
my $sock = new IO::Socket::INET (
PeerAddr => 'localhost'
,PeerPort => '10011'
,Proto => 'tcp'
,Autoflush => 1
,Blocking => 1
,Timeout => 10
);
sub ExecuteCommand{
print $sock $_[0]."\n";$sock->sysread($buf,1024*10);
return $buf;
};
ExecuteCommand("login ${username} ${password}");
ExecuteCommand("use sid=1");
ExecuteCommand("clientupdate client_nickname=Idle\\sTimer");
my $client_list = ExecuteCommand("clientlist");
Each command is executed properly, however the server likes to return extra lines, so a single sysread will not be enough and I will have to execute another. The size of responses are at most 512, so they aren't being cut off. If I try to run the sysread multiple times in an attempt to flush it, when there is nothing to read it will just make the program hang.
The end of the executions are followed with "error id=0 msg=ok"
How would I be able to read all the data that comes out, even if it's multiple lines? Or just be able to flush it all out so I can move onto the next command without having to worry about old data?
So you want to read until you find a line starting with error. In addition to doing that, the following buffers anything extra read since it's part of the next response.
sub read_response {
my ($conn) = #_;
my $fh = $conn->{fh};
our $buf; local *buf = \($conn->{buf}); # alias
our $eof; local *eof = \($conn->{eof}); # alias
$buf = '' if !defined($buf);
return undef if $eof;
while (1) {
if ($buf =~ s/\A(.*?^error[^\n]*\n)//ms) {
return $1;
}
my $rv = sysread($fh, $buf, 64*1024, length($buf));
if (!$rv) {
if (defined($rv)) {
$eof = 1;
return undef;
} else {
die "Can't read response: $!\n";
}
}
}
}
my $conn = { fh => $sock };
... send command ...
my $response = read_response($conn);
...
... send command ...
my $response = read_response($conn);
...
I changed my ExecuteCommand subroutine to include a check for "error code=[0-9]{1,}", which is what is always at the end of a response for Teamspeak 3 servers.
sub ExecuteCommand{
print $sock $_[0]."\n";
my $response = "";
while (1){
$sock->sysread($buf,1024*10);
last if($buf =~ /error id=([0-9]{1,})/);
$response .= $buf;
};
return $response;
};

Progress indicator for Perl LWP POST upload

I'm working on a Perl script which uploads big files with a POST request. My question is if it's possible to have a status output, because uploading big files can take some time with my internet connection.
I mean like a status bar with
$| = 1;
print "\r|----------> | 33%";
print "\r|--------------------> | 66%";
print "\r|------------------------------| 100%\n";
Here's my upload code:
my $ua=LWP::UserAgent->new();
$file = "my_big_holyday_vid.mp4";
$user = "username";
$pass = "password";
print "starting Upload...\n";
$res = $ua->post(
"http://$server",
Content_Type => 'form-data',
Content =>[
fn => ["$file" => $file],
username => $user,
password => $pass,
],
);
print "Upload complete!\n"
If you look at the documentation for HTTP::Request::Common you will see that, if you set $HTTP::Request::Common::DYNAMIC_FILE_UPLOAD to a true value, then the request object's content method will provide a callback that is used to fetch the data in chunks.
Normally this is called each time more data is needed for upload, but you can wrap it in your own subroutine to monitor the progress of the upload.
The program below gives an example. As you can see, the HTTP::Request object is created (I have assumed that the fn field should be just [$file]) and the content method is used to fetch the callback subroutine.
The subroutine wrapper just calls $callback in the first line to fetch the next data chunk, and returns it in the last line, just as $callback itself would do. Between these two lines you can add what you like, as long as it doesn't interfere with passing the chunk back to LWP. In this case I have printed the size of each chunk together with the percentage upload so far on each call.
For the purpose of percentage calculations, the full size of the file is accessible as $req->header('content-length'), which is more correct than using -s on the file.
Also, the final iteration can be detected if necessary as the callback will return chunk with a size of zero.
Note that this is untested except as far as it compiles and does roughly the right thing, as I have no internet service available that expects a file upload.
use strict;
use warnings;
use LWP;
use HTTP::Request::Common;
$HTTP::Request::Common::DYNAMIC_FILE_UPLOAD = 1;
my $ua = LWP::UserAgent->new;
my $server = 'example.com';
my $file = 'my_big_holyday_vid.mp4';
my ($user, $pass) = qw/ username password /;
print "Starting Upload...\n";
my $req = POST "http://$server",
Content_Type => 'form-data',
Content => [
fn => [$file],
username => $user,
password => $pass,
];
my $total;
my $callback = $req->content;
my $size = $req->header('content-length');
$req->content(\&wrapper);
my $resp = $ua->request($req);
sub wrapper {
my $chunk = $callback->();
if ($chunk) {
my $length = length $chunk;
$total += $length;
printf "%+5d = %5.1f%%\n", $length, $total / $size * 100;
}
else {
print "Completed\n";
}
$chunk;
}

Mechanize question regarding submit()

I tried looking around on the forum and googling for answers but cannot figure it out. After submitting a form for a webpage that requires time to do some computation does Mechanize wait for all the computation to finish (even if it's taking an hour?). It seems as if that doesn't happen. I am iterating through a subroutine that creates a Mechanize object and submits a form and downloads the output file after computation is done. However, I feel like it jumps to the next iteration of loop without completing all those tasks since some times the computation takes a long time. Does anyone have any suggestions? Thanks. This is the subroutine
sub microinspector {
my ($sequence, $folder) = #_;
print STDOUT "subroutine sequence: $sequence\n";
my $browser = WWW::Mechanize->new();
$browser->get("http://bioinfo.uni-plovdiv.bg/microinspector/");
$browser->form_number(1);
$browser->field("target_sequence", $sequence);
$browser->select("Choose an organism : ", "Mus musculus");
$browser->submit();
#print $browser->content();
my #links = $browser->links();
chdir($folder) or die "Cannot chdir to $folder";
foreach my $link (#links) {
#print $link->url();
if( $link->url() =~ /csv$/i ){
my $result = $browser->get( $link->url() );
my $filename = ( $link->url() =~ /\/([^\/]+)$/ )[0];
print "Saving $filename\n";
open( OUT, ">$filename" );
print OUT $result->content();
close( OUT );
}
}
}
WWW::Mechanize can take an optional timeout parameter (specified in seconds) in its constructor (which is passed to its parent class LWP::UserAgent in this case). I think the default is like 180 seconds.
Try increasing it, like:
my $browser = WWW::Mechanize->new(
timeout => 60 * 10, # 10 minutes
);
See the LWP::UserAgent docs on the timeout method for the specific semantics of how this is treated. It's mostly as you expect, but just in case.

True timeout on LWP::UserAgent request method

I am trying to implement a request to an unreliable server. The request is a nice to have, but not 100% required for my perl script to successfully complete. The problem is that the server will occasionally deadlock (we're trying to figure out why) and the request will never succeed. Since the server thinks it is live, it keeps the socket connection open thus LWP::UserAgent's timeout value does us no good what-so-ever. What is the best way to enforce an absolute timeout on a request?
FYI, this is not an DNS problem. The deadlock has something to do with a massive number of updates hitting our Postgres database at the same time. For testing purposes, we've essentially put a while(1) {} line in the servers response handler.
Currently, the code looks like so:
my $ua = LWP::UserAgent->new;
ua->timeout(5); $ua->cookie_jar({});
my $req = HTTP::Request->new(POST => "http://$host:$port/auth/login");
$req->content_type('application/x-www-form-urlencoded');
$req->content("login[user]=$username&login[password]=$password");
# This line never returns
$res = $ua->request($req);
I've tried using signals to trigger a timeout, but that does not seem to work.
eval {
local $SIG{ALRM} = sub { die "alarm\n" };
alarm(1);
$res = $ua->request($req);
alarm(0);
};
# This never runs
print "here\n";
The final answer I'm going to use was proposed by someone offline, but I'll mention it here. For some reason, SigAction works while $SIG(ALRM) does not. Still not sure why, but this has been tested to work. Here are two working versions:
# Takes a LWP::UserAgent, and a HTTP::Request, returns a HTTP::Request
sub ua_request_with_timeout {
my $ua = $_[0];
my $req = $_[1];
# Get whatever timeout is set for LWP and use that to
# enforce a maximum timeout per request in case of server
# deadlock. (This has happened.)
use Sys::SigAction qw( timeout_call );
our $res = undef;
if( timeout_call( 5, sub {$res = $ua->request($req);}) ) {
return HTTP::Response->new( 408 ); #408 is the HTTP timeout
} else {
return $res;
}
}
sub ua_request_with_timeout2 {
print "ua_request_with_timeout\n";
my $ua = $_[0];
my $req = $_[1];
# Get whatever timeout is set for LWP and use that to
# enforce a maximum timeout per request in case of server
# deadlock. (This has happened.)
my $timeout_for_client = $ua->timeout() - 2;
our $socket_has_timedout = 0;
use POSIX;
sigaction SIGALRM, new POSIX::SigAction(
sub {
$socket_has_timedout = 1;
die "alarm timeout";
}
) or die "Error setting SIGALRM handler: $!\n";
my $res = undef;
eval {
alarm ($timeout_for_client);
$res = $ua->request($req);
alarm(0);
};
if ( $socket_has_timedout ) {
return HTTP::Response->new( 408 ); #408 is the HTTP timeout
} else {
return $res;
}
}
You might try LWPx::ParanoidAgent, a subclass of LWP::UserAgent which is more cautious about how it interacts with remote webservers.
Among other things, it allows you to specify a global timeout. It was developed by Brad Fitzpatrick as part of the LiveJournal project.
You can make your own timeout like this:
use LWP::UserAgent;
use IO::Pipe;
my $agent = new LWP::UserAgent;
my $finished = 0;
my $timeout = 5;
$SIG{CHLD} = sub { wait, $finished = 1 };
my $pipe = new IO::Pipe;
my $pid = fork;
if($pid == 0) {
$pipe->writer;
my $response = $agent->get("http://stackoverflow.com/");
$pipe->print($response->content);
exit;
}
$pipe->reader;
sleep($timeout);
if($finished) {
print "Finished!\n";
my $content = join('', $pipe->getlines);
}
else {
kill(9, $pid);
print "Timed out.\n";
}
From what I understand, the timeout property doesn't take into account DNS timeouts. It's possible that you could make a DNS lookup separately, then make the request to the server if that works, with the correct timeout value set for the useragent.
Is this a DNS problem with the server, or something else?
EDIT: It could also be a problem with IO::Socket. Try updating your IO::Socket module, and see if that helps. I'm pretty sure there was a bug in there that was preventing LWP::UserAgent timeouts from working.
Alex
The following generalization of one of the original answers also restores the alarm signal handler to the previous handler and adds a second call to alarm(0) in case the call in the eval clock throws a non alarm exception and we want to cancel the alarm. Further $# inspection and handling can be added:
sub ua_request_with_timeout {
my $ua = $_[0];
my $request = $_[1];
# Get whatever timeout is set for LWP and use that to
# enforce a maximum timeout per request in case of server
# deadlock. (This has happened.)`enter code here`
my $timeout_for_client_sec = $ua->timeout();
our $res_has_timedout = 0;
use POSIX ':signal_h';
my $newaction = POSIX::SigAction->new(
sub { $res_has_timedout = 1; die "web request timeout"; },# the handler code ref
POSIX::SigSet->new(SIGALRM),
# not using (perl 5.8.2 and later) 'safe' switch or sa_flags
);
my $oldaction = POSIX::SigAction->new();
if(!sigaction(SIGALRM, $newaction, $oldaction)) {
log('warn',"Error setting SIGALRM handler: $!");
return $ua->request($request);
}
my $response = undef;
eval {
alarm ($timeout_for_client_sec);
$response = $ua->request($request);
alarm(0);
};
alarm(0);# cancel alarm (if eval failed because of non alarm cause)
if(!sigaction(SIGALRM, $oldaction )) {
log('warn', "Error resetting SIGALRM handler: $!");
};
if ( $res_has_timedout ) {
log('warn', "Timeout($timeout_for_client_sec sec) while waiting for a response from cred central");
return HTTP::Response->new(408); #408 is the HTTP timeout
} else {
return $response;
}
}