Mechanize question regarding submit() - perl

I tried looking around on the forum and googling for answers but cannot figure it out. After submitting a form for a webpage that requires time to do some computation does Mechanize wait for all the computation to finish (even if it's taking an hour?). It seems as if that doesn't happen. I am iterating through a subroutine that creates a Mechanize object and submits a form and downloads the output file after computation is done. However, I feel like it jumps to the next iteration of loop without completing all those tasks since some times the computation takes a long time. Does anyone have any suggestions? Thanks. This is the subroutine
sub microinspector {
my ($sequence, $folder) = #_;
print STDOUT "subroutine sequence: $sequence\n";
my $browser = WWW::Mechanize->new();
$browser->get("http://bioinfo.uni-plovdiv.bg/microinspector/");
$browser->form_number(1);
$browser->field("target_sequence", $sequence);
$browser->select("Choose an organism : ", "Mus musculus");
$browser->submit();
#print $browser->content();
my #links = $browser->links();
chdir($folder) or die "Cannot chdir to $folder";
foreach my $link (#links) {
#print $link->url();
if( $link->url() =~ /csv$/i ){
my $result = $browser->get( $link->url() );
my $filename = ( $link->url() =~ /\/([^\/]+)$/ )[0];
print "Saving $filename\n";
open( OUT, ">$filename" );
print OUT $result->content();
close( OUT );
}
}
}

WWW::Mechanize can take an optional timeout parameter (specified in seconds) in its constructor (which is passed to its parent class LWP::UserAgent in this case). I think the default is like 180 seconds.
Try increasing it, like:
my $browser = WWW::Mechanize->new(
timeout => 60 * 10, # 10 minutes
);
See the LWP::UserAgent docs on the timeout method for the specific semantics of how this is treated. It's mostly as you expect, but just in case.

Related

How to get a user-configurable buffer for printing?

I'd like to have a print function supporting a user-configurable buffer, so to print what I have in the buffer only when the buffer is > a threshold).
I need to write multiple files, so I have multiple filehandles to write to, and for this an object oriented module might be handier.
I imagine something like this:
my $printer1 = Print::Buffer->new({ size => 1000, filehandle => \$OUT1 });
for (my $i=1; $i<1000; $i++) {
$printer1->print("This string will be eventually printed ($i/1000)");
}
# and at the end print the remaining buffer
$printer1->flush();
Any recommendation? I probably don't use the right keywords as with print/buffer I didn't find clear matches in CPAN.
UPDATE:
Thanks everyone for the very useful comments. As some of you pointed out, the problem is more complex than I initially thought, and probably a bad idea. (This question arose as I was printing very large files [>100Gb] in with a print statement at each loop iteration, and noted that if I was printing every hunderth iteration I had a speedup, but it could be dependent on how the loop was changed...)
UPDATE 2:
I need/want to accept an answer. To me both have been instructive and they are both useful. I tested both and they both need further work before being able to benchmark the improvement (if any, see update above). The tie handle is a less known feature that I loved, that's why I accepted that. They were both equally close to the desired answer in my opinion. Thank you all very much for the discussion and the insights.
I'd like to have a print function supporting a user-configurable buffer, [...]
I imagine something like this: [...]
It's not hard to write something like it. Here's a basic sketch
File PrintBuffer.pm
package PrintBuffer;
use warnings;
use strict;
sub new {
my ($class, %args) = #_;
my $self = {
_size => $args{size} // 64*1024, #//
_fh => $args{filehandle} // *STDOUT,
_buf => ''
};
$self->{_fh}->autoflush; # want it out once it's printed
bless $self, $class;
}
sub print {
my ($self, $string) = #_;
$self->{_buf} .= $string;
if ( length($self->{_buf}) > $self->{_size} ) {
print { $self->{_fh} } $self->{_buf};
$self->{_buf} = '';
}
return $self;
}
sub DESTROY {
my $self = shift;
print { $self->{_fh} } $self->{_buf} if $self->{_buf} ne '';
$self->{_buf} = '';
}
1;
There's a bit more to do here, and a whole lot that can be added, and since it relies only on basic tools one can add/change as desired.† For one, I can imagine a size method to manipulate the buffer size of an existing object (print if there's already more data than the new size), and flush.
Note that DESTROY method provides for the buffer to be printed as the object drops out of any scope, and is getting destroyed, what seems reasonable to do.
A driver
use warnings;
use strict;
use feature 'say';
use PrintBuffer;
my $fout = shift // die "Usage: $0 out-file\n";
open my $fh, '>', $fout or die "Can't open $fout: $!";
my $obj_file = PrintBuffer->new(size => 100, filehandle => $fh);
my $obj_stdout = PrintBuffer->new(size => 100);
$obj_file->print('a little bit');
$obj_stdout->print('a little bit');
say "printed 'a little bit' ..."; sleep 10;
$obj_file->print('out'x30); # push it over a 100 chars
$obj_stdout->print('out'x30);
say "printed 'out'x30 ... "; sleep 10;
$obj_file->print('again...'); # check DESTROY
$obj_stdout->print('again');
say "printed 'again' (and we're done)";
Check the size of output file in another terminal after each informational print.
I tried PerlIO::buffersize brought up by Grinnz in a comment and it seems to work "as advertised" as they say. It doesn't allow you to do all you may wish but it may be a ready solution for basic needs. Note that this doesn't work with :encoding layer in use.
Thanks to ikegami for comments and tests (linked in comments).
† The print works with an autoflush-ed handle. Still, the first change could be to use syswrite instead, which is unbuffered and attempts to directly write all that's asked of it, via one write(2) call. But since there's no guarantee that all got written we also need to check
use Carp; # for croak
WRITE: {
my $bytes_written = 0;
while ( $bytes_written < length $self->{_buf} ) {
my $rv = syswrite(
$self->{_fh},
$self->{_buf},
length($self->{_buf}) - $bytes_written,
$bytes_written
);
croak "Error writing: $!" if not defined $rv;
$bytes_written += $rv;
}
$self->{_buf} = '';
};
I've put this in a block only to limit the scope of $bytes_written and any other variables that one may wish to introduce so to reduce the number of dereferences of $self (but note that $self->{_buf} may be quite large and copying it "to optimize" dereferencing may end up slower).
Naively we'd only need syswrite(FH, SCALAR) but if it happens that not all of SCALAR gets written then we need to continue writing from past what was written, thus the need to use the form with length-to-write and offset as well.
Since this is unbuffered it mustn't be mixed with buffered IO (or that need be done very carefully); see the docs. Also, :encoding layers can't be used with it. Consider these restrictions against other capabilities that may be wanted in this class.
I don't see a general solution on CPAN, either. But this is straightforward enough with tied filehandles. Something like
use Symbol;
sub Print::Buffer::new {
my ($class,$mode,$file,#opts) = #_;
my $x = Symbol::gensym;
open ($x, $mode, $file) or die "failed to open '$file': $!";
tie *$x, "Print::Buffer", fh => $fh, #opts;
$x;
}
sub Print::Buffer::TIEHANDLE {
my $pkg = shift;
my $self = { #_ };
$self->{bufsize} //= 16 * 1024 * 1024;
$self->{_buffer} = "";
bless $self, $pkg;
}
sub Print::Buffer::PRINT {
my ($self,#msg) = #_;
$self->{buffer} .= join($,,#msg);
$self->_FLUSH if length($self->{buffer}) > $self->{bufsize};
}
sub Print::Buffer::_FLUSH {
my $self = shift;
print {$self->{fh}} $self->{buffer};
$self->{buffer} = "";
}
sub Print::Buffer::CLOSE {
my $self = shift;
$self->_FLUSH;
close( $self->{fh} );
}
sub Print::Buffer::DESTROY {
my $self = shift;
$self->_FLUSH;
}
# ----------------------------------------
my $fh1 = Print::Buffer->new(">", "/tmp/file1",
bufsize => 16*1024*1024);
for (my $i=1; $i<1000; $i++) {
print $fh1 "This string will be eventually printed ($i/1000)\n";
}

How to read from URL line-wise?

I'm looking for the "moral equivalent" of the (fictitious) openremote below:
my $handle = openremote( 'http://some.domain.org/huge.tsv' ) or die $!;
while ( <$handle> ) {
chomp;
# etc.
# do stuff with $_
}
close $handle;
IOW, I'm looking for a way to open a read handle to a remote file so that I can read from it line-by-line. (Typically this file will be larger than I want to read entirely into memory. This means that solutions based on stuffing the value returned by LWP::Simple::get (for example) into an IO::String are not suitable.)
I'm sure this is really basic stuff, but I have not been able to find it after a lot of searching.
Here's a "solution" much like the other responses but it cheats a bit by using IO::All
use IO::All ;
my $http_io = io->http("http://some.domain.org/huge.tsv");
while (my $line = $http_io->getline || $http_io->getline) {
print $line;
}
After you have an object with io->http you can use IO methods to look at it (like getline() etc.).
Cheers.
You can use LWP::UserAgent's parameter :content_file => $filename to save the big file to disk directly, without filling the memory with it, and then you can read that file in your program.
$ua->get( $url, ':content_file' => $filename );
Or you can use the parameter :content_cb => \&callback and in the callback subroutine you can process the data chunk by chunk as it is downloaded. This is probably the way you need.
$ua->get( $url, ':content_cb' => \&callback );
sub callback {
my ( $chunk, $response, $protocol ) = #_;
#Do whatever you like with $chunk
}
Read (a little) more about this with perldoc LWP::UserAgent.
Use LWP::Simple coupled with IO::String like so:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP::Simple;
use IO::String;
my $handle = IO::String->new(get("http://stackoverflow.com"));
while (defined (my $line = <$handle>)) {
print $line;
}
close $handle;
Hope it works for you.
Paul

mojolicious script works three times, then crashes

The following script should demonstrate a problem I'm facing using Mojolicious on OpenBSD5.2 using mod_perl.
The script works fine 4 times being called as CGI under mod_perl. Additional runs of the script result in Mojolicious not returning the asynchronous posts. The subs that are usually called when data is arriving just don't seem to be called anymore. Running the script from command line works fine since perl is then completely started from scratch and everything is reinitialized, which is not the case under mod_perl. Stopping and starting Apache reinitializes mod_perl so that the script can be run another 4 times.
I only tested this on OpenBSD5.2 using Mojolicious in the version that's provided in OpenBSDs ports tree (2.76). This is kinda old I think but that's what OpenBSD comes with.
Am I doing something completely wrong here? Or is it possible that Mojolicious has some circular reference or something which causes this issue?
I have no influence on the platform (OpenBSD) being used. So please don't suggest to "use Linux and install latest Mojolicious version". However if you are sure that running a later version of Mojolicous will solve the problem, I might get the permission to install that (though I don't yet know how to do that).
Thanks in advance!
T.
Here's the script:
#!/usr/bin/perl
use diagnostics;
use warnings;
use strict;
use feature qw(switch);
use CGI qw/:param/;
use CGI qw/:url/;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
use Mojo::IOLoop;
use Mojo::JSON;
use Mojo::UserAgent;
my ($activeconnections, $md5, $cgi);
my $ua = Mojo::UserAgent->new;
$ua->max_redirects(0)->connect_timeout(3)->request_timeout(6); # Timeout 6 seconds of which 3 may be connecting
my $delay = Mojo::IOLoop->delay();
sub online{
my $url = "http://www.backgroundtask.eu/Systeemtaken/Search.php";
$delay->begin;
$activeconnections++;
my $response_bt = $ua->post_form($url, { 'ex' => $md5 }, sub {
my ($ua, $tx) = #_;
my $content=$tx->res->body;
$content =~ m/(http:\/\/www\.backgroundtask\.eu\/Systeemtaken\/taakinfo\/.*$md5\/)/;
if ($1){
print "getting $1\n";
my $response_bt2 = $ua->get($1, sub {
$delay->end();
$activeconnections--;
print "got result, ActiveConnections: $activeconnections\n";
($ua, $tx) = #_;
my $filename = $tx->res->dom->find('table.view')->[0]->find('tr.even')->[2]->td->[1]->all_text;
print "fn = " . $filename . "\n";
}
)
} else {
print "query did not return a result\n";
$activeconnections--;
$delay->end;
}
});
}
$cgi = new CGI;
print $cgi->header(-cache_control=>"no-cache, no-store, must-revalidate") . "\n";
$md5 = lc($cgi->param("md5") || ""); # read param
$md5 =~ s/[^a-f0-9]*//g if (length($md5) == 32); # custom input filter for md5 values only
if (length $md5 != 32) {
$md5=lc($ARGV[0]);
$md5=~ s/[^a-f0-9]*//g;
die "invalid MD5 $md5\n" if (length $md5 ne 32);
}
online;
if ($activeconnections) {
print "waiting..., activeconnections: $activeconnections\n" for $delay->wait;
}
print "all pending requests completed, activeconnections is " . $activeconnections . "\n";
print "script done.\n md5 was $md5\n";
exit 0;
Well I hate to say it, but there's a lot wrong here. The most glaring is your use of ... for $delay->wait which doesn't make much sense. Also you are comparing numbers with ne rather than !=. Not my-ing the arguments in the deeper callback seems problematic for async style code.
Then there are some code smells, like regexing for urls and closing over the $md5 variable unnecessarily.
Lastly, why use CGI.pm when Mojolicious can operate under CGI just fine? When you do that, the IOLoop is already running, so some things get easier. And yes I understand that you are using the system provided Mojolicious, however I feel I should mention that the current version is 3.93 :-)
Anyway, here is an example, which strips out a lot of things but still should do pretty much the same thing as the example. Of course I can't test it without a valid md5 for the site (and I also can't get rid of the url regex without sample data).
#!/usr/bin/perl
use Mojolicious::Lite;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
$ua->max_redirects(0)->connect_timeout(3)->request_timeout(6); # Timeout 6 seconds of which 3 may be connecting
any '/' => sub {
my $self = shift;
$self->res->headers->cache_control("no-cache, no-store, must-revalidate");
my $md5 = lc($self->param("md5") || ""); # read param
$md5 =~ s/[^a-f0-9]*//g if (length($md5) == 32); # custom input filter for md5 values only
if (length $md5 != 32) {
$md5=lc($ARGV[0]);
$md5=~ s/[^a-f0-9]*//g;
die "invalid MD5 $md5\n" if (length $md5 != 32);
}
$self->render_later; # wait for ua
my $url = "http://www.backgroundtask.eu/Systeemtaken/Search.php";
$ua->post_form($url, { 'ex' => $md5 }, sub {
my ($ua, $tx) = #_;
my $content=$tx->res->body;
$content =~ m{(http://www\.backgroundtask\.eu/Systeemtaken/taakinfo/.*$md5/)};
return $self->render( text => 'Failed' ) unless $1;
$ua->get($1, sub {
my ($ua, $tx) = #_;
my $filename = $tx->res->dom->find('table.view')->[0]->find('tr.even')->[2]->td->[1]->all_text;
$self->render( text => "md5 was $md5, filename was $filename" );
});
});
};
app->start;

using Perl to scrape a website

I am interested in writing a perl script that goes to the following link and extracts the number 1975: https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219
That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.
In the file, locations.txt, I put the list of counties, such as San Diego County.
The current code runs, but instead of the # 1975, it displays unknown. The number 1975 should be in $val\n.
I would very much appreciate any help!
#!/usr/bin/perl
use strict;
use LWP::Simple;
open(L, "locations26.txt");
my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';
open(O, ">out26.txt");
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
chomp($location);
$location =~ s/ /+/g;
foreach my $year (1923..1923) {
my $u = $url;
$u =~ s/%LOCATION%/$location/;
$u =~ s/%YEAR%/$year/;
#print "$u\n";
my $content = get($u);
my $val = 'unknown';
if ($content =~ / of .strong.([0-9,]+)..strong. /) {
$val = $1;
}
$val =~ s/,//g;
$location =~ s/\+/ /g;
print "'$location',$year,$val\n";
print O "'$location',$year,$val\n";
}
}
Update: API is not a viable solution. I have been in contact with the site developer. The API does not apply to that part of the webpage. Hence, any solution pertaining to JSON will not be applicbale.
It would appear that your data is generated by Javascript and thus LWP cannot help you. That said, it seems that the site you are interested in has a developer API: https://familysearch.org/developers/
I recommend using Mojo::URL to construct your query and either Mojo::DOM or Mojo::JSON to parse XML or JSON results respectively. Of course other modules will work too, but these tools are very nicely integrated and let you get started quickly.
You could use WWW::Mechanize::Firefox to process any site that could be loaded by Firefox.
http://metacpan.org/pod/WWW::Mechanize::Firefox::Examples
You have to install the Mozrepl plugin and you will be able to process the web page contant via this module. Basically you will "remotly control" the browser.
Here is an example (maybe working)
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html');
my $retries = 10;
while ($retries-- and ! $mech->is_visible( xpath => '//*[#class="form-submit"]' )) {
print "Sleep until we find the thing\n";
sleep 2;
};
die "Timeout" if 0 > $retries;
#fill out the search form
my #forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[#class="form-submit"]'});
If you use your browser's development tools, you can clearly see the JSON request that the page you link to uses to get the data you're looking for.
This program should do what you want. I've added a bunch of comments for readability and explanation, as well as made a few other changes.
use warnings;
use strict;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
# Create an LWP User-Agent object for sending HTTP requests.
my $ua = LWP::UserAgent->new;
# Open data files
open(L, 'locations26.txt') or die "Can't open locations: $!";
open(O, '>', 'out26.txt') or die "Can't open output file: $!";
# Enable autoflush on the output file handle
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
# This regular expression is like chomp, but removes both Windows and
# *nix line-endings, regardless of the system the script is running on.
$location =~ s/[\r\n]//g;
foreach my $year (1923..1923) {
# If you need to add quotes around the location, use "\"$location\"".
my %args = (LOCATION => $location, YEAR => $year);
my $url = 'https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A^LOCATION^%2520%252Bbirth_year%253A^YEAR^-^YEAR^~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219';
# Note that values need to be doubly-escaped because of the
# weird way their website is set up (the "/proxy" URL we're
# requesting is subsequently loading some *other* URL which
# is provided to "/proxy" as a URL-encoded URL).
#
# This regular expression replaces any ^WHATEVER^ in the URL
# with the double-URL-encoded value of WHATEVER in %args.
# The /e flag causes the replacement to be evaluated as Perl
# code. This way I can look data up in a hash and do URL-encoding
# as part of the regular expression without an extra step.
$url =~ s/\^([A-Z]+)\^/escape(escape($args{$1}))/ge;
#print "$url\n";
# Create an HTTP request object for this URL.
my $request = HTTP::Request->new(GET => $url);
# This HTTP header is required. The server outputs garbage if
# it's not present.
$request->push_header('Content-Type' => 'application/json');
# Send the request and check for an error from the server.
my $response = $ua->request($request);
die "Error ".$response->code if !$response->is_success;
# The response should be JSON.
my $obj = from_json($response->content);
my $str = "$args{LOCATION},$args{YEAR},$obj->{totalHits}\n";
print O $str;
print $str;
}
}
What about this simple script without firefox ? I had investigated the site a bit to understand how it works, and I saw some JSON requests with firebug firefox addon, so I know which URL to query to get the relevant stuff. Here is the code :
use strict; use warnings;
use JSON::XS;
use LWP::UserAgent;
use HTTP::Request;
my $ua = LWP::UserAgent->new();
open my $fh, '<', 'locations2.txt' or die $!;
open my $fh2, '>>', 'out2.txt' or die $!;
# iterate over locations from locations2.txt file
while (my $place = <$fh>) {
# remove line ending
chomp $place;
# iterate over years
foreach my $year (1923..1925) {
# building URL with the variables
my $url = "https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A%2522$place%2522%2520%252Bbirth_year%253A$year-$year~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219";
my $request = HTTP::Request->new(GET => $url);
# faking referer (where we comes from)
$request->header('Referer', 'https://familysearch.org/search/collection/results');
# setting expected format header for response as JSON
$request->header('content_type', 'application/json');
my $response = $ua->request($request);
if ($response->code == 200) {
# this line convert a JSON to Perl HASH
my $hash = decode_json $response->content;
my $val = $hash->{totalHits};
print $fh2 "year $year, place $place : $val\n";
}
else {
die $response->status_line;
}
}
}
END{ close $fh; close $fh2; }
This seems to do what you need. Instead of waiting for the disappearance of the hourglass it waits - more obviously I think - for the appearance of the text node you're interested in.
use 5.010;
use warnings;
use WWW::Mechanize::Firefox;
STDOUT->autoflush;
my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';
my $mech = WWW::Mechanize::Firefox->new(tab => qr/FamilySearch\.org/, create => 1, activate => 1);
$mech->autoclose_tab(0);
$mech->get('about:blank');
$mech->get($url);
my $text;
while () {
sleep 1;
$text = $mech->xpath('//p[#class="num-search-results"]/text()', maybe => 1);
last if defined $text;
}
my $results = $text->{nodeValue};
say $results;
if ($results =~ /([\d,]+)\s+results/) {
(my $n = $1) =~ tr/,//d;
say $n;
}
output
1-20 of 1,975 results
1975
Update
This update is with special thanks to #nandhp, who inspired me to look at the underlying data server that produces the data in JSON format.
Rather than making a request via the superfluous https://familysearch.org/proxy this code accesses the server directly at https://familysearch.org/search/records, reencodes the JSON and dumps the required data out of the resulting structure. This has the advantage of both speed (the requests are served about once a second - more than ten times faster than with the equivalent request from the basic web site) and stability (as you note, the site is very flaky - in contrast I have never seen an error using this method).
use strict;
use warnings;
use LWP::UserAgent;
use URI;
use JSON;
use autodie;
STDOUT->autoflush;
open my $fh, '<', 'locations26.txt';
my #locations = <$fh>;
chomp #locations;
open my $outfh, '>', 'out26.txt';
my $ua = LWP::UserAgent->new;
for my $county (#locations[36, 0..2]) {
for my $year (1923 .. 1926) {
my $total = familysearch_info($county, $year);
print STDOUT "$county,$year,$total\n";
print $outfh "$county,$year,$total\n";
}
print "\n";
}
sub familysearch_info {
my ($county, $year) = #_;
my $query = join ' ', (
'+event_place_level_1:California',
sprintf('+event_place_level_2:"%s"', $county),
sprintf('+birth_year:%1$d-%1$d~', $year),
'+gender:M',
'+race:White',
);
my $url = URI->new('https://familysearch.org/search/records');
$url->query_form(
collection_id => 2000219,
count => 20,
query => $query);
my $resp = $ua->get($url, 'Content-Type'=> 'application/json');
my $data = decode_json($resp->decoded_content);
return $data->{totalHits};
}
output
San Diego,1923,1975
San Diego,1924,2004
San Diego,1925,1871
San Diego,1926,1908
Alameda,1923,3577
Alameda,1924,3617
Alameda,1925,3567
Alameda,1926,3464
Alpine,1923,1
Alpine,1924,2
Alpine,1925,0
Alpine,1926,1
Amador,1923,222
Amador,1924,248
Amador,1925,134
Amador,1926,67
I do not know how to post revised code from the solution above.
This code does not (yet) compile correctly. However, I have made some essential update to definitely head in that direction.
I would very much appreciate help on this updated code. I do not know how to post this code and this follow up such that it appease the lords who run this sight.
It get stuck at the sleep line. Any advice on how to proceed past it would be much appreciated!
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html', synchronize => 0);
my $retries = 10;
while ($retries-- and $mech->is_visible( xpath => '//*[#id="hourglass"]' )) {
print "Sleep until we find the thing\n";
sleep 2;
};
die "Timeout while waiting for application" if 0 > $retries;
# Now the hourglass is not visible anymore
#fill out the search form
my #forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[#class="form-submit"]'});
You should set the current form before accessing a field:
"Given the name of a field, set its value to the value specified. This applies to the current form (as set by the "form_name()" or "form_number()" method or defaulting to the first form on the page)."
$mech->form_name( 'census-search' );
$mech->field( birth_place => 'value_for_birth_place' );
Sorry, I am not able too try this code out and thanks for open a question for a new question.

How can I add a progress bar to WWW::Mechanize?

I have the following code:
$mech->get($someurl, ":content_file" => "$i.flv");
So I'm getting the contents of a url and saving it as an flv file. I'd like to print out every second or so how much of the download is remaining. Is there any way to accomplish this in WWW::Mechanize?
WWW::Mechanize says that the get method is a "well-behaved" overload of LWP::UserAgent get. Looking at the docs for LWP::UserAgent, you can provide a content_cb key which is called with each chunk of the downloaded file:
$mech->get( $someurl, ":content_cb" => \&callback );
sub callback
{
my( $data, $response, $proto ) = #_;
# save $data to $i.flv
# print download notification
}
Many thanks to Peter Kovacs' answer for leading me to the correct answer. It turned out to be a bit more elaborate than I'd expected though so I decided to (horror) answer my own question.
As Peter showed, I can set a callback like so:
$m->get($u, ":content_cb" => \&callback);
But now I can't save the content using the :content_file value, because I can only choose one of the two. The callback function gets passed the data, and I ended up writing that to a file instead.
I also get a response object which contains the total size of the content as friedo pointed out. So by keeping a running total of content received so far and dividing it by the total content I can find out what percent of the content has been downloaded. Here's the full callback function:
open (VID,">$i.flv") or die "$!";
$total = 0;
sub callback
{
my( $data, $response, $proto ) = #_;
print VID "$data"; # write data to file
$total+= length($data);
$size = $response->header('Content-Length');
print floor(($total/$size)*100),"% downloaded\n"; # print percent downloaded
}
I hope that helps someone.