Perl web scraping - perl

I am a Perl beginner and I am passionate about web scraping using Perl. After spending a couple of hours I wrote the code below for scraping company name, addresses and telephone number from yell.com. The script is working fine and I successfully to collected one record (1/15 from page 1).
I need your valuable suggestion regarding how can I scrape all ten companies in the first page in one go, so that I can move on to others pages of data.
use strict;
use Data::Dumper;
use LWP::Simple; # from CPAN
use JSON qw( decode_json ); # from CPAN
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
my $header = "company_name|Address|Telphone";
open (CH, ">output.csv");
print CH "$header\n";
my $url = "http://www.yell.com/ucs/UcsSearchAction.do?keywords=Engineering+consulatants&location=United+Kingdom&scrambleSeed=13724563&searchType=&M=&bandedclarifyResults=&ssm=1";
$mech->get($url);
my $con = $mech->content();
my $res = "";
############ for company name ##########
if ( $con =~ /<a data-omniture="LIST:COMPANYNAME" href="\/biz\/ross-davy-associates-grimsby-901271213\/" itemprop="name">(.*?)<\/a>/is ) {
$res = $1;
}
else {
$res = "Not_Match";
}
############### for address #########
my ($add1, $add2, $add3, $add4, $add) = ("", "", "", "", "");
if ( $con =~ /<span itemprop="streetAddress">(.*?)<\/span> <span itemprop="addressLocality">(.*?)<\/span> , <span itemprop="postalCode">(.*?)<\/span> , <span itemprop="addressRegion">(.*?)<\/span>/is ) {
$add1 = $1;
$add2 = $2;
$add3 = $3;
$add4 = $4;
$add = $1.$2.$3.$$;
}
else {
$add = "Not_Match";
}
########### telephone ##########
my $tel="";
if ( $con =~ /<li data-company-item="telephone" class="last"> Tel: <strong>(.*?)<\/strong> <\/li>/is ) {
$tel = $1;
}
else {
$tel = "Not_Match";
}
print "==$res===$add===$tel==\n";
print CH "$res|$add|$tel\n";

These points should help
Always use warnings as well as use strict
Always use the three-parameter form of open, test the success of every open call, and die with a string that includes the built-in variable $! so that you know why the open failed
Never use regular expressions for parsing HTML. There are several modules such as HTML::TreeBuilder::XPath that do the job properly and allow simple access to the contents of the data using XPath
Always make sure that extracting data like this is within the terms of service of the site in question.
With regard to the last point, the majority of sites prohibit any form of automated access and copying of their data. Yell.com is no different. Their conditions of use say this.
You cannot use the website ... using any automated means to monitor or copy the website or its content ...
So what you are doing opens you to the possibility of legal prosecution.

Related

Perl Web Login Script With CGI::Session

i'm on the same problem since almost two week ago.
i'm a newbie with Perl and Web :/
i followed the CGI::Session tutorial and Cookbook, the code seems to be good but... not working.
index.cgi
#!/usr/bin/perl
use CGI;
use CGI::Cookie;
use HTML::Template;
use strict;
use warnings;
use CGI::Session;
use CGI::Carp qw(fatalsToBrowser);
require "cgi-bin/web_base.pl";
require "cgi-bin/login.pl";
my $cgi = new CGI;
my $session = new CGI::Session("driver:File", undef, {Directory=>'/tmp'}) or die CGI::Session->errstr;
my $CGISESSID = $session->id();
print header();
print $cgi->header();
print my_topbar();
login_attempt($session, $cgi);
if ( $session->param("~login-trials") >= 3 ) {
print error("You failed 3 times in a row.\n" .
"Your session is blocked. Please contact us with ".
"the details of your action");
exit(0);
}
unless ( $session->param("~logged-in") ) {
print login_form($cgi, $session);
exit(0);
}
print footer();
login.cgi
#!/usr/bin/perl
use strict;
use warnings;
use CGI;
use CGI::Cookie;
use HTML::Template;
use CGI::Session;
use CGI::Carp qw(fatalsToBrowser);
use Fcntl;
my $cgi = new CGI;
my $session = new CGI::Session(undef, $cgi, {Directory=>'/tmp'});
sub login_attempt {
my ($session, $cgi) = #_;
if ( $session->param("~logged-in") ) {
return 1; # Verify if user is not logged.
}
my $username = $cgi->param("username") or return;
my $password=$cgi->param("password") or return;
# Form submited. Try to load profile.
if ( my $profile = load_profile($username, $password) ) {
$session->param("~profile", $profile);
$session->param("~logged-in", 1);
print "YOUPIIIII";
$session->clear(["~login-trials"]);
$session->redirect("dashboard.cgi");
return 1;
}
# Failed to login, wrong credentials.
my $trials = $session->param("~login-trials") || 0;
return $session->param("~login-trials", ++$trials);
}
return 1;
sub load_profile {
my ($username, $password) = #_;
local $/ = "\n";
unless (sysopen(PROFILE, "profile.txt", O_RDONLY) ) {
die ("Couldn't open profile.txt: $!");
}
while ( <PROFILE> ) {
/^(\n|#)/ and next;
chomp;
my ($n, $p) = split "\s+";
if ( ($n eq $username) && ($p eq $password) ) {
my $p_mask = "x" . length($p);
return {username=>$n, password=>$p_mask};
}
}
close(PROFILE);
return undef;
}
profile.txt
Formget 123
When i try to login, nothing happen, even when i try wrong crendentials it should block me after 3 attemps but it is not.
Can someone really help me on this ? i can't take it anymooooore.
feel free for any questions :)
EDIT :
-login_attempt() corrected
-load-profile wasn't working, made a new one, but still need improvement.
-Last Problem is session init
Are you sure that you don't get any errors? Have you checked the web server error log?
You call login_attempt() with two parameters ($session and $cgi) but in login.cgi, that subroutine is defined like this:
sub login_attempt() {
...
}
You're (probably accidentally) using a prototype on that subroutine, telling Perl that it takes no parameters. So I'd be surprised if you don't get an error saying:
Too many arguments for main::login_attempt
Remove the parentheses from that definition.
sub login_attempt {
...
}
Update: I think you're missing one very important step here. From the CGI::Session Tutorial:
There is one small, but very important thing your application needs to perform after creating CGI::Session object as above. It needs to drop Session ID as an HTTP cookie into the user's computer. CGI::Session will use this cookie to identify the user at his/her next request and will be able to load his/her previously stored session data.
To make sure CGI::Session will be able to read your cookie at next request you need to consult its name() method for cookie's suggested name:
$cookie = $query->cookie( -name => $session->name,
-value => $session->id );
print $query->header( -cookie=>$cookie );
name() returns CGISESSID by default. If you prefer a different cookie name, you can change it as easily too, but you have to do it before CGI::Session object is created:
CGI::Session->name("SID");
$session = CGI::Session->new();
Baking the cookie wasn't too difficult, was it? But there is an even easier way to send a cookie using CGI::Session:
print $session->header();
The above will create the cookie using CGI::Cookie and will return proper http headers using CGI.pm's CGI method. Any arguments to CGI::Session will be passed to CGI::header().
Without this, you'll be creating a brand new session for each request.

Perl Script stops running

I am trying to get a perl script running (in a Windows cmd window) but it will allways just stop working at a certain point. How can I find out why it will not go on?
Here is the script: The last thing that I can see gets executed is "get_html_source()" in line 37
#!/usr/bin/perl
# Perl script that scrapes the members of the Hellenic Parliament
# Created by Kostas Ntonas, 03 May 2013 - http://ntonas.gr
# http://deixto.blogspot.gr/2013/05/scraping-members-of-greek-parliament.html
use strict;
use warnings;
use utf8;
use IO::File;
use POSIX qw(tmpnam);
use DEiXToBot;
use WWW::Selenium;
my $agent = DEiXToBot->new(); # create the DEiXToBot agent object
# launch a Firefox instance
my $sel = WWW::Selenium->new( host => "localhost",
port => 4444,
browser => "*firefox",
browser_url => "http://www.hellenicparliament.gr/"
);
$sel->start;
for my $i (1..30) {
my $url = "http://www.hellenicparliament.gr/en/Vouleftes/Viografika-Stoicheia?pageNo=$i";
$sel->open($url);
$sel->wait_for_page_to_load(5000);
$sel->pause(1);
print "$i) $url\n";
my $content = $sel->get_html_source();
my ($fh,$name); # create a temporary file containing the page's source code
do { $name = tmpnam() } until $fh = IO::File->new($name, O_RDWR|O_CREAT|O_EXCL);
binmode( $fh, ':utf8' );
print $fh $content;
close $fh;
$agent->get("file://$name"); # load the temporary file/page with the DEiXToBot agent using the file:// scheme
unlink $name; # delete the temporary file, it is not needed any more
if (! $agent->success) { die "Could not fetch the temp file!\n"; }
$agent->build_dom();
$agent->load_pattern('C:\Users\XXX\Documents\Privat\MyCase3\Deixto Patterns\parliament_CVs.xml');
$agent->extract_content();
if (! $agent->hits) {
die "Could not find any MPs/ records!\n";
}
else {
for my $record ($agent->records) {
my #rec = #$record;
my $party;
my $logo = $rec[0];
# deduce the party name from the logo in the first column of the table
if ($logo=~m#ND_Logo#) { $party = "N.D. (New Democracy)"; }
elsif ($logo=~m#COALITION#) { $party = "SYRIZA Unitary Social Front"; }
elsif ($logo=~m#PASOK#) { $party = "PA.SO.K. (Panhellenic Socialist Movement)"; }
elsif ($logo=~m#ANEKS_ELL#) { $party = "ANEXARTITOI ELLINES (Independent Hellenes)"; }
elsif ($logo=~m#xrisi#) { $party = "LAIKOS SYNDESMOS - CHRYSI AVGI (People's Association - Golden Dawn)"; }
elsif ($logo=~m#small#) { $party = "DHM.AR (Democratic Left)"; }
elsif ($logo=~m#KKE#) { $party = "K.K.E. (Communist Party of Greece)"; }
elsif ($logo=~m#INDEPENDENT#) { $party = "INDEPENDENT"; }
else { die "$logo => Unknown logo!\n"; }
$rec[0] = $party;
$rec[3]=~s#\s+# #g; # replace whitespace characters with a single space
# append the data in a tab delimited text file
open my $fh,">>:utf8","MPs.txt";
print $fh join("\t",#rec)."\n";
close $fh;
}
}
}
$sel->stop;
Do you know that the code is dying inside get_html_source, or is it actually dying immediately before or after (e.g. in the call to tmpnam, which seems to be missing a semi-colon)?
Another comment is that this seems like a lot of work just to scrape the list of MPs and their parties. If you look at the page source there is a huge block of base-64 encoded text that appears to have all the data you need. So you might find it quicker to load the page, decode the block and have everything you need.
The tmpnam function is provided by the POSIX Perl module. It should work fine on most variants of Unix/Linux but it seems to be broken under Windows.
I suggest replacing the "problematic" line containing the tmpnam call with the following:
use File::Temp qw/ tempfile /;
($fh,$name) = tempfile();
Hopefully this change will fix the issue and allow the script to complete.
This is also what the Perl tmpnam documentation (http://perldoc.perl.org/POSIX.html) suggests: "For security reasons, which are probably detailed in your system's documentation for the C library tmpnam() function, this interface should not be used; instead see File::Temp".

Web-crawler optimization

I am building a basic search engine using vector-space model and this is the crawler for returning 500 URLs and removes the SGML tags from the content. However, it is very slow (takes more than 30mins for retrieving the URLs only). How can I optimize the code? I have inserted wikipedia.org as an example starting URL.
use warnings;
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;
my $starting_url = 'http://en.wikipedia.org/wiki/Main_Page';
my #urls = $starting_url;
my %alreadyvisited;
my $browser = LWP::UserAgent->new();
$browser->timeout(5);
my $url_count = 0;
while (#urls)
{
my $url = shift #urls;
next if $alreadyvisited{$url}; ## check if already visited
my $request = HTTP::Request->new(GET => $url);
my $response = $browser->request($request);
if ($response->is_error())
{
print $response->status_line, "\n"; ## check for bad URL
}
my $contents = $response->content(); ## get contents from URL
push #c, $contents;
my #text = &RemoveSGMLtags(\#c);
#print "#text\n";
$alreadyvisited{$url} = 1; ## store URL in hash for future reference
$url_count++;
print "$url\n";
if ($url_count == 500) ## exit if number of crawled pages exceed limit
{
exit 0;
}
my ($page_parser) = HTML::LinkExtor->new(undef, $url);
$page_parser->parse($contents)->eof; ## parse page contents
my #links = $page_parser->links;
foreach my $link (#links)
{
$test = $$link[2];
$test =~ s!^https?://(?:www\.)?!!i;
$test =~ s!/.*!!;
$test =~ s/[\?\#\:].*//;
if ($test eq "en.wikipedia.org") ## check if URL belongs to unt domain
{
next if ($$link[2] =~ m/^mailto/);
next if ($$link[2] =~ m/s?html?|xml|asp|pl|css|jpg|gif|pdf|png|jpeg/);
push #urls, $$link[2];
}
}
sleep 1;
}
sub RemoveSGMLtags
{
my ($input) = #_;
my #INPUTFILEcontent = #$input;
my $j;my #raw_text;
for ($j=0; $j<$#INPUTFILEcontent; $j++)
{
my $INPUTFILEvalue = $INPUTFILEcontent[$j];
use HTML::Parse;
use HTML::FormatText;
my $plain_text = HTML::FormatText->new->format(parse_html($INPUTFILEvalue));
push #raw_text, ($plain_text);
}
return #raw_text;
}
Always use strict
Never use the ampersand & on subroutine calls
Use URI to manipulate URLs
You have a sleep 1 in there, which I assume is to avoid hammering the site too much, which is good. But the bottleneck in almost any web-based application is the internet itself, and you won't be able to make your program any faster without requesting more from the site. That means removing your sleep and perhaps making parallel requests to the server using, for instance, LWP::Parallel::RobotUA. Is that a way you should be going?
Use WWW::Mechanize which handles all the URL parsing and extraction for you. So much easier than all the link parsing you're dealing with. It was created specifically for the sort of thing you're doing, and it's a subclass of LWP::UserAgent so you should just be able to change all your LWP::UserAgent to WWW::Mechanize without having to change any code, except for all the link extraction, so you can do this:
my $mech = WWW::Mechanize->new();
$mech->get( 'someurl.com' );
my #links = $mech->links;
and then #links is an array of WWW::Mechanize::Link objects.

using Perl to scrape a website

I am interested in writing a perl script that goes to the following link and extracts the number 1975: https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219
That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.
In the file, locations.txt, I put the list of counties, such as San Diego County.
The current code runs, but instead of the # 1975, it displays unknown. The number 1975 should be in $val\n.
I would very much appreciate any help!
#!/usr/bin/perl
use strict;
use LWP::Simple;
open(L, "locations26.txt");
my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';
open(O, ">out26.txt");
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
chomp($location);
$location =~ s/ /+/g;
foreach my $year (1923..1923) {
my $u = $url;
$u =~ s/%LOCATION%/$location/;
$u =~ s/%YEAR%/$year/;
#print "$u\n";
my $content = get($u);
my $val = 'unknown';
if ($content =~ / of .strong.([0-9,]+)..strong. /) {
$val = $1;
}
$val =~ s/,//g;
$location =~ s/\+/ /g;
print "'$location',$year,$val\n";
print O "'$location',$year,$val\n";
}
}
Update: API is not a viable solution. I have been in contact with the site developer. The API does not apply to that part of the webpage. Hence, any solution pertaining to JSON will not be applicbale.
It would appear that your data is generated by Javascript and thus LWP cannot help you. That said, it seems that the site you are interested in has a developer API: https://familysearch.org/developers/
I recommend using Mojo::URL to construct your query and either Mojo::DOM or Mojo::JSON to parse XML or JSON results respectively. Of course other modules will work too, but these tools are very nicely integrated and let you get started quickly.
You could use WWW::Mechanize::Firefox to process any site that could be loaded by Firefox.
http://metacpan.org/pod/WWW::Mechanize::Firefox::Examples
You have to install the Mozrepl plugin and you will be able to process the web page contant via this module. Basically you will "remotly control" the browser.
Here is an example (maybe working)
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html');
my $retries = 10;
while ($retries-- and ! $mech->is_visible( xpath => '//*[#class="form-submit"]' )) {
print "Sleep until we find the thing\n";
sleep 2;
};
die "Timeout" if 0 > $retries;
#fill out the search form
my #forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[#class="form-submit"]'});
If you use your browser's development tools, you can clearly see the JSON request that the page you link to uses to get the data you're looking for.
This program should do what you want. I've added a bunch of comments for readability and explanation, as well as made a few other changes.
use warnings;
use strict;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
# Create an LWP User-Agent object for sending HTTP requests.
my $ua = LWP::UserAgent->new;
# Open data files
open(L, 'locations26.txt') or die "Can't open locations: $!";
open(O, '>', 'out26.txt') or die "Can't open output file: $!";
# Enable autoflush on the output file handle
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
# This regular expression is like chomp, but removes both Windows and
# *nix line-endings, regardless of the system the script is running on.
$location =~ s/[\r\n]//g;
foreach my $year (1923..1923) {
# If you need to add quotes around the location, use "\"$location\"".
my %args = (LOCATION => $location, YEAR => $year);
my $url = 'https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A^LOCATION^%2520%252Bbirth_year%253A^YEAR^-^YEAR^~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219';
# Note that values need to be doubly-escaped because of the
# weird way their website is set up (the "/proxy" URL we're
# requesting is subsequently loading some *other* URL which
# is provided to "/proxy" as a URL-encoded URL).
#
# This regular expression replaces any ^WHATEVER^ in the URL
# with the double-URL-encoded value of WHATEVER in %args.
# The /e flag causes the replacement to be evaluated as Perl
# code. This way I can look data up in a hash and do URL-encoding
# as part of the regular expression without an extra step.
$url =~ s/\^([A-Z]+)\^/escape(escape($args{$1}))/ge;
#print "$url\n";
# Create an HTTP request object for this URL.
my $request = HTTP::Request->new(GET => $url);
# This HTTP header is required. The server outputs garbage if
# it's not present.
$request->push_header('Content-Type' => 'application/json');
# Send the request and check for an error from the server.
my $response = $ua->request($request);
die "Error ".$response->code if !$response->is_success;
# The response should be JSON.
my $obj = from_json($response->content);
my $str = "$args{LOCATION},$args{YEAR},$obj->{totalHits}\n";
print O $str;
print $str;
}
}
What about this simple script without firefox ? I had investigated the site a bit to understand how it works, and I saw some JSON requests with firebug firefox addon, so I know which URL to query to get the relevant stuff. Here is the code :
use strict; use warnings;
use JSON::XS;
use LWP::UserAgent;
use HTTP::Request;
my $ua = LWP::UserAgent->new();
open my $fh, '<', 'locations2.txt' or die $!;
open my $fh2, '>>', 'out2.txt' or die $!;
# iterate over locations from locations2.txt file
while (my $place = <$fh>) {
# remove line ending
chomp $place;
# iterate over years
foreach my $year (1923..1925) {
# building URL with the variables
my $url = "https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A%2522$place%2522%2520%252Bbirth_year%253A$year-$year~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219";
my $request = HTTP::Request->new(GET => $url);
# faking referer (where we comes from)
$request->header('Referer', 'https://familysearch.org/search/collection/results');
# setting expected format header for response as JSON
$request->header('content_type', 'application/json');
my $response = $ua->request($request);
if ($response->code == 200) {
# this line convert a JSON to Perl HASH
my $hash = decode_json $response->content;
my $val = $hash->{totalHits};
print $fh2 "year $year, place $place : $val\n";
}
else {
die $response->status_line;
}
}
}
END{ close $fh; close $fh2; }
This seems to do what you need. Instead of waiting for the disappearance of the hourglass it waits - more obviously I think - for the appearance of the text node you're interested in.
use 5.010;
use warnings;
use WWW::Mechanize::Firefox;
STDOUT->autoflush;
my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';
my $mech = WWW::Mechanize::Firefox->new(tab => qr/FamilySearch\.org/, create => 1, activate => 1);
$mech->autoclose_tab(0);
$mech->get('about:blank');
$mech->get($url);
my $text;
while () {
sleep 1;
$text = $mech->xpath('//p[#class="num-search-results"]/text()', maybe => 1);
last if defined $text;
}
my $results = $text->{nodeValue};
say $results;
if ($results =~ /([\d,]+)\s+results/) {
(my $n = $1) =~ tr/,//d;
say $n;
}
output
1-20 of 1,975 results
1975
Update
This update is with special thanks to #nandhp, who inspired me to look at the underlying data server that produces the data in JSON format.
Rather than making a request via the superfluous https://familysearch.org/proxy this code accesses the server directly at https://familysearch.org/search/records, reencodes the JSON and dumps the required data out of the resulting structure. This has the advantage of both speed (the requests are served about once a second - more than ten times faster than with the equivalent request from the basic web site) and stability (as you note, the site is very flaky - in contrast I have never seen an error using this method).
use strict;
use warnings;
use LWP::UserAgent;
use URI;
use JSON;
use autodie;
STDOUT->autoflush;
open my $fh, '<', 'locations26.txt';
my #locations = <$fh>;
chomp #locations;
open my $outfh, '>', 'out26.txt';
my $ua = LWP::UserAgent->new;
for my $county (#locations[36, 0..2]) {
for my $year (1923 .. 1926) {
my $total = familysearch_info($county, $year);
print STDOUT "$county,$year,$total\n";
print $outfh "$county,$year,$total\n";
}
print "\n";
}
sub familysearch_info {
my ($county, $year) = #_;
my $query = join ' ', (
'+event_place_level_1:California',
sprintf('+event_place_level_2:"%s"', $county),
sprintf('+birth_year:%1$d-%1$d~', $year),
'+gender:M',
'+race:White',
);
my $url = URI->new('https://familysearch.org/search/records');
$url->query_form(
collection_id => 2000219,
count => 20,
query => $query);
my $resp = $ua->get($url, 'Content-Type'=> 'application/json');
my $data = decode_json($resp->decoded_content);
return $data->{totalHits};
}
output
San Diego,1923,1975
San Diego,1924,2004
San Diego,1925,1871
San Diego,1926,1908
Alameda,1923,3577
Alameda,1924,3617
Alameda,1925,3567
Alameda,1926,3464
Alpine,1923,1
Alpine,1924,2
Alpine,1925,0
Alpine,1926,1
Amador,1923,222
Amador,1924,248
Amador,1925,134
Amador,1926,67
I do not know how to post revised code from the solution above.
This code does not (yet) compile correctly. However, I have made some essential update to definitely head in that direction.
I would very much appreciate help on this updated code. I do not know how to post this code and this follow up such that it appease the lords who run this sight.
It get stuck at the sleep line. Any advice on how to proceed past it would be much appreciated!
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html', synchronize => 0);
my $retries = 10;
while ($retries-- and $mech->is_visible( xpath => '//*[#id="hourglass"]' )) {
print "Sleep until we find the thing\n";
sleep 2;
};
die "Timeout while waiting for application" if 0 > $retries;
# Now the hourglass is not visible anymore
#fill out the search form
my #forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[#class="form-submit"]'});
You should set the current form before accessing a field:
"Given the name of a field, set its value to the value specified. This applies to the current form (as set by the "form_name()" or "form_number()" method or defaulting to the first form on the page)."
$mech->form_name( 'census-search' );
$mech->field( birth_place => 'value_for_birth_place' );
Sorry, I am not able too try this code out and thanks for open a question for a new question.

Downloading Text from Several Links using WWW::Mechanize

For an entire week I have been attempting to write a code that will download links from a webpage and then loop through each link to dump the content written on each link's page. The original webpage I downloaded has 500 links to separate web pages that each contain important information for me. I only want to go one level down. However I am having several issues.
RECAP: I want to download the links from a webpage and automatically have my program print off the text contained in those links. I would prefer to have them printed in a file.
1) When I download the links from the original website, the useful ones are not written out fully. (ie they say "/festevents.nsf/all?openform" which is not a usable webpage)
2) I have been unable to print the text content of the page. I have been able to print the font details, but that is useless.
#Download all the modules I used#
use LWP::UserAgent;
use HTML::TreeBuilder;
use HTML::FormatText;
use WWW::Mechanize;
use Data::Dumper;
#Download original webpage and acquire 500+ Links#
$url = "http://wx.toronto.ca/festevents.nsf/all?openform";
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->get($url);
my $title = $mechanize->title;
print "<b>$title</b><br />";
my #links = $mechanize->links;
foreach my $link (#links) {
# Retrieve the link URL
my $href = $link->url_abs;
#
# $URL1= get("$link");
#
my $ua = LWP::UserAgent->new;
my $response = $ua->get($href);
unless($response->is_success) {
die $response->status_line;
}
my $URL1 = $response->decoded_content;
die Dumper($URL1);
#This part of the code is just to "clean up" the text
$Format=HTML::FormatText->new;
$TreeBuilder=HTML::TreeBuilder->new;
$TreeBuilder->parse($URL1);
$Parsed=$Format->format($TreeBuilder);
open(FILE, ">TorontoParties.txt");
print FILE "$Parsed";
close (FILE);
}
Please help me! I am desperate! If possible please explain to me the logic behind each step? I have been frying my brain on this for a week and I want help seeing other peoples logic behind the problems.
Too much work. Study the WWW::Mechanize API to realise that almost all of that functionality is already built-in. Untested:
use strictures;
use WWW::Mechanize qw();
use autodie qw(:all);
open my $h, '>:encoding(UTF-8)', 'TorontoParties.txt';
my $mechanize = WWW::Mechanize->new;
$mechanize->get('http://wx.toronto.ca/festevents.nsf/all?openform');
foreach my $link (
$mechanize->find_all_links(url_regex => qr'/festevents[.]nsf/[0-9a-f]{32}/[0-9a-f]{32}[?]OpenDocument')
) {
$mechanize->get($link->url_abs);
print {$h} $mechanize->content(format => 'text');
}
close $h;