Web collection from Google data page

Web collection from Google data page - perl

I am trying to collect the data from this page:
https://datastudio.google.com/reporting/c07bc5cf-5c09-4156-a903-3e7acd02721a/page/ql6IC
I usually use PERL/LWP to GET the page then parse, but this page does not return the visible elements, just the initial Google goo.
Looking to grab the Number of Confirmed Cases and the date updated at the bottom of the page.
Thanks in advance!

Here is an example of how to access the data after javascript has modified the DOM using Selenium::Chrome :
use strict;
use warnings;
use Selenium::Chrome;
# Enter your driver path here. See https://sites.google.com/a/chromium.org/chromedriver/
# for download instructions
my $driver_path = '/home/hakon/chromedriver/chromedriver';
my $driver = Selenium::Chrome->new( binary => $driver_path );
$driver->get("https://datastudio.google.com/reporting/c07bc5cf-5c09-4156-a903-3e7acd02721a/page/ql6IC");
sleep 5; # modify this sleep period such that the page is fully loaded before continuing
my $elem = $driver->find_element_by_class_name('tableBody');
# Do something with the table..
Update
To avoid specifying the sleep limit above you can use the wait_until() function from Selenium::Waiter, for example:
use feature qw(say);
use strict;
use warnings;
use Selenium::Chrome;
use Selenium::Waiter;
# Enter your driver path here. See https://sites.google.com/a/chromium.org/chromedriver/
# for download instructions
my $driver_path = '/home/hakon/chromedriver/chromedriver';
my $driver = Selenium::Chrome->new(
binary => $driver_path,
# avoid printing error from find_element_by_class_name() when class is not found,
# see the below wait_until() call. The error message will be on the form:
#
# "Error while executing command: no such element: no such element:
# Unable to locate element"
#
error_handler => sub { my $msg = $_[1]; die $msg if $msg !~ /\Qno such element\E/}
);
$driver->get("https://datastudio.google.com/reporting/c07bc5cf-5c09-4156-a903-3e7acd02721a/page/ql6IC");
my $timeouts = $driver->get_timeouts();
say "Current implicit timeout = ", $timeouts->{implicit};
$driver->set_implicit_wait_timeout(0);
say "Updated implicit wait timeout to 0 ms";
my $timeout = 30;
my $start_time = time;
my $elem = wait_until {
$driver->find_element_by_class_name('tableBody')
} timeout => $timeout, interval => 1;
if ( $elem ) {
my $elapsed = time - $start_time;
say "Found element after $elapsed seconds";
}
else {
say "Could not find tableBody element after $timeout seconds";
}

Related

Progress indicator for Perl LWP POST upload

I'm working on a Perl script which uploads big files with a POST request. My question is if it's possible to have a status output, because uploading big files can take some time with my internet connection.
I mean like a status bar with
$| = 1;
print "\r|----------> | 33%";
print "\r|--------------------> | 66%";
print "\r|------------------------------| 100%\n";
Here's my upload code:
my $ua=LWP::UserAgent->new();
$file = "my_big_holyday_vid.mp4";
$user = "username";
$pass = "password";
print "starting Upload...\n";
$res = $ua->post(
"http://$server",
Content_Type => 'form-data',
Content =>[
fn => ["$file" => $file],
username => $user,
password => $pass,
],
);
print "Upload complete!\n"

If you look at the documentation for HTTP::Request::Common you will see that, if you set $HTTP::Request::Common::DYNAMIC_FILE_UPLOAD to a true value, then the request object's content method will provide a callback that is used to fetch the data in chunks.
Normally this is called each time more data is needed for upload, but you can wrap it in your own subroutine to monitor the progress of the upload.
The program below gives an example. As you can see, the HTTP::Request object is created (I have assumed that the fn field should be just [$file]) and the content method is used to fetch the callback subroutine.
The subroutine wrapper just calls $callback in the first line to fetch the next data chunk, and returns it in the last line, just as $callback itself would do. Between these two lines you can add what you like, as long as it doesn't interfere with passing the chunk back to LWP. In this case I have printed the size of each chunk together with the percentage upload so far on each call.
For the purpose of percentage calculations, the full size of the file is accessible as $req->header('content-length'), which is more correct than using -s on the file.
Also, the final iteration can be detected if necessary as the callback will return chunk with a size of zero.
Note that this is untested except as far as it compiles and does roughly the right thing, as I have no internet service available that expects a file upload.
use strict;
use warnings;
use LWP;
use HTTP::Request::Common;
$HTTP::Request::Common::DYNAMIC_FILE_UPLOAD = 1;
my $ua = LWP::UserAgent->new;
my $server = 'example.com';
my $file = 'my_big_holyday_vid.mp4';
my ($user, $pass) = qw/ username password /;
print "Starting Upload...\n";
my $req = POST "http://$server",
Content_Type => 'form-data',
Content => [
fn => [$file],
username => $user,
password => $pass,
];
my $total;
my $callback = $req->content;
my $size = $req->header('content-length');
$req->content(\&wrapper);
my $resp = $ua->request($req);
sub wrapper {
my $chunk = $callback->();
if ($chunk) {
my $length = length $chunk;
$total += $length;
printf "%+5d = %5.1f%%\n", $length, $total / $size * 100;
}
else {
print "Completed\n";
}
$chunk;
}

Web-crawler optimization

I am building a basic search engine using vector-space model and this is the crawler for returning 500 URLs and removes the SGML tags from the content. However, it is very slow (takes more than 30mins for retrieving the URLs only). How can I optimize the code? I have inserted wikipedia.org as an example starting URL.
use warnings;
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;
my $starting_url = 'http://en.wikipedia.org/wiki/Main_Page';
my #urls = $starting_url;
my %alreadyvisited;
my $browser = LWP::UserAgent->new();
$browser->timeout(5);
my $url_count = 0;
while (#urls)
{
my $url = shift #urls;
next if $alreadyvisited{$url}; ## check if already visited
my $request = HTTP::Request->new(GET => $url);
my $response = $browser->request($request);
if ($response->is_error())
{
print $response->status_line, "\n"; ## check for bad URL
}
my $contents = $response->content(); ## get contents from URL
push #c, $contents;
my #text = &RemoveSGMLtags(\#c);
#print "#text\n";
$alreadyvisited{$url} = 1; ## store URL in hash for future reference
$url_count++;
print "$url\n";
if ($url_count == 500) ## exit if number of crawled pages exceed limit
{
exit 0;
}
my ($page_parser) = HTML::LinkExtor->new(undef, $url);
$page_parser->parse($contents)->eof; ## parse page contents
my #links = $page_parser->links;
foreach my $link (#links)
{
$test = $$link[2];
$test =~ s!^https?://(?:www\.)?!!i;
$test =~ s!/.*!!;
$test =~ s/[\?\#\:].*//;
if ($test eq "en.wikipedia.org") ## check if URL belongs to unt domain
{
next if ($$link[2] =~ m/^mailto/);
next if ($$link[2] =~ m/s?html?|xml|asp|pl|css|jpg|gif|pdf|png|jpeg/);
push #urls, $$link[2];
}
}
sleep 1;
}
sub RemoveSGMLtags
{
my ($input) = #_;
my #INPUTFILEcontent = #$input;
my $j;my #raw_text;
for ($j=0; $j<$#INPUTFILEcontent; $j++)
{
my $INPUTFILEvalue = $INPUTFILEcontent[$j];
use HTML::Parse;
use HTML::FormatText;
my $plain_text = HTML::FormatText->new->format(parse_html($INPUTFILEvalue));
push #raw_text, ($plain_text);
}
return #raw_text;
}

Always use strict
Never use the ampersand & on subroutine calls
Use URI to manipulate URLs
You have a sleep 1 in there, which I assume is to avoid hammering the site too much, which is good. But the bottleneck in almost any web-based application is the internet itself, and you won't be able to make your program any faster without requesting more from the site. That means removing your sleep and perhaps making parallel requests to the server using, for instance, LWP::Parallel::RobotUA. Is that a way you should be going?

Use WWW::Mechanize which handles all the URL parsing and extraction for you. So much easier than all the link parsing you're dealing with. It was created specifically for the sort of thing you're doing, and it's a subclass of LWP::UserAgent so you should just be able to change all your LWP::UserAgent to WWW::Mechanize without having to change any code, except for all the link extraction, so you can do this:
my $mech = WWW::Mechanize->new();
$mech->get( 'someurl.com' );
my #links = $mech->links;
and then #links is an array of WWW::Mechanize::Link objects.

using Perl to scrape a website

I am interested in writing a perl script that goes to the following link and extracts the number 1975: https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219
That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.
In the file, locations.txt, I put the list of counties, such as San Diego County.
The current code runs, but instead of the # 1975, it displays unknown. The number 1975 should be in $val\n.
I would very much appreciate any help!
#!/usr/bin/perl
use strict;
use LWP::Simple;
open(L, "locations26.txt");
my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';
open(O, ">out26.txt");
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
chomp($location);
$location =~ s/ /+/g;
foreach my $year (1923..1923) {
my $u = $url;
$u =~ s/%LOCATION%/$location/;
$u =~ s/%YEAR%/$year/;
#print "$u\n";
my $content = get($u);
my $val = 'unknown';
if ($content =~ / of .strong.([0-9,]+)..strong. /) {
$val = $1;
}
$val =~ s/,//g;
$location =~ s/\+/ /g;
print "'$location',$year,$val\n";
print O "'$location',$year,$val\n";
}
}
Update: API is not a viable solution. I have been in contact with the site developer. The API does not apply to that part of the webpage. Hence, any solution pertaining to JSON will not be applicbale.

It would appear that your data is generated by Javascript and thus LWP cannot help you. That said, it seems that the site you are interested in has a developer API: https://familysearch.org/developers/
I recommend using Mojo::URL to construct your query and either Mojo::DOM or Mojo::JSON to parse XML or JSON results respectively. Of course other modules will work too, but these tools are very nicely integrated and let you get started quickly.

You could use WWW::Mechanize::Firefox to process any site that could be loaded by Firefox.
http://metacpan.org/pod/WWW::Mechanize::Firefox::Examples
You have to install the Mozrepl plugin and you will be able to process the web page contant via this module. Basically you will "remotly control" the browser.
Here is an example (maybe working)
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html');
my $retries = 10;
while ($retries-- and ! $mech->is_visible( xpath => '//*[#class="form-submit"]' )) {
print "Sleep until we find the thing\n";
sleep 2;
};
die "Timeout" if 0 > $retries;
#fill out the search form
my #forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[#class="form-submit"]'});

If you use your browser's development tools, you can clearly see the JSON request that the page you link to uses to get the data you're looking for.
This program should do what you want. I've added a bunch of comments for readability and explanation, as well as made a few other changes.
use warnings;
use strict;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
# Create an LWP User-Agent object for sending HTTP requests.
my $ua = LWP::UserAgent->new;
# Open data files
open(L, 'locations26.txt') or die "Can't open locations: $!";
open(O, '>', 'out26.txt') or die "Can't open output file: $!";
# Enable autoflush on the output file handle
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
# This regular expression is like chomp, but removes both Windows and
# *nix line-endings, regardless of the system the script is running on.
$location =~ s/[\r\n]//g;
foreach my $year (1923..1923) {
# If you need to add quotes around the location, use "\"$location\"".
my %args = (LOCATION => $location, YEAR => $year);
my $url = 'https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A^LOCATION^%2520%252Bbirth_year%253A^YEAR^-^YEAR^~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219';
# Note that values need to be doubly-escaped because of the
# weird way their website is set up (the "/proxy" URL we're
# requesting is subsequently loading some *other* URL which
# is provided to "/proxy" as a URL-encoded URL).
#
# This regular expression replaces any ^WHATEVER^ in the URL
# with the double-URL-encoded value of WHATEVER in %args.
# The /e flag causes the replacement to be evaluated as Perl
# code. This way I can look data up in a hash and do URL-encoding
# as part of the regular expression without an extra step.
$url =~ s/\^([A-Z]+)\^/escape(escape($args{$1}))/ge;
#print "$url\n";
# Create an HTTP request object for this URL.
my $request = HTTP::Request->new(GET => $url);
# This HTTP header is required. The server outputs garbage if
# it's not present.
$request->push_header('Content-Type' => 'application/json');
# Send the request and check for an error from the server.
my $response = $ua->request($request);
die "Error ".$response->code if !$response->is_success;
# The response should be JSON.
my $obj = from_json($response->content);
my $str = "$args{LOCATION},$args{YEAR},$obj->{totalHits}\n";
print O $str;
print $str;
}
}

What about this simple script without firefox ? I had investigated the site a bit to understand how it works, and I saw some JSON requests with firebug firefox addon, so I know which URL to query to get the relevant stuff. Here is the code :
use strict; use warnings;
use JSON::XS;
use LWP::UserAgent;
use HTTP::Request;
my $ua = LWP::UserAgent->new();
open my $fh, '<', 'locations2.txt' or die $!;
open my $fh2, '>>', 'out2.txt' or die $!;
# iterate over locations from locations2.txt file
while (my $place = <$fh>) {
# remove line ending
chomp $place;
# iterate over years
foreach my $year (1923..1925) {
# building URL with the variables
my $url = "https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A%2522$place%2522%2520%252Bbirth_year%253A$year-$year~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219";
my $request = HTTP::Request->new(GET => $url);
# faking referer (where we comes from)
$request->header('Referer', 'https://familysearch.org/search/collection/results');
# setting expected format header for response as JSON
$request->header('content_type', 'application/json');
my $response = $ua->request($request);
if ($response->code == 200) {
# this line convert a JSON to Perl HASH
my $hash = decode_json $response->content;
my $val = $hash->{totalHits};
print $fh2 "year $year, place $place : $val\n";
}
else {
die $response->status_line;
}
}
}
END{ close $fh; close $fh2; }

This seems to do what you need. Instead of waiting for the disappearance of the hourglass it waits - more obviously I think - for the appearance of the text node you're interested in.
use 5.010;
use warnings;
use WWW::Mechanize::Firefox;
STDOUT->autoflush;
my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';
my $mech = WWW::Mechanize::Firefox->new(tab => qr/FamilySearch\.org/, create => 1, activate => 1);
$mech->autoclose_tab(0);
$mech->get('about:blank');
$mech->get($url);
my $text;
while () {
sleep 1;
$text = $mech->xpath('//p[#class="num-search-results"]/text()', maybe => 1);
last if defined $text;
}
my $results = $text->{nodeValue};
say $results;
if ($results =~ /([\d,]+)\s+results/) {
(my $n = $1) =~ tr/,//d;
say $n;
}
output
1-20 of 1,975 results
1975
Update
This update is with special thanks to #nandhp, who inspired me to look at the underlying data server that produces the data in JSON format.
Rather than making a request via the superfluous https://familysearch.org/proxy this code accesses the server directly at https://familysearch.org/search/records, reencodes the JSON and dumps the required data out of the resulting structure. This has the advantage of both speed (the requests are served about once a second - more than ten times faster than with the equivalent request from the basic web site) and stability (as you note, the site is very flaky - in contrast I have never seen an error using this method).
use strict;
use warnings;
use LWP::UserAgent;
use URI;
use JSON;
use autodie;
STDOUT->autoflush;
open my $fh, '<', 'locations26.txt';
my #locations = <$fh>;
chomp #locations;
open my $outfh, '>', 'out26.txt';
my $ua = LWP::UserAgent->new;
for my $county (#locations[36, 0..2]) {
for my $year (1923 .. 1926) {
my $total = familysearch_info($county, $year);
print STDOUT "$county,$year,$total\n";
print $outfh "$county,$year,$total\n";
}
print "\n";
}
sub familysearch_info {
my ($county, $year) = #_;
my $query = join ' ', (
'+event_place_level_1:California',
sprintf('+event_place_level_2:"%s"', $county),
sprintf('+birth_year:%1$d-%1$d~', $year),
'+gender:M',
'+race:White',
);
my $url = URI->new('https://familysearch.org/search/records');
$url->query_form(
collection_id => 2000219,
count => 20,
query => $query);
my $resp = $ua->get($url, 'Content-Type'=> 'application/json');
my $data = decode_json($resp->decoded_content);
return $data->{totalHits};
}
output
San Diego,1923,1975
San Diego,1924,2004
San Diego,1925,1871
San Diego,1926,1908
Alameda,1923,3577
Alameda,1924,3617
Alameda,1925,3567
Alameda,1926,3464
Alpine,1923,1
Alpine,1924,2
Alpine,1925,0
Alpine,1926,1
Amador,1923,222
Amador,1924,248
Amador,1925,134
Amador,1926,67

I do not know how to post revised code from the solution above.
This code does not (yet) compile correctly. However, I have made some essential update to definitely head in that direction.
I would very much appreciate help on this updated code. I do not know how to post this code and this follow up such that it appease the lords who run this sight.
It get stuck at the sleep line. Any advice on how to proceed past it would be much appreciated!
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html', synchronize => 0);
my $retries = 10;
while ($retries-- and $mech->is_visible( xpath => '//*[#id="hourglass"]' )) {
print "Sleep until we find the thing\n";
sleep 2;
};
die "Timeout while waiting for application" if 0 > $retries;
# Now the hourglass is not visible anymore
#fill out the search form
my #forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[#class="form-submit"]'});

You should set the current form before accessing a field:
"Given the name of a field, set its value to the value specified. This applies to the current form (as set by the "form_name()" or "form_number()" method or defaulting to the first form on the page)."
$mech->form_name( 'census-search' );
$mech->field( birth_place => 'value_for_birth_place' );
Sorry, I am not able too try this code out and thanks for open a question for a new question.

Why does my Perl script using WWW-Mechanize fail intermittently?

I am trying to write a Perl script using WWW-Mechanize.
Here is my code:
use DBI;
use JSON;
use WWW::Mechanize;
sub fetch_companies_list
{
my $url = shift;
my $browser = WWW::Mechanize->new( stack_depth => 0 );
my ($content, $json, $parsed_text, $company_name, $company_url);
eval
{
print "Getting the companies list...\n";
$browser->get( $url );
# die "Can't get the companies list.\n" unless( $browser->status );
$content = $browser->content();
# die "Can't get companies names.\n" unless( $browser->status );
$json = new JSON;
$parsed_text = $json->allow_nonref->utf8->relaxed->escape_slash->loose->allow_singlequote->allow_barekey->decode( $content );
foreach(#$parsed_text)
{
$company_name = $_->{name};
fetch_company_info( $company_name, $browser );
}
}
}
fetch_companies_list( "http://api.crunchbase.com/v/1/companies.js" );
The problem is the follows:
I start the script it finishes fine.
I restart the script. The script fails in "$browser->get()".
I have to wait some time (about 5 min) then it will start working again.
I am working on Linux and have WWW-Mechanize version 1.66.
Any idea what might be the problem? I don't have any firewall installed either on computer or on my router.
Moreover uncommenting the "die ..." line does not help as it stopping inside get() call. I can try to upgrade to the latest, which is 1.71, but I'd like to know if someone else experience this with this Perl module.

5 minutes (300 seconds) is the default timeout. Exactly what timed out will be returned in the response's status line.
my $response = $mech->res;
if (!$response->is_success()) {
die($response->status_line());
}

This is target site issue. It shows
503 Service Unavailable No server is available to handle this
request.
right now.

Retry with wait, try this
## set maximum no of tries
my $retries = 10;
## number of secs to sleep
my $sleep = 1;
do {
eval {
print "Getting the companies list...\n";
$browser->get($url);
# die "Can't get the companies list.\n" unless( $browser->status );
$content = $browser->content();
# die "Can't get companies names.\n" unless( $browser->status );
$json = new JSON;
$parsed_text = $json->allow_nonref->utf8->relaxed->escape_slash->loose->allow_singlequote->allow_barekey->decode($content);
foreach (#$parsed_text) {
$company_name = $_->{name};
fetch_company_info( $company_name, $browser );
}
};
if ($#) {
warn $#;
## rest for some time
sleep($sleep);
## increase the value of $sleep exponetially
$sleep *= 2;
}
} while ( $# && $retries-- );

How to use a perl module that you have written?

I've just written my first Perl module and am having trouble getting it to work with a script I produced also. Here is the error that the Perl interpreter displays when I attempt to run the script that is using my newly created module.
Error message:
scraper_tools_v1.pm did not return a true value at getYid.pl line 5.
BEGIN failed--compilation aborted at getYid.pl line 5.
scraper_tools_v1.pm is the Perl module which I have written and getYid.pl is the Perl script which attempts to utilize the scraper_tools_v1.pm module.
Here is the code for the scraper_tools_v1.pm file:
#!/usr/bin/perl
package scraper_tools_v1;
use strict;
use warnings;
use WWW::Curl::Easy;
# Note this function expects a single parameter which should be in the form of a URL
sub getWebPage($)
{
# Setting up the Curl parameters
my $curl = WWW::Curl::Easy->new; # create a variable to store the curl object
# A parameter set to 1 tells the library to include the header in the body output.
# This is only relevant for protocols that actually have headers preceding the data (like HTTP).
$curl->setopt(CURLOPT_HEADER, 1);
# Setting the target URL to retrieve with the passed parameter
$curl->setopt(CURLOPT_URL, #_);
# Declaring a variable to store the response from the Curl request
my $response_body = '';
# Creating a file handle for CURL to output to, then redirecting our output to the $response_body variable
open(my $fileb, ">",\$response_body) or die $!;
$curl->setopt(CURLOPT_WRITEDATA, $fileb);
# getting the return code from the header to see if the GET was successful
my $return_code = $curl->perform;
# capturing the response code from the GET request in the HTTP header, i.e... 200, 404, 500, etc...
# 200 is success
my $response_code = $curl->getinfo(CURLINFO_HTTP_CODE);
# if the return code is zero than the request was a success
if ($return_code == 0)
{
# A little debug output to keep you informed
print ("Success ". $response_code.": ".#_."\n");
# return whatever was contained on the web page that we just got using a GET
return $response_body;
}
else
{
print ("Failure ". $response_code.": ".#_."\n");
}
close($fileb); # close the file-handle
}
And here is the getYid.pl script which attempts to use the above module
#!/usr/bin/perl
use strict;
use warnings;
use scraper_tools_v1;
my %cat_links; # Hash that stores categories and their numbers (ID's)
my $web_page = scraper_tools_v1->getWebPage("http://something.com/categoryindex.aspx");
my #lines = split(/\n/, $web_page);
foreach my $line (#lines)
{
chomp($line);
if ($line =~ /<option value=\"{1}(.+)\">(.+)<\/option>/)
{
my $num = $1;
my $desc = $2;
$desc =~ s/\s+&\s+/ & /;
$cat_links{$desc} = $num;
}
}
my #allTargetUrls; # make a new array to store all the links we need to extract listings from
$web_page = ''; # Reset this variable so we can reuse it.
my $totalNumberOfListings = 0;
foreach my $key (keys %cat_links)
{
my $target = "http://something.com/categorydetail.aspx?id=$cat_links{$key}&exact_phrase=0";
$web_page = scraper_tools_v1->getWebPage($target);
#lines = split(/\n/, $web_page);
foreach my $line (#lines)
{
my $pages;
chomp($line);
if ($line =~ /We found (\d) listings for your search\./)
{
my $listingsInCat = $1;
print ("$cat_links{$key}, $listingsInCat");
$totalNumberOfListings += $listingsInCat;
}
if ($line =~ /Page 1 of (\d)/)
{
$pages = $1;
}
for (my $i = 1; $i <= $pages; $i++)
{
#build the target urls
my $pageUrl = "http://something.com/categorydetail.aspx?id=$key&search=&exact_phrase=True&city=&state=&zipcode=&page=$i";
push(#allTargetUrls, $pageUrl);
}
}
print("Total number of listings = ".$totalNumberOfListings);
}
Any help in resolving this issue would greatly be appreciated and please note that I have tested both files independently for interpreter errors and found nothing. Thanks to all for taking a look.

When you write a Perl module, you should always end the file with the line
1;
Perl executes code at the module level when the module is imported. If you don't return a true value (1 is true), then you'll get the error you describe. Essentially, Perl is informing you that the initialisation code in your module didn't succeed.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Web collection from Google data page - perl

Related

Progress indicator for Perl LWP POST upload

Web-crawler optimization

using Perl to scrape a website

Why does my Perl script using WWW-Mechanize fail intermittently?

How to use a perl module that you have written?

Categories

Resources