Web crawler using perl

Web crawler using perl - perl

I want to develop a web crawler which starts from a seed URL and then crawls 100 html pages it finds belonging to the same domain as the seed URL as well as keeps a record of the traversed URLs avoiding duplicates. I have written the following but the $url_count value does not seem to be incremented and the retrieved URLs contain links even from other domains. How do I solve this? Here I have inserted stackoverflow.com as my starting URL.
use strict;
use warnings;
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
##open file to store links
open my $file1,">>", ("extracted_links.txt");
select($file1);
##starting URL
my #urls = 'http://stackoverflow.com/';
my $browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);
my %visited;
my $url_count = 0;
while (#urls)
{
my $url = shift #urls;
if (exists $visited{$url}) ##check if URL already exists
{
next;
}
else
{
$url_count++;
}
my $request = HTTP::Request->new(GET => $url);
my $response = $browser->request($request);
if ($response->is_error())
{
printf "%s\n", $response->status_line;
}
else
{
my $contents = $response->content();
$visited{$url} = 1;
#lines = split(/\n/,$contents);
foreach $line(#lines)
{
$line =~ m#(((http\:\/\/)|(www\.))([a-z]|[A-Z]|[0-9]|[/.]|[~]|[-_]|[()])*[^'">])#g;
print "$1\n";
push #urls, $$line[2];
}
sleep 60;
if ($visited{$url} == 100)
{
last;
}
}
}
close $file1;

Several points, your URL parsing is fragile, you certainly won't get relative links. Also you don't test for 100 links but 100 matches of the current url, which almost certainly isn't what you mean. Finally, I'm not too familiar with LWP so I'm going to show an example using the Mojolicious suite of tools.
This seems to work, perhaps it will give you some ideas.
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::UserAgent;
use Mojo::URL;
##open file to store links
open my $log, '>', 'extracted_links.txt' or die $!;
##starting URL
my $base = Mojo::URL->new('http://stackoverflow.com/');
my #urls = $base;
my $ua = Mojo::UserAgent->new;
my %visited;
my $url_count = 0;
while (#urls) {
my $url = shift #urls;
next if exists $visited{$url};
print "$url\n";
print $log "$url\n";
$visited{$url} = 1;
$url_count++;
# find all <a> tags and act on each
$ua->get($url)->res->dom('a')->each(sub{
my $url = Mojo::URL->new($_->{href});
if ( $url->is_abs ) {
return unless $url->host eq $base->host;
}
push #urls, $url;
});
last if $url_count == 100;
sleep 1;
}

Related

Getting Absolute URLs with module creating object outside loop

I have a doubt I've been trying to solve myself using CPAN modules documentation, but I'm a bit new and I'm confused with some terminology and sections within the different modules.
I'm trying to create the object in the code below, and get the absolute URL for relative links extracted from a website.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use Digest::MD5 qw(md5_hex);
use URI;
my $url = $ARGV[0];
if ($url !~ m{^https?://[^\W]+-?\.com/?}i) {
exit(0);
}
my $ua = LWP::UserAgent->new;
$ua->timeout( 10 );
my $response = $ua->get( $url );
my $content = $response->decoded_content();
my $links = URI->new($content);
my $abs = $links->abs('http:', $content);
my $abs_links = $links->abs($abs);
while ($content =~ m{<a[^>]\s*href\s*=\s*"?([^"\s>]+)}gis) {
$abs_links = $1;
print "$abs_links\n";
print "Digest for the above URL is " . md5_hex($abs_links) . "\n";
}
The problem is when I try to add that part outside the While loop (the 3-line block preceding the loop), it does not work, whereas if I add the same part in the While loop, it will work fine. This one just gets the relative URLs from a given website, but instead of printing "Http://..." it prints "//...".
The script that works fine for me is the following:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use Digest::MD5 qw(md5_hex);
use URI::URL;
my $url = $ARGV[0]; ## Url passed in command
if ($url !~ m{^https?://[\w]+-?[\w]+\.com/?}i) {
exit(0); ## Program stops if not valid URL
}
my $ua = LWP::UserAgent->new;
$ua->timeout( 10 );
my $response = $ua->get( $url ); ## Get response, not content
my $content = $response->decoded_content(); ## Now let's get the content
while ($content =~ m{<a[^>]\s*href\s*=\s*"?([^"\s>]+)}gis) { ## All links
my $links = $1;
my $abs = new URI::URL "$links";
my $abs_url = $abs->abs('http:', $links);
print "$abs_url\n";
print "Digest for the above URL is " . md5_hex($abs_url) . "\n";
}
Any ideas? Much appreciated.

I don't understand your code. There are a few weird bits:
[^\W] is the same as \w
The regex allows an optional - before and an optional / after .com, i.e. http://bitwise.complement.biz matches but http://cool-beans.com doesn't.
URI->new($content) makes no sense: $content is random HTML, not a URI.
$links->abs('http:', $content) makes no sense: $content is simply ignored, and $links->abs('http:') tries to make $links an absolute URL relative to 'http:', but 'http:' is not a valid URL.
Here's what I think you're trying to do:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTML::LinkExtor;
use Digest::MD5 qw(md5_hex);
#ARGV == 1 or die "Usage: $0 URL\n";
my $url = $ARGV[0];
my $ua = LWP::UserAgent->new(timeout => 10);
my $response = $ua->get($url);
$response->is_success or die "$0: " . $response->request->uri . ": " . $response->status_line . "\n";
my $content = $response->decoded_content;
my $base = $response->base;
my #links;
my $p = HTML::LinkExtor->new(
sub {
my ($tag, %attrs) = #_;
if ($tag eq 'a' && $attrs{href}) {
push #links, "$attrs{href}"; # stringify
}
},
$base,
);
$p->parse($content);
$p->eof;
for my $link (#links) {
print "$link\n";
print "Digest for the above URL is " . md5_hex($link) . "\n";
}
I don't try to validate the URL passed in $ARGV[0]. Leave it to LWP::UserAgent. (If you don't like this, just add the check back in.)
I make sure $ua->get($url) was successful before proceeding.
I get the base URL for absolutifying relative links from $response->base.
I use HTML::LinkExtor for parsing the content, extracting links, and making them absolute.

I think your biggest mistake is trying to parse links out of HTML using a regular expression. You would be far better advised to use a CPAN module for this. I'd recommend WWW::Mechanize, which would make your code look something like this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use WWW::Mechanize;
use Digest::MD5 qw(md5_hex);
use URI;
my $url = $ARGV[0];
if ($url !~ m{^https?://[^\W]+-?\.com/?}i) {
exit(0);
}
my $ua = WWW::Mechanize->new;
$ua->timeout( 10 );
$ua->get( $url );
foreach ($ua->links) {
say $_->url;
say "Digest for the above URL is " . md5_hex($_->url) . "\n";
}
That looks a lot simpler to me.

Web-crawler optimization

I am building a basic search engine using vector-space model and this is the crawler for returning 500 URLs and removes the SGML tags from the content. However, it is very slow (takes more than 30mins for retrieving the URLs only). How can I optimize the code? I have inserted wikipedia.org as an example starting URL.
use warnings;
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;
my $starting_url = 'http://en.wikipedia.org/wiki/Main_Page';
my #urls = $starting_url;
my %alreadyvisited;
my $browser = LWP::UserAgent->new();
$browser->timeout(5);
my $url_count = 0;
while (#urls)
{
my $url = shift #urls;
next if $alreadyvisited{$url}; ## check if already visited
my $request = HTTP::Request->new(GET => $url);
my $response = $browser->request($request);
if ($response->is_error())
{
print $response->status_line, "\n"; ## check for bad URL
}
my $contents = $response->content(); ## get contents from URL
push #c, $contents;
my #text = &RemoveSGMLtags(\#c);
#print "#text\n";
$alreadyvisited{$url} = 1; ## store URL in hash for future reference
$url_count++;
print "$url\n";
if ($url_count == 500) ## exit if number of crawled pages exceed limit
{
exit 0;
}
my ($page_parser) = HTML::LinkExtor->new(undef, $url);
$page_parser->parse($contents)->eof; ## parse page contents
my #links = $page_parser->links;
foreach my $link (#links)
{
$test = $$link[2];
$test =~ s!^https?://(?:www\.)?!!i;
$test =~ s!/.*!!;
$test =~ s/[\?\#\:].*//;
if ($test eq "en.wikipedia.org") ## check if URL belongs to unt domain
{
next if ($$link[2] =~ m/^mailto/);
next if ($$link[2] =~ m/s?html?|xml|asp|pl|css|jpg|gif|pdf|png|jpeg/);
push #urls, $$link[2];
}
}
sleep 1;
}
sub RemoveSGMLtags
{
my ($input) = #_;
my #INPUTFILEcontent = #$input;
my $j;my #raw_text;
for ($j=0; $j<$#INPUTFILEcontent; $j++)
{
my $INPUTFILEvalue = $INPUTFILEcontent[$j];
use HTML::Parse;
use HTML::FormatText;
my $plain_text = HTML::FormatText->new->format(parse_html($INPUTFILEvalue));
push #raw_text, ($plain_text);
}
return #raw_text;
}

Always use strict
Never use the ampersand & on subroutine calls
Use URI to manipulate URLs
You have a sleep 1 in there, which I assume is to avoid hammering the site too much, which is good. But the bottleneck in almost any web-based application is the internet itself, and you won't be able to make your program any faster without requesting more from the site. That means removing your sleep and perhaps making parallel requests to the server using, for instance, LWP::Parallel::RobotUA. Is that a way you should be going?

Use WWW::Mechanize which handles all the URL parsing and extraction for you. So much easier than all the link parsing you're dealing with. It was created specifically for the sort of thing you're doing, and it's a subclass of LWP::UserAgent so you should just be able to change all your LWP::UserAgent to WWW::Mechanize without having to change any code, except for all the link extraction, so you can do this:
my $mech = WWW::Mechanize->new();
$mech->get( 'someurl.com' );
my #links = $mech->links;
and then #links is an array of WWW::Mechanize::Link objects.

using Perl to scrape a website

I am interested in writing a perl script that goes to the following link and extracts the number 1975: https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219
That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.
In the file, locations.txt, I put the list of counties, such as San Diego County.
The current code runs, but instead of the # 1975, it displays unknown. The number 1975 should be in $val\n.
I would very much appreciate any help!
#!/usr/bin/perl
use strict;
use LWP::Simple;
open(L, "locations26.txt");
my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';
open(O, ">out26.txt");
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
chomp($location);
$location =~ s/ /+/g;
foreach my $year (1923..1923) {
my $u = $url;
$u =~ s/%LOCATION%/$location/;
$u =~ s/%YEAR%/$year/;
#print "$u\n";
my $content = get($u);
my $val = 'unknown';
if ($content =~ / of .strong.([0-9,]+)..strong. /) {
$val = $1;
}
$val =~ s/,//g;
$location =~ s/\+/ /g;
print "'$location',$year,$val\n";
print O "'$location',$year,$val\n";
}
}
Update: API is not a viable solution. I have been in contact with the site developer. The API does not apply to that part of the webpage. Hence, any solution pertaining to JSON will not be applicbale.

It would appear that your data is generated by Javascript and thus LWP cannot help you. That said, it seems that the site you are interested in has a developer API: https://familysearch.org/developers/
I recommend using Mojo::URL to construct your query and either Mojo::DOM or Mojo::JSON to parse XML or JSON results respectively. Of course other modules will work too, but these tools are very nicely integrated and let you get started quickly.

You could use WWW::Mechanize::Firefox to process any site that could be loaded by Firefox.
http://metacpan.org/pod/WWW::Mechanize::Firefox::Examples
You have to install the Mozrepl plugin and you will be able to process the web page contant via this module. Basically you will "remotly control" the browser.
Here is an example (maybe working)
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html');
my $retries = 10;
while ($retries-- and ! $mech->is_visible( xpath => '//*[#class="form-submit"]' )) {
print "Sleep until we find the thing\n";
sleep 2;
};
die "Timeout" if 0 > $retries;
#fill out the search form
my #forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[#class="form-submit"]'});

If you use your browser's development tools, you can clearly see the JSON request that the page you link to uses to get the data you're looking for.
This program should do what you want. I've added a bunch of comments for readability and explanation, as well as made a few other changes.
use warnings;
use strict;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
# Create an LWP User-Agent object for sending HTTP requests.
my $ua = LWP::UserAgent->new;
# Open data files
open(L, 'locations26.txt') or die "Can't open locations: $!";
open(O, '>', 'out26.txt') or die "Can't open output file: $!";
# Enable autoflush on the output file handle
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
# This regular expression is like chomp, but removes both Windows and
# *nix line-endings, regardless of the system the script is running on.
$location =~ s/[\r\n]//g;
foreach my $year (1923..1923) {
# If you need to add quotes around the location, use "\"$location\"".
my %args = (LOCATION => $location, YEAR => $year);
my $url = 'https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A^LOCATION^%2520%252Bbirth_year%253A^YEAR^-^YEAR^~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219';
# Note that values need to be doubly-escaped because of the
# weird way their website is set up (the "/proxy" URL we're
# requesting is subsequently loading some *other* URL which
# is provided to "/proxy" as a URL-encoded URL).
#
# This regular expression replaces any ^WHATEVER^ in the URL
# with the double-URL-encoded value of WHATEVER in %args.
# The /e flag causes the replacement to be evaluated as Perl
# code. This way I can look data up in a hash and do URL-encoding
# as part of the regular expression without an extra step.
$url =~ s/\^([A-Z]+)\^/escape(escape($args{$1}))/ge;
#print "$url\n";
# Create an HTTP request object for this URL.
my $request = HTTP::Request->new(GET => $url);
# This HTTP header is required. The server outputs garbage if
# it's not present.
$request->push_header('Content-Type' => 'application/json');
# Send the request and check for an error from the server.
my $response = $ua->request($request);
die "Error ".$response->code if !$response->is_success;
# The response should be JSON.
my $obj = from_json($response->content);
my $str = "$args{LOCATION},$args{YEAR},$obj->{totalHits}\n";
print O $str;
print $str;
}
}

What about this simple script without firefox ? I had investigated the site a bit to understand how it works, and I saw some JSON requests with firebug firefox addon, so I know which URL to query to get the relevant stuff. Here is the code :
use strict; use warnings;
use JSON::XS;
use LWP::UserAgent;
use HTTP::Request;
my $ua = LWP::UserAgent->new();
open my $fh, '<', 'locations2.txt' or die $!;
open my $fh2, '>>', 'out2.txt' or die $!;
# iterate over locations from locations2.txt file
while (my $place = <$fh>) {
# remove line ending
chomp $place;
# iterate over years
foreach my $year (1923..1925) {
# building URL with the variables
my $url = "https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A%2522$place%2522%2520%252Bbirth_year%253A$year-$year~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219";
my $request = HTTP::Request->new(GET => $url);
# faking referer (where we comes from)
$request->header('Referer', 'https://familysearch.org/search/collection/results');
# setting expected format header for response as JSON
$request->header('content_type', 'application/json');
my $response = $ua->request($request);
if ($response->code == 200) {
# this line convert a JSON to Perl HASH
my $hash = decode_json $response->content;
my $val = $hash->{totalHits};
print $fh2 "year $year, place $place : $val\n";
}
else {
die $response->status_line;
}
}
}
END{ close $fh; close $fh2; }

This seems to do what you need. Instead of waiting for the disappearance of the hourglass it waits - more obviously I think - for the appearance of the text node you're interested in.
use 5.010;
use warnings;
use WWW::Mechanize::Firefox;
STDOUT->autoflush;
my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';
my $mech = WWW::Mechanize::Firefox->new(tab => qr/FamilySearch\.org/, create => 1, activate => 1);
$mech->autoclose_tab(0);
$mech->get('about:blank');
$mech->get($url);
my $text;
while () {
sleep 1;
$text = $mech->xpath('//p[#class="num-search-results"]/text()', maybe => 1);
last if defined $text;
}
my $results = $text->{nodeValue};
say $results;
if ($results =~ /([\d,]+)\s+results/) {
(my $n = $1) =~ tr/,//d;
say $n;
}
output
1-20 of 1,975 results
1975
Update
This update is with special thanks to #nandhp, who inspired me to look at the underlying data server that produces the data in JSON format.
Rather than making a request via the superfluous https://familysearch.org/proxy this code accesses the server directly at https://familysearch.org/search/records, reencodes the JSON and dumps the required data out of the resulting structure. This has the advantage of both speed (the requests are served about once a second - more than ten times faster than with the equivalent request from the basic web site) and stability (as you note, the site is very flaky - in contrast I have never seen an error using this method).
use strict;
use warnings;
use LWP::UserAgent;
use URI;
use JSON;
use autodie;
STDOUT->autoflush;
open my $fh, '<', 'locations26.txt';
my #locations = <$fh>;
chomp #locations;
open my $outfh, '>', 'out26.txt';
my $ua = LWP::UserAgent->new;
for my $county (#locations[36, 0..2]) {
for my $year (1923 .. 1926) {
my $total = familysearch_info($county, $year);
print STDOUT "$county,$year,$total\n";
print $outfh "$county,$year,$total\n";
}
print "\n";
}
sub familysearch_info {
my ($county, $year) = #_;
my $query = join ' ', (
'+event_place_level_1:California',
sprintf('+event_place_level_2:"%s"', $county),
sprintf('+birth_year:%1$d-%1$d~', $year),
'+gender:M',
'+race:White',
);
my $url = URI->new('https://familysearch.org/search/records');
$url->query_form(
collection_id => 2000219,
count => 20,
query => $query);
my $resp = $ua->get($url, 'Content-Type'=> 'application/json');
my $data = decode_json($resp->decoded_content);
return $data->{totalHits};
}
output
San Diego,1923,1975
San Diego,1924,2004
San Diego,1925,1871
San Diego,1926,1908
Alameda,1923,3577
Alameda,1924,3617
Alameda,1925,3567
Alameda,1926,3464
Alpine,1923,1
Alpine,1924,2
Alpine,1925,0
Alpine,1926,1
Amador,1923,222
Amador,1924,248
Amador,1925,134
Amador,1926,67

I do not know how to post revised code from the solution above.
This code does not (yet) compile correctly. However, I have made some essential update to definitely head in that direction.
I would very much appreciate help on this updated code. I do not know how to post this code and this follow up such that it appease the lords who run this sight.
It get stuck at the sleep line. Any advice on how to proceed past it would be much appreciated!
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html', synchronize => 0);
my $retries = 10;
while ($retries-- and $mech->is_visible( xpath => '//*[#id="hourglass"]' )) {
print "Sleep until we find the thing\n";
sleep 2;
};
die "Timeout while waiting for application" if 0 > $retries;
# Now the hourglass is not visible anymore
#fill out the search form
my #forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[#class="form-submit"]'});

You should set the current form before accessing a field:
"Given the name of a field, set its value to the value specified. This applies to the current form (as set by the "form_name()" or "form_number()" method or defaulting to the first form on the page)."
$mech->form_name( 'census-search' );
$mech->field( birth_place => 'value_for_birth_place' );
Sorry, I am not able too try this code out and thanks for open a question for a new question.

Download Specific Images

I'm trying to search and download specific images /front and back cover / of a website if found but whatever I do I always download only one of them. What should I change in my code to download both of them if found?
use strict;
use warnings;
use LWP;
use LWP::UserAgent;
use URI::Escape;
use HTTP::Status;
getCover(...);
sub getCover {
......
while ($title_found =~ /'(http:\/\/images.blu-ray.com\/movies\/covers\/\d+_.*?)'/gis) {
$url = getSite($1);
if ($title_found =~ /front/) {
$filename = 'front.jpg';
}
elsif ($title_found =~ /back/) {
$filename = 'back.jpg';
}
}
my $dir = 'somepath'.$filename;
open F, ">", $dir;
binmode F;
print F $url;
close F;
return 0;
}
sub getSite {
$url = shift;
print "URL: $url\n";
my $r;
my $ua = new LWP::UserAgent();
$ua->agent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030312");
my $req = new HTTP::Request GET => $url;
$req->push_header("Accept-Language", "en");
$req = $ua->prepare_request($req);
my $res = $ua->request($req);
my $rc = $res->code;
if(is_success($rc)){
$r = $res->as_string();
$r = $res->content();
}
else {
print "Failed\n";
}
return $r;
}

Try putting the part that saves to 'somepath'.$filename inside the while loop instead of outside it.
Also, it appears that $title_found is supposed to contain multiple URLs. In that case, you need to save $1 to a temporary variable, and look for front/back in that instead of in $title_found. Otherwise, you'll wind up saving both covers to front.jpg.

Copy and retain previous output for backup of transformed json data

I have a perl script that transforms json data to perl and saves output in files called teams.txt, backyard, and also a file called backup.txt, where the output of teams.txt is copied from. The following are two snippets from the script/the part of it that writes the data to the text files:
my %manager_to_directs;
my %user_to_manager;
my #users;
my $url = "https://xxxxxxxxxxxxxx.com/api/v1/reports/active/week";
my $useragent = LWP::UserAgent->new();
my $response = $useragent->get(($url));
if ($response->code !~ "200" || $response->code !~ "204" ){
while ($url && $url ne "Null") {
my $data = fetch_json($url);
last if !defined $data;
$url = $data->{next};
.
.
.
# write backyard.txt
open my $backyard_fh, ">", "backyard.txt";
foreach my $user (sort keys %user_to_management_chain) {
my $chain = join ',', #{$user_to_management_chain{$user}};
print $backyard_fh "$user:$chain\n";
}
close $backyard_fh;
# write teams.txt
open my $team_fh, ">", "teams.txt";
foreach my $user (sort #users) {
my $followers = $manager_to_followers{$user};
my $followers_joined = $followers ? join (',', sort #$followers) : "";
print $team_fh "$user:$followers_joined\n";
}
close $team_fh;
# write backup.txt, backup for teams.txt
open my $backup_fh, ">", "backup.txt";
copy("teams.txt", "backup.txt")
or die ("Can't copy teams.txt \n");
close $backup_fh;
This works almost exactly how I want it to, but now I've been testing with a negative scenario, where the .json url provided in the script is false/nonexistent, and I have to make sure that not another teams.txt file is created and the backup.txt file is still retained from the last execution.
I tested by replacing
my $url = "https://xxxxxxxxxxxxxx.com/api/v1/reports/active/week";
with
my $url = "https://fakeUrl.com/api/v1/reports/active/week";
And in this scenario, 404 would be passed and the program is supposed to fail. With this test, I noticed that the the contents of teams.txt and backyard.txt get wiped, but the backup.txt file gets wiped too...and that's not good.
I'm fine with teams.txt and backyard.txt being overwritten per each run of the script, but I need the backup.txt file to be retained no matter what, unless the program runs successfully and there's new content from teams.txt to be copied over to backup.txt.
Any help I can get is highly appreciated!

Following code snippets taken almost directly from documentation for modules.
May be you should try this approach.
use strict;
use warnings;
use feature 'say';
use LWP::UserAgent ();
my $url = 'https://metacpan.org/pod/HTTP::Tiny';
$url = 'https://fakeUrl.com/api/v1/reports/active/week';
my $ua = LWP::UserAgent->new(timeout => 10);
$ua->env_proxy;
my $response = $ua->get($url);
my $data;
if ($response->is_success) {
$data = $response->decoded_content;
}
else {
die $response->status_line;
}
# Process further data
say $data;
Output
500 Can't connect to fakeUrl.com:443 (Bad file descriptor) at C:\....\http_lwp.pl line 19.
use strict;
use warnings;
use feature 'say';
use HTTP::Tiny;
my $url = 'https://metacpan.org/pod/HTTP::Tiny';
$url = 'https://fakeUrl.com/api/v1/reports/active/week';
my $data;
my $response = HTTP::Tiny->new->get($url);
if( $response->{success} ) {
$data = $response->{content};
} else {
say "$response->{status} $response->{reason}";
exit 1;
}
# Process further data
say $data;
Output
403 Forbidden

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Web crawler using perl - perl

Related

Getting Absolute URLs with module creating object outside loop

Web-crawler optimization

using Perl to scrape a website

Download Specific Images

Copy and retain previous output for backup of transformed json data

Categories

Resources