program is terminated after else condition in perl - perl

After encountering the else condition(invalid url) loop is terminated and not processing further urls. 2. even if the node fails in xpath it is not printed in screen or file.I want to print that in both file and screen (node exception)
use LWP::Simple;
use File::Compare;
use HTML::TreeBuilder::XPath;
use LWP::UserAgent;
use Win32::Console::ANSI;
use Term::ANSIColor;
sub crawl_content{
{
open(FILE, "C:/Users/jeyakuma/Desktop/input/input.txt");
{
while(<FILE>){
chomp;
$url=$_;
foreach ($url){
$domain) = $url =~ m|www.([A-Z a-z 0-9]+.{3}).|x;
}
do 'C:/Users/jeyakuma/Desktop/perl/mainsub.pl';
&domain_check();
my $ua = LWP::UserAgent->new( agent => "Mozilla/5.0" );
my $req = HTTP::Request->new( GET => "$url" );
my $res = $ua->request($req);
if ( $res->is_success ){
print "working on $domain\n";
binmode ":utf8";
my $xp = HTML::TreeBuilder::XPath->new_from_url($url);
my #node = $xp->findnodes_as_string("$xpath") or print "couldn't find the node\n" ;
open HTML, '>:encoding(cp1252)',"C:/Users/jeyakuma/Desktop/ project/data_$date/$site.html";
foreach(<#node>){
print HTML #node;
close HTML ;
}
}
else{
print color("green"), "$domain Invalid url\n", color("reset") and open FILE,">C:/Users/jeyakuma/Desktop/log.txt"; print FILE " $domain Invalid URL";
}
}
}
}
}
do 'C:/Users/jeyakuma/Desktop/perl/comparefinal.pl';
compare_result();
}

The else condition reopens FILE for writing to another file. Thus, at the next iteration of the while (<FILE>) loop, Perl will attempt to read from FILE and fail (because it's now only available for writing, not reading), and the loop will end. You need to use a name other than FILE in the else condition.

Related

perl redirect stdout to lexical filehandle

I'm trying to write a helper function that runs a perl function in another process and returns a closure that produces a line of output at a time when called.
I figured out a way of doing this using pipe that mixes old and new-style filehandles. I used an old-style one for the sink in order to use the open(STDOUT, ">&thing") syntax and a new-style one for the source since it needs to be captured by a closure and I didn't want to burden the caller with providing a filehandle.
Is there a way of using a new-style filehandle in a construction with the same meaning as open(STDOUT, ">&thing")?
#!/usr/bin/env perl
# pipe.pl
# use pipe() to create a pair of fd's.
# write to one and read from the other.
#
# The source needs to be captured by the closure and can't be
# destructed at the end of get_reader(), so it has to be lexical.
#
# We need to be able to redirect stdout to sink in such a way that
# we actually dup the file descriptor (so shelling out works as intended).
# open(STDOUT, ">&FILEHANDLE") achieves this but appears to require an
# old-style filehandle.
use strict;
use warnings;
sub get_reader {
local *SINK;
my $source;
pipe($source, SINK) or die "can't open pipe!";
my $cpid = fork();
if ($cpid == -1) {
die 'failed to fork';
}
elsif ($cpid == 0) {
open STDOUT, ">&SINK" or die "can't open sink";
system("echo -n hi");
exit;
}
else {
return sub {
my $line = readline($source);
printf "from child (%s)\n", $line;
exit;
}
}
}
sub main {
my $reader = get_reader();
$reader->();
}
main();
When run, this produces
from child (hi)
as expected.
sub get_reader {
my ($cmd) = #_;
open(my $pipe, '-|', #$cmd);
return sub {
return undef if !$pipe;
my $line = <$pipe>;
if (!defined($line)) {
close($pipe);
$pipe = undef;
return undef;
}
chomp($line);
return $line;
};
}
If that's not good enough (e.g. because you also need to redirect the child's STDIN or STDERR), you can use IPC::Run instead.
use IPC::Run qw( start );
sub get_reader {
my ($cmd) = #_;
my $buf = '';
my $h = start($cmd, '>', \$buf);
return sub {
return undef if !$h;
while (1) {
if ($buf =~ s/^([^\n]*)\n//) {
return $1;
}
if (!$h->pump())) {
$h->finish();
$h = undef;
return substr($buf, 0, length($buf), '') if length($buf);
return undef;
}
}
};
}
Either way, you can now do
my $i = get_reader(['prog', 'arg', 'arg']);
while (defined( my $line = $i->() )) {
print "$line\n";
}
Either way, error handling left to you.

Perl error handling

how can i cache errors in perl? Is there try/cache like in JS? I would like if any error occurs to go to the start of the script.
And if anyone has an idea of improvement for the script below let me know because this is my first one in perl. The script just has to loop forever and never stop. :)
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use JSON;
use HTTP::Request::Common qw(POST GET);
use Encode qw(encode);
use DBI;
use Time::Piece;
# Beware: we disable the SSL certificate check for this script.
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME}=0;
# Debugging: off=0, medium=3, extensive=5
my $debuglevel=0;
my ($host,$username,$password)=('192.168.xxx.xxx','xxxx','xxxx');
# Define cms api key and nodeid.
my ($cmsapi,$cmsnode)=('xxxxxxxxx','1');
# Define all parameters to be logged each script's iteration.
# #parameterlist[x][$parameterid,$parameterlongtext,$parametershorttext,$data]
# which corresponds for FHEM's DbLog with:
# #parameterlist[x][$parameterid,$parameterlongtext,READING ,VALUE]
# $parameterlist[x][3] will be populated by the script, thus here undefined in each line (the last value is missing).
my #parameterlist=(
[3922,"Status TC","statusHeatPump"],
[3931,"Zunanja temperatura","outsideTemperature"],
[3924,"Status zalogovnika","statusBuffer"],
[3925,"Status bojlerja","statusBoiler"],
[3940,"Temperatura bojlerja","boilerTemperature"],
[3943,"Temperatura zalogovnika","bufferTemperature"],
[4331,"Temperatura nadstropja","floorTemperature"],
[3811,"Temperatura pritličja","groundTemperature"],
);
# We substitute the text for the burner's status with an integer, so plots are easier.
# Define which parameter holds the burner's status.
my $parameterstatusHeatPump=3922;
my #statusHeatPumpmatrix=(
["Off",0],
["Heating mode",50],
);
sub trim() {
my $str = $_[0];
$str =~ s/^\s+|\s+$//g;
return $str;
};
print "DEBUG: *** Script starting ***\n" if($debuglevel>0);
while (1) {
sleep 1;
my $ua=LWP::UserAgent->new;
my $request=HTTP::Request->new(GET=>'https://'.$host.'/api/auth/login.json?user='.$username.'&pwd='.$password);
my $response=$ua->request($request);
my $decoded=decode_json($response->decoded_content( charset => 'none'));
my $success=$decoded->{'Result'}{'Success'};
my $sessionid=$decoded->{'SessionId'};
print "DEBUG: ".$response->content."\n" if($debuglevel>4);
print "DEBUG: ".$success."\n" if($debuglevel>4);
my $i=0;
my $j=0;
my $parameterid;
my $dataValue;
my $rightnow;
my $data = "empty";
while (defined($parameterlist[$i][0])) {
$parameterid=$parameterlist[$i][0];
$request=HTTP::Request->new(GET=>'https://'.$host.'/api/menutree/read_datapoint.json?SessionId='.$sessionid.'&Id='.$parameterid);
$response=$ua->request($request);
$decoded=JSON->new->utf8->decode($response->decoded_content( charset => 'none'));
$success=$decoded->{'Result'}{'Success'};
$dataValue=encode('UTF-8', $decoded->{'Data'}{'Value'});
$parameterlist[$i][3]=&trim($dataValue);
if ($parameterlist[$i][0]==$parameterstatusHeatPump) {
$j=0;
while (defined($statusHeatPumpmatrix[$j][0])) {
if ($statusHeatPumpmatrix[$j][0] eq $parameterlist[$i][3]) {
$parameterlist[$i][3]=$statusHeatPumpmatrix[$j][1];
print "DEBUG: Substituting text of HeatPump\n" if($debuglevel>0);
};
$j++;
}
}
print "DEBUG: ".$response->content."\n" if($debuglevel>4);
print "DEBUG: ".$success."\n" if($debuglevel>4);
print "DEBUG: ".$parameterlist[$i][1]."=".$dataValue."\n" if($debuglevel>0);
$rightnow=localtime->strftime('%Y-%m-%d %H:%M:%S');
if ($data eq "empty"){
$data = $parameterlist[$i][2].':'.$parameterlist[$i][3];
}
else{
$data = $parameterlist[$i][2].':'.$parameterlist[$i][3].','.$data;
}
$i++;
}
print "JSON data = ".$data."\n" if($debuglevel>0);;
#Post data
my $req=HTTP::Request->new(POST=>'http://cms.org/input/post.json?apikey='.$cmsapi.'&node='.$cmsnode.'&json={'.$data.'}');
my $resp = $ua->request($req);
if ($resp->is_success) {
my $message = $resp->decoded_content;
print "Received reply: $message\n" if($debuglevel>0);
}
else {
print "HTTP POST error code: ", $resp->code, "\n" if($debuglevel>0);
print "HTTP POST error message: ", $resp->message, "\n" if($debuglevel>0);
}
}
print "DEBUG: *** Script ended ***\n\n" if($debuglevel>0);
I am answering the specific:
Is there try/cache like in JS?
Yes there is. Instead of
try {
possible evil code;
} catch (e) {
...
}
in perl you write
eval {
possible evil code;
};
if ($#) {
...
}
where $# is the message with which youre code died. BTW - don't vorget the ';' after the eval code.
HTH
Georg
In Perl you can use eval,
For Perl Script:
eval {
your code statement;
}
if($#){
print qq{Error: $#};
}
For CGI file use like below if you want to print the error:
eval {
your code statement || die "Error: $!";
}
if($#){
print qq{Error: $#};
}

Web crawler using perl

I want to develop a web crawler which starts from a seed URL and then crawls 100 html pages it finds belonging to the same domain as the seed URL as well as keeps a record of the traversed URLs avoiding duplicates. I have written the following but the $url_count value does not seem to be incremented and the retrieved URLs contain links even from other domains. How do I solve this? Here I have inserted stackoverflow.com as my starting URL.
use strict;
use warnings;
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
##open file to store links
open my $file1,">>", ("extracted_links.txt");
select($file1);
##starting URL
my #urls = 'http://stackoverflow.com/';
my $browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);
my %visited;
my $url_count = 0;
while (#urls)
{
my $url = shift #urls;
if (exists $visited{$url}) ##check if URL already exists
{
next;
}
else
{
$url_count++;
}
my $request = HTTP::Request->new(GET => $url);
my $response = $browser->request($request);
if ($response->is_error())
{
printf "%s\n", $response->status_line;
}
else
{
my $contents = $response->content();
$visited{$url} = 1;
#lines = split(/\n/,$contents);
foreach $line(#lines)
{
$line =~ m#(((http\:\/\/)|(www\.))([a-z]|[A-Z]|[0-9]|[/.]|[~]|[-_]|[()])*[^'">])#g;
print "$1\n";
push #urls, $$line[2];
}
sleep 60;
if ($visited{$url} == 100)
{
last;
}
}
}
close $file1;
Several points, your URL parsing is fragile, you certainly won't get relative links. Also you don't test for 100 links but 100 matches of the current url, which almost certainly isn't what you mean. Finally, I'm not too familiar with LWP so I'm going to show an example using the Mojolicious suite of tools.
This seems to work, perhaps it will give you some ideas.
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::UserAgent;
use Mojo::URL;
##open file to store links
open my $log, '>', 'extracted_links.txt' or die $!;
##starting URL
my $base = Mojo::URL->new('http://stackoverflow.com/');
my #urls = $base;
my $ua = Mojo::UserAgent->new;
my %visited;
my $url_count = 0;
while (#urls) {
my $url = shift #urls;
next if exists $visited{$url};
print "$url\n";
print $log "$url\n";
$visited{$url} = 1;
$url_count++;
# find all <a> tags and act on each
$ua->get($url)->res->dom('a')->each(sub{
my $url = Mojo::URL->new($_->{href});
if ( $url->is_abs ) {
return unless $url->host eq $base->host;
}
push #urls, $url;
});
last if $url_count == 100;
sleep 1;
}

How to keep data marked as UTF-8 after parsing with HTML::Tree?

I wrote a script, where i slurp in UTF-8 encoded HTML-file and then parse it to tree with HTML::Tree. Problem is that after parsing the strings are not marked as UTF-8 anymore.
As _utf8_on() is not recommended way to set flag on, i am looking for proper way.
My simplified code-example:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use utf8::all;
use autodie;
use HTML::Tree;
use Encode qw/is_utf8/;
my $file = shift;
my $tree;
if ($file) {
my $content = slurp_in( 'file' => $file );
$tree = html_tree('content' => $content);
} else {
die "no file";
}
my $title = $tree->look_down(_tag => 'title');
$title = $title->as_HTML('');
if ( is_utf8( $title ) ) {
say "OK: $title";
} else {
say "NOT OK: $title";
}
## SUBS
##
sub slurp_in {
my %v = #_;
open(my $fh, "<:utf8", $v{file}) || die "no $v{file}: $!";
local $/;
my $content = (<$fh>);
close $fh;
if ($content) {
return $content;
} else {
die "no content in $v{file} !";
}
}
sub html_tree {
my %v = #_;
my $tree = HTML::Tree->new();
$tree->utf8_mode(1); ## wrong call here, no such method, but no warnings on it!
$tree->parse( $v{content} );
if ($tree) {
return $tree;
} else {
die "no tree here";
}
}
Your code is overcomplicated, and you employ utf8::all and decode manually and call that strange method all at once. Rhetorically asking, what do you expect to achieve that way? I do not have the patience to find out the details what goes wrong and where, especially since you did not post any input with which your program fails to do the expected, so I drastically reduce it to a much simpler one. This works:
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings FATAL => ':all';
use File::Slurp qw(read_file); # autodies on error
use HTML::Tree qw();
my $file = shift;
die 'no file' unless $file;
my $tree = HTML::Tree->new_from_content(
read_file($file, binmode => ':encoding(UTF-8)')
);
my $title = $tree->look_down(_tag => 'title');
$title->as_HTML(''); # returns a Perl string

Copy and retain previous output for backup of transformed json data

I have a perl script that transforms json data to perl and saves output in files called teams.txt, backyard, and also a file called backup.txt, where the output of teams.txt is copied from. The following are two snippets from the script/the part of it that writes the data to the text files:
my %manager_to_directs;
my %user_to_manager;
my #users;
my $url = "https://xxxxxxxxxxxxxx.com/api/v1/reports/active/week";
my $useragent = LWP::UserAgent->new();
my $response = $useragent->get(($url));
if ($response->code !~ "200" || $response->code !~ "204" ){
while ($url && $url ne "Null") {
my $data = fetch_json($url);
last if !defined $data;
$url = $data->{next};
.
.
.
# write backyard.txt
open my $backyard_fh, ">", "backyard.txt";
foreach my $user (sort keys %user_to_management_chain) {
my $chain = join ',', #{$user_to_management_chain{$user}};
print $backyard_fh "$user:$chain\n";
}
close $backyard_fh;
# write teams.txt
open my $team_fh, ">", "teams.txt";
foreach my $user (sort #users) {
my $followers = $manager_to_followers{$user};
my $followers_joined = $followers ? join (',', sort #$followers) : "";
print $team_fh "$user:$followers_joined\n";
}
close $team_fh;
# write backup.txt, backup for teams.txt
open my $backup_fh, ">", "backup.txt";
copy("teams.txt", "backup.txt")
or die ("Can't copy teams.txt \n");
close $backup_fh;
This works almost exactly how I want it to, but now I've been testing with a negative scenario, where the .json url provided in the script is false/nonexistent, and I have to make sure that not another teams.txt file is created and the backup.txt file is still retained from the last execution.
I tested by replacing
my $url = "https://xxxxxxxxxxxxxx.com/api/v1/reports/active/week";
with
my $url = "https://fakeUrl.com/api/v1/reports/active/week";
And in this scenario, 404 would be passed and the program is supposed to fail. With this test, I noticed that the the contents of teams.txt and backyard.txt get wiped, but the backup.txt file gets wiped too...and that's not good.
I'm fine with teams.txt and backyard.txt being overwritten per each run of the script, but I need the backup.txt file to be retained no matter what, unless the program runs successfully and there's new content from teams.txt to be copied over to backup.txt.
Any help I can get is highly appreciated!
Following code snippets taken almost directly from documentation for modules.
May be you should try this approach.
use strict;
use warnings;
use feature 'say';
use LWP::UserAgent ();
my $url = 'https://metacpan.org/pod/HTTP::Tiny';
$url = 'https://fakeUrl.com/api/v1/reports/active/week';
my $ua = LWP::UserAgent->new(timeout => 10);
$ua->env_proxy;
my $response = $ua->get($url);
my $data;
if ($response->is_success) {
$data = $response->decoded_content;
}
else {
die $response->status_line;
}
# Process further data
say $data;
Output
500 Can't connect to fakeUrl.com:443 (Bad file descriptor) at C:\....\http_lwp.pl line 19.
use strict;
use warnings;
use feature 'say';
use HTTP::Tiny;
my $url = 'https://metacpan.org/pod/HTTP::Tiny';
$url = 'https://fakeUrl.com/api/v1/reports/active/week';
my $data;
my $response = HTTP::Tiny->new->get($url);
if( $response->{success} ) {
$data = $response->{content};
} else {
say "$response->{status} $response->{reason}";
exit 1;
}
# Process further data
say $data;
Output
403 Forbidden