What's wrong with my IF statement and LWP::Simple? - perl

I am trying to create a simple scraper, and I am using getstore(), but the scirpt won't create the .txt file when used within an IF statement. What am I doing wrong there?
Thanks,
Carlos N.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
my $url;
my $content;
print "Enter URL:";
chomp($url = <STDIN>);
$content = get($url);
if ($content =~ s%<(style|script)[^<>]*>.*?</\1>|</?[a-z][a-z0-9]*[^<>]*>|<!--.*?-->%%g) {
$content = getstore($content,"../crawled_text.txt");
}
die "Couldn't get $url" unless defined $content;

From the LWP::Simple documentation:
my $code = getstore($url, $file)
Gets a document identified by a URL and stores it in the file. The
return value is the HTTP response code.
Your first parameter is a stripped HTML file and likely not a URL. You could use a debugger or print statements in your code to understand more about the contents of your variables and about whether your program goes into an if block.

getstore takes an URL as a parameter and stores it into a file. What you want to do is to just store content in a file, so use this instead
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use Path::Tiny;
my $url = shift || "https://perl.org";
my $content = get($url) or die "Couldn't get $url" ;
if ($content =~ s%<(style|script)[^<>]*>.*?</\1>|</?[a-z][a-z0-9]*[^<>]*>|<!--.*?-->%%g) {
my $crawled_text = path("../crawled_text.txt");
$crawled_text->spew_utf8($content)
}
I have made also some small style changes and Path::Tiny to save content to a file. You can use the default open and print (or say) if you prefer to do so. Using shift allows also to take the URL as an argument from the command line, which is more idiomatic than prompting the user for it.

Related

Where does PERL LWP::Simple getstore save the image?

I am trying to use perl getstore to get a list of image from URL after reading a text file containing the file names, I created the code and able to run it successfully but I do not know where is the file saved, i checked the disk size it and shows that every time i run the code the hard disk free space decrease, so i assume there are file saved but I can't find it. So where does perl getstore save file and what is the correct way to save image from a link ?
use strict;
use warnings;
use LWP::UserAgent;
use LWP::Simple;
my $url = "https://labs.jamesooi.com/images/";
my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36");
my $file = 'image-list.txt';
open (DATA, $file) or die "Could not open $file: $!";
while(<DATA>){
my $link = "$url" . "$_";
my $filename = "$_";
print $link;
print $filename;
my $req = HTTP::Request->new(GET => $link);
my $res = $ua->request($req);
if($res->is_success){
my $rc = getstore($link, $filename);
if(is_success($rc)){
print "Success\n";
}else{
print "Error\n";
}
} else {
print $res->status_line, "\n";
}
}
According to
the documentation,
getstore(url, file) takes the URL as the first argument and the second argument is the file name where the result is stored. If the file name is a relative path (it doesn't begin with a slash /) it will be relative to the current working directory.
But you read the name from a file and then treat the full line, including the newline character, as the file name. That's probably not what you want, so you should use chomp to remove the newline.
Apart from that:
You are doing first a GET request using LWP::UserAgent to retrieve the file but ignore the response and instead call getstore to retrieve and store the same resource if the first GET was successful. It would be simpler to either just save the result from the first GET or just skip it and use only getstore.
You are using DATA as a file handle. While this is not wrong, DATA is already an implicit file handle which points to the program file after the __DATA__ marker, so I recommend to use a different file handle.
When using a simplified version of the code the file gets successfully stored:
use strict;
use warnings;
use LWP::Simple;
my $url = "https://labs.jamesooi.com/images/";
my $file = 'image-list.txt';
open (my $fh, '<', $file) or die "Could not open $file: $!";
while ( <$fh> ) {
chomp; # remove the newline from the end of the line
my $link = $url . $_;
my $filename = $_;
my $rc = getstore($link, $filename);
if (is_success($rc)) {
print "Success\n";
}
else {
print "Error\n";
}
}

How to pass and read a File handle to a Perl subroutine?

I want a Perl module that reads from the special file handle, <STDIN>, and passes this to a subroutine. You will understand what I mean when you see my code.
Here is how it was before:
#!/usr/bin/perl
use strict; use warnings;
use lib '/usr/local/custom_pm'
package Read_FH
sub read_file {
my ($filein) = #_;
open FILEIN, $filein or die "could not open $filein for read\n";
# reads each line of the file text one by one
while(<FILEIN>){
# do something
}
close FILEIN;
Right now the subroutine takes a file name (stored in $filein) as an argument, opens the file with a file handle, and reads each line of the file one by one using the fine handle.
Instead, I want get the file name from <STDIN>, store it inside a variable, then pass this variable into a subroutine as an argument.
From the main program:
$file = <STDIN>;
$variable = read_file($file);
The subroutine for the module is below:
#!/usr/bin/perl
use strict; use warnings;
use lib '/usr/local/custom_pm'
package Read_FH
# subroutine that parses the file
sub read_file {
my ($file)= #_;
# !!! Should I open $file here with a file handle? !!!!
# read each line of the file
while($file){
# do something
}
Does anyone know how I can do this? I appreciate any suggestions.
It is a good idea in general to use lexical filehandlers. That is a lexical variable containing the file handler instead of a bareword.
You can pass it around like any other variables. If you use read_file from File::Slurp you do not need a seperate file handler, it slurps the content into a variable.
As it is also good practice to close opened file handles as soon as possible this should be the preferred way if you realy only need to get the complete file content.
With File::Slurp:
use strict;
use warnings;
use autodie;
use File::Slurp;
sub my_slurp {
my ($fname) = #_;
my $content = read_file($fname);
print $content; # or do something else with $content
return 1;
}
my $filename = <STDIN>;
my_slurp($filename);
exit 0;
Without extra modules:
use strict;
use warnings;
use autodie;
sub my_handle {
my ($handle) = #_;
my $content = '';
## slurp mode
{
local $/;
$content = <$handle>
}
## or line wise
#while (my $line = <$handle>){
# $content .= $line;
#}
print $content; # or do something else with $content
return 1;
}
my $filename = <STDIN>;
open my $fh, '<', $filename;
my_handle($fh); # pass the handle around
close $fh;
exit 0;
I agree with #mugen kenichi, his solution is a better way to do it than building your own. It's often a good idea to use stuff the community has tested. Anyway, here are the changes you can make to your own program to make it do what you want.
#/usr/bin/perl
use strict; use warnings;
package Read_FH;
sub read_file {
my $filein = <STDIN>;
chomp $filein; # Remove the newline at the end
open my $fh, '<', $filein or die "could not open $filein for read\n";
# reads each line of the file text one by one
my $content = '';
while (<$fh>) {
# do something
$content .= $_;
}
close $fh;
return $content;
}
# This part only for illustration
package main;
print Read_FH::read_file();
If I run it, it looks like this:
simbabque#geektour:~/scratch$ cat test
this is a
testfile
with blank lines.
simbabque#geektour:~/scratch$ perl test.pl
test
this is a
testfile
with blank lines.

How to Call .pl File inside .cgi Script

i am using getpdftext.pl from CAM::PDF to extract pdf and print it to text, but in my web application i want to call this getpdftext.pl inside .cgi script. Can u suggest me as to what to do or how to proceed ahead. i tried converting getpdftext.pl to getpdftext.cgi but it doesnt work.
Thanks all
this is a extract from my request_admin.cgi script
my $filename = $q->param('quote');
:
:
:
&parsePdf($filename);
#function to extract text from pdf ,save it in a text file and parse the required fields
sub parsePdf($)
{
my $i;
print $_[0];
$filein = "quote_uploads/$_[0]";
$fileout = 'output.txt';
print "inside parsePdf\n";
open OUT, ">$fileout" or die "error: $!";
open IN, '-|', "getpdftext.pl $filein" or die "error :$!" ;
while(<IN>)
{
print "$i";
$i++;
print OUT;
}
}
It's highly likely that
Your CGI script's environment isn't complete enough to locate
getpdftext.pl and/or
The web-server user doesn't have permission to execute it anyway
Have a look in your web-server's error-log and see if it is reporting any pointers as to why this doesn't work.
In your particular case, it might be simpler and more direct to use CAM::PDF directly, which should have been installed along with getpdftext.pl anyway.
I had a look at this script and I think that your parsePdf sub could just as easily be written as:
#!/usr/bin/perl
use warnings;
use strict;
use CAM::PDF;
sub parsePdf {
my $filein = "quote_uploads/$_[0]";
my $fileout = 'output.txt';
open my $out_fh, ">$fileout" or die "error: $!";
my $doc = CAM::PDF->new($filein) || die "$CAM::PDF::errstr\n";
my $i = 0;
foreach my $p ($doc->rangeToArray(1,$doc->numPages()))
{
my $str = $doc->getPageText($p);
if (defined $str)
{
CAM::PDF->asciify(\$str);
print $i++;
print $out_fh $str;
}
}
}

wget not working properly inside a Perl program

I am trying to download some xml files from a given URL. Below is the code which I have used for the same-
use strict;
use warnings;
my $url ='https://givenurl.com/';
my $username ='scott';
my $password='tiger';
system("wget --user=$username --password=$password $url") == 0 or die "system execution failed ($?): $!";
local $/ = undef;
open(FILE, "<index.html") or die "not able to open $!";
my $index = <FILE>;
my #childs = map /<a\s+href\=\"(AAA.*\.xml)\">/g , $index;
for my $xml (#childs)
{
system("wget --user=$username --password=$password $url/$xml");
}
But when I am running this, it gets stuck in the for-loop wget command. It seems wget is not able to fetch the files properly? Any clue or suggestion?
Thank you.
Man
You shouldn't use an external command in the first place.
Ensure that WWW::Mechanize is available, then use code like:
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
...
$mech->credentials($username, $password);
$mech->get($url);
foreach my $link ($mech->find_all_links(url_regex=>qr/\bAAA/)) {
$mech->get($link);
...
}
If $url or $xml contains any shell metacharacters (? and & are common ones in URLs) then you may need to either quote them properly
system("wget --user=$username --password=$password '$url/$xml'");
system qq(wget --user=$username --password=$password "$url/$xml");
or use the LIST form of system that bypasses the shell
system( 'wget', "--user=$username", "--password=$password", "$url/$xml");
to get the command to work properly.
maybe it's because the path to wget, what if you use:
system("/usr/bin/wget --user=$username --password=$password $url")
or I guess it can be a problem with variables passed to system: ($username, $password, $url)

How do I read multiple directories and read the contents of subdirectories in Perl?

I have a folder and inside that I have many subfolders. In those subfolders I have many .html files to be read. I have written the following code to do that. It opens the parent folder and also the first subfolder and it prints only one .html file. It shows error:
NO SUCH FILE OR DIRECTORY
I dont want to change the entire code. Any modifications in the existing code will be good for me.
use FileHandle;
opendir PAR_DIR,"D:\\PERL\\perl_programes\\parent_directory";
while (our $sub_folders = readdir(PAR_DIR))
{
next if(-d $sub_folders);
opendir SUB_DIR,"D:\\PERL\\perl_programes\\parent_directory\\$sub_folders";
while(our $file = readdir(SUB_DIR))
{
next if($file !~ m/\.html/i);
print_file_names($file);
}
close(FUNC_MODEL1);
}
close(FUNC_MODEL);
sub print_file_names()
{
my $fh1 = FileHandle->new("D:\\PERL\\perl_programes\\parent_directory\\$file")
or die "ERROR: $!"; #ERROR HERE
print("$file\n");
}
Your posted code looks way overcomplicated. Check out File::Find::Rule and you could do most of that heavy lifting in very little code.
use File::Find::Rule;
my $finder = File::Find::Rule->new()->name(qr/\.html?$/i)->start("D:/PERL/perl_programes/parent_directory");
while( my $file = $finder->match() ){
print "$file\n";
}
I mean isn't that sexy?!
A user commented that you may be wishing to use only Depth=2 entries.
use File::Find::Rule;
my $finder = File::Find::Rule->new()->name(qr/\.html?$/i)->mindepth(2)->maxdepth(2)->start("D:/PERL/perl_programes/parent_directory");
while( my $file = $finder->match() ){
print "$file\n";
}
Will Apply this restriction.
You're not extracting the supplied $file parameter in the print_file_names() function.
It should be:
sub print_file_names()
{
my $file = shift;
...
}
Your -d test in the outer loop looks wrong too, BTW. You're saying next if -d ... which means that it'll skip the inner loop for directories, which appears to be the complete opposite of what you require. The only reason it's working at all is because you're testing $file which is only the filename relative to the path, and not the full path name.
Note also:
Perl on Windows copes fine with / as a path separator
Set your parent directory once, and then derive other paths from that
Use opendir($scalar, $path) instead of opendir(DIR, $path)
nb: untested code follows:
use strict;
use warnings;
use FileHandle;
my $parent = "D:/PERL/perl_programes/parent_directory";
my ($par_dir, $sub_dir);
opendir($par_dir, $parent);
while (my $sub_folders = readdir($par_dir)) {
next if ($sub_folders =~ /^..?$/); # skip . and ..
my $path = $parent . '/' . $sub_folders;
next unless (-d $path); # skip anything that isn't a directory
opendir($sub_dir, $path);
while (my $file = readdir($sub_dir)) {
next unless $file =~ /\.html?$/i;
my $full_path = $path . '/' . $file;
print_file_names($full_path);
}
closedir($sub_dir);
}
closedir($par_dir);
sub print_file_names()
{
my $file = shift;
my $fh1 = FileHandle->new($file)
or die "ERROR: $!"; #ERROR HERE
print("$file\n");
}
Please start putting:
use strict;
use warnings;
at the top of all your scripts, it will help you avoid problems like this and make your code much more readable.
You can read more about it here: Perlmonks
You are going to need to change the entire code to make it robust:
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
my $top = $ENV{TEMP};
find( { wanted => \&wanted, no_chdir=> 1 }, $top );
sub wanted {
return unless -f and /\.html$/i;
print $_, "\n";
}
__END__
Have you considered using
File::Find
Here's one method which does not require to use File::Find:
First open the root directory, and store all the sub-folders' names in an array by using readdir;
Then, use foreach loop. For each sub-folder, open the new directory by linking the root directory and the folder's name. Still use readdir to store the file names in an array.
The last step is to write the codes for processing the files inside this foreach loop.
Special thanks to my teacher who has given me this idea :) It really worked well!