Perl Script, Web Scraper - perl

I am new to Perl language and have this script which scrapes the amazon website for reviews. Everytime I run it I get an error about a compilation error. Was wondering if someone could shed some light as to whats wrong with it.
#!/usr/bin/perl
# get_reviews.pl
#
# A script to scrape Amazon, retrieve reviews, and write to a file
# Usage: perl get_reviews.pl <asin>
use strict;
use warnings;
use LWP::Simple;
# Take the asin from the command-line
my $asin = shift #ARGV or die "Usage: perl get_reviews.pl <asin>\n";
# Assemble the URL from the passed asin.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";
# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('"'=>'"', '&'=>'&', ' '=>' ');
my $unescape_re = join '|' => keys %unescape;
# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;
#Remove everything before the reviews
$content =~ s!.*?Number of Reviews:!!ms;
# Loop through the HTML looking for matches
while ($content =~ m!<img.*?stars-(\d)-0.gif.*?>.*?<b>(.*?)</b>, (.*?)[RETURN]
\n.*?Reviewer:\n<b>\n(.*?)</b>.*?</table>\n(.*?)<br>\n<br>!mgis) {
my($rating,$title,$date,$reviewer,$review) = [RETURN]
($1||'',$2||'',$3||'',$4||'',$5||'');
$reviewer =~ s!<.+?>!!g; # drop all HTML tags
$reviewer =~ s!\(.+?\)!!g; # remove anything in parenthesis
$reviewer =~ s!\n!!g; # remove newlines
$review =~ s!<.+?>!!g; # drop all HTML tags
$review =~ s/($unescape_re)/$unescape{$1}/migs; # unescape.
# Print the results
print "$title\n" . "$date\n" . "by $reviewer\n" .
"$rating stars.\n\n" . "$review\n\n";
}

The syntax errors seem to be caused by the "[RETURN]" that appears twice in your code. When I removed those, the code compiled without problems.
Amazon don't really like people scraping their web site. Which is why they provide an API that gives you access to their content. And there's a Perl module that for using that API - Net::Amazon. You should use that instead of fragile web scraping techniques.

Maybe you should try Web::Scraper (http://metacpan.org/pod/Web::Scraper).
It will get the job done in a much cleaner way.
[EDIT] Anyway, I checked the HTML code of a random review and it appears that your pattern is outdated. The reviewer's name, for instance, is introduced by 'By' and not by 'Reviewer'.

Related

... architecting & implementing help w/ Perl & REST

I really just need some basic software engineering architecting guidance... When I say 'architecting', I mean pseudo-code or statements I can go after in the Perl book I have, and hopefully some online examples...
I'm using Perl and I'm trying to parse Medline/Pubmed file paths on an Unix system in order to finally pass the PMID from each path to a pmid2doi conversion website (http://www.pmid2doi.org/).
The structure of each link is a filepath of the form...
/xxxxx/xxxxx/xxxxx/xxxxx/xxxxx/UNC00000000000042/00223468/v45i3/S0022346809003820
|<-PMID->|
where 00223468 is the PMID.
$ ls
18507872 main.pdf main.raw main.xml
So far, I've got something that looks like this (in bash):
for doi in `find . -name "*.xml" | awk -F\/ '{print $2}' ` #this extracts the PMID
do
echo $doi
wget pmid2doi website
done
Now I need help translating it to Perl.
The pmid2doi website requires inputting the PMID in order to get back the DOI.
This is what I need to get running in Perl, and I need a little help in designing implementing this...
Given your development background, I would be grateful for any insights, or recommended sites for generating regular expressions in Perl.
thanks very much!
Additionally:
I see that the pmid2doi website says the REST API expects the website plus the DOI value.
So I need to find some example Perl code that gets a REST value from a URL.
In REST I just prepare the URL as specified and then the returned result should be the value I want.
Something like this maybe?
use File::Find;
my $client = REST::Client->new( $an_url );
File::Find::find( sub {
return unless m/\.xml$/;
carp "Could not open $File::Find::name!"
unless open( my $fh, '<', $File::Find::name )
;
my $doi;
while ( <$fh> ) {
next unless ( $doi ) = m{[^/]*/([^/]*)};
$client->GET( join( '/', $base, $doi ));
do_stuff_with_content( $client->responseContent );
}
close $fh;
} => '.'
);
Am I misunderstanding things, or could you just do this in Perl locally and not bother with the website? No need to hit up a remote host and deal with latency, usage limits, etc...
https://metacpan.org/module/Bio::DB::EUtilities
Edit: I am, that's a CGI client for NCBI. From the docs: 'A separate project, currently present in BioPerl-Run, utilizes the SOAP-based interface for eUtils.'
https://bitbucket.org/reece/docdepot/src/ca32360f6fa4/archive/perl-version/bin/doi-to-pmid
#!/usr/bin/perl
# doi-to-pmid -- get the PubMed ids for a doi
use strict;
use warnings;
use Bio::DB::EUtilities;
my #pmids = doi_to_pmids( $ARGV[0] );
exit 1 unless (#pmids);
print( join("\n",#pmids), "\n" );
exit;
sub doi_to_pmids {
my $doi = $_[0];
my $f = Bio::DB::EUtilities->new(-eutil => 'esearch',
-db => 'pubmed',
-term => "$doi [doi]");
my #ids = $f->get_ids();
return (#ids);
}

Strange behavior using POST data in perl scripts

Server is linux. I am having inexplicable problems when I send POST data to the script.
For example, I send the following POST data: choice=update
Here is the script:
#!/usr/bin/perl -w
print "Content-type: text/html\n\n";
if ( $ENV{'REQUEST_METHOD'} eq "GET" ) {
$in = $ENV{'QUERY_STRING'};
} elsif ($ENV{'REQUEST_METHOD'} eq "POST") {
read(STDIN,$in,$ENV{'CONTENT_LENGTH'});
}
#in = split(/&/,$in);
foreach $i (0 .. $#in) {
# Convert plus's to spaces
$in[$i] =~ s/\+/ /g;
# Split into key and value.
($key, $val) = split(/=/,$in[$i],2); # splits on the first =.
# Convert %XX from hex numbers to alphanumeric
$key =~ s/%(..)/pack("c",hex($1))/ge;
$val =~ s/%(..)/pack("c",hex($1))/ge;
# Associate key and value
$in{$key} .= "\0" if (defined($in{$key})); # \0 is the multiple separator
$in{$key} .= $val;
}
print $in{'choice'};
The first time I access the script, it prints update
The second time I access it, it prints updateupdate
The third time, it prints updateupdateupdate
...and so on.
What on earth could be causing it to keep appending the string to itself between requests? I am sending exactly the same POST data every time by simply refreshing with my browser. Cookies are not being used. There is nothing else in the file that is not commented out.
Edit: Also, when I print <STDIN> it says choice=update every time. The other updates don't appear to be added to STDIN
My guess is that the script is kept running between requests. As %in is a global variable it is never cleared, so that $in{$key} .= $value ends up making the string longer and longer. You can probably evade the problem by using lexical variables.
This means you'll need to find out how the script is being run by the web server.
You'll also want to look at using modules to do all this parsing work for you, and learn about ways to write perl code avoid the problem you've encountered. I'd suggest taking a look at Modern Perl and working from there.
It sounds / looks like it's related to the web server's configuration and not the script itself.
However, at the beginning of the code, try adding:
my %in;
This would scope the variable you're printing.
Also, at the end of the code I would add: exit 0;
(Although usually not necessary).

Perl script to automate a website for bioinformatics

I would like to automate this website with a Perl script
http://bioinfo.uni-plovdiv.bg/microinspector/
This is what I have so far and I am not sure how to get to the output page after this, I know it has something to do with POST, redirect_ok?, response(), but I am not sure. I read through the documentation but am confused about some things. Thanks.
use strict;
use warnings;
use WWW::Mechanize;
# create object for browser
my $browser = WWW::Mechanize->new();
my ($sequence, $results);
open (DRG, "<microRNA_target_cspg_drg_output.fa") || die "cannot open microRNA_target_cspg_drg_output.fa";
while (<DRG>) {
chomp;
$sequence=$_;
last; #for testing purposes
}
close (DRG);
$browser->get("http://bioinfo.uni-plovdiv.bg/microinspector/");
$browser->form_number(1);
$browser->field("target_sequence", $sequence);
$browser->field("Choose an organism : ", "Mus musculus");
$browser->click_button( number => 1);
You should start with WWW::Mechanize. It's page provides examples on submitting forms, and anything else you will need.
EDIT: as a reply to your update, if you want to get the content of the page, use the content method, like in this example:
my $content = $browser->content();
See this for more info.

Subroutines vs scripts in Perl

I'm fairly new to Perl and was wondering what the best practices regarding subroutines are with Perl. Can a subroutine be too big?
I'm working on a script right now, and it might need to call another script. Should I just integrate the old script into the new one in the form of a subroutine? I need to pass one argument to the script and need one return value.
I'm guessing I'd have to do some sort of black magic to get the output from the original script, so subroutine-ing it makes sense right?
Avoiding "black magic" is always a good idea when writing code. You never want to jump through hoops and come up with an unintuitive hack to solve a problem, especially if that code needs to be supported later. It happens, admittedly, and we're all guilty of it. Circumstances can weigh heavily on "just getting the darn thing to work."
The point is, the best practice is always to make the code clean and understandable. Remember, and this is especially true with Perl code in my experience, any code you wrote yourself more than a few months ago may as well have been written by someone else. So even if you're the only one who needs to support it, do yourself a favor and make it easy to read.
Don't cling to broad sweeping ideas like "favor more files over larger files" or "favor smaller methods/subroutines over larger ones" etc. Those are good guidelines to be sure, but apply the spirit of the guideline rather than the letter of it. Keep the code clean, understandable, and maintainable. If that means the occasional large file or large method/subroutine, so be it. As long as it makes sense.
A key design goal is separation of concerns. Ideally, each subroutine performs a single well-defined task. In this light, the main question revolves not around a subroutine's size but its focus. If your program requires multiple tasks, that implies multiple subroutines.
In more complex scenarios, you may end up with groups of subroutines that logically belong together. They can be organized into libraries or, even better, modules. If possible, you want to avoid a scenario where you end up with multiple scripts that need to communicate with each other, because the usual mechanism for one script to return data to another script is tedious: the first script writes to standard output and the second script must parse that output.
Several years ago I started work at a job requiring that I build a large number of command-line scripts (at least, that's how it turned out; in the beginning, it wasn't clear what we were building). I was quite inexperienced at the time and did not organize the code very well. In hindsight, I should have worked from the premise that I was writing modules rather than scripts. In other words, the real work would have been done by modules, and the scripts (the code executed by a user on the command line) would have remained very small front-ends to invoke the modules in various ways. This would have facilitated code reuse and all of that good stuff. Live and learn, right?
Another option that hasn't been mentioned yet for reusing the code in your scripts is to put common code in a module. If you put shared subroutines into a module or modules, you can keep your scripts short and focussed on what they do that is special, while isolating the common code in a easy to access and reuse form.
For example, here is a module with a few subroutines. Put this in a file called MyModule.pm:
package MyModule;
# Always do this:
use strict;
use warnings;
use IO::Handle; # For OOP filehandle stuff.
use Exporter qw(import); # This lets us export subroutines to other scripts.
# These may be exported.
our #EXPORT_OK = qw( gather_data_from_fh open_data_file );
# Automatically export everything allowed.
# Generally best to leave empty, but in some cases it makes
# sense to export a small number of subroutines automatically.
our #EXPORT = #EXPORT_OK;
# Array of directories to search for files.
our #SEARCH_PATH;
# Parse the contents of a IO::Handle object and return structured data
sub gather_data_from_fh {
my $fh = shift;
my %data;
while( my $line = $fh->readline );
# Parse the line
chomp $line;
my ($key, #values) = split $line;
$data{$key} = \#values;
}
return \%data;
}
# Search a list of directories for a file with a matching name.
# Open it and return a handle if found.
# Die otherwise
sub open_data_file {
my $file_name = shift;
for my $path ( #SEARCH_PATH, '.' ) {
my $file_path = "$path/$file_name";
next unless -e $file_path;
open my $fh, '<', $file_path
or die "Error opening '$file_path' - $!\n"
return $fh;
}
die "No matching file found in path\n";
}
1; # Need to have trailing TRUE value at end of module.
Now in script A, we take a filename to search for and process and then print formatted output:
use strict;
use warnings;
use MyModule;
# Configure which directories to search
#MyModule::SEARCH_PATH = qw( /foo/foo/rah /bar/bar/bar /eeenie/meenie/mynie/moe );
#get file name from args.
my $name = shift;
my $fh = open_data_file($name);
my $data = gather_data_from_fh($fh);
for my $key ( sort keys %$data ) {
print "$key -> ", join ', ', #{$data->{$key}};
print "\n";
}
Script B, searches for a file, parses it and then writes the parsed data structure into a YAML file.
use strict;
use warnings;
use MyModule;
use YAML qw( DumpFile );
# Configure which directories to search
#MyModule::SEARCH_PATH = qw( /da/da/da/dum /tutti/frutti/unruly /cheese/burger );
#get file names from args.
my $infile = shift;
my $outfile = shift;
my $fh = open_data_file($infile);
my $data = gather_data_from_fh($fh);
DumpFile( $outfile, $data );
Some related documentation:
perlmod - About Perl modules in general
perlmodstyle - Perl module style guide; this has very useful info.
perlnewmod - Starting a new module
Exporter - The module used to export functions in the sample code
use - the perlfunc article on use.
Some of these docs assume you will be sharing your code on CPAN. If you won't be publishing to CPAN, simply ignore the parts about signing up and uploading code.
Even if you aren't writing for CPAN, it is beneficial to use the standard tools and CPAN file structure for your module development. Following the standard allows you to use all of the tools CPAN authors use to simplify the development, testing and installation process.
I know that all this seems really complicated, but the standard tools make each step easy. Even adding unit tests to your module distribution is easy thanks to the great tools available. The payoff is huge, and well worth the time you will invest.
Sometimes it makes sense to have a separate script, sometimes it doesn't. The "black magic" isn't that complicated.
#!/usr/bin/perl
# square.pl
use strict;
use warnings;
my $input = shift;
print $input ** 2;
#!/usr/bin/perl
# sum_of_squares.pl
use strict;
use warnings;
my ($from, $to) = #ARGV;
my $sum;
for my $num ( $from .. $to ) {
$sum += `square.pl $num` // die "square.pl failed: $? $!";
}
print $sum, "\n";
Easier and better error reporting on failure is automatic with IPC::System::Simple:
#!/usr/bin/perl
# sum_of_squares.pl
use strict;
use warnings;
use IPC::System::Simple 'capture';
my ($from, $to) = #ARGV;
my $sum;
for my $num ( $from .. $to ) {
$sum += capture( "square.pl $num" );
}
print $sum, "\n";

Why does my Perl CGI program fail when I include a file?

I'm trying to create a base template which then loads data depending on what actions are taken. I included ( required ) some pages which was fine but when I included another file which I got a 500 internal error. pasting the code straight in and it works fine:
Here's what I've got;
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
require LWP::UserAgent;
use DBI;
#deal with post requests
require "perl/post-sort.pl";
#loading stylesheets and javascripts
require "header.pl";
# bring in loggin js
if( $arg{REQUEST_KEY} eq "") {
require "javascript/js-main-login.pl";
}
print "</head> \n";
print " \n";
...
...
perl/post-sort.pl
my %arg = ();
for (split /\&/, <STDIN>) {
my ($key, $val) = split /=/;
$val =~ s/\+/ /g;
$val =~ s/%([0-9a-fA-F]{2})/chr(hex($1))/ge;
$arg{$key} = $val;
}
Any help much appreciated.
A 500 internal server error often indicates a bad or missing header. Make sure that in the included code, the first thing that gets printed (to the browser) is the header, or make sure that nothing gets printed and the original code will print out the right header.
Another possibility is that a file you are require'ing does not "return true as the last statement" (i.e., doesn't end with a 1;), which would cause your script to fail at compile-time and produce a 500 error.
Also see this apropos discussion on debugging CGI scripts from earlier today.