... architecting & implementing help w/ Perl & REST - perl

I really just need some basic software engineering architecting guidance... When I say 'architecting', I mean pseudo-code or statements I can go after in the Perl book I have, and hopefully some online examples...
I'm using Perl and I'm trying to parse Medline/Pubmed file paths on an Unix system in order to finally pass the PMID from each path to a pmid2doi conversion website (http://www.pmid2doi.org/).
The structure of each link is a filepath of the form...
/xxxxx/xxxxx/xxxxx/xxxxx/xxxxx/UNC00000000000042/00223468/v45i3/S0022346809003820
|<-PMID->|
where 00223468 is the PMID.
$ ls
18507872 main.pdf main.raw main.xml
So far, I've got something that looks like this (in bash):
for doi in `find . -name "*.xml" | awk -F\/ '{print $2}' ` #this extracts the PMID
do
echo $doi
wget pmid2doi website
done
Now I need help translating it to Perl.
The pmid2doi website requires inputting the PMID in order to get back the DOI.
This is what I need to get running in Perl, and I need a little help in designing implementing this...
Given your development background, I would be grateful for any insights, or recommended sites for generating regular expressions in Perl.
thanks very much!
Additionally:
I see that the pmid2doi website says the REST API expects the website plus the DOI value.
So I need to find some example Perl code that gets a REST value from a URL.
In REST I just prepare the URL as specified and then the returned result should be the value I want.

Something like this maybe?
use File::Find;
my $client = REST::Client->new( $an_url );
File::Find::find( sub {
return unless m/\.xml$/;
carp "Could not open $File::Find::name!"
unless open( my $fh, '<', $File::Find::name )
;
my $doi;
while ( <$fh> ) {
next unless ( $doi ) = m{[^/]*/([^/]*)};
$client->GET( join( '/', $base, $doi ));
do_stuff_with_content( $client->responseContent );
}
close $fh;
} => '.'
);

Am I misunderstanding things, or could you just do this in Perl locally and not bother with the website? No need to hit up a remote host and deal with latency, usage limits, etc...
https://metacpan.org/module/Bio::DB::EUtilities
Edit: I am, that's a CGI client for NCBI. From the docs: 'A separate project, currently present in BioPerl-Run, utilizes the SOAP-based interface for eUtils.'
https://bitbucket.org/reece/docdepot/src/ca32360f6fa4/archive/perl-version/bin/doi-to-pmid
#!/usr/bin/perl
# doi-to-pmid -- get the PubMed ids for a doi
use strict;
use warnings;
use Bio::DB::EUtilities;
my #pmids = doi_to_pmids( $ARGV[0] );
exit 1 unless (#pmids);
print( join("\n",#pmids), "\n" );
exit;
sub doi_to_pmids {
my $doi = $_[0];
my $f = Bio::DB::EUtilities->new(-eutil => 'esearch',
-db => 'pubmed',
-term => "$doi [doi]");
my #ids = $f->get_ids();
return (#ids);
}

Related

Perl Script, Web Scraper

I am new to Perl language and have this script which scrapes the amazon website for reviews. Everytime I run it I get an error about a compilation error. Was wondering if someone could shed some light as to whats wrong with it.
#!/usr/bin/perl
# get_reviews.pl
#
# A script to scrape Amazon, retrieve reviews, and write to a file
# Usage: perl get_reviews.pl <asin>
use strict;
use warnings;
use LWP::Simple;
# Take the asin from the command-line
my $asin = shift #ARGV or die "Usage: perl get_reviews.pl <asin>\n";
# Assemble the URL from the passed asin.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";
# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('"'=>'"', '&'=>'&', ' '=>' ');
my $unescape_re = join '|' => keys %unescape;
# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;
#Remove everything before the reviews
$content =~ s!.*?Number of Reviews:!!ms;
# Loop through the HTML looking for matches
while ($content =~ m!<img.*?stars-(\d)-0.gif.*?>.*?<b>(.*?)</b>, (.*?)[RETURN]
\n.*?Reviewer:\n<b>\n(.*?)</b>.*?</table>\n(.*?)<br>\n<br>!mgis) {
my($rating,$title,$date,$reviewer,$review) = [RETURN]
($1||'',$2||'',$3||'',$4||'',$5||'');
$reviewer =~ s!<.+?>!!g; # drop all HTML tags
$reviewer =~ s!\(.+?\)!!g; # remove anything in parenthesis
$reviewer =~ s!\n!!g; # remove newlines
$review =~ s!<.+?>!!g; # drop all HTML tags
$review =~ s/($unescape_re)/$unescape{$1}/migs; # unescape.
# Print the results
print "$title\n" . "$date\n" . "by $reviewer\n" .
"$rating stars.\n\n" . "$review\n\n";
}
The syntax errors seem to be caused by the "[RETURN]" that appears twice in your code. When I removed those, the code compiled without problems.
Amazon don't really like people scraping their web site. Which is why they provide an API that gives you access to their content. And there's a Perl module that for using that API - Net::Amazon. You should use that instead of fragile web scraping techniques.
Maybe you should try Web::Scraper (http://metacpan.org/pod/Web::Scraper).
It will get the job done in a much cleaner way.
[EDIT] Anyway, I checked the HTML code of a random review and it appears that your pattern is outdated. The reviewer's name, for instance, is introduced by 'By' and not by 'Reviewer'.

Copy the last modified Dir from one location to another using Perl

fairly new to perl so this is most likely is not the best code which is why I am posting. I got this to work but was wondering if there is a better way. I do not have the ability to download modules. I am copying the last modified directory in a build folder from one server to another server. The argument allows me to choose which build directory to choose from.
Thanks
#!C:\strawberry\perl
use warnings;
use strict;
use File::Copy::Recursive;
my $NewFolder = `(dir /o-d/ad/b \\\\myserver1.name.com\\builds\\$ARGV[0] | head -1)`;
chomp($NewFolder);
$dir1 = "\\\\myserver1.name.com\\builds\\$ARGV[0]/$NewFolder";
$dir2 = "\\\\myserver2.name.com\\builds\\$ARGV[0]/Backup/$NewFolder";
File::Copy::Recursive::dircopy $dir1, $dir2 or die "Copy failed: $!";
Use forward slashes. It just makes your code easier to read:
$dir1 = "\\\\myserver1.name.com\\builds\\$ARGV[0]/$NewFolder";
vs.
$dir1 = "//myserver1.name.com/builds/$ARGV[0]/$NewFolder";
Also, don't do system calls where Perl can do it. For example, Perl can see the last modification date of a file via the stat. Even better is the File::stat module that makes the stat command so much easier to use.
Don't use #ARGV in your programs. Instead, read the variables from #ARGV into your own variables. It makes your program easier to understand, and your own variables have limited scope while #ARGV is global.
Use modern conventions. Variable names should be in all lower case, and use underscores to separate out words. That is $new_folder vs. $NewFolder. Is this arbitrary? Yes, but it's a convention followed by most Perl developers. It means not wondering if the variable is $newFolder, $NewFolder, or $newfolder because you know by these rules it is $new_folder.
And finally, use autodie; This will kill your program whenever a file operation fails. This turns perl from a check function for errors programming language into a exception checking language. This way, you don't have to worry whether or not you have to check for a failed IO operation.
Here's a completely untested, error ridden example:
use strict;
use warnings;
use autodie;
use File::Copy::Recursive qw(dircopy); #Optional Module
use File::Stat;
use constants {
ORIG_SERVER => '//myserver1.name.com/builds',
TO_SERVER => '//myserver2.name.com/builds',
};
my $from_directory = shift;
#
# Find newest directory
#
opendir my $dir_fh, ORIG_SERVER . "/$from_directory";
my $newest_directory;
while ( my $sub_directory = readdir $dir_fh ) {
next if $sub_directory eq "." or $sub_directory eq "..";
next unless -d $sub_directory;
if ( not defined $newest_directory ) {
$youngest_directory = $sub_directory;
next;
}
my $youngest_directory_stat = stat ORIG_SERVER . "/$directory/$newest_directory";
my $sub_directory_stat = stat ORIG_SERVER . "/$directory/$sub_directory";
if ( $newest_directory_stat->mtime > $sub_directory_stat->mtime ) {
$newest_directory = $sub_directory;
}
}
dircopy ORIG_SERVER . "/$directory/$youngest_directory",
TO_SERVER . "/$directory/$youngest_directory/backup";
My program is a lot longer than your program because your program depended upon various system operating commands, like dir and head which I don't believe is a standard Windows OS command. Instead, I read each entry under that directory into my loop. Anything that's not a directory, I toss (next if -d $sub_directory) and I toss out the special directories . and ...
After that, I use stat to find the youngest directory which to me means the one with the newest modification time. Note that Unix doesn't store creation time. However, according to perlport ctime is creation time on Win32, so you might prefer that instead of mtime.
If I didn't use File::stat, instead of this:
my $youngest_directory_stat = stat ORIG_SERVER . "/$directory/$newest_directory";
my $sub_directory_stat = stat ORIG_SERVER . "/$directory/$sub_directory";
if ( $newest_directory_stat->mtime > $sub_directory_stat->mtime ) {
$newest_directory = $sub_directory;
}
I could have done this:
my $newest = ORIG_SERVER . "/$directory/$newest_directory";
my $sub_dir = ORIG_SERVER . "/$directory/$sub_directory";
if ( stat( $newest )[9] > stat( $sub_dir )[9] ) {
$newest_directory = $sub_directory;
}
The stat command without File::stat returns an array of values, and I could have simply used the [9] element of that array. However, what is 9? Even though it could of saved me a few lines of code, and including an extra Perl module, it's better to use File::stat.
One thing you notice is that constants don't interpolate which means I have to keep doing things like this:
my $youngest_directory_stat = stat ORIG_SERVER . "/$directory/$newest_directory";
However, you can use this bit of Perlish black magic to interpolate constants inside quotes:
my $youngest_directory_stat = stat "#{[ORIG_SERVER]}/$directory/$newest_directory";
Hope that helps.

CGI script cant create file

I have the following CGI script that launches a module that creates a PNG file and then shows it.
#!/usr/bin/perl
use RRDs;
use CGI;
main:
{
my $cgi = new CGI;
my $filename = $cgi->param('filename');
print "Content-type: text/html\n\n";
my $curr_time = time();
my $start_time = $curr_time-3600;
RRDs::graph("$filename", "--start", "$start_time", "DEF:DiskC=c.rrd:DiskC:AVERAGE", "AREA:DiskC#0000FF");
my $err = RRDs::error();
print "<HTML><HEAD><TITLE>Disk C Utilization</TITLE></HEAD><BODY><H1>Disk C Utilization</H1><BR>$err<img src=\"$filename\"/></BODY></HTML>";
}
The graph method says that can't create the PNG file. If I run this script in a command like it works fine so I think it's a matter of permissions. I already set chmod 755 on the cgi-script folder. What do you suggest? Is this related to Apache2 settings?
Um, check the logs :) CGI Help Guide
$filename is not the filename that you want to use , it can be anything the browser sends, even F:/bar/bar/bar/bar/bar/bar/bar/UHOH.png
Its unlikely that F:/bar/bar/bar/bar/bar/bar/bar/UHOH.png exists on your server
You want to generate a filename, maybe like this
sub WashFilename {
use File::Basename;
my $basename = basename( shift );
# untainted , only use a-z A-Z 0-9 and dot
$basename = join '', $basename =~ m/([.a-zA-Z0-9])/g;
# basename is now, hopefully, file.ext
## so to ensure uniqueness, we adulterate it :)
my $id = $$.'-'.time;
my( $file, $ext ) = split /\./, $basename, 2 ;
return join '.', grep defined, $file, $id, $ext;
} ## end sub WashFilename
You also want to place the file in a directory of your webserver that will serve images (NOT cgi-bin)
And you also want to use File::Type::WebImages to determine web image file types using magic and make sure the filename has the appropriate extension , so your server will send the appropriate headers
And don't forget to chmod the file appropriately
chmod 777 solved my problem!
Don't store a document to file unless you have a reason to: it's slow and introduces all kinds of potential hazards.
You don't appear to reuse the same image on different requests, which might be such a reason.
Instead, teach your CGI script to output the image directly, depending on how it is called (or write a second one that does).

How can I download link targets from a web site using Perl?

I just made a script to grab links from a website, and in turn saves them into a text file.
Now I'm working on my regexes so it will grab links which contains php?dl= in the url from the text file:
E.g.: www.example.com/site/admin/a_files.php?dl=33931
Its pretty much the address you get when you hover over the dl button on the site. From which you can click to download or "right click save".
I'm just wondering on how to achieve this, having to download the content of the given address which will download a *.txt file. All from the script of course.
Make WWW::Mechanize your new best friend.
Here's why:
It can identify links on a webpage that match a specific regex (/php\?dl=/ in this case)
It can follow those links through the follow_link method
It can get the targets of those links and save them to file
All this without needing to save your wanted links in an intermediate file! Life's sweet when you have the right tool for the job...
Example
use strict;
use warnings;
use WWW::Mechanize;
my $url = 'http://www.example.com/';
my $mech = WWW::Mechanize->new();
$mech->get ( $url );
my #linksOfInterest = $mech->find_all_links ( text_regex => qr/php\?dl=/ );
my $fileNumber++;
foreach my $link (#linksOfInterest) {
$mech->get ( $link, ':contentfile' => "file".($fileNumber++).".txt" );
$mech->back();
}
You can download the file with LWP::UserAgent:
my $ua = LWP::UserAgent->new();
my $response = $ua->get($url, ':content_file' => 'file.txt');
Or if you need a filehandle:
open my $fh, '<', $response->content_ref or die $!;
Old question, but when I'm doing quick scripts, I often use "wget" or "curl" and pipe. This isn't cross-system portable, perhaps, but if I know my system has one or the other of these commands, it's generally good.
For example:
#! /usr/bin/env perl
use strict;
open my $fp, "curl http://www.example.com/ |";
while (<$fp>) {
print;
}

How can I hook into Perl's print?

Here's a scenario. You have a large amount of legacy scripts, all using a common library. Said scripts use the 'print' statement for diagnostic output. No changes are allowed to the scripts - they range far and wide, have their approvals, and have long since left the fruitful valleys of oversight and control.
Now a new need has arrived: logging must now be added to the library. This must be done automatically and transparently, without users of the standard library needing to change their scripts. Common library methods can simply have logging calls added to them; that's the easy part. The hard part lies in the fact that diagnostic output from these scripts were always displayed using the 'print' statement. This diagnostic output must be stored, but just as importantly, processed.
As an example of this processing, the library should only record the printed lines that contain the words 'warning', 'error', 'notice', or 'attention'. The below Extremely Trivial and Contrived Example Code (tm) would record some of said output:
sub CheckPrintOutput
{
my #output = #_; # args passed to print eventually find their way here.
foreach my $value (#output) {
Log->log($value) if $value =~ /warning|error|notice|attention/i;
}
}
(I'd like to avoid such issues as 'what should actually be logged', 'print shouldn't be used for diagnostics', 'perl sucks', or 'this example has the flaws x y and z'...this is greatly simplified for brevity and clarity. )
The basic problem comes down to capturing and processing data passed to print (or any perl builtin, along those lines of reasoning). Is it possible? Is there any way to do it cleanly? Are there any logging modules that have hooks to let you do it? Or is it something that should be avoided like the plague, and I should give up on ever capturing and processing the printed output?
Additional: This must run cross-platform - windows and *nix alike. The process of running the scripts must remain the same, as must the output from the script.
Additional additional: An interesting suggestion made in the comments of codelogic's answer:
You can subclass http://perldoc.perl.org/IO/Handle.html and create your
own file handle which will do the logging work. – Kamil Kisiel
This might do it, with two caveats:
1) I'd need a way to export this functionality to anyone who uses the common library. It would have to apply automatically to STDOUT and probably STDERR too.
2) the IO::Handle documentation says that you can't subclass it, and my attempts so far have been fruitless. Is there anything special needed to make sublclassing IO::Handle work? The standard 'use base 'IO::Handle' and then overriding the new/print methods seem to do nothing.
Final edit: Looks like IO::Handle is a dead end, but Tie::Handle may do it. Thanks for all the suggestions; they're all really good. I'm going to give the Tie::Handle route a try. If it causes problems I'll be back!
Addendum: Note that after working with this a bit, I found that Tie::Handle will work, if you don't do anything tricky. If you use any of the features of IO::Handle with your tied STDOUT or STDERR, it's basically a crapshoot to get them working reliably - I could not find a way to get the autoflush method of IO::Handle to work on my tied handle. If I enabled autoflush before I tied the handle it would work. If that works for you, the Tie::Handle route may be acceptable.
There are a number of built-ins that you can override (see perlsub). However, print is one of the built-ins that doesn't work this way. The difficulties of overriding print are detailed at this perlmonk's thread.
However, you can
Create a package
Tie a handle
Select this handle.
Now, a couple of people have given the basic framework, but it works out kind of like this:
package IO::Override;
use base qw<Tie::Handle>;
use Symbol qw<geniosym>;
sub TIEHANDLE { return bless geniosym, __PACKAGE__ }
sub PRINT {
shift;
# You can do pretty much anything you want here.
# And it's printing to what was STDOUT at the start.
#
print $OLD_STDOUT join( '', 'NOTICE: ', #_ );
}
tie *PRINTOUT, 'IO::Override';
our $OLD_STDOUT = select( *PRINTOUT );
You can override printf in the same manner:
sub PRINTF {
shift;
# You can do pretty much anything you want here.
# And it's printing to what was STDOUT at the start.
#
my $format = shift;
print $OLD_STDOUT join( '', 'NOTICE: ', sprintf( $format, #_ ));
}
See Tie::Handle for what all you can override of STDOUT's behavior.
You can use Perl's select to redirect STDOUT.
open my $fh, ">log.txt";
print "test1\n";
my $current_fh = select $fh;
print "test2\n";
select $current_fh;
print "test3\n";
The file handle could be anything, even a pipe to another process that post processes your log messages.
PerlIO::tee in the PerlIO::Util module seems to allows you to 'tee' the output of a file handle to multiple destinations (e.g. log processor and STDOUT).
Lots of choices. Use select() to change the filehandle that print defaults to. Or tie STDOUT. Or reopen it. Or apply an IO layer to it.
This isn't the answer to your issue but you should be able to adopt the logic for your own use. If not, maybe someone else will find it useful.
Catching malformed headers before they happen...
package PsychicSTDOUT;
use strict;
my $c = 0;
my $malformed_header = 0;
open(TRUE_STDOUT, '>', '/dev/stdout');
tie *STDOUT, __PACKAGE__, (*STDOUT);
sub TIEHANDLE {
my $class = shift;
my $handles = [#_];
bless $handles, $class;
return $handles;
}
sub PRINT {
my $class = shift;
if (!$c++ && #_[0] !~ /^content-type/i) {
my (undef, $file, $line) = caller;
print STDERR "Missing content-type in $file at line $line!!\n";
$malformed_header = 1;
}
return 0 if ($malformed_header);
return print TRUE_STDOUT #_;
}
1;
usage:
use PsychicSTDOUT;
print "content-type: text/html\n\n"; #try commenting out this line
print "<html>\n";
print "</html>\n";
You could run the script from a wrapper script that captures the original script's stdout and writes the output somewhere sensible.