Perl WWW::Mechanize Web Spider. How to find all links - perl

I am currently attempting to create a Perl webspider using WWW::Mechanize.
What I am trying to do is create a webspider that will crawl the whole site of the URL (entered by the user) and extract all of the links from every page on the site.
What I have so far:
use strict;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
my $urlToSpider = $ARGV[0];
$mech->get($urlToSpider);
print "\nThe url that will be spidered is $urlToSpider\n";
print "\nThe links found on the url's starting page\n";
my #foundLinks = $mech->find_all_links();
foreach my $linkList(#foundLinks) {
unless ($linkList->[0] =~ /^http?:\/\//i || $linkList->[0] =~ /^https?:\/\//i) {
$linkList->[0] = "$urlToSpider" . $linkList->[0];
}
print "$linkList->[0]";
print "\n";
}
What it does:
1. At present it will extract and list all links on the starting page
2. If the links found are in /contact-us or /help format it will add 'http://www.thestartingurl.com' to the front of it so it becomes 'http://www.thestartingurl.com/contact-us'.
The problem:
At the moment it also finds links to external sites which I do not want it to do, e.g if I want to spider 'http://www.tree.com' it will find links such as http://www.tree.com/find-us.
However it will also find links to other sites like http://www.hotwire.com.
How do I stop it finding these external urls?
After finding all the urls on the page I then also want to save this new list of internal-only links to a new array called #internalLinks but cannot seem to get it working.
Any help is much appreciated, thanks in advance.

This should do the trick:
my #internalLinks = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);
If you don't want css links try:
my #internalLinks = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/, tag => 'a');
Also, the regex you're using to add the domain to any relative links can be replaced with:
print $linkList->url_abs();

Related

Downloading PDB files from BindingDB

I am trying to find a way to download PDB files of proteins using BindingDB. I have a file with different BindingDB IDs, and I want to download PDB files for every ligand that binds to the protein for each ID.
I was using a script to download specific PDB files from RSCB: PDB. Now I have to do the same, but I have BindingDB IDs.
The previous script that I used looks like this:
#! usr/bin/perl -w
open (NDX, 'file.txt');
#ndx_ar = <NDX>;
close NDX;
$ndx_sz = scalar #ndx_ar;
for ( $c = 0; $c < $ndx_sz; ++$c ) {
chomp $ndx_ar[$c];
if ( $ndx_ar[$c] =~ /pdb/ ) {
$ndx_ar[$c] =~ s/.pdb//;
`wget 'http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=$ndx_ar[$c]' -O $ndx_ar[$c].pdb`;
}
else {
`wget 'http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=$ndx_ar[$c]' -O $ndx_ar[$c].pdb`;
}
}
exit;
I assume you want to access the PDBBind database?
The welcome page says this
Accessibility. The basic information of each complex in PDBbind is completely open for access (see the [BROWSE] page). Users are required to register under a license agreement in order to utilize the searching functions provided on this web site or to download the contents of PDBbind in bulk. Registration is free of charge to all academic and industrial users. Please go to the [REGISTER] page and follow the instructions to complete registration.
I am no biochemist, but I suspect that you will need to register as the browse facilities are minimal. I can't help you to use the search facilities as I don't understand what it is that you need.

SugarCRM support links are broken within modules

I recently downloaded and installed sugarcrm community edition: 6.5.17.
How can I fix the help links within the various modules? Or, how can I figure out how the help calls work, since I can't find any code reference within code to the links being generated.
When I click on a help link within any module, such as within opportunities:
Click here to learn more about the Opportunities module. In order to access more information, use the user menu drop down located on the main navigation bar to access Help.
Where the "click here" link looks like this when viewing the html page source:
href='?module=Administration&action=SupportPortal&view=documentation&version=6.5.17&edition=CE&lang=&help_module=Project&help_action=&key='
I get a "403 Forbidden" error with this link:
[link]http://support.sugarcrm.com/02_Documentation/01_Sugar_Editions/05_Sugar_Community_Edition/Sugar_Community_Edition_6.5/Application_Guide/13_Opportunities/
I manually found the correct link to be:
[link]http://support.sugarcrm.com/02_Documentation/01_Sugar_Editions/05_Sugar_Community_Edition/Sugar_Community_Edition_6.5/Sugar_Community_Edition_Application_Guide_6.5.0/13_Opportunities/
Open config_override.php and add a param for custom_help_url that points to a top-level page at the site that you found.
Here's how I found it:
In Sugar, nearly anything you read on the screen is kept in some language file for localization and translation purposes. So we can start by using grep to search the system for this bit of translated text:
[MKP01] [~/Sites/sugar] > grep -r 'to learn more about the' include/
include//language/en_us.lang.php: 'MSG_EMPTY_LIST_VIEW_NO_RESULTS_SUBMSG' => "<item4> to learn more about the <item1> module. In order to access more information, use the user menu drop down located on the main navigation bar to access Help.",
Now that we know the name of the label, we can search for instances of when it is used:
[MKP01] [~/Sites/nemrace] > grep -r 'MSG_EMPTY_LIST_VIEW_NO_RESULTS_SUBMSG' include/
include//language/en_us.lang.php: 'MSG_EMPTY_LIST_VIEW_NO_RESULTS_SUBMSG' => "<item4> to learn more about the <item1> module. In order to access more information, use the user menu drop down located on the main navigation bar to access Help.",
include//ListView/ListViewGeneric.tpl: {$APP.MSG_EMPTY_LIST_VIEW_NO_RESULTS_SUBMSG|replace:"<item1>":$moduleName|replace:"<item4>":$helpLink}
The first result there we already know about, but the second is what you're after: it's the default Smarty template for all list views. If we open that file in a text editor we can trace the definition for variable $helpLink is only a few lines above where it's used:
{capture assign="helpLink"}<a target="_blank" href='?module=Administration&action=SupportPortal&view=documentation&version={$sugar_info.sugar_version}&edition={$sugar_info.sugar_flavor}&lang=&help_module={$currentModule}&help_action=&key='>{$APP.LBL_CLICK_HERE}</a>{/capture}
Then you know to dig into modules/Administration/SupportPortal.php and reading that logic, know to look for the definition of function get_help_url. We find it in include/utils.php
function get_help_url($send_edition = '', $send_version = '', $send_lang = '', $send_module = '', $send_action = '', $dev_status = '', $send_key = '', $send_anchor = '') {
global $sugar_config;
if (!empty($sugar_config['custom_help_url'])) {
$sendUrl = $sugar_config['custom_help_url'];
} else {
if (!empty($sugar_config['custom_help_base_url'])) {
$baseUrl= $sugar_config['custom_help_base_url'];
} else {
$baseUrl = "http://www.sugarcrm.com/crm/product_doc.php";
}
$sendUrl = $baseUrl . "?edition={$send_edition}&version={$send_version}&lang={$send_lang}&module={$send_module}&help_action={$send_action}&status={$dev_status}&key={$ send_key}";
if(!empty($send_anchor)) {
$sendUrl .= "&anchor=".$send_anchor;
}
}
return $sendUrl;
}
This is great news - Sugar understands that some system administrators would wish to provide their own help URls or base URLs, so my recommendation to you would be open config_override.php and add a param for custom_help_url that points to a top-level page at the site that you found.

What does this Lucene-related code actually do?

#usr/bin/perl
use Plucene::Document;
use Plucene::Document::Field;
use Plucene::Index::Writer;
use Plucene::Analysis::SimpleAnalyzer;
use Plucene::Search::HitCollector;
use Plucene::Search::IndexSearcher;
use Plucene::QueryParser;
my $content = "I am the law";
my $doc = Plucene::Document->new;
$doc->add(Plucene::Document::Field->Text(content => $content));
$doc->add(Plucene::Document::Field->Text(author => "Philip Johnson"));
my $analyzer = Plucene::Analysis::SimpleAnalyzer->new();
my $writer = Plucene::Index::Writer->new("my_index", $analyzer, 1);
$writer->add_document($doc);
undef $writer; # close
my $searcher = Plucene::Search::IndexSearcher->new("my_index");
my #docs;
my $hc = Plucene::Search::HitCollector->new(collect => sub {
my ($self, $doc, $score) = #_;
push #docs, $searcher->doc($doc);
});
$searcher->search_hc($query => $hc);
Try as I may, I don't understand what this code does. I understand the familiar Perl syntax and what's going on on that end...but what is a Lucene Document, Index::Writer - etc.? Most importantly, when I run this code I expect something to be generated...yet I see nothing.
I know what an Analyzer is...thanks to this doc linked to in CPAN: http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2. But I am just not getting why I run this code and it doesn't seem to DO anything...
Lucene is a search engine designed to search huge amounts of text very fast.
My perl is not strong, but from what I understand from Lucene objects:
my $content = "I am the law";
my $doc = Plucene::Document->new;
$doc->add(Plucene::Document::Field->Text(content => $content));
$doc->add(Plucene::Document::Field->Text(author => "Philip Johnson"));
This part creates a new document object and adds two text fields to it, content and author, in preparation to add it to an lucene index file as searchable data.
my $analyzer = Plucene::Analysis::SimpleAnalyzer->new();
my $writer = Plucene::Index::Writer->new("my_index", $analyzer, 1);
$writer->add_document($doc);
undef $writer; # close
This part creates the index files and adds the previously created document do that index. At this point, you should have a "my_index" folder with several index files in it, in your application directory, with docs's data in it as searchable text.
my $searcher = Plucene::Search::IndexSearcher->new("my_index");
my #docs;
my $hc = Plucene::Search::HitCollector->new(collect => sub {
my ($self, $doc, $score) = #_;
push #docs, $searcher->doc($doc);
});
$searcher->search_hc($query => $hc);
This part attempts to search the index file created above for the same document data you just used to create the index file. Presumably, you'll have your search results in #docs at this point, which you might want to display to user (tho it is not, in this sample).
This seems to be a "hello world" application for Lucene usage in perl. In real-life applications, I dont see a scenario where you would create the index file and then search it from same piece of code.
Where did you get this code from? It is a copy of the code in the Synopsis at the start of the Plucene POD documentation.
I guess it was an attempt by someone to begin learning about Plucene. The code in a module's synopsis isn't necessarily meant to achieve something useful on its own.
As the documentation you refer to says, Lucene is a Java library that adds text indexing and searching capabilities to an application. It is not a complete application that one can just download, install, and run.
Where did you get the idea that you should run the code you show?

Perl WWW:Mechanize/HTML:TokeParser and following/storing URL from href attr

i am making some good progress with Perl due to the help on this site but i've run into a problem. One of the pages i was scraping from has changed and i can't figure out how to get to it now. What i want to do is store a link to each page i want to get to. The problem is that these links are inside the a href attribute tags in the source code and i have no idea how to extract them. Could anyone help me?
the links i need are from line 316 to 354 of this page(source code) http://www.soccerbase.com/teams/home.sd
i need to basically extract the links to variables for use in my other scripts. As mentioned i am using WWW::Mechanize and HTML::TokeParser, hopefully there are methods within these that i can use but can't currently figure out. Thanks in advance!
See method find_all_links in WWW::Mechanize. No need to bother manually with the parser. You probably want to relax the regex so that you get all ~1000 possible teams at once.
use WWW::Mechanize qw();
my $w = WWW::Mechanize->new;
$w->get('http://www.soccerbase.com/teams/home.sd');
for my $link ($w->find_all_links(url_regex => qr/comp_id=1\b/)) {
# 20 instances of WWW::Mechanize::Link
printf "URL=%s\tTeam=%s\n", $link->url_abs, $link->text
}
URL=http://www.soccerbase.com/tournaments/tournament.sd?comp_id=1 Team=Premier League
URL=http://www.soccerbase.com/teams/team.sd?team_id=142&comp_id=1 Team=Arsenal
URL=http://www.soccerbase.com/teams/team.sd?team_id=154&comp_id=1 Team=Aston Villa
URL=http://www.soccerbase.com/teams/team.sd?team_id=308&comp_id=1 Team=Blackburn
URL=http://www.soccerbase.com/teams/team.sd?team_id=354&comp_id=1 Team=Bolton
URL=http://www.soccerbase.com/teams/team.sd?team_id=536&comp_id=1 Team=Chelsea
URL=http://www.soccerbase.com/teams/team.sd?team_id=942&comp_id=1 Team=Everton
URL=http://www.soccerbase.com/teams/team.sd?team_id=1055&comp_id=1 Team=Fulham
URL=http://www.soccerbase.com/teams/team.sd?team_id=1563&comp_id=1 Team=Liverpool
URL=http://www.soccerbase.com/teams/team.sd?team_id=1718&comp_id=1 Team=Man City
URL=http://www.soccerbase.com/teams/team.sd?team_id=1724&comp_id=1 Team=Man Utd
URL=http://www.soccerbase.com/teams/team.sd?team_id=1823&comp_id=1 Team=Newcastle
URL=http://www.soccerbase.com/teams/team.sd?team_id=1855&comp_id=1 Team=Norwich
URL=http://www.soccerbase.com/teams/team.sd?team_id=2093&comp_id=1 Team=QPR
URL=http://www.soccerbase.com/teams/team.sd?team_id=2477&comp_id=1 Team=Stoke
URL=http://www.soccerbase.com/teams/team.sd?team_id=2493&comp_id=1 Team=Sunderland
URL=http://www.soccerbase.com/teams/team.sd?team_id=2513&comp_id=1 Team=Swansea
URL=http://www.soccerbase.com/teams/team.sd?team_id=2590&comp_id=1 Team=Tottenham
URL=http://www.soccerbase.com/teams/team.sd?team_id=2744&comp_id=1 Team=West Brom
URL=http://www.soccerbase.com/teams/team.sd?team_id=2783&comp_id=1 Team=Wigan
URL=http://www.soccerbase.com/teams/team.sd?team_id=2848&comp_id=1 Team=Wolves

What's the best method to generate Multi-Page PDFs with Perl and PDF::API2?

I have been using PDF::API2 module to program a PDF. I work at a warehousing company and we are trying switch from text packing slips to PDF packing slips. Packing Slips have a list of items needed on a single order. It works great but I have run into a problem. Currently my program generates a single page PDF and it was all working fine. But now I realize that the PDF will need to be multiple pages if there are more than 30 items in an order. I was trying to think of an easy(ish) way to do that, but couldn’t come up with one. The only thing I could think of involves creating another page and having logic that redefines the coordinates of the line items if there are multiple pages. So I was trying to see if there was a different method or something I was missing that could help but I wasn’t really finding anything on CPAN.
Basically, i need to create a single page PDF unless there are > 30 items. Then it will need to be multiple.
I hope that made sense and any help at all would be greatly appreciated as I am relatively new to programming.
Since you already have the code working for one-page PDFs, changing it to work for multi-page PDFs shouldn't be too hard.
Try something like this:
use PDF::API2;
sub create_packing_list_pdf {
my #items = #_;
my $pdf = PDF::API2->new();
my $page = _add_pdf_page($pdf);
my $max_items_per_page = 30;
my $item_pos = 0;
while (my $item = shift(#items)) {
$item_pos++;
# Create a new page, if needed
if ($item_pos > $max_items_per_page) {
$page = _add_pdf_page($pdf);
$item_pos = 1;
}
# Add the item at the appropriate height for that position
# (you'll need to declare $base_height and $line_height)
my $y = $base_height - ($item_pos - 1) * $line_height;
# Your code to display the line here, using $y as needed
# to get the right coordinates
}
return $pdf;
}
sub _add_pdf_page {
my $pdf = shift();
my $page = $pdf->page();
# Your code to display the page template here.
#
# Note: You can use a different template for additional pages by
# looking at e.g. $pdf->pages(), which returns the page count.
#
# If you need to include a "Page 1 of 2", you can pass the total
# number of pages in as an argument:
# int(scalar #items / $max_items_per_page) + 1
return $page;
}
The main thing is to split up the page template from the line items so you can easily start a new page without having to duplicate code.
PDF::API2 is low-level. It doesn't have most of what you would consider necessary for a document, things like margins, blocks, and paragraphs. Because of this, I afraid you're going to have to do things the hard way. You may want to look at PDF::API2::Simple. It might meet your criteria and it's simple to use.
I use PDF::FromHTML for some similar work. Seems to be a reasonable choice, I guess I am not too big on positioning by hand.
The simplest method is to use PDF-API2-Simple
my #content;
my $pdf = PDF::API2::Simple->new(file => "$name");
$pdf->add_font('Courier');
$pdf->add_page();
foreach $line (#content)
{
$pdf->text($line, autoflow => 'on');
}
$pdf->save();