Formatting parsed output perl - perl

I have this file - all I need is the last five lines from the file.
I know that I am not supposed to parse html without a html module. but this is not really like
a program strict - I mean all I really need is the last five lines or so. Besides I cannot download
any modules. I do have access to the proxy server which allows me to curl files from the command line
so maybe there is a way to use cpan fromteh or through the proxy - but that is a nother matter.
the matter at hand is that when I parse out thelast file lines or so, I don't get the
"Names IN MY-DEPT that are restricted"
and I want it. it gets skipped.
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$ cat restricted.html.bak
To:DL-BANK#big_business.com
From:dl-dept?g-gsd-stm#big_business.com
Subject: Restricted List for 25-Nov-2014
Content-Type: text/html;
Content-Transfer-Encoding: quoted-print HTMLFILEable>
<HTML>
<HEAD>
<STYLE type="text/css">
body { font-family: verdana; font-size: 10pt }
td { font-size: 8pt; vertical-align: top }
td.cat { color: 6699FF ; background: 666699; text-align: right; vertical-align: bottom; height: 30 }
td.ind { width: 20pt }
td.link { }
td.desc { color: a0a0a0 }
a:visited { color: 800080; text-decoration: none }
</STYLE>
<TITLE>TRADES</TITLE>
</HEAD><BODY><TABLE width="80%" border="0" cellpadding="0" cellspacing="0">
<tr>
<td colspan="3" align="center">Names IN MY-DEPT that are restricted</td>
</tr>
<tr>
<td><b>Restriction Code</b></td>
<td><b>Company</b></td>
<td><b>Ticker</b></td>
</tr><tr><td>RL5</td><td>First Trust Global Risk Managed Inc</td><td>ETP</td></tr><font color="red"><tr><td>RLMT</td><td>GT Advanced Technologies Inc</td><td nowrap>GTATQ (position only, not in MY-DEPT)</td></tr></font></TABLE></BODY</HTML>new_guy#casper0170foo:~/hey/hit_BANK_restricted.$
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$ cat parse_restrict2
#!/usr/bin/perl
use strict;
use warnings ;
my #restrict_codes = qw(RL3 RL5 RL5H RL6 REGM RAF RLMT RTCA RTCAH RTCB RTCBH RTCI RTCIH RLSI RLHK RLJP RPROP RLCB RLCS RLBZ RLBZH RLSUS);
my $rest_dir = "/home/new_guy/hey/hit_BANK_restricted./";
my $restrict_file = "restricted.html.bak" ;
open my $fh_rest_codes, '<', "$rest_dir$restrict_file" or die "cannot load $! " ;
while (<$fh_rest_codes>) {
next unless $_ =~ m/Names/;
my #lines = <$fh_rest_codes> ;
}
foreach(#lines) {
s/td/ /g ;
s/<[^>]*>/ /g ;
foreach $restrict(#restrict_codes) {
s/$restrict/\n$restrict/g;
}
print $_ ;
sleep 1 ;
}
print "\n" ;
These are the results that I get:
They are Ok but I would like to format them and I do not know how.
new_gue#casper0170foo:~/hey/hit_BANK_restricted.$ cat parse_restrict^C
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$ ./parse_restrict2
Restriction Code
Company
Ticker
RL5 First Trust Global Risk Managed Inc ETP
RLMT GT Advanced Technologies Inc GTATQ (position only, not in MY-DEPT)
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$
new_guy#casper0170foo:~/hey/hit_BANK_restricted.$
Would there be any way possible to get the lines in this kind of format.
Names IN MY-DEPT that are restricted
Restriction Code Company Ticker
RL5 First Trust Global Risk Managed Inc ETP
RLMT GT Advanced Technologies Inc GTATQ (position only, not in MY-DEPT)

Good question, you could try this workaround if you like:
my #lines;
while (<$fh_rest_codes>) {
next unless $_ =~ m/Names/;
push(#lines, $_);
push (#lines, <$fh_rest_codes>);
}
my $str=join ('',#lines);
$str=~m|<td.*?>(.*?)</td>|;
print "$1\n\n";
$str=~ m|<tr>(.*?)</tr>|msg;
my $fmt="%-24s%-40s%-40s\n";
printf ($fmt, $1=~ m{<td><b>(.*?)</b></td>}msg );
while ($str=~ m|<tr>(.*?)</tr>|msg) {
printf ($fmt, $1=~ m{<td.*?>(.*?)</td>}msg );
}
Output:
Names IN MY-DEPT that are restricted
Restriction Code Company Ticker
RL5 First Trust Global Risk Managed Inc ETP
RLMT GT Advanced Technologies Inc GTATQ (position only, not in MY-DEPT)

Related

HTML parsing with HTML::TokeParser::Simple

I am parsing an HTML file with HTML::TokeParser::Simple. The HTML file has the content shown far below. My problem is, I am trying to ignore the JavaScript from showing up as text content. Example:
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( 'test.html' );
while ( my $token = $p->get_token ) {
next unless $token->is_text;
print $token->as_is, "\n";
}
This prints the output as seen below:
Test HTML
<!--
var form_submitted = 0;
function submit_form() {
[..]
}
//-->
The actual HTML Document Content:
<html>
<span>Test HTML</span>
<script type="text/javascript">
<!--
var form_submitted = 0;
function submit_form() {
[..]
}
//-->
</script>
</html>
How do I ignore the JavaScript tag contents from showing.
I get the desired result. Comments are (correctly) not considered text by the version I have. Looks like you need to upgrade the modules you are using. (I used HTML::Parser 3.69 and HTML::TokeParser::Simple 3.15.)
>perl a.pl
Test HTML
>
You'll still have to process HTML entities and format the text usefully, the latter being quite difficult since you removed all formatting instruction. Your approach seems fatally flawed.
I believe you only need to use the as_text method.
my $tree = HTML::TreeBuilder->new();
$tree->parse( $html );
$tree->eof();
$tree->elementify(); # just for safety
my $text = $tree->as_text();
$tree->delete;
I adapted this from the WWW::Mechanize module (http://search.cpan.org/dist/WWW-Mechanize/) which has tons of convenience methods that can help you. It basically acts as a web browser in an object.
Scan through the token to ignore all open and close script tags. See below as used to resolved the issue.
my $ignore=0;
while ( my $token = $p->get_token ) {
if ( $token->is_start_tag('script') ) {
print $token->as_is, "\n";
$ignore = 1;
next;
}
if ( $token->is_end_tag('script') ) {
$ignore = 0;
print $token->as_is, "\n";
next;
}
if ($ignore) {
#Everything inside the script tag. Here you can ignore or print as is
print $token->as_is, "\n";
}
else
{
#Everything excluding scripts falls here handle as appropriate
next unless $token->is_text;
print $token->as_is, "\n";
}
}

Extracting links inside <div>'s with HTML::TokeParser & URI

I'm an old-newbie in Perl, and Im trying to create a subroutine in perl using HTML::TokeParser and URI.
I need to extract ALL valid links enclosed within on div called "zone-extract"
This is my code:
#More perl above here... use strict and other subs
use HTML::TokeParser;
use URI;
sub extract_links_from_response {
my $response = $_[0];
my $base = URI->new( $response->base )->canonical;
# "canonical" returns it in the one "official" tidy form
my $stream = HTML::TokeParser->new( $response->content_ref );
my $page_url = URI->new( $response->request->uri );
print "Extracting links from: $page_url\n";
my($tag, $link_url);
while ( my $div = $stream->get_tag('div') ) {
my $id = $div->get_attr('id');
next unless defined($id) and $id eq 'zone-extract';
while( $tag = $stream->get_tag('a') ) {
next unless defined($link_url = $tag->[1]{'href'});
next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL.
next unless length $link_url; # sanity check!
$link_url = URI->new_abs($link_url, $base)->canonical;
next unless $link_url->scheme eq 'http'; # sanity
$link_url->fragment(undef); # chop off any "#foo" part
print $link_url unless $link_url->eq($page_url); # Don't note links to itself!
}
}
return;
}
As you can see, I have 2 loops, first using get_tag 'div' and then look for id = 'zone-extract'. The second loop looks inside this div and retrieve all links (or that was my intention)...
The inner loop works, it extracts all links correctly working standalone, but I think there is some issues inside the first loop, looking for my desired div 'zone-extract'... Im using this post as a reference: How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?
But all I have by the moment is this error:
Can't call method "get_attr" on unblessed reference
Some ideas? Help!
My HTML (Note URL_TO_EXTRACT_1 & 2):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="...">
<h2 class="genres"><img alt="extracting" class="png"></h2>
<li><a title="Extr 2" href="**URL_TO_EXTRACT_1**">2</a></li>
<li><a title="Con 1" class="sel" href="**URL_TO_EXTRACT_2**">1</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
I find that TokeParser is a very crude tool requiring too much code, its fault is that only supports the procedural style of programming.
A better alternatives which require less code due to declarative programming is Web::Query:
use Web::Query 'wq';
my $results = wq($response)->find('div#zone-extract a')->map(sub {
my (undef, $elem_a) = #_;
my $link_url = $elem_a->attr('href');
return unless $link_url && $link_url !~ m/\s/ && …
# Further checks like in the question go here.
return [$link_url => $elem_a->text];
});
Code is untested because there is no example HTML in the question.

Get <td> values with Perl

So I have a reporting tool that spits out job scheduling statistics in an HTML file, and I'm looking to consume this data using Perl. I don't know how to step through a HTML table though.
I know how to do this with jQuery using
$.find('<tr>').each(function(){
variable = $(this).find('<td>').text
});
But I don't know how to do this same logic with Perl. What should I do? Below is a sample of the HTML output. Each table row includes the three same stats: object name, status, and return code.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
<HTML>
<HEAD>
<meta name="GENERATOR" content="UC4 Reporting Tool V8.00A">
<Title></Title>
<style type="text/css">
th,td {
font-family: arial;
font-size: 0.8em;
}
th {
background: rgb(77,148,255);
color: white;
}
td {
border: 1px solid rgb(208,213,217);
}
table {
border: 1px solid grey;
background: white;
}
body {
background: rgb(208,213,217);
}
</style>
</HEAD>
<BODY>
<table>
<tr>
<th>Object name</th>
<th>Status</th>
<th>Return code</th>
</tr>
<tr>
<td>JOBS.UNIX.S_SITEVIEW.WF_M_SITEVIEW_CHK_FACILITIES_REGISTRY</td>
<td>ENDED_OK - ended normally</td>
<td>0</td>
</tr>
<tr>
<td>JOBS.UNIX.ADMIN.INFA_CHK_REP_SERVICE</td>
<td>ENDED_OK - ended normally</td>
<td>0</td>
</tr>
<tr>
<td>JOBS.UNIX.S_SITEVIEW.WF_M_SITEVIEW_CHK_FACILITIES_REGISTRY</td>
<td>ENDED_OK - ended normally</td>
<td>0</td>
</tr>
The HTML::Query module is a wrapper around the HTML parser that provides a querying interface that is familiar to jQuery users. So you could write something like
use HTML::Query qw(Query);
my $docName = "test.html";
my $doc = Query(file => $docName);
for my $tr ($doc->query("td")) {
for my $td (Query($tr)->query("td")) {
# $td is now an HTML::Element object for the td element
print $td->as_text, "\n";
}
}
Read the HTML::Query documentation to get a better idea of how to use it--- the above is hardly the prettiest example.
You could use a RegExp but Perl already has modules built for this specific task. Check out HTML::TableContentParser
You would probably do something like this:
use HTML::TableContentParser;
$tcp = HTML::TableContentParser->new;
$tables = $tcp->parse($HTML);
foreach $table (#$tables) {
foreach $row (#{ $tables->{rows} }) {
foreach $col (#{ $row->{cols} }) {
# each <td>
$data = $col->{data};
}
}
}
Here I use the HTML::Parser, is a little verbose, but guaranteed to work. I am using the diamond operator so, you can use it as a filter. If you call this Perl source extractTd, here are a couple of ways to call it.
$ extractTd test.html
or
$ extractTd < test.html
will both work, output will go on standard output and you can redirect it to a file.
#!/usr/bin/perl -w
use strict;
package ExtractTd;
use 5.010;
use base "HTML::Parser";
my $td_flag = 0;
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = #_;
if ($tag =~ /^td$/i) {
$td_flag = 1;
}
}
sub end {
my ($self, $tag, $origtext) = #_;
if ($tag =~ /^td$/i) {
$td_flag = 0;
}
}
sub text {
my ($self, $text) = #_;
if ($td_flag) {
say $text;
}
}
my $extractTd = new ExtractTd;
while (<>) {
$extractTd->parse($_);
}
$extractTd->eof;
Have you tried looking at cpan for HTML libraries? This seems to do what your wanting
http://search.cpan.org/~msisk/HTML-TableExtract-2.11/lib/HTML/TableExtract.pm
Also here is a whole page of different HTML related libraries to use
http://search.cpan.org/search?m=all&q=html+&s=1&n=100
Perl CPAN module HTML::TreeBuilder.
I use it extensively to parse a lot of HTML documents.
The concept is that you get an HTML::Element (the root node by example).
From it, you can look for other nodes:
Get a list of children nodes with ->content_list()
Get the parent node with ->parent()
Disclaimer: The following code has not been tested, but it's the idea.
my $root = HTML::TreeBuilder->new;
$root->utf8_mode(1);
$root->parse($content);
$root->eof();
# This gets you an HTML::Element, of the root document
$root->elementify();
my #td = $root->look_down("_tag", "td");
foreach my $td_elem (#td)
{
printf "-> %s\n", $td_elem->as_trimmed_text();
}
If your table is more complex than that, you could first find the TABLE element,
then iterate over each TR children, and for each TR children, iterate over TD elements...
http://metacpan.org/pod/HTML::TreeBuilder

How do I extract information from a webpage using perl?

I need to extract the largest values(number) of specific names from a webpage. consider the webpage as
http://earth.wifi.com/isos/preFCS5.3/upgrade/
and the URL content is
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Index of /isos/preFCS5.3/upgrade</title>
</head>
<body>
<h1>Index of /isos/preFCS5.3/upgrade</h1>
<table><tr><th><img src="/icons/blank.gif" alt="[ICO]"></th><th>Name</th><th>Last modified</th><th>Size</th><th>Description</th></tr><tr><th colspan="5"><hr></th></tr>
<tr><td valign="top"><img src="/icons/back.gif" alt="[DIR]"></td><td>Parent Directory</td><td> </td><td align="right"> - </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td>GTP-UPG-LATEST-5.3.0.160.iso</td><td align="right">29-Aug-2011 16:06 </td><td align="right">804M</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td>GTP-UPG-LATEST-5.3.0.169.iso</td><td align="right">31-Aug-2011 16:26 </td><td align="right">804M</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td>GTP-UPG-LATEST-5.3.0.172.iso</td><td align="right">01-Sep-2011 16:26 </td><td align="right">804M</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td>PRE-UPG-LATEST-5.3.0.157.iso</td><td align="right">29-Aug-2011 16:05 </td><td align="right">1.5G</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td>PRE-UPG-LATEST-5.3.0.165.iso</td><td align="right">31-Aug-2011 16:26 </td><td align="right">1.5G</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td>PRE-UPG-LATEST-5.3.0.168.iso</td><td align="right">01-Sep-2011 16:26 </td><td align="right">1.5G</td></tr>
<tr><th colspan="5"><hr></th></tr>
</table>
<address>Apache/2.2.3 (Red Hat) Server at earth.wifi.com Port 80</address>
</body></html>
In the above source you can see 172 is the largest for GTP-UPG-LATEST-5.3.0
and 168 is the largest for PRE-UPG-LATEST-5.3.0
How can I extract these values and put it to a varialble say $gtp and $pre in perl
Thanks so much in advance
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $upgrade = 'http://earth.wifi.com/isos/preFCS5.3/upgrade/';
my $website_content = get($upgrade);
if ( $website_content =~ /href=\"PRE-UPG-LATEST-5.3.0(.*?)\.iso\"/ )
{
my $preversion = ${1};
print $preversion;
}
This is the code I tried with but its not getting the largest value. This is code is getting the first PRE-UPG-LATEST version value that it encounters . But I need the largest of the value
An if() executes only once. Since you want to get many, you need a loop
while ( m//g ) {
In your data it has "UPG" but your regex has "UGP", so it won't match
(you should copy/paste long strings rather than (attempt to) retype them!).
This will list the data you need, I'll leave it to you to figure out how to process it.
while ($website_content =~ /href="((?:PRE|GTP)-UPG-LATEST-.*?)\.(\d+)\.iso"/g) {
my($file, $version) = ($1, $2);
print "file=$file version=$version\n";
}
I would suggest that you not only use LWP::Simple, but XML::Simple too. This will allow you to example the data as a hash of hashes. It'll be a lot easier to find the largest version.
You can't parse HTML or XML with simple regular expressions because the XML data structure is too free form. Large structures can legally be broken up on separate lines. Take a look at this example:
The Foobar Page
It can also be expressed as:
<a
href="http://foo.com/bar/bar/">
The Foobar Page
</a>
If you were looking for a href, you'll never find it. Heck, you could even look for a\s+href and not find it.
There might be better modules to use for parsing HTML (I found HTML::Dom), but I've never used them and don't know which one is the best one to use.
As for finding the largest version number:
There's some difficulty because there are all sorts of strange and wacky rules with version numbering. For example:
2.2 < 2.10
Perl has something called V-Strings, but rumor has it that they've been deprecated. If this doesn't concern you, you can use Perl::Version.
Otherwise, here's a subroutine that does version comparison. Note that I also verify that each section is an integer via the /^\d+$/ regular expression. My subroutine can return four values:
0: Both are the same size
1: First Number is bigger
2: Second Number is bigger
undef: There is something wrong.
Here's the program:
my $minVersion = "10.3.1.3";
my $userVersion = "10.3.2";
# Create the version arrays
my $result = compare($minVersion, $userVersion);
if (not defined $results) {
print "Non-version string detected!\n";
}
elsif ($result == 0) {
print "$minVersion and $userVersion are the same\n";
}
elsif ($result == 1) {
print "$minVersion is bigger than $userVersion\n";
}
elsif ($result == 2) {
print "$userVersion is bigger than $minVersion\n";
}
else {
print "Something is wrong\n";
}
sub compare {
my $version1 = shift;
my $version2 = shift;
my #versionList1 = split /\./, $version1;
my #versionList2 = split /\./, $version2;
my $result;
while (1) {
# Shift off the first value for comparison
# Returns undef if there are no more values to parse
my $versionCompare1 = shift #versionList1;
my $versionCompare2 = shift #versionList2;
# If both are empty, Versions Matched
if (not defined $versionCompare1 and not defined $versionCompare2) {
return 0;
}
# If $versionCompare1 is empty $version2 is bigger
if (not defined $versionCompare1) {
return 2;
}
# If $versionCompare2 is empty $version1 is bigger
if (not defined $versionCompare2) {
return 1;
}
# Make sure both are numeric or else there's an error
if ($versionCompare1 !~ /\^d+$/ or $versionCompare2 !~ /\^\d+$/) {
return;
}
if ($versionCompare1 > $versionCompare2) {
return 1;
}
if ($versionCompare2 > $versionCompare1) {
return 2;
}
}
}

Perl Treebuilder HTML Parsing, can't seem to parse to DIV, getting error "Use of uninitialized value in pattern match "

I'm new to using the Perl treebuilder module for HTML parsing and can't figure out what the issue is with this.. I have spent a few hours trying to get this to work and looked at a few tutorials but am still getting this error: "Use of uninitialized value in pattern match ", referring to this line in my code:
sub{ $_[0]-> tag() eq 'div' and ($_[0]->attr('class') =~ /snap_preview/)}
);
This error prints out many times in the terminal, I have checked everything over and over and its definitely getting the input as the $downloaded page is a full HTML file that contains the string I give below... any advice is greatly appreciated.
sample string, contained within the $downloadedpage variable
<div class='snap_preview'><p><img src="http://www.dishbase.com/recipe_images/large/chicken-enchiladas-12005010871.jpg" width="160" height="115" align="left" border="0" alt="Mexican dishes recipes" style="border:none;"><b>Mexican dishes recipes</b> <i></i><br />
Mexican cuisine is popular the world over for its intense flavor and colorful presentation. Traditional Mexican recipes such as tacos, quesadillas, enchiladas and barbacoa are consistently explored for options by some of the world’s foremost gourmet chefs. A celebration of spices and unique culinary trends, Mexican food is now dominating world cuisines.</p>
<div style="margin-top: 1em" class="possibly-related"><hr /><p><strong>Possibly related posts: (automatically generated)</strong></p><ul><li><a rel='related' href='http://vireja59.wordpress.com/2010/02/13/all-best-italian-dishes-recipes/' style='font-weight:bold'>All best Italian dishes recipes</a></li><li><a rel='related' href='http://vireja59.wordpress.com/2010/05/24/liver-dishes-recipes/' style='font-weight:bold'>Liver dishes recipes</a></li><li><a rel='related' href='http://vireja59.wordpress.com/2010/04/24/parsley-in-cooking/' style='font-weight:bold'>Parsley in cooking</a></li></ul></div>
my code:
my $tree = HTML::TreeBuilder->new();
$tree->parse($downloadedpage);
$tree->eof();
#the article is in the div with class "snap_preview"
#article = $tree->look_down(
sub{ $_[0]-> tag() eq 'div' and ($_[0]->attr('class') =~ /snap_preview/)}
);
Using the exact code and example you gave,
use warnings;
use strict;
use HTML::TreeBuilder;
my $downloadedpage=<<EOF;
<div class='snap_preview'><p><img src="http://www.dishbase.com/recipe_images/large/chicken-enchiladas-12005010871.jpg" width="160" height="115" align="left" border="0" alt="Mexican dishes recipes" style="border:none;"><b>Mexican dishes recipes</b> <i></i><br />
Mexican cuisine is popular the world over for its intense flavor and colorful presentation. Traditional Mexican recipes such as tacos, quesadillas, enchiladas and barbacoa are consistently explored for options by some of the world’s foremost gourmet chefs. A celebration of spices and unique culinary trends, Mexican food is now dominating world cuisines.</p>
<div style="margin-top: 1em" class="possibly-related"><hr /><p><strong>Possibly related posts: (automatically generated)</strong></p><ul><li><a rel='related' href='http://vireja59.wordpress.com/2010/02/13/all-best-italian-dishes-recipes/' style='font-weight:bold'>All best Italian dishes recipes</a></li><li><a rel='related' href='http://vireja59.wordpress.com/2010/05/24/liver-dishes-recipes/' style='font-weight:bold'>Liver dishes recipes</a></li><li><a rel='related' href='http://vireja59.wordpress.com/2010/04/24/parsley-in-cooking/' style='font-weight:bold'>Parsley in cooking</a></li></ul></div>
EOF
my $tree = HTML::TreeBuilder->new();
$tree->parse($downloadedpage);
$tree->eof();
#the article is in the div with class "snap_preview"
my #article = $tree->look_down(
sub{ $_[0]-> tag() eq 'div' and ($_[0]->attr('class') =~ /snap_preview/)}
);
I don't get any errors at all. My first guess would be that there are some <div>s in the HTML which don't have a class attribute.
Maybe you need to write
sub{
$_[0]-> tag() eq 'div' and
$_[0]->attr('class') and
($_[0]->attr('class') =~ /snap_preview/)
}
there?