Good practice parsing specific html table with urls in perl - perl

Given a html with table data like the following...
<tr class=nbg1><td><A HREF=api.dll?pgm=cdq32&p1=oavmsd&p2=fg3m9s5d&p3=&cmd=w1d27d0id9654&hl=antibio&lstid=026>Nadifloxacin</A></td><td>Aknetherapeutikum Antibiotikum (Gyrasehemmer)</td><td>WST</td><td></td></tr>
<tr class=nbg2><td><A HREF=api.dll?pgm=cdq32&p1=oavmsd&p2=fg3m9s5d&p3=&cmd=w1d27d0id9728&hl=antibio&lstid=026>Ertapenem</A></td><td>Antibiotikum</td><td>WST</td><td></td></tr>
<tr class=nbg1><td><A HREF=api.dll?pgm=cdq32&p1=oavmsd&p2=fg3m9s5d&p3=&cmd=w1d27d0id9761&hl=antibio&lstid=026>Panipenem</A></td><td>Beta-Lactam-Antibiotikum</td><td>WST</td><td></td></tr>
<tr class=nbg2><td><A HREF=api.dll?pgm=cdq32&p1=oavmsd&p2=fg3m9s5d&p3=&cmd=w1d27d0id10302&hl=antibio&lstid=026>Prulifloxacin</A></td><td>Antibiotikum (Gyrasehemmer)</td><td>WST</td><td></td></tr>
</table></td>
<td width=15></td><td valign=top nowrap class=NBG1>
<TABLE width="200" border="0" cellspacing="0" cellpadding="2">
<TR><TD CLASS="NBG2">
</TD></TR></TABLE><BR>
I need to parse the url and the url description, where the extracted url will be used for further parsing the subpage. What would be a good practice to accomplish this, especially getting the url.
current code:
my $te = HTML::TableExtract->new( depth => 3, count => 0 );
$te->parse($mainpage);
my $ts = "";
my $row = "";
foreach $ts ($te->tables) {
foreach $row ($ts->rows) {
print #$row[0] . "\n";
}
}

if you want to extract only the href attribute from each a' element in that table, no need to use TableExtract, just use HTML::Query
my $qry = HTML::Query->new(text => $mainpage);
my #hrefs = map { $_->attr('href') } grep { m/api\.dll/i } $qry->query('tr > td > a')->get_elements();
no tested, but you get the idea...

HTML::TableExtract can help you exactly with dealing with tables.

Related

Fixing code to fetch data from dom document (getElementby...)

url : sayuri.go.jp/used-cars
$content = file_get_contents('http://www.sayuri.co.jp/used-cars/');
$dom = new DOMDocument;
$dom->loadHTML($content);
Partial Source code :
<td colspan="4">
<h4 class="stk-title">Toyota Wish G</h4>
</td>
<td colspan="4">
I am trying to go through the source code and for each parts of the above i want to save the url e.g : "/used-cars/B37753-Toyota-Wish-japanese-used-cars"
Here is the code i am using but unsuccessful so far
$p = $dom->getElementsByTagName("h4");
$titles = array();
foreach ($p as $node) {
if ($node->hasAttributes()) {
if($node->getAttribute('class') == "stk-title") {
foreach ($node->attributes as $attr) {
if ($attr->nodeName == "href") {
array_push($titles , $attr->nodeValue);
}
}
}
}
}
print_r($titles) ;
It should give me an array containing all the urls of each car : ("/used-cars/B37753-Toyota-Wish-japanese-used-cars" , "" , "" ......)
but its returning an empty array - i guess i made a mistake in my code and it can't access the urls.
I also need to save the car name inside a variable e.g : $car_name = "Toyota Wish G"
Use XPath:
$doc = new DOMDocument;
$doc->loadHTMLFile('http://www.sayuri.co.jp/used-cars/');
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//table[#class="itemlist-table"]//h4[#class="stk-title"]/a');
$links = array();
foreach ($nodes as $node) {
$links[] = array(
'href' => $node->getAttribute('href'),
'text' => $node->textContent,
);
}
print_r($links);

Clicking on a div tag

I am using WWW::Scripter to grab a page written with javascript/ajax, the "link" to the next page is a div tag, I can get the tag but cannot seem to figure out a way to click on it to get to the next page.. Any suggestions?
my $w = new WWW::Scripter;
$w->use_plugin('Ajax');
$w->get($c->website);
my $loop = 1;
my $page = 1;
while ($loop) {
my $te = HTML::TableExtract->new();
$content = $w->content();
$te->parse($content);
$table = $te->first_table_found;
$str .= Dumper $table;
$page += 1;
$loop = $self->next_page($w);
}
sub next_page {
my $self = shift;
my $w = shift;
$div = $w->document->getElementById('example_next');
if (defined $div) {
--I want to click on the div and move to the next page, suggestions?---
return 1;
} else {
return 0;
}
}
example html code... First there is a table holding the data...
<table class="display" id="example">
<thead>
headers
</thead>
<tbody>---DATA---</tbody>
</table>
Then pagination to go from "page" to "page" the data is rewritten with each pagination click..
<div class="dataTables_paginate paging_two_button" id="example_paginate">
<div class="paginate_disabled_previous" title="Previous" id="example_previous"></div>
<div class="paginate_enabled_next" title="Next" id="example_next"></div>
</div>
This is all using www.datatables.net
You need to identify the JavaScript call that occurs when that div's id is clicked, and then execute it. Alternatively you could use WWW::Mechanize::Firefox or WWW::Selenium.

Get <td> values with Perl

So I have a reporting tool that spits out job scheduling statistics in an HTML file, and I'm looking to consume this data using Perl. I don't know how to step through a HTML table though.
I know how to do this with jQuery using
$.find('<tr>').each(function(){
variable = $(this).find('<td>').text
});
But I don't know how to do this same logic with Perl. What should I do? Below is a sample of the HTML output. Each table row includes the three same stats: object name, status, and return code.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
<HTML>
<HEAD>
<meta name="GENERATOR" content="UC4 Reporting Tool V8.00A">
<Title></Title>
<style type="text/css">
th,td {
font-family: arial;
font-size: 0.8em;
}
th {
background: rgb(77,148,255);
color: white;
}
td {
border: 1px solid rgb(208,213,217);
}
table {
border: 1px solid grey;
background: white;
}
body {
background: rgb(208,213,217);
}
</style>
</HEAD>
<BODY>
<table>
<tr>
<th>Object name</th>
<th>Status</th>
<th>Return code</th>
</tr>
<tr>
<td>JOBS.UNIX.S_SITEVIEW.WF_M_SITEVIEW_CHK_FACILITIES_REGISTRY</td>
<td>ENDED_OK - ended normally</td>
<td>0</td>
</tr>
<tr>
<td>JOBS.UNIX.ADMIN.INFA_CHK_REP_SERVICE</td>
<td>ENDED_OK - ended normally</td>
<td>0</td>
</tr>
<tr>
<td>JOBS.UNIX.S_SITEVIEW.WF_M_SITEVIEW_CHK_FACILITIES_REGISTRY</td>
<td>ENDED_OK - ended normally</td>
<td>0</td>
</tr>
The HTML::Query module is a wrapper around the HTML parser that provides a querying interface that is familiar to jQuery users. So you could write something like
use HTML::Query qw(Query);
my $docName = "test.html";
my $doc = Query(file => $docName);
for my $tr ($doc->query("td")) {
for my $td (Query($tr)->query("td")) {
# $td is now an HTML::Element object for the td element
print $td->as_text, "\n";
}
}
Read the HTML::Query documentation to get a better idea of how to use it--- the above is hardly the prettiest example.
You could use a RegExp but Perl already has modules built for this specific task. Check out HTML::TableContentParser
You would probably do something like this:
use HTML::TableContentParser;
$tcp = HTML::TableContentParser->new;
$tables = $tcp->parse($HTML);
foreach $table (#$tables) {
foreach $row (#{ $tables->{rows} }) {
foreach $col (#{ $row->{cols} }) {
# each <td>
$data = $col->{data};
}
}
}
Here I use the HTML::Parser, is a little verbose, but guaranteed to work. I am using the diamond operator so, you can use it as a filter. If you call this Perl source extractTd, here are a couple of ways to call it.
$ extractTd test.html
or
$ extractTd < test.html
will both work, output will go on standard output and you can redirect it to a file.
#!/usr/bin/perl -w
use strict;
package ExtractTd;
use 5.010;
use base "HTML::Parser";
my $td_flag = 0;
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = #_;
if ($tag =~ /^td$/i) {
$td_flag = 1;
}
}
sub end {
my ($self, $tag, $origtext) = #_;
if ($tag =~ /^td$/i) {
$td_flag = 0;
}
}
sub text {
my ($self, $text) = #_;
if ($td_flag) {
say $text;
}
}
my $extractTd = new ExtractTd;
while (<>) {
$extractTd->parse($_);
}
$extractTd->eof;
Have you tried looking at cpan for HTML libraries? This seems to do what your wanting
http://search.cpan.org/~msisk/HTML-TableExtract-2.11/lib/HTML/TableExtract.pm
Also here is a whole page of different HTML related libraries to use
http://search.cpan.org/search?m=all&q=html+&s=1&n=100
Perl CPAN module HTML::TreeBuilder.
I use it extensively to parse a lot of HTML documents.
The concept is that you get an HTML::Element (the root node by example).
From it, you can look for other nodes:
Get a list of children nodes with ->content_list()
Get the parent node with ->parent()
Disclaimer: The following code has not been tested, but it's the idea.
my $root = HTML::TreeBuilder->new;
$root->utf8_mode(1);
$root->parse($content);
$root->eof();
# This gets you an HTML::Element, of the root document
$root->elementify();
my #td = $root->look_down("_tag", "td");
foreach my $td_elem (#td)
{
printf "-> %s\n", $td_elem->as_trimmed_text();
}
If your table is more complex than that, you could first find the TABLE element,
then iterate over each TR children, and for each TR children, iterate over TD elements...
http://metacpan.org/pod/HTML::TreeBuilder

How do I extract information from a webpage using perl?

I need to extract the largest values(number) of specific names from a webpage. consider the webpage as
http://earth.wifi.com/isos/preFCS5.3/upgrade/
and the URL content is
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Index of /isos/preFCS5.3/upgrade</title>
</head>
<body>
<h1>Index of /isos/preFCS5.3/upgrade</h1>
<table><tr><th><img src="/icons/blank.gif" alt="[ICO]"></th><th>Name</th><th>Last modified</th><th>Size</th><th>Description</th></tr><tr><th colspan="5"><hr></th></tr>
<tr><td valign="top"><img src="/icons/back.gif" alt="[DIR]"></td><td>Parent Directory</td><td> </td><td align="right"> - </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td>GTP-UPG-LATEST-5.3.0.160.iso</td><td align="right">29-Aug-2011 16:06 </td><td align="right">804M</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td>GTP-UPG-LATEST-5.3.0.169.iso</td><td align="right">31-Aug-2011 16:26 </td><td align="right">804M</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td>GTP-UPG-LATEST-5.3.0.172.iso</td><td align="right">01-Sep-2011 16:26 </td><td align="right">804M</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td>PRE-UPG-LATEST-5.3.0.157.iso</td><td align="right">29-Aug-2011 16:05 </td><td align="right">1.5G</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td>PRE-UPG-LATEST-5.3.0.165.iso</td><td align="right">31-Aug-2011 16:26 </td><td align="right">1.5G</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td>PRE-UPG-LATEST-5.3.0.168.iso</td><td align="right">01-Sep-2011 16:26 </td><td align="right">1.5G</td></tr>
<tr><th colspan="5"><hr></th></tr>
</table>
<address>Apache/2.2.3 (Red Hat) Server at earth.wifi.com Port 80</address>
</body></html>
In the above source you can see 172 is the largest for GTP-UPG-LATEST-5.3.0
and 168 is the largest for PRE-UPG-LATEST-5.3.0
How can I extract these values and put it to a varialble say $gtp and $pre in perl
Thanks so much in advance
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $upgrade = 'http://earth.wifi.com/isos/preFCS5.3/upgrade/';
my $website_content = get($upgrade);
if ( $website_content =~ /href=\"PRE-UPG-LATEST-5.3.0(.*?)\.iso\"/ )
{
my $preversion = ${1};
print $preversion;
}
This is the code I tried with but its not getting the largest value. This is code is getting the first PRE-UPG-LATEST version value that it encounters . But I need the largest of the value
An if() executes only once. Since you want to get many, you need a loop
while ( m//g ) {
In your data it has "UPG" but your regex has "UGP", so it won't match
(you should copy/paste long strings rather than (attempt to) retype them!).
This will list the data you need, I'll leave it to you to figure out how to process it.
while ($website_content =~ /href="((?:PRE|GTP)-UPG-LATEST-.*?)\.(\d+)\.iso"/g) {
my($file, $version) = ($1, $2);
print "file=$file version=$version\n";
}
I would suggest that you not only use LWP::Simple, but XML::Simple too. This will allow you to example the data as a hash of hashes. It'll be a lot easier to find the largest version.
You can't parse HTML or XML with simple regular expressions because the XML data structure is too free form. Large structures can legally be broken up on separate lines. Take a look at this example:
The Foobar Page
It can also be expressed as:
<a
href="http://foo.com/bar/bar/">
The Foobar Page
</a>
If you were looking for a href, you'll never find it. Heck, you could even look for a\s+href and not find it.
There might be better modules to use for parsing HTML (I found HTML::Dom), but I've never used them and don't know which one is the best one to use.
As for finding the largest version number:
There's some difficulty because there are all sorts of strange and wacky rules with version numbering. For example:
2.2 < 2.10
Perl has something called V-Strings, but rumor has it that they've been deprecated. If this doesn't concern you, you can use Perl::Version.
Otherwise, here's a subroutine that does version comparison. Note that I also verify that each section is an integer via the /^\d+$/ regular expression. My subroutine can return four values:
0: Both are the same size
1: First Number is bigger
2: Second Number is bigger
undef: There is something wrong.
Here's the program:
my $minVersion = "10.3.1.3";
my $userVersion = "10.3.2";
# Create the version arrays
my $result = compare($minVersion, $userVersion);
if (not defined $results) {
print "Non-version string detected!\n";
}
elsif ($result == 0) {
print "$minVersion and $userVersion are the same\n";
}
elsif ($result == 1) {
print "$minVersion is bigger than $userVersion\n";
}
elsif ($result == 2) {
print "$userVersion is bigger than $minVersion\n";
}
else {
print "Something is wrong\n";
}
sub compare {
my $version1 = shift;
my $version2 = shift;
my #versionList1 = split /\./, $version1;
my #versionList2 = split /\./, $version2;
my $result;
while (1) {
# Shift off the first value for comparison
# Returns undef if there are no more values to parse
my $versionCompare1 = shift #versionList1;
my $versionCompare2 = shift #versionList2;
# If both are empty, Versions Matched
if (not defined $versionCompare1 and not defined $versionCompare2) {
return 0;
}
# If $versionCompare1 is empty $version2 is bigger
if (not defined $versionCompare1) {
return 2;
}
# If $versionCompare2 is empty $version1 is bigger
if (not defined $versionCompare2) {
return 1;
}
# Make sure both are numeric or else there's an error
if ($versionCompare1 !~ /\^d+$/ or $versionCompare2 !~ /\^\d+$/) {
return;
}
if ($versionCompare1 > $versionCompare2) {
return 1;
}
if ($versionCompare2 > $versionCompare1) {
return 2;
}
}
}

xpath query to parse html tags

I need to parse the following sample html using xpath query..
<td id="msgcontents">
<div class="user-data">Just seeing if I can post a link... please ignore post
http://finance.yahoo.com
</div>
</td>
<td id="msgcontents">
<div class="user-data">some text2...
http://abc.com
</div>
</td>
<td id="msgcontents">
<div class="user-data">some text3...
</div>
</td>
The above html may repeat n no of times in a page.
Also sometimes the ..... portion may be absent as shown in the above html blocks.
What I need is the xpath syntax so that I can get the parsed strings as
array1[0]= "Just seeing if I can post a link... please ignore post ttp://finance.yahoo.com"
array[1]="some text2 htp://abc.com"
array[2]="sometext3"
Maybe something like the following:
$remote = file_get_contents('http://www.sitename.com');
$dom = new DOMDocument();
//Error suppression unfortunately, as an invalid xhtml document throws up warnings.
$file = #$dom->loadHTML($remote);
$xpath = new DOMXpath($dom);
//Get all data with the user-data class.
$userdata = $xpath->query('//*[contains(#class, \'user-data\')]');
//get links
$links = $xpath->query('//a/#href');
So to access one of these variables, you need to use nodeValue:
$ret = array();
foreach($userdata as $data) {
$ret[] = $data->nodeValue;
}
Edit: I thought I'd mention that this will get all the links on a given page, I assume this is what you wanted?
Use:
concat(/td/div/text[1], ' ', /td/div/a)
You can use instead of the ' ' above, whatever delimiter you'd like to appear between the two strings.