Using HTML::TreeBuilder::XPath on nested information

Using HTML::TreeBuilder::XPath on nested information - perl

Imagine an HTML page that is a report with repetitive structure:
<html>
<body>
<h1>Big Hairy Report Page</h1>
<div class="customer">
<div class="customer_id">001</div>
<div class="customer_name">Joe Blough</div>
<div class="customer_addr">123 That Road</div>
<div class="customer_city">Smallville</div>
<div class="customer_state">Nebraska</div>
<div class="order_info">
<div class="shipping_details">
<ul>
<li>Large crate</li>
<li>Fragile</li>
<li>Express</li>
</ul>
</div>
<div class="order_item">Deluxe Hoodie</div>
<div class="payment">35.95</div>
<div class="order_id">000123456789</div>
</div>
<div class="comment">StackOverflow rocks!</div>
</div>
<div class="customer">
<div class="customer_id">002</div>
.... and so forth for a list of 150 customers
This kind of report page appears often. My goal is to extract each customer's related information into some reasonable data structure using HTML::TreeBuilder::XPath.
I know to do the basics and get the file read into $tree. But how can one concisely loop through that tree and get associated clusters of information per each customer? How, for example, would I create a list of address labels sorted by customer number based on this information? What if I want to sort all my customer information by state?
I'm not asking for the whole perl (I can read my file, output to file, etc). I just need help understanding how to ask HTML::TreeBuilder::XPath for those bundles of related data, and then how to dereference them. If it's easier to express this in terms of an output statement (i.e., Joe Blough ordered 1 Deluxe Hoodie and left 1 comment) then that's cool, too.
Thank you very much for those of you who tackle this one, it seems a bit overwhelming to me.

This will do what you need.
It starts by pulling all the <div class="customer"> elements into array #customers and extracting the information from there.
I have taken your example of the address label, sorted by the customer number (by which I assume you mean the field with class="customer_id"). All of the address values are pulled from the array into the hash %customers, keyed by the customer ID and the name of the element class. The information is then printed in the order of the ID.
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new_from_file('html.html');
my #customers = $tree->findnodes('/html/body/div[#class="customer"');
my %customers;
for my $cust (#customers) {
my $id = $cust->findvalue('div[#class="customer_id"]');
for my $field (qw/ customer_name customer_addr customer_city customer_state /) {
my $xpath = "div[\#class='$field']";
my $val = $cust->findvalue($xpath);
$customers{$id}{$field} = $val;
}
}
for my $id (sort keys %customers) {
my $info = $customers{$id};
print "Customer ID $id\n";
print $info->{customer_name}, "\n";
print $info->{customer_addr}, "\n";
print $info->{customer_city}, "\n";
print $info->{customer_state}, "\n";
print "\n";
}
output
Customer ID 001
Joe Blough
123 That Road
Smallville
Nebraska

use HTML::TreeBuilder::XPath;
...
my #customers;
my $tree = HTML::TreeBuilder::XPath->new_from_content( $mech->content() );
foreach my $customer_section_node ( $tree->findnodes('//div[ #class = "customer" ]') ) {
my $customer = {};
$customer->{id} = find_customer_id($customer_section_node);
$customer->{name} = find_customer_name($customer_section_node);
...
push #customers, $customer;
}
$tree->delete();
sub find_customer_id {
my $node = shift;
my ($id) = $node->findvalues('.//div[ #class = "customer_id" ]');
return $id
}

I'll use XML::LibXML since it's faster and I'm familiar with it, but it should be pretty straightforward to convert what I post to from XML::LibXML to HTML::TreeBuilder::XPath if you so desire.
use XML::LibXML qw( );
sub get_text { defined($_[0]) ? $_[0]->textContent() : undef }
my $doc = XML::LibXML->load_html(...);
my #customers;
for my $cust_node ($doc->findnodes('/html/body/div[#class="customer"]')) {
my $id = get_text( $cust_node->findnodes('div[#class="customer_id"]') );
my $name = get_text( $cust_node->findnodes('div[#class="customer_name"]') );
...
push #customers, {
id => $id,
name => $name,
...
};
}
Actually, given the regularity of the data, you don't have to hardcode the field names.
use XML::LibXML qw( );
sub parse_list {
my ($node) = #_;
return [
map parse_field($_),
$node->findnodes('li')
];
}
sub parse_field {
my ($node) = #_;
my #children = $node->findnodes('*');
return $node->textContent() if !#children;
return parse_list($children[0]) if $children[0]->nodeName() eq 'ul';
return {
map { $_->getAttribute('class') => parse_field($_) }
#children
};
}
{
my $doc = XML::LibXML->load_html( ... );
my #customers =
map parse_field($_),
$doc->findnodes('/html/body/div[#class="customer"]');
...
}

Related

Formmail - make it show name and email in success response

I'm using formmail.pl to handle a form I'm using on my site. There are several fields which are sent to the script including name, email, phone and some text in a text area.
As it stands, the successful submission prints the phone and textarea data correctly but doesn't print the name and email which were entered. In the email it sends the name and email in the 'from' header and as with the success page only shows the phone and textarea data are shown in the email body.
I would like to show all data in both cases however I can't seem to find the section of code that handles this. I'd post up the formmail.pl script except its over 3000 lines of code so I'll just post the places I think are responsible and hopefully somebody can point me in the right direction. I'm fairly new to Perl and its a bit overwhelming reading and understanding a script of this size.
sub success_page {
my ($self, $date) = #_;
if ($self->{FormConfig}{'redirect'}) {
print $self->cgi_object->redirect( $self->{FormConfig}{'redirect'} );
}
elsif ( $self->{CFG}{'no_content'}) {
print $self->cgi_object->header(Status => 204);
}
else {
$self->output_cgi_html_header;
$self->success_page_html_preamble($date);
$self->success_page_fields;
$self->success_page_footer;
}
}
sub success_page_html_preamble {
my ($self, $date) = #_;
my $title = $self->escape_html( $self->{FormConfig}{'title'} || 'Success' );
my $torecipient = 'to ' . $self->escape_html($self->{FormConfig}{'recipient'});
$torecipient = '' if $self->{Hide_Recipient};
my $attr = $self->body_attributes;
print <<END;
<head>
<title>$title</title>
END
$self->output_style_element;
print <<END;
<link type="text/css" href="css/stylesheet.css" rel="stylesheet" /></script>
</head>
<body>
<p>Below is what you submitted $torecipient on $date</p>
END
}
sub success_page_fields {
my ($self) = #_;
foreach my $f (#{ $self->{Field_Order} }) {
my $val = (defined $self->{Form}{$f} ? $self->{Form}{$f} : '');
$self->success_page_field( $self->escape_html($f), $self->escape_html($val) );
}
}
sub success_page_field {
my ($self, $name, $value) = #_;
print "<p><b>$name:</b> $value</p>\n";
}
Okay that's getting a bit long. That stuff is mostly for the success page and not much to do with the email side of things but maybe if somebody can find what I need there I can apply it to the email section also.
If any further information is needed let me know
Thanks in Advance

I haven't really used NMS FormMail myself, but looking at the source, it seems that you should be able to achieve something like what you want by setting the following extra configuration options:
$more_config{include_config_email} = 1;
$more_config{include_config_realname} = 1;
This should cause FormMail.pl to treat the email and realname fields as normal form fields, in addition to their special meaning.

Clicking on a div tag

I am using WWW::Scripter to grab a page written with javascript/ajax, the "link" to the next page is a div tag, I can get the tag but cannot seem to figure out a way to click on it to get to the next page.. Any suggestions?
my $w = new WWW::Scripter;
$w->use_plugin('Ajax');
$w->get($c->website);
my $loop = 1;
my $page = 1;
while ($loop) {
my $te = HTML::TableExtract->new();
$content = $w->content();
$te->parse($content);
$table = $te->first_table_found;
$str .= Dumper $table;
$page += 1;
$loop = $self->next_page($w);
}
sub next_page {
my $self = shift;
my $w = shift;
$div = $w->document->getElementById('example_next');
if (defined $div) {
--I want to click on the div and move to the next page, suggestions?---
return 1;
} else {
return 0;
}
}
example html code... First there is a table holding the data...
<table class="display" id="example">
<thead>
headers
</thead>
<tbody>---DATA---</tbody>
</table>
Then pagination to go from "page" to "page" the data is rewritten with each pagination click..
<div class="dataTables_paginate paging_two_button" id="example_paginate">
<div class="paginate_disabled_previous" title="Previous" id="example_previous"></div>
<div class="paginate_enabled_next" title="Next" id="example_next"></div>
</div>
This is all using www.datatables.net

You need to identify the JavaScript call that occurs when that div's id is clicked, and then execute it. Alternatively you could use WWW::Mechanize::Firefox or WWW::Selenium.

Extracting links inside <div>'s with HTML::TokeParser & URI

I'm an old-newbie in Perl, and Im trying to create a subroutine in perl using HTML::TokeParser and URI.
I need to extract ALL valid links enclosed within on div called "zone-extract"
This is my code:
#More perl above here... use strict and other subs
use HTML::TokeParser;
use URI;
sub extract_links_from_response {
my $response = $_[0];
my $base = URI->new( $response->base )->canonical;
# "canonical" returns it in the one "official" tidy form
my $stream = HTML::TokeParser->new( $response->content_ref );
my $page_url = URI->new( $response->request->uri );
print "Extracting links from: $page_url\n";
my($tag, $link_url);
while ( my $div = $stream->get_tag('div') ) {
my $id = $div->get_attr('id');
next unless defined($id) and $id eq 'zone-extract';
while( $tag = $stream->get_tag('a') ) {
next unless defined($link_url = $tag->[1]{'href'});
next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL.
next unless length $link_url; # sanity check!
$link_url = URI->new_abs($link_url, $base)->canonical;
next unless $link_url->scheme eq 'http'; # sanity
$link_url->fragment(undef); # chop off any "#foo" part
print $link_url unless $link_url->eq($page_url); # Don't note links to itself!
}
}
return;
}
As you can see, I have 2 loops, first using get_tag 'div' and then look for id = 'zone-extract'. The second loop looks inside this div and retrieve all links (or that was my intention)...
The inner loop works, it extracts all links correctly working standalone, but I think there is some issues inside the first loop, looking for my desired div 'zone-extract'... Im using this post as a reference: How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?
But all I have by the moment is this error:
Can't call method "get_attr" on unblessed reference
Some ideas? Help!
My HTML (Note URL_TO_EXTRACT_1 & 2):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="...">
<h2 class="genres"><img alt="extracting" class="png"></h2>
<li><a title="Extr 2" href="**URL_TO_EXTRACT_1**">2</a></li>
<li><a title="Con 1" class="sel" href="**URL_TO_EXTRACT_2**">1</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>

I find that TokeParser is a very crude tool requiring too much code, its fault is that only supports the procedural style of programming.
A better alternatives which require less code due to declarative programming is Web::Query:
use Web::Query 'wq';
my $results = wq($response)->find('div#zone-extract a')->map(sub {
my (undef, $elem_a) = #_;
my $link_url = $elem_a->attr('href');
return unless $link_url && $link_url !~ m/\s/ && …
# Further checks like in the question go here.
return [$link_url => $elem_a->text];
});
Code is untested because there is no example HTML in the question.

How to check the HTML element is a end node?

I am building a HTML parser in Perl. I would like to know if the HTML element is an element without any sibilings.
Here is the HTML, I would like to parse :
<span class="bold1">A:</span> ELementA<br />
<span class="bold1">B:</span> mailto:admin<br />
<span class="bold1">C </span> 01/12<br />
<span class="bold1">D:</span> ELementC<br />
<span class="bold1">E:</span> ElementD<br />
<span class="bold1">F:</span> ElementE<br />
How to check if the element is the end element.
I am getting the error :
Can't call method "as_text" without a package or object reference at
Any idea what could be wrong ?
Here is the code snippet in Perl,
my $mech = WWW::Mechanize->new( autocheck => 1 );
eval
{
$mech->get($url);
};
if ($#)
{
print "Error connecting to URL $url \n";
exit(0);
}
my $root = HTML::TreeBuilder->new_from_content(decode_utf8($mech->content));
my #PageSections = $root->look_down(
sub {
return (
($_[0]->tag() eq 'span' ) and
($_[0]->attr('class') =~ m/bold1/i) )
});
my $temp2;
my $temp3;
for my $ps (#PageSections)
{
# my $temp1= $ps->right()->as_text;
$temp2= $ps->as_text;
my $temp3=ref $ps->right();
#
print defined $temp3 ? "defined \n" : "not defined\n";
}
Thanks

It's hard to tell without knowing more of your code, but I'm guessing #PageSections contains objects of some home brewed module, and that something happens there to make $_ point to something completely different. I'd go with
for my $ps (#PageSections)
{
my $temp1= $ps->right()->as_text;
my $temp2= $ps->as_text;
print "$temp2 " . $temp1 . " \n";
}
instead.

How can I get the next immediate tag with HTML::Tokeparser?

I'm trying to get the tags that occur immediately after a particular div tag. For e.g., I have html code
<div id="example">
<h2>Example</h2>
<p>Hello !World</p>
</div>
I'm doing the following,
while ( $tag = $stream->get_tag('div') ) {
if( $tag->[1]{id} eq 'Example' ) {
$tag = $stream->get_tag;
$tag = $stream->get_tag;
if ( $tag->[0] eq 'div' ) {
...
}
}
}
But this throws the error
Can't use string ("</h2>") as a HASH ref while "strict refs" in use
It works fine if I say
$tag = $stream->get_tag('h2');
$tag = $stream->get_tag('p');
But I can't have that because I need to get the immediate two tags and verify if they are what i expect them to be.

It would be easier to tell if you posted a runnable example program, but it looks like the problem is you didn't realize that get_tag returns both start and end tags. End tags don't have attributes. Start tags are returned as [$tag, $attr, $attrseq, $text], and end tags are returned as ["/$tag", $text].

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Using HTML::TreeBuilder::XPath on nested information - perl

Related

Formmail - make it show name and email in success response

Clicking on a div tag

Extracting links inside <div>'s with HTML::TokeParser & URI

How to check the HTML element is a end node?

How can I get the next immediate tag with HTML::Tokeparser?

Categories

Resources