I am building a HTML parser in Perl. I would like to know if the HTML element is an element without any sibilings.
Here is the HTML, I would like to parse :
<span class="bold1">A:</span> ELementA<br />
<span class="bold1">B:</span> mailto:admin<br />
<span class="bold1">C </span> 01/12<br />
<span class="bold1">D:</span> ELementC<br />
<span class="bold1">E:</span> ElementD<br />
<span class="bold1">F:</span> ElementE<br />
How to check if the element is the end element.
I am getting the error :
Can't call method "as_text" without a package or object reference at
Any idea what could be wrong ?
Here is the code snippet in Perl,
my $mech = WWW::Mechanize->new( autocheck => 1 );
eval
{
$mech->get($url);
};
if ($#)
{
print "Error connecting to URL $url \n";
exit(0);
}
my $root = HTML::TreeBuilder->new_from_content(decode_utf8($mech->content));
my #PageSections = $root->look_down(
sub {
return (
($_[0]->tag() eq 'span' ) and
($_[0]->attr('class') =~ m/bold1/i) )
});
my $temp2;
my $temp3;
for my $ps (#PageSections)
{
# my $temp1= $ps->right()->as_text;
$temp2= $ps->as_text;
my $temp3=ref $ps->right();
#
print defined $temp3 ? "defined \n" : "not defined\n";
}
Thanks
It's hard to tell without knowing more of your code, but I'm guessing #PageSections contains objects of some home brewed module, and that something happens there to make $_ point to something completely different. I'd go with
for my $ps (#PageSections)
{
my $temp1= $ps->right()->as_text;
my $temp2= $ps->as_text;
print "$temp2 " . $temp1 . " \n";
}
instead.
Related
Is it possible to have more than one flash message at a time under Mojolicious?
I have a case where a form can have multiple errors, and I want to be able to list all the error messages at once, instead of finding one error, displaying one error, having the user fix one error, repeat for the other errors.
In this example where messages are set in one page, and then displayed in another, only the last message added is shown.
get '/' => sub {
my ($c) = #_;
$c->flash(msg => "This is message one.");
$c->flash(msg => "This is message two.");
$c->flash(msg => "This is message three.");
$c->flash(msg => "This is message four.");
$c->flash(msg => "This is message five.");
return $c->redirect_to('/second');
};
get '/second' => sub {
my ($c) = #_;
return $c->render(template => 'second');
};
app->secrets(["aren't important here"]);
app->start;
__DATA__
## second.html.ep
<!doctype html><html><head><title>Messages</title></head>
<body>
These are the flash messages:
<ul>
% if (my $msg = flash('msg')) {
<li><%= $msg %></li>
% }
</ul>
</body></html>
The output is
These are the flash messages:
This is message five.
I have also tried fetching the messages in list context, but still only the last message is displayed.
__DATA__
## second.html.ep
<!doctype html><html><head><title>Messages</title></head>
<body>
These are the flash messages:
<ul>
% if (my #msg = flash('msg')) {
% foreach my $m (#msg) {
<li><%= $m %></li>
% }
% }
</ul>
</body></html>
Thank you.
Flash data is stored in a hash. You are overwriting the value for key 'msg' each time you call flash().
You can use flash to store a data structure instead of a scalar value:
use Mojolicious::Lite -signatures;
get '/' => sub {
my ($c) = #_;
my #messages = map ("This is message $_", qw/one two three four/);
$c->flash(msg =>\#messages);
return $c->redirect_to('/second');
};
get '/second' => sub {
my ($c) = #_;
return $c->render(template => 'second');
};
app->secrets(["aren't important here"]);
app->start;
__DATA__
## second.html.ep
<!doctype html><html><head><title>Messages</title></head>
<body>
These are the flash messages:
<ul>
% for my $msg (#{flash('msg')//[]}) {
<li><%= $msg %></li>
% }
</ul>
</body></html>
I have a problem checking if the variables I retrieve with CGI are right.
HTML:
<label for="name">Nombre de Usuario:</label>
<input type="text" id="name" name="user_name" />
<label for="password">Contraseña:</label>
<input type="password" id="password" name="user_password" />
CGI/Perl code:
my $c = CGI->new;
.
.
.
my $user = $c->param('user_name');
my $password = $c->param('user_password');
if($user eq "" || $password eq "") {
printf "Error";
exit;
}
The main problem is that if I just do:
print $user;
The output is correct and it prints the value, but in apache error.log, it still says that the variable is uninitialized and then goes into the "if" and exits.
I also tried with "$user" and comparing the strings to undef but nothing.
To test whether a variable has a defined value, you should use the 'defined' function:
if (!defined($user) || !defined($password)) { ... }
If those parameters have not been passed in by the form then they will be undefined, and you will receive a warning if enabled by "use warnings;". You can ensure they are defined by providing fallback values for the assignments:
my $user = $c->param('user_name') // "";
my $password = $c->param('user_password') // "";
I am creating a simple perl script to create a web page to register users. This is just a learning program for me. It is very simple. I will display a page on the browser. The user enters name, user name, and password. After the user presses submit, I will check the user name against the database. If the user name exists in the database, I just want to display an error and bring up the register page again. I am using the cgi->redirect function. I am not sure if that is how I should use the redirection function. It does not work like I thought. It display "The document has moved here". Please point me to the right way. Thanks.
Here is the scripts
registeruser.pl
#!/usr/bin/perl
print "Content-type: text/html\n\n";
print <<PAGE;
<html>
<head>
<link rel="stylesheet" type="text/css" href="tracker.css"/>
</head>
<body>
<div id="header">
<h1> Register New User</h1>
</div>
<div id="content">
<form action="adduser.pl" method="POST">
<b>Name:</b> <input type="text" name="name"><br>
<b>UserName:</b> <input type="text" name="username"><br>
<b>Password:</b> <input type="password" name="password"><br>
<input type="submit">
</div>
</body>
<html>
PAGE
adduser.pl
#!/usr/bin/perl
use CGI;
use DBI;
$cgiObj = CGI->new;
print $cgiObj->header ('text/html');
# get post data
$newUser = $cgiObj->param('username');
$newName = $cgiObj->param('name');
$newPass = $cgiObj->param('password');
# set up sql connection
$param = 'DBI:mysql:Tracker:localhost';
$user = 'madison';
$pass = 'qwerty';
$connect = DBI->connect ($param, $user, $pass);
$sql = 'select user from users where user = "' . $newUser . '"';
$query = $connect->prepare ($sql);
$query->execute;
$found = 0;
while (#row = $query->fetchrow_array)
{
$found = 1;
}
if ($found == 0)
{
# no user found add new user
$sql = 'insert into users (user, name, passwd) values (?, ?, ?)';
$insert = $connect->prepare ($sql);
$insert->execute ($newUser, $newName, $newPass);
}
else
{
# user already exists, get new user name
# What do I do here ????
print $cgiObj->redirect ("registerusr.pl");
}
One thing to look out for, SQL Injection. For an illustrated example, Little Bobby Tables.
As it stands your code is inescure, and can allow people to do bad things to your database. DBI provides placeholders as a secure way of querying a database with user input. Example http://bobby-tables.com/perl.html
Also, in this day and age even the CGI module warns you not to use it:
The rational for this decision is that CGI.pm is no longer considered good practice for developing web applications, including quick prototyping and small web scripts. There are far better, cleaner, quicker, easier, safer, more scalable, more extensible, more modern alternatives available at this point in time. These will be documented with CGI::Alternatives.
I suggest you use Dancer to make your life easier.
Three things
Include use strict; and use warnings; in EVERY perl script. No exceptions.
This is the #1 thing that you can do to be a better perl programmer. It will save you an incalculable amount of time during both development and testing.
Don't use redirects to switch between form processing and form display
Keep your form display and form processing in the same script. This enables you to display error messages in the form and only move on to a new step upon a successfully processed form.
You simply need to test the request_method to determine if the form is needing to be processed or just displayed.
CGI works for learning perl, but look at CGI::Alternatives for live code.
The following is your form refactored with the first 2 guidelines in mind:
register.pl:
#!/usr/bin/perl
use strict;
use warnings;
use CGI;
my $q = CGI->new;
my $name = $q->param('name') // '';
my $username = $q->param('username') // '';
my $password = $q->param('password') // '';
# Process Form
my #errors;
if ( $q->request_method() eq 'POST' ) {
if ( $username =~ /^\s*$/ ) {
push #errors, "No username specified.";
}
if ( $password =~ /^\s*$/ ) {
push #errors, "No password specified.";
}
# Successful Processing
if ( !#errors ) {
# Obfuscate for display
$password =~ s/./*/g;
print $q->header();
print <<"END_PAGE";
<html>
<head><title>Success</title></head>
<body>
<p>Name = $name</p>
<p>Username = $username</p>
<p>Password = $password</p>
</body>
</html>
END_PAGE
exit;
}
}
# Display Form
print $q->header();
print <<"END_PAGE";
<html>
<head>
<link rel="stylesheet" type="text/css" href="tracker.css"/>
</head>
<body>
<div id="header">
<h1>Register New User</h1>
</div>
#{[ #errors ? join("\n", map "<p>Error: $_</p>", #errors) : '' ]}
<div id="content">
<form action="register.pl" method="POST">
<b>Name:</b> #{[ $q->textfield( -name => 'name' ) ]}<br>
<b>UserName:</b> #{[ $q->textfield( -name => 'username' ) ]}<br>
<b>Password:</b> #{[ $q->password_field( -name => 'password' ) ]}<br>
<input type="submit">
</div>
</body>
<html>
END_PAGE
__DATA__
Imagine an HTML page that is a report with repetitive structure:
<html>
<body>
<h1>Big Hairy Report Page</h1>
<div class="customer">
<div class="customer_id">001</div>
<div class="customer_name">Joe Blough</div>
<div class="customer_addr">123 That Road</div>
<div class="customer_city">Smallville</div>
<div class="customer_state">Nebraska</div>
<div class="order_info">
<div class="shipping_details">
<ul>
<li>Large crate</li>
<li>Fragile</li>
<li>Express</li>
</ul>
</div>
<div class="order_item">Deluxe Hoodie</div>
<div class="payment">35.95</div>
<div class="order_id">000123456789</div>
</div>
<div class="comment">StackOverflow rocks!</div>
</div>
<div class="customer">
<div class="customer_id">002</div>
.... and so forth for a list of 150 customers
This kind of report page appears often. My goal is to extract each customer's related information into some reasonable data structure using HTML::TreeBuilder::XPath.
I know to do the basics and get the file read into $tree. But how can one concisely loop through that tree and get associated clusters of information per each customer? How, for example, would I create a list of address labels sorted by customer number based on this information? What if I want to sort all my customer information by state?
I'm not asking for the whole perl (I can read my file, output to file, etc). I just need help understanding how to ask HTML::TreeBuilder::XPath for those bundles of related data, and then how to dereference them. If it's easier to express this in terms of an output statement (i.e., Joe Blough ordered 1 Deluxe Hoodie and left 1 comment) then that's cool, too.
Thank you very much for those of you who tackle this one, it seems a bit overwhelming to me.
This will do what you need.
It starts by pulling all the <div class="customer"> elements into array #customers and extracting the information from there.
I have taken your example of the address label, sorted by the customer number (by which I assume you mean the field with class="customer_id"). All of the address values are pulled from the array into the hash %customers, keyed by the customer ID and the name of the element class. The information is then printed in the order of the ID.
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new_from_file('html.html');
my #customers = $tree->findnodes('/html/body/div[#class="customer"');
my %customers;
for my $cust (#customers) {
my $id = $cust->findvalue('div[#class="customer_id"]');
for my $field (qw/ customer_name customer_addr customer_city customer_state /) {
my $xpath = "div[\#class='$field']";
my $val = $cust->findvalue($xpath);
$customers{$id}{$field} = $val;
}
}
for my $id (sort keys %customers) {
my $info = $customers{$id};
print "Customer ID $id\n";
print $info->{customer_name}, "\n";
print $info->{customer_addr}, "\n";
print $info->{customer_city}, "\n";
print $info->{customer_state}, "\n";
print "\n";
}
output
Customer ID 001
Joe Blough
123 That Road
Smallville
Nebraska
use HTML::TreeBuilder::XPath;
...
my #customers;
my $tree = HTML::TreeBuilder::XPath->new_from_content( $mech->content() );
foreach my $customer_section_node ( $tree->findnodes('//div[ #class = "customer" ]') ) {
my $customer = {};
$customer->{id} = find_customer_id($customer_section_node);
$customer->{name} = find_customer_name($customer_section_node);
...
push #customers, $customer;
}
$tree->delete();
sub find_customer_id {
my $node = shift;
my ($id) = $node->findvalues('.//div[ #class = "customer_id" ]');
return $id
}
I'll use XML::LibXML since it's faster and I'm familiar with it, but it should be pretty straightforward to convert what I post to from XML::LibXML to HTML::TreeBuilder::XPath if you so desire.
use XML::LibXML qw( );
sub get_text { defined($_[0]) ? $_[0]->textContent() : undef }
my $doc = XML::LibXML->load_html(...);
my #customers;
for my $cust_node ($doc->findnodes('/html/body/div[#class="customer"]')) {
my $id = get_text( $cust_node->findnodes('div[#class="customer_id"]') );
my $name = get_text( $cust_node->findnodes('div[#class="customer_name"]') );
...
push #customers, {
id => $id,
name => $name,
...
};
}
Actually, given the regularity of the data, you don't have to hardcode the field names.
use XML::LibXML qw( );
sub parse_list {
my ($node) = #_;
return [
map parse_field($_),
$node->findnodes('li')
];
}
sub parse_field {
my ($node) = #_;
my #children = $node->findnodes('*');
return $node->textContent() if !#children;
return parse_list($children[0]) if $children[0]->nodeName() eq 'ul';
return {
map { $_->getAttribute('class') => parse_field($_) }
#children
};
}
{
my $doc = XML::LibXML->load_html( ... );
my #customers =
map parse_field($_),
$doc->findnodes('/html/body/div[#class="customer"]');
...
}
I'm an old-newbie in Perl, and Im trying to create a subroutine in perl using HTML::TokeParser and URI.
I need to extract ALL valid links enclosed within on div called "zone-extract"
This is my code:
#More perl above here... use strict and other subs
use HTML::TokeParser;
use URI;
sub extract_links_from_response {
my $response = $_[0];
my $base = URI->new( $response->base )->canonical;
# "canonical" returns it in the one "official" tidy form
my $stream = HTML::TokeParser->new( $response->content_ref );
my $page_url = URI->new( $response->request->uri );
print "Extracting links from: $page_url\n";
my($tag, $link_url);
while ( my $div = $stream->get_tag('div') ) {
my $id = $div->get_attr('id');
next unless defined($id) and $id eq 'zone-extract';
while( $tag = $stream->get_tag('a') ) {
next unless defined($link_url = $tag->[1]{'href'});
next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL.
next unless length $link_url; # sanity check!
$link_url = URI->new_abs($link_url, $base)->canonical;
next unless $link_url->scheme eq 'http'; # sanity
$link_url->fragment(undef); # chop off any "#foo" part
print $link_url unless $link_url->eq($page_url); # Don't note links to itself!
}
}
return;
}
As you can see, I have 2 loops, first using get_tag 'div' and then look for id = 'zone-extract'. The second loop looks inside this div and retrieve all links (or that was my intention)...
The inner loop works, it extracts all links correctly working standalone, but I think there is some issues inside the first loop, looking for my desired div 'zone-extract'... Im using this post as a reference: How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?
But all I have by the moment is this error:
Can't call method "get_attr" on unblessed reference
Some ideas? Help!
My HTML (Note URL_TO_EXTRACT_1 & 2):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="...">
<h2 class="genres"><img alt="extracting" class="png"></h2>
<li><a title="Extr 2" href="**URL_TO_EXTRACT_1**">2</a></li>
<li><a title="Con 1" class="sel" href="**URL_TO_EXTRACT_2**">1</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
I find that TokeParser is a very crude tool requiring too much code, its fault is that only supports the procedural style of programming.
A better alternatives which require less code due to declarative programming is Web::Query:
use Web::Query 'wq';
my $results = wq($response)->find('div#zone-extract a')->map(sub {
my (undef, $elem_a) = #_;
my $link_url = $elem_a->attr('href');
return unless $link_url && $link_url !~ m/\s/ && …
# Further checks like in the question go here.
return [$link_url => $elem_a->text];
});
Code is untested because there is no example HTML in the question.