Perl HTML::Element how to look_down to extract next tag after a matching tag - perl

I am using HTML::TreeBuilder to process HTML files. In those files I can have definition lists where there is term "Database" with definition "Database Name". Simulated html looks like this:
#!/usr/bin/perl -w
use strict;
use warnings;
use HTML::TreeBuilder 5 -weak;
use feature qw( say );
my $exampleContent = '<dl>
<dt data-auto="citation_field_label">
<span class="medium-bold">Language:</span>
</dt>
<dd data-auto="citation_field_value">
<span class="medium-normal">English</span>
</dd>
<dt data-auto="citation_field_label">
<span class="medium-bold">Database:</span>
</dt>
<dd data-auto="citation_field_value">
<span class="medium-normal">Data Archive</span>
</dd>
</dl>';
my $root = HTML::TreeBuilder->new_from_content($exampleContent);
my $dlist = $root->look_down("_tag" => "dl");
foreach my $e ($dlist->look_down("_tag" => 'dt', "data-auto" => "citation_field_label")) {
if ($e->as_text =~ m/Datab.*/) {
say $e->as_text; # I have found "Database:" 'dt' field
# now I need to go to the next field 'dd' and return the value of that
}
}
I need to identify which database the file has come from and return the value.
I would like to be able to say something like say $dlist->right()->as_text; when I have identified <dt> with "Database:" in it, but I do not know how. Your thoughts would be much appreciated.

You were almost there. Using
$e->right->as_text;
Gives me the "Data Archive".

Related

DOMXPath multiple contain selectors not working

I have the following XPath query that a kind user on SO helped me with:
$xpath->query(".//*[not(self::textarea or self::select or self::input) and contains(., '{{{')]/text()") as $node)
Its purpose is to replace certain placeholders with a value, and correctly catches occurences such as the below that should not be replaced:
<textarea id="testtextarea" name="testtextarea">{{{variable:test}}}</textarea>
And replaces correctly occurrences like this:
<div>{{{variable:test}}}</div>
Now I want to exclude elements that are of type <div> that contain the class name note-editable in that query, e.g., <div class="note-editable mayhaveanotherclasstoo">, in addition to textareas, selects or inputs.
I have tried:
$xpath->query(".//*[not(self::textarea or self::select or self::input) and not(contains(#class, 'note-editable')) and contains(., '{{{')]/text()") as $node)
and:
$xpath->query(".//*[not(self::textarea or self::select or self::input or contains(#class, 'note-editable')) and contains(., '{{{')]/text()") as $node)
I have followed the advice on some questions similar to this: PHP xpath contains class and does not contain class, and I do not get PHP errors, but the note-editable <div> tags are still having their placeholders replaced.
Any idea what's wrong with my attempted queries?
EDIT
Minimum reproducible DOM sample:
<div class="note-editing-area">
<textarea class="note-codable"></textarea>
<div class="note-editable panel-body" contenteditable="true" style="height: 350px;">{{{variable:system_url}}</div>
</div>
Code that does the replacement:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
foreach ($xpath->query(".//*[not(self::textarea or self::select or self::input or self::div[contains(#class,'note-editable')]) and contains(., '{{{')]/text()") as $node) {
$node->nodeValue = preg_replace_callback('~{{{([^:]+):([^}]+)}}}~', function($m) use ($placeholders) {
return $placeholders[$m[1]][$m[2]] ?? '';
},
$node->nodeValue);
}
$html = $dom->saveHTML();
echo html_entity_decode($html);
Use this below xpath.
.//*[not(self::textarea or self::select or self::input or self::div[contains(#class,'note-editable')]) and contains(., '{{{')]

Why do I get the error 'Attempt to get non existent parameter "count-1"' when I use HTML::Template?

I am trying to populate an HTML drop down list using HTML::Template:
use HTML::Template;
my #ARRAY = ("count1,count2,count3");
my $template = HTML::Template->new(filename => 'test.tmpl');
$template->param( COUNT => \#ARRAY );
print $template->output();
Here's my template file (test.tmpl):
<div class="input-field col s6">
<select><TMPL_LOOP NAME="COUNT"><option value="count1"><TMPL_VAR NAME=NAME><TMPL_VAR NAME=VALUES></option></TMPL_LOOP></select>
</div>
When I run this, I get the error:
Attempt to get nonexistent parameter 'count1,count2,count3' -
this parameter name doesn't match any declarations in the template file
How can I fix this?
You don't say what result you want from this, but each element of a TMPL_LOOP parameter must be a hash containing one or more values. Within the <TMPL_LOOP NAME="COUNT"> element you can call out elements of each hash using <TMPL_VAR NAME=FIELD> where FIELD is the name of the hash key that must appear in every element of the array
If I combine this template
<div class="input-field col s6">
<select>
<TMPL_LOOP NAME="COUNT">
<option value="<TMPL_VAR NAME=VALUE>"><TMPL_VAR NAME=NAME></option>
</TMPL_LOOP>
</select>
</div>
with this Perl code
use strict;
use warnings 'all';
use HTML::Template;
my $template = HTML::Template->new(filename => 'test.tmpl');
my #count = (
{ name => 'count1', value => 1 },
{ name => 'count2', value => 2 },
{ name => 'count3', value => 3 },
);
$template->param( COUNT => \#count );
print $template->output, "\n";
then this is the result. I assume you were hoping for something similar
<div class="input-field col s6">
<select>
<option value="1">count1</option>
<option value="2">count2</option>
<option value="3">count3</option>
</select>
</div>
According to the documentation, the loop variable needs to be a AoH. Based on your template, it specifically needs to loop like
my #ARRAY = (
{ NAME => 'count1', VALUES => 1 },
{ NAME => 'count2', VALUES => 2 },
{ NAME => 'count3', VALUES => 3 },
);
You can ignore these errors if there are fields in your result set that you want to skip in the template...
my $template = HTML::Template->new(filename => $filename, die_on_bad_params => 0);
Use with caution!

How to repost a webpage?

I am creating a simple perl script to create a web page to register users. This is just a learning program for me. It is very simple. I will display a page on the browser. The user enters name, user name, and password. After the user presses submit, I will check the user name against the database. If the user name exists in the database, I just want to display an error and bring up the register page again. I am using the cgi->redirect function. I am not sure if that is how I should use the redirection function. It does not work like I thought. It display "The document has moved here". Please point me to the right way. Thanks.
Here is the scripts
registeruser.pl
#!/usr/bin/perl
print "Content-type: text/html\n\n";
print <<PAGE;
<html>
<head>
<link rel="stylesheet" type="text/css" href="tracker.css"/>
</head>
<body>
<div id="header">
<h1> Register New User</h1>
</div>
<div id="content">
<form action="adduser.pl" method="POST">
<b>Name:</b> <input type="text" name="name"><br>
<b>UserName:</b> <input type="text" name="username"><br>
<b>Password:</b> <input type="password" name="password"><br>
<input type="submit">
</div>
</body>
<html>
PAGE
adduser.pl
#!/usr/bin/perl
use CGI;
use DBI;
$cgiObj = CGI->new;
print $cgiObj->header ('text/html');
# get post data
$newUser = $cgiObj->param('username');
$newName = $cgiObj->param('name');
$newPass = $cgiObj->param('password');
# set up sql connection
$param = 'DBI:mysql:Tracker:localhost';
$user = 'madison';
$pass = 'qwerty';
$connect = DBI->connect ($param, $user, $pass);
$sql = 'select user from users where user = "' . $newUser . '"';
$query = $connect->prepare ($sql);
$query->execute;
$found = 0;
while (#row = $query->fetchrow_array)
{
$found = 1;
}
if ($found == 0)
{
# no user found add new user
$sql = 'insert into users (user, name, passwd) values (?, ?, ?)';
$insert = $connect->prepare ($sql);
$insert->execute ($newUser, $newName, $newPass);
}
else
{
# user already exists, get new user name
# What do I do here ????
print $cgiObj->redirect ("registerusr.pl");
}
One thing to look out for, SQL Injection. For an illustrated example, Little Bobby Tables.
As it stands your code is inescure, and can allow people to do bad things to your database. DBI provides placeholders as a secure way of querying a database with user input. Example http://bobby-tables.com/perl.html
Also, in this day and age even the CGI module warns you not to use it:
The rational for this decision is that CGI.pm is no longer considered good practice for developing web applications, including quick prototyping and small web scripts. There are far better, cleaner, quicker, easier, safer, more scalable, more extensible, more modern alternatives available at this point in time. These will be documented with CGI::Alternatives.
I suggest you use Dancer to make your life easier.
Three things
Include use strict; and use warnings; in EVERY perl script. No exceptions.
This is the #1 thing that you can do to be a better perl programmer. It will save you an incalculable amount of time during both development and testing.
Don't use redirects to switch between form processing and form display
Keep your form display and form processing in the same script. This enables you to display error messages in the form and only move on to a new step upon a successfully processed form.
You simply need to test the request_method to determine if the form is needing to be processed or just displayed.
CGI works for learning perl, but look at CGI::Alternatives for live code.
The following is your form refactored with the first 2 guidelines in mind:
register.pl:
#!/usr/bin/perl
use strict;
use warnings;
use CGI;
my $q = CGI->new;
my $name = $q->param('name') // '';
my $username = $q->param('username') // '';
my $password = $q->param('password') // '';
# Process Form
my #errors;
if ( $q->request_method() eq 'POST' ) {
if ( $username =~ /^\s*$/ ) {
push #errors, "No username specified.";
}
if ( $password =~ /^\s*$/ ) {
push #errors, "No password specified.";
}
# Successful Processing
if ( !#errors ) {
# Obfuscate for display
$password =~ s/./*/g;
print $q->header();
print <<"END_PAGE";
<html>
<head><title>Success</title></head>
<body>
<p>Name = $name</p>
<p>Username = $username</p>
<p>Password = $password</p>
</body>
</html>
END_PAGE
exit;
}
}
# Display Form
print $q->header();
print <<"END_PAGE";
<html>
<head>
<link rel="stylesheet" type="text/css" href="tracker.css"/>
</head>
<body>
<div id="header">
<h1>Register New User</h1>
</div>
#{[ #errors ? join("\n", map "<p>Error: $_</p>", #errors) : '' ]}
<div id="content">
<form action="register.pl" method="POST">
<b>Name:</b> #{[ $q->textfield( -name => 'name' ) ]}<br>
<b>UserName:</b> #{[ $q->textfield( -name => 'username' ) ]}<br>
<b>Password:</b> #{[ $q->password_field( -name => 'password' ) ]}<br>
<input type="submit">
</div>
</body>
<html>
END_PAGE
__DATA__

WWW::Mechanize::Firefox follow_link not working

I'm trying to follow a link in Perl.
My initial code:
use WWW::Mechanize::Firefox;
use Crypt::SSLeay;
use HTML::TagParser;
use URI::Fetch;
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME}=0; #not verifying certificate
my $url = 'https://';
$url = $url.#ARGV[0];
my $mech = WWW::Mechanize::Firefox->new;
$mech->get($url);
$mech->follow_link(tag => 'a', text => '<span class=\"normalNode\">VSCs</span>');
$mech->reload();
I found here that the tag and text options work this way but I got the error MozRepl::RemoteObject: SyntaxError: The expression is not a legal expression. I tried to escape some characters in the text, but the error was still the same.
Then I changed my code adding:
my #list = $mech->find_all_links();
my $found = 0;
my $i=0;
while($i<=$#list && $found == 0){
print #list[$i]->url()."\n";
if(#list[$i]->text() =~ /VSCs/){
print #list[$i]->text()."\n";
my $follow =#list[$i]->url();
$mech->follow_link( url => $follow);
}
$i++;
}
But then again there's an error: No link found matching '//a[(#href = "https://... and a lot of more text that seems to be the link's description.
I hope I made myself clear, if not, please tell me what else to add. Thanks to all for your help.
Here's the part where the link I want to follow is:
<li id="1" class="liClosed"><span class="bullet clickable"> </span><b><span class="normalNode">VSCs</span></b>
<ul id="1.l1">
<li id="i1.i1" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">First</span></b></li>
<li id="i1.i2" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">Second</span></b></li>
<li id="i1.i3" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">Third</span></b></li>
<li id="i1.i4" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">Fourth</span></b></li>
<li id="i1.i5" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">None</span></b></li>
</ul>
I'm working in Windows 7, MozRepl is version 1.1 and I'm using Strawberry perl 5.16.2.1 for 64 bits
After poking around with the given code I was able to make W::M::F to follow the links in a following manner:
use WWW::Mechanize::Firefox;
use Crypt::SSLeay;
use HTML::TagParser;
use URI::Fetch;
...
$mech->follow_link(xpath => '//a[text() = "<span class=\"normalNode\">VSCs</span>"]');
$mech->reload();
Note xpath parameter given instead of text.
I didn't take a long look into W::M::F sources, but under the hood it tries to translate given text parameter into XPath string, and if text contains number of XML/HTML tags, which is your case, it probably drives him crazy.
I recommend you to try :
$mech->follow_link( url_regex => qr/selector=All/ );

Perl Treebuilder HTML Parsing, can't seem to parse to DIV, getting error "Use of uninitialized value in pattern match "

I'm new to using the Perl treebuilder module for HTML parsing and can't figure out what the issue is with this.. I have spent a few hours trying to get this to work and looked at a few tutorials but am still getting this error: "Use of uninitialized value in pattern match ", referring to this line in my code:
sub{ $_[0]-> tag() eq 'div' and ($_[0]->attr('class') =~ /snap_preview/)}
);
This error prints out many times in the terminal, I have checked everything over and over and its definitely getting the input as the $downloaded page is a full HTML file that contains the string I give below... any advice is greatly appreciated.
sample string, contained within the $downloadedpage variable
<div class='snap_preview'><p><img src="http://www.dishbase.com/recipe_images/large/chicken-enchiladas-12005010871.jpg" width="160" height="115" align="left" border="0" alt="Mexican dishes recipes" style="border:none;"><b>Mexican dishes recipes</b> <i></i><br />
Mexican cuisine is popular the world over for its intense flavor and colorful presentation. Traditional Mexican recipes such as tacos, quesadillas, enchiladas and barbacoa are consistently explored for options by some of the world’s foremost gourmet chefs. A celebration of spices and unique culinary trends, Mexican food is now dominating world cuisines.</p>
<div style="margin-top: 1em" class="possibly-related"><hr /><p><strong>Possibly related posts: (automatically generated)</strong></p><ul><li><a rel='related' href='http://vireja59.wordpress.com/2010/02/13/all-best-italian-dishes-recipes/' style='font-weight:bold'>All best Italian dishes recipes</a></li><li><a rel='related' href='http://vireja59.wordpress.com/2010/05/24/liver-dishes-recipes/' style='font-weight:bold'>Liver dishes recipes</a></li><li><a rel='related' href='http://vireja59.wordpress.com/2010/04/24/parsley-in-cooking/' style='font-weight:bold'>Parsley in cooking</a></li></ul></div>
my code:
my $tree = HTML::TreeBuilder->new();
$tree->parse($downloadedpage);
$tree->eof();
#the article is in the div with class "snap_preview"
#article = $tree->look_down(
sub{ $_[0]-> tag() eq 'div' and ($_[0]->attr('class') =~ /snap_preview/)}
);
Using the exact code and example you gave,
use warnings;
use strict;
use HTML::TreeBuilder;
my $downloadedpage=<<EOF;
<div class='snap_preview'><p><img src="http://www.dishbase.com/recipe_images/large/chicken-enchiladas-12005010871.jpg" width="160" height="115" align="left" border="0" alt="Mexican dishes recipes" style="border:none;"><b>Mexican dishes recipes</b> <i></i><br />
Mexican cuisine is popular the world over for its intense flavor and colorful presentation. Traditional Mexican recipes such as tacos, quesadillas, enchiladas and barbacoa are consistently explored for options by some of the world’s foremost gourmet chefs. A celebration of spices and unique culinary trends, Mexican food is now dominating world cuisines.</p>
<div style="margin-top: 1em" class="possibly-related"><hr /><p><strong>Possibly related posts: (automatically generated)</strong></p><ul><li><a rel='related' href='http://vireja59.wordpress.com/2010/02/13/all-best-italian-dishes-recipes/' style='font-weight:bold'>All best Italian dishes recipes</a></li><li><a rel='related' href='http://vireja59.wordpress.com/2010/05/24/liver-dishes-recipes/' style='font-weight:bold'>Liver dishes recipes</a></li><li><a rel='related' href='http://vireja59.wordpress.com/2010/04/24/parsley-in-cooking/' style='font-weight:bold'>Parsley in cooking</a></li></ul></div>
EOF
my $tree = HTML::TreeBuilder->new();
$tree->parse($downloadedpage);
$tree->eof();
#the article is in the div with class "snap_preview"
my #article = $tree->look_down(
sub{ $_[0]-> tag() eq 'div' and ($_[0]->attr('class') =~ /snap_preview/)}
);
I don't get any errors at all. My first guess would be that there are some <div>s in the HTML which don't have a class attribute.
Maybe you need to write
sub{
$_[0]-> tag() eq 'div' and
$_[0]->attr('class') and
($_[0]->attr('class') =~ /snap_preview/)
}
there?