How to extract some text from an HTML doc using Web::Query

How to extract some text from an HTML doc using Web::Query - perl

I'm trying to extract the subject (between the h3 tags) in the following example using Web::Query. Find 'h3' returns the author text, but I want the h3 in the subject class instead. I tried .subject.div.h3 but it returns undef.
#!/usr/bin/perl
use strict;
use warnings;
use Web::Query; # libweb-query-perl
use Data::Dumper;
my $testhtml ='
<html><head></head>
<body>
<div class="author"
<div><h3>Neil Watson</h3></div>
</div>
<div class="subject">
<div><h3>#if version_after macro is illogical</h3></div>
</div>
</body>
</html>
';
my $parts = Web::Query->new_from_html( $testhtml );
my $subject = $parts->find( 'div.subject.div.h3' )->text;
print "subjectfinal ".Dumper( $subject );

The dot selector denotes class selections, which is not what you intend for the second div and h3. For these you want descendant. The correct syntax is;
my $subject = $parts->find( 'div.subject > div > h3' )->text;
# Which outputs
# subjectfinal $VAR1 = '#if version_after macro is illogical';
For more information on CSS selectors which is what Web::Query is loosely based off have a look at http://www.w3schools.com/cssref/css_selectors.asp

Related

DOMXPath multiple contain selectors not working

I have the following XPath query that a kind user on SO helped me with:
$xpath->query(".//*[not(self::textarea or self::select or self::input) and contains(., '{{{')]/text()") as $node)
Its purpose is to replace certain placeholders with a value, and correctly catches occurences such as the below that should not be replaced:
<textarea id="testtextarea" name="testtextarea">{{{variable:test}}}</textarea>
And replaces correctly occurrences like this:
<div>{{{variable:test}}}</div>
Now I want to exclude elements that are of type <div> that contain the class name note-editable in that query, e.g., <div class="note-editable mayhaveanotherclasstoo">, in addition to textareas, selects or inputs.
I have tried:
$xpath->query(".//*[not(self::textarea or self::select or self::input) and not(contains(#class, 'note-editable')) and contains(., '{{{')]/text()") as $node)
and:
$xpath->query(".//*[not(self::textarea or self::select or self::input or contains(#class, 'note-editable')) and contains(., '{{{')]/text()") as $node)
I have followed the advice on some questions similar to this: PHP xpath contains class and does not contain class, and I do not get PHP errors, but the note-editable <div> tags are still having their placeholders replaced.
Any idea what's wrong with my attempted queries?
EDIT
Minimum reproducible DOM sample:
<div class="note-editing-area">
<textarea class="note-codable"></textarea>
<div class="note-editable panel-body" contenteditable="true" style="height: 350px;">{{{variable:system_url}}</div>
</div>
Code that does the replacement:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
foreach ($xpath->query(".//*[not(self::textarea or self::select or self::input or self::div[contains(#class,'note-editable')]) and contains(., '{{{')]/text()") as $node) {
$node->nodeValue = preg_replace_callback('~{{{([^:]+):([^}]+)}}}~', function($m) use ($placeholders) {
return $placeholders[$m[1]][$m[2]] ?? '';
},
$node->nodeValue);
}
$html = $dom->saveHTML();
echo html_entity_decode($html);

Use this below xpath.
.//*[not(self::textarea or self::select or self::input or self::div[contains(#class,'note-editable')]) and contains(., '{{{')]

How to repost a webpage?

I am creating a simple perl script to create a web page to register users. This is just a learning program for me. It is very simple. I will display a page on the browser. The user enters name, user name, and password. After the user presses submit, I will check the user name against the database. If the user name exists in the database, I just want to display an error and bring up the register page again. I am using the cgi->redirect function. I am not sure if that is how I should use the redirection function. It does not work like I thought. It display "The document has moved here". Please point me to the right way. Thanks.
Here is the scripts
registeruser.pl
#!/usr/bin/perl
print "Content-type: text/html\n\n";
print <<PAGE;
<html>
<head>
<link rel="stylesheet" type="text/css" href="tracker.css"/>
</head>
<body>
<div id="header">
<h1> Register New User</h1>
</div>
<div id="content">
<form action="adduser.pl" method="POST">
<b>Name:</b> <input type="text" name="name"><br>
<b>UserName:</b> <input type="text" name="username"><br>
<b>Password:</b> <input type="password" name="password"><br>
<input type="submit">
</div>
</body>
<html>
PAGE
adduser.pl
#!/usr/bin/perl
use CGI;
use DBI;
$cgiObj = CGI->new;
print $cgiObj->header ('text/html');
# get post data
$newUser = $cgiObj->param('username');
$newName = $cgiObj->param('name');
$newPass = $cgiObj->param('password');
# set up sql connection
$param = 'DBI:mysql:Tracker:localhost';
$user = 'madison';
$pass = 'qwerty';
$connect = DBI->connect ($param, $user, $pass);
$sql = 'select user from users where user = "' . $newUser . '"';
$query = $connect->prepare ($sql);
$query->execute;
$found = 0;
while (#row = $query->fetchrow_array)
{
$found = 1;
}
if ($found == 0)
{
# no user found add new user
$sql = 'insert into users (user, name, passwd) values (?, ?, ?)';
$insert = $connect->prepare ($sql);
$insert->execute ($newUser, $newName, $newPass);
}
else
{
# user already exists, get new user name
# What do I do here ????
print $cgiObj->redirect ("registerusr.pl");
}

One thing to look out for, SQL Injection. For an illustrated example, Little Bobby Tables.
As it stands your code is inescure, and can allow people to do bad things to your database. DBI provides placeholders as a secure way of querying a database with user input. Example http://bobby-tables.com/perl.html
Also, in this day and age even the CGI module warns you not to use it:
The rational for this decision is that CGI.pm is no longer considered good practice for developing web applications, including quick prototyping and small web scripts. There are far better, cleaner, quicker, easier, safer, more scalable, more extensible, more modern alternatives available at this point in time. These will be documented with CGI::Alternatives.
I suggest you use Dancer to make your life easier.

Three things
Include use strict; and use warnings; in EVERY perl script. No exceptions.
This is the #1 thing that you can do to be a better perl programmer. It will save you an incalculable amount of time during both development and testing.
Don't use redirects to switch between form processing and form display
Keep your form display and form processing in the same script. This enables you to display error messages in the form and only move on to a new step upon a successfully processed form.
You simply need to test the request_method to determine if the form is needing to be processed or just displayed.
CGI works for learning perl, but look at CGI::Alternatives for live code.
The following is your form refactored with the first 2 guidelines in mind:
register.pl:
#!/usr/bin/perl
use strict;
use warnings;
use CGI;
my $q = CGI->new;
my $name = $q->param('name') // '';
my $username = $q->param('username') // '';
my $password = $q->param('password') // '';
# Process Form
my #errors;
if ( $q->request_method() eq 'POST' ) {
if ( $username =~ /^\s*$/ ) {
push #errors, "No username specified.";
}
if ( $password =~ /^\s*$/ ) {
push #errors, "No password specified.";
}
# Successful Processing
if ( !#errors ) {
# Obfuscate for display
$password =~ s/./*/g;
print $q->header();
print <<"END_PAGE";
<html>
<head><title>Success</title></head>
<body>
<p>Name = $name</p>
<p>Username = $username</p>
<p>Password = $password</p>
</body>
</html>
END_PAGE
exit;
}
}
# Display Form
print $q->header();
print <<"END_PAGE";
<html>
<head>
<link rel="stylesheet" type="text/css" href="tracker.css"/>
</head>
<body>
<div id="header">
<h1>Register New User</h1>
</div>
#{[ #errors ? join("\n", map "<p>Error: $_</p>", #errors) : '' ]}
<div id="content">
<form action="register.pl" method="POST">
<b>Name:</b> #{[ $q->textfield( -name => 'name' ) ]}<br>
<b>UserName:</b> #{[ $q->textfield( -name => 'username' ) ]}<br>
<b>Password:</b> #{[ $q->password_field( -name => 'password' ) ]}<br>
<input type="submit">
</div>
</body>
<html>
END_PAGE
__DATA__

storing radio button value and presetting after refresh- cgi visiblity issue

I have a html table that has 2 radio buttons for every row and a save button. I want to store the value of the radio button when saved and preset the value when the page is revisited.This is the html code I have written
<form action='table_extract.cgi' method = 'get'>
<td><input type='radio' name='signoff' value = 'approve'>Approve<br>
<input type='radio' name='signoff' value='review'>Review</td>
<td><input type='submit' name='button' value='Save'/></td></form>
This is what is in table_extract.cgi
#!usr/local/bin/perl
use CGI qw(:standard);
use CGI::Carp qw(warningsToBrowser fatalsToBrowser);
use strict;
use warnings;
print <<END;
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
END
my $regfile = 'signoff.out';
my $sign;
$sign = param('signoff');
open(REG,">>$regfile") or fail();
print REG "$sign\n";
close(REG);
print "param value :", param('signoff');
print <<END;
<title>Thank you!</title>
<h1>Thank you!</h1>
<p>signoff preference:$sign </p>
END
sub fail {
print "<title>Error</title>",
"<p>Error: cannot record your registration!</p>";
exit; }
This is just first part of the problem. I was not able to find any output in console or in poll.out. Once I read the values, I need to preset the values to the radio buttons that was saved by the user in the previous visit.

The problem is in the HTML, your submit button has two type attributes. When I repaired this, the form worked for me.
You are doing too much work to save the form. See chapter SAVING THE STATE OF THE SCRIPT TO A FILE in the documentation.

Using Perl Mechanize to strip text from a web page

I am trying to scrape only the test information from a web page which is set up with a set of divs, tags etc. I want to only extract information from a specific div class, and again only the information within the tags.
<div class="col col60 moduledetail"><table cellspacing="2" cellpadding="0" border="0" id="moduleDetail"><tr><th class="moduleCode">test<small>CRN: 33413</small></th><th>test</th></tr><tr><td class="label"><nobr>Campus</nobr></td><td><a target="_blank" href="test/">test</a></td></tr><tr><td class="label">
above is a snippet of what is contained within the web page. My attempt at getting the page contents is doing exactly what it says, its getting everything from the web page, how can i narrow this down to this class and only the textual information within the tags.
thanks

Use a HTML parser. Here's an example using HTML::TreeBuilder:
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new;
$mech->get($url);
my $tree = HTML::TreeBuilder->new_from_content($mech->content);
if (my $div = $tree->look_down(_tag => "div", class => "col col60 moduledetail")) {
print $div->as_text(), "\n";
}
$tree->delete();

Extracting links inside <div>'s with HTML::TokeParser & URI

I'm an old-newbie in Perl, and Im trying to create a subroutine in perl using HTML::TokeParser and URI.
I need to extract ALL valid links enclosed within on div called "zone-extract"
This is my code:
#More perl above here... use strict and other subs
use HTML::TokeParser;
use URI;
sub extract_links_from_response {
my $response = $_[0];
my $base = URI->new( $response->base )->canonical;
# "canonical" returns it in the one "official" tidy form
my $stream = HTML::TokeParser->new( $response->content_ref );
my $page_url = URI->new( $response->request->uri );
print "Extracting links from: $page_url\n";
my($tag, $link_url);
while ( my $div = $stream->get_tag('div') ) {
my $id = $div->get_attr('id');
next unless defined($id) and $id eq 'zone-extract';
while( $tag = $stream->get_tag('a') ) {
next unless defined($link_url = $tag->[1]{'href'});
next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL.
next unless length $link_url; # sanity check!
$link_url = URI->new_abs($link_url, $base)->canonical;
next unless $link_url->scheme eq 'http'; # sanity
$link_url->fragment(undef); # chop off any "#foo" part
print $link_url unless $link_url->eq($page_url); # Don't note links to itself!
}
}
return;
}
As you can see, I have 2 loops, first using get_tag 'div' and then look for id = 'zone-extract'. The second loop looks inside this div and retrieve all links (or that was my intention)...
The inner loop works, it extracts all links correctly working standalone, but I think there is some issues inside the first loop, looking for my desired div 'zone-extract'... Im using this post as a reference: How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?
But all I have by the moment is this error:
Can't call method "get_attr" on unblessed reference
Some ideas? Help!
My HTML (Note URL_TO_EXTRACT_1 & 2):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="...">
<h2 class="genres"><img alt="extracting" class="png"></h2>
<li><a title="Extr 2" href="**URL_TO_EXTRACT_1**">2</a></li>
<li><a title="Con 1" class="sel" href="**URL_TO_EXTRACT_2**">1</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>

I find that TokeParser is a very crude tool requiring too much code, its fault is that only supports the procedural style of programming.
A better alternatives which require less code due to declarative programming is Web::Query:
use Web::Query 'wq';
my $results = wq($response)->find('div#zone-extract a')->map(sub {
my (undef, $elem_a) = #_;
my $link_url = $elem_a->attr('href');
return unless $link_url && $link_url !~ m/\s/ && …
# Further checks like in the question go here.
return [$link_url => $elem_a->text];
});
Code is untested because there is no example HTML in the question.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to extract some text from an HTML doc using Web::Query - perl

Related

DOMXPath multiple contain selectors not working

How to repost a webpage?

storing radio button value and presetting after refresh- cgi visiblity issue

Using Perl Mechanize to strip text from a web page

Extracting links inside <div>'s with HTML::TokeParser & URI

Categories

Resources