Extracting links inside <div>'s with HTML::TokeParser & URI

Extracting links inside <div>'s with HTML::TokeParser & URI - perl

I'm an old-newbie in Perl, and Im trying to create a subroutine in perl using HTML::TokeParser and URI.
I need to extract ALL valid links enclosed within on div called "zone-extract"
This is my code:
#More perl above here... use strict and other subs
use HTML::TokeParser;
use URI;
sub extract_links_from_response {
my $response = $_[0];
my $base = URI->new( $response->base )->canonical;
# "canonical" returns it in the one "official" tidy form
my $stream = HTML::TokeParser->new( $response->content_ref );
my $page_url = URI->new( $response->request->uri );
print "Extracting links from: $page_url\n";
my($tag, $link_url);
while ( my $div = $stream->get_tag('div') ) {
my $id = $div->get_attr('id');
next unless defined($id) and $id eq 'zone-extract';
while( $tag = $stream->get_tag('a') ) {
next unless defined($link_url = $tag->[1]{'href'});
next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL.
next unless length $link_url; # sanity check!
$link_url = URI->new_abs($link_url, $base)->canonical;
next unless $link_url->scheme eq 'http'; # sanity
$link_url->fragment(undef); # chop off any "#foo" part
print $link_url unless $link_url->eq($page_url); # Don't note links to itself!
}
}
return;
}
As you can see, I have 2 loops, first using get_tag 'div' and then look for id = 'zone-extract'. The second loop looks inside this div and retrieve all links (or that was my intention)...
The inner loop works, it extracts all links correctly working standalone, but I think there is some issues inside the first loop, looking for my desired div 'zone-extract'... Im using this post as a reference: How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?
But all I have by the moment is this error:
Can't call method "get_attr" on unblessed reference
Some ideas? Help!
My HTML (Note URL_TO_EXTRACT_1 & 2):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="...">
<h2 class="genres"><img alt="extracting" class="png"></h2>
<li><a title="Extr 2" href="**URL_TO_EXTRACT_1**">2</a></li>
<li><a title="Con 1" class="sel" href="**URL_TO_EXTRACT_2**">1</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>

I find that TokeParser is a very crude tool requiring too much code, its fault is that only supports the procedural style of programming.
A better alternatives which require less code due to declarative programming is Web::Query:
use Web::Query 'wq';
my $results = wq($response)->find('div#zone-extract a')->map(sub {
my (undef, $elem_a) = #_;
my $link_url = $elem_a->attr('href');
return unless $link_url && $link_url !~ m/\s/ && …
# Further checks like in the question go here.
return [$link_url => $elem_a->text];
});
Code is untested because there is no example HTML in the question.

Related

Find and extract content of division of certain class using DomXPath

I am trying to extract and save into PHP string (or array) the content of a certain section of a remote page. That particular section looks like:
<section class="intro">
<div class="container">
<h1>Student Club</h1>
<h2>Subtitle</h2>
<p>Lore ipsum paragraph.</p>
</div>
</section>
And since I can't narrow down using class container because there are several other sections of class "container" on the same page and because there is the only section of class "intro", I use the following code to find the right division:
$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;
#$doc->loadHTMLFile("https://www.remotesite.tld/remotepage.html");
$finder = new DomXPath($doc);
$intro = $finder->query("//*[contains(#class, 'intro')]");
And at this point, I'm hitting a problem - can't extract the content of $intro as PHP string.
Trying further the following code
foreach ($intro as $item) {
$string = $item->nodeValue;
echo $string;
}
gives only the text value, all the tags are stripped and I really need all those divs, h1 and h2 and p tags preserved for further manipulation needs.
Trying:
foreach ($intro->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo $name;
echo $value;
}
is giving the error:
Notice: Undefined property: DOMNodeList::$attributes in
So how could I extract the full HTML code of the found DOM elements?

I knew I was so close... I just needed to do:
foreach ($intro as $item) {
$h1= $item->getElementsByTagName('h1');
$h2= $item->getElementsByTagName('h2');
$p= $item->getElementsByTagName('p');
}

How to repost a webpage?

I am creating a simple perl script to create a web page to register users. This is just a learning program for me. It is very simple. I will display a page on the browser. The user enters name, user name, and password. After the user presses submit, I will check the user name against the database. If the user name exists in the database, I just want to display an error and bring up the register page again. I am using the cgi->redirect function. I am not sure if that is how I should use the redirection function. It does not work like I thought. It display "The document has moved here". Please point me to the right way. Thanks.
Here is the scripts
registeruser.pl
#!/usr/bin/perl
print "Content-type: text/html\n\n";
print <<PAGE;
<html>
<head>
<link rel="stylesheet" type="text/css" href="tracker.css"/>
</head>
<body>
<div id="header">
<h1> Register New User</h1>
</div>
<div id="content">
<form action="adduser.pl" method="POST">
<b>Name:</b> <input type="text" name="name"><br>
<b>UserName:</b> <input type="text" name="username"><br>
<b>Password:</b> <input type="password" name="password"><br>
<input type="submit">
</div>
</body>
<html>
PAGE
adduser.pl
#!/usr/bin/perl
use CGI;
use DBI;
$cgiObj = CGI->new;
print $cgiObj->header ('text/html');
# get post data
$newUser = $cgiObj->param('username');
$newName = $cgiObj->param('name');
$newPass = $cgiObj->param('password');
# set up sql connection
$param = 'DBI:mysql:Tracker:localhost';
$user = 'madison';
$pass = 'qwerty';
$connect = DBI->connect ($param, $user, $pass);
$sql = 'select user from users where user = "' . $newUser . '"';
$query = $connect->prepare ($sql);
$query->execute;
$found = 0;
while (#row = $query->fetchrow_array)
{
$found = 1;
}
if ($found == 0)
{
# no user found add new user
$sql = 'insert into users (user, name, passwd) values (?, ?, ?)';
$insert = $connect->prepare ($sql);
$insert->execute ($newUser, $newName, $newPass);
}
else
{
# user already exists, get new user name
# What do I do here ????
print $cgiObj->redirect ("registerusr.pl");
}

One thing to look out for, SQL Injection. For an illustrated example, Little Bobby Tables.
As it stands your code is inescure, and can allow people to do bad things to your database. DBI provides placeholders as a secure way of querying a database with user input. Example http://bobby-tables.com/perl.html
Also, in this day and age even the CGI module warns you not to use it:
The rational for this decision is that CGI.pm is no longer considered good practice for developing web applications, including quick prototyping and small web scripts. There are far better, cleaner, quicker, easier, safer, more scalable, more extensible, more modern alternatives available at this point in time. These will be documented with CGI::Alternatives.
I suggest you use Dancer to make your life easier.

Three things
Include use strict; and use warnings; in EVERY perl script. No exceptions.
This is the #1 thing that you can do to be a better perl programmer. It will save you an incalculable amount of time during both development and testing.
Don't use redirects to switch between form processing and form display
Keep your form display and form processing in the same script. This enables you to display error messages in the form and only move on to a new step upon a successfully processed form.
You simply need to test the request_method to determine if the form is needing to be processed or just displayed.
CGI works for learning perl, but look at CGI::Alternatives for live code.
The following is your form refactored with the first 2 guidelines in mind:
register.pl:
#!/usr/bin/perl
use strict;
use warnings;
use CGI;
my $q = CGI->new;
my $name = $q->param('name') // '';
my $username = $q->param('username') // '';
my $password = $q->param('password') // '';
# Process Form
my #errors;
if ( $q->request_method() eq 'POST' ) {
if ( $username =~ /^\s*$/ ) {
push #errors, "No username specified.";
}
if ( $password =~ /^\s*$/ ) {
push #errors, "No password specified.";
}
# Successful Processing
if ( !#errors ) {
# Obfuscate for display
$password =~ s/./*/g;
print $q->header();
print <<"END_PAGE";
<html>
<head><title>Success</title></head>
<body>
<p>Name = $name</p>
<p>Username = $username</p>
<p>Password = $password</p>
</body>
</html>
END_PAGE
exit;
}
}
# Display Form
print $q->header();
print <<"END_PAGE";
<html>
<head>
<link rel="stylesheet" type="text/css" href="tracker.css"/>
</head>
<body>
<div id="header">
<h1>Register New User</h1>
</div>
#{[ #errors ? join("\n", map "<p>Error: $_</p>", #errors) : '' ]}
<div id="content">
<form action="register.pl" method="POST">
<b>Name:</b> #{[ $q->textfield( -name => 'name' ) ]}<br>
<b>UserName:</b> #{[ $q->textfield( -name => 'username' ) ]}<br>
<b>Password:</b> #{[ $q->password_field( -name => 'password' ) ]}<br>
<input type="submit">
</div>
</body>
<html>
END_PAGE
__DATA__

HTML parsing with HTML::TokeParser::Simple

I am parsing an HTML file with HTML::TokeParser::Simple. The HTML file has the content shown far below. My problem is, I am trying to ignore the JavaScript from showing up as text content. Example:
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( 'test.html' );
while ( my $token = $p->get_token ) {
next unless $token->is_text;
print $token->as_is, "\n";
}
This prints the output as seen below:
Test HTML
<!--
var form_submitted = 0;
function submit_form() {
[..]
}
//-->
The actual HTML Document Content:
<html>
<span>Test HTML</span>
<script type="text/javascript">
<!--
var form_submitted = 0;
function submit_form() {
[..]
}
//-->
</script>
</html>
How do I ignore the JavaScript tag contents from showing.

I get the desired result. Comments are (correctly) not considered text by the version I have. Looks like you need to upgrade the modules you are using. (I used HTML::Parser 3.69 and HTML::TokeParser::Simple 3.15.)
>perl a.pl
Test HTML
>
You'll still have to process HTML entities and format the text usefully, the latter being quite difficult since you removed all formatting instruction. Your approach seems fatally flawed.

I believe you only need to use the as_text method.
my $tree = HTML::TreeBuilder->new();
$tree->parse( $html );
$tree->eof();
$tree->elementify(); # just for safety
my $text = $tree->as_text();
$tree->delete;
I adapted this from the WWW::Mechanize module (http://search.cpan.org/dist/WWW-Mechanize/) which has tons of convenience methods that can help you. It basically acts as a web browser in an object.

Scan through the token to ignore all open and close script tags. See below as used to resolved the issue.
my $ignore=0;
while ( my $token = $p->get_token ) {
if ( $token->is_start_tag('script') ) {
print $token->as_is, "\n";
$ignore = 1;
next;
}
if ( $token->is_end_tag('script') ) {
$ignore = 0;
print $token->as_is, "\n";
next;
}
if ($ignore) {
#Everything inside the script tag. Here you can ignore or print as is
print $token->as_is, "\n";
}
else
{
#Everything excluding scripts falls here handle as appropriate
next unless $token->is_text;
print $token->as_is, "\n";
}
}

Formmail - make it show name and email in success response

I'm using formmail.pl to handle a form I'm using on my site. There are several fields which are sent to the script including name, email, phone and some text in a text area.
As it stands, the successful submission prints the phone and textarea data correctly but doesn't print the name and email which were entered. In the email it sends the name and email in the 'from' header and as with the success page only shows the phone and textarea data are shown in the email body.
I would like to show all data in both cases however I can't seem to find the section of code that handles this. I'd post up the formmail.pl script except its over 3000 lines of code so I'll just post the places I think are responsible and hopefully somebody can point me in the right direction. I'm fairly new to Perl and its a bit overwhelming reading and understanding a script of this size.
sub success_page {
my ($self, $date) = #_;
if ($self->{FormConfig}{'redirect'}) {
print $self->cgi_object->redirect( $self->{FormConfig}{'redirect'} );
}
elsif ( $self->{CFG}{'no_content'}) {
print $self->cgi_object->header(Status => 204);
}
else {
$self->output_cgi_html_header;
$self->success_page_html_preamble($date);
$self->success_page_fields;
$self->success_page_footer;
}
}
sub success_page_html_preamble {
my ($self, $date) = #_;
my $title = $self->escape_html( $self->{FormConfig}{'title'} || 'Success' );
my $torecipient = 'to ' . $self->escape_html($self->{FormConfig}{'recipient'});
$torecipient = '' if $self->{Hide_Recipient};
my $attr = $self->body_attributes;
print <<END;
<head>
<title>$title</title>
END
$self->output_style_element;
print <<END;
<link type="text/css" href="css/stylesheet.css" rel="stylesheet" /></script>
</head>
<body>
<p>Below is what you submitted $torecipient on $date</p>
END
}
sub success_page_fields {
my ($self) = #_;
foreach my $f (#{ $self->{Field_Order} }) {
my $val = (defined $self->{Form}{$f} ? $self->{Form}{$f} : '');
$self->success_page_field( $self->escape_html($f), $self->escape_html($val) );
}
}
sub success_page_field {
my ($self, $name, $value) = #_;
print "<p><b>$name:</b> $value</p>\n";
}
Okay that's getting a bit long. That stuff is mostly for the success page and not much to do with the email side of things but maybe if somebody can find what I need there I can apply it to the email section also.
If any further information is needed let me know
Thanks in Advance

I haven't really used NMS FormMail myself, but looking at the source, it seems that you should be able to achieve something like what you want by setting the following extra configuration options:
$more_config{include_config_email} = 1;
$more_config{include_config_realname} = 1;
This should cause FormMail.pl to treat the email and realname fields as normal form fields, in addition to their special meaning.

How can I get the next immediate tag with HTML::Tokeparser?

I'm trying to get the tags that occur immediately after a particular div tag. For e.g., I have html code
<div id="example">
<h2>Example</h2>
<p>Hello !World</p>
</div>
I'm doing the following,
while ( $tag = $stream->get_tag('div') ) {
if( $tag->[1]{id} eq 'Example' ) {
$tag = $stream->get_tag;
$tag = $stream->get_tag;
if ( $tag->[0] eq 'div' ) {
...
}
}
}
But this throws the error
Can't use string ("</h2>") as a HASH ref while "strict refs" in use
It works fine if I say
$tag = $stream->get_tag('h2');
$tag = $stream->get_tag('p');
But I can't have that because I need to get the immediate two tags and verify if they are what i expect them to be.

It would be easier to tell if you posted a runnable example program, but it looks like the problem is you didn't realize that get_tag returns both start and end tags. End tags don't have attributes. Start tags are returned as [$tag, $attr, $attrseq, $text], and end tags are returned as ["/$tag", $text].

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Extracting links inside <div>'s with HTML::TokeParser & URI - perl

Related

Find and extract content of division of certain class using DomXPath

How to repost a webpage?

HTML parsing with HTML::TokeParser::Simple

Formmail - make it show name and email in success response

How can I get the next immediate tag with HTML::Tokeparser?

Categories

Resources