HTML parsing with HTML::TokeParser::Simple - perl

I am parsing an HTML file with HTML::TokeParser::Simple. The HTML file has the content shown far below. My problem is, I am trying to ignore the JavaScript from showing up as text content. Example:
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( 'test.html' );
while ( my $token = $p->get_token ) {
next unless $token->is_text;
print $token->as_is, "\n";
}
This prints the output as seen below:
Test HTML
<!--
var form_submitted = 0;
function submit_form() {
[..]
}
//-->
The actual HTML Document Content:
<html>
<span>Test HTML</span>
<script type="text/javascript">
<!--
var form_submitted = 0;
function submit_form() {
[..]
}
//-->
</script>
</html>
How do I ignore the JavaScript tag contents from showing.

I get the desired result. Comments are (correctly) not considered text by the version I have. Looks like you need to upgrade the modules you are using. (I used HTML::Parser 3.69 and HTML::TokeParser::Simple 3.15.)
>perl a.pl
Test HTML
>
You'll still have to process HTML entities and format the text usefully, the latter being quite difficult since you removed all formatting instruction. Your approach seems fatally flawed.

I believe you only need to use the as_text method.
my $tree = HTML::TreeBuilder->new();
$tree->parse( $html );
$tree->eof();
$tree->elementify(); # just for safety
my $text = $tree->as_text();
$tree->delete;
I adapted this from the WWW::Mechanize module (http://search.cpan.org/dist/WWW-Mechanize/) which has tons of convenience methods that can help you. It basically acts as a web browser in an object.

Scan through the token to ignore all open and close script tags. See below as used to resolved the issue.
my $ignore=0;
while ( my $token = $p->get_token ) {
if ( $token->is_start_tag('script') ) {
print $token->as_is, "\n";
$ignore = 1;
next;
}
if ( $token->is_end_tag('script') ) {
$ignore = 0;
print $token->as_is, "\n";
next;
}
if ($ignore) {
#Everything inside the script tag. Here you can ignore or print as is
print $token->as_is, "\n";
}
else
{
#Everything excluding scripts falls here handle as appropriate
next unless $token->is_text;
print $token->as_is, "\n";
}
}

Related

Perl Read a file into a variable and add suffix to each lines

I'm very new to Perl and I'm having a hard time find out what I want.
I have a text file containing something like
text 2015-02-02:
- blabla1
- blabla2
text2 2014-12-12:
- blabla
- ...
I'm trying to read the file, put it in var, add to end of each line (of my var) and use it to send it to a web page.
This is what I have for the moment. It works except for the part.
if (open (IN, "CHANGELOG.OLD")) {
local $/;
$oldchangelog = <IN>'</br>';
close (IN);
$tmplhtml{'CHANGELOG'} = $oldchangelog;
} else {
# changelog not available
$tmplhtml{'CHANGELOG'} = "Changelog not available";
}
thanks for the help!
As someone comments - this looks like YAML, so parsing as YAML is probably more appropriate.
However to address your scenario:
3 argument file opens are good.
you're using local $/; which means you're reading the whole file into a string. This is not suitable for line by line processing.
Looks like you're putting everything into one element of a hash. Is there any particular reason you're doing this?
Anyway:
if ( open ( my $input, "<", "CHANGELOG.OLD" ) ) {
while ( my $line = <$input> ) {
$tmplhtml{'CHANGELOG'} .= $line . " <BR/>\n";
}
}
else {
$tmplhtml{'CHANGELOG'} = "Changelog not available";
}
As an alternative - you can render text 'neatly' to HTML using <PRE> tags.

Formmail - make it show name and email in success response

I'm using formmail.pl to handle a form I'm using on my site. There are several fields which are sent to the script including name, email, phone and some text in a text area.
As it stands, the successful submission prints the phone and textarea data correctly but doesn't print the name and email which were entered. In the email it sends the name and email in the 'from' header and as with the success page only shows the phone and textarea data are shown in the email body.
I would like to show all data in both cases however I can't seem to find the section of code that handles this. I'd post up the formmail.pl script except its over 3000 lines of code so I'll just post the places I think are responsible and hopefully somebody can point me in the right direction. I'm fairly new to Perl and its a bit overwhelming reading and understanding a script of this size.
sub success_page {
my ($self, $date) = #_;
if ($self->{FormConfig}{'redirect'}) {
print $self->cgi_object->redirect( $self->{FormConfig}{'redirect'} );
}
elsif ( $self->{CFG}{'no_content'}) {
print $self->cgi_object->header(Status => 204);
}
else {
$self->output_cgi_html_header;
$self->success_page_html_preamble($date);
$self->success_page_fields;
$self->success_page_footer;
}
}
sub success_page_html_preamble {
my ($self, $date) = #_;
my $title = $self->escape_html( $self->{FormConfig}{'title'} || 'Success' );
my $torecipient = 'to ' . $self->escape_html($self->{FormConfig}{'recipient'});
$torecipient = '' if $self->{Hide_Recipient};
my $attr = $self->body_attributes;
print <<END;
<head>
<title>$title</title>
END
$self->output_style_element;
print <<END;
<link type="text/css" href="css/stylesheet.css" rel="stylesheet" /></script>
</head>
<body>
<p>Below is what you submitted $torecipient on $date</p>
END
}
sub success_page_fields {
my ($self) = #_;
foreach my $f (#{ $self->{Field_Order} }) {
my $val = (defined $self->{Form}{$f} ? $self->{Form}{$f} : '');
$self->success_page_field( $self->escape_html($f), $self->escape_html($val) );
}
}
sub success_page_field {
my ($self, $name, $value) = #_;
print "<p><b>$name:</b> $value</p>\n";
}
Okay that's getting a bit long. That stuff is mostly for the success page and not much to do with the email side of things but maybe if somebody can find what I need there I can apply it to the email section also.
If any further information is needed let me know
Thanks in Advance
I haven't really used NMS FormMail myself, but looking at the source, it seems that you should be able to achieve something like what you want by setting the following extra configuration options:
$more_config{include_config_email} = 1;
$more_config{include_config_realname} = 1;
This should cause FormMail.pl to treat the email and realname fields as normal form fields, in addition to their special meaning.

Extracting links inside <div>'s with HTML::TokeParser & URI

I'm an old-newbie in Perl, and Im trying to create a subroutine in perl using HTML::TokeParser and URI.
I need to extract ALL valid links enclosed within on div called "zone-extract"
This is my code:
#More perl above here... use strict and other subs
use HTML::TokeParser;
use URI;
sub extract_links_from_response {
my $response = $_[0];
my $base = URI->new( $response->base )->canonical;
# "canonical" returns it in the one "official" tidy form
my $stream = HTML::TokeParser->new( $response->content_ref );
my $page_url = URI->new( $response->request->uri );
print "Extracting links from: $page_url\n";
my($tag, $link_url);
while ( my $div = $stream->get_tag('div') ) {
my $id = $div->get_attr('id');
next unless defined($id) and $id eq 'zone-extract';
while( $tag = $stream->get_tag('a') ) {
next unless defined($link_url = $tag->[1]{'href'});
next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL.
next unless length $link_url; # sanity check!
$link_url = URI->new_abs($link_url, $base)->canonical;
next unless $link_url->scheme eq 'http'; # sanity
$link_url->fragment(undef); # chop off any "#foo" part
print $link_url unless $link_url->eq($page_url); # Don't note links to itself!
}
}
return;
}
As you can see, I have 2 loops, first using get_tag 'div' and then look for id = 'zone-extract'. The second loop looks inside this div and retrieve all links (or that was my intention)...
The inner loop works, it extracts all links correctly working standalone, but I think there is some issues inside the first loop, looking for my desired div 'zone-extract'... Im using this post as a reference: How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?
But all I have by the moment is this error:
Can't call method "get_attr" on unblessed reference
Some ideas? Help!
My HTML (Note URL_TO_EXTRACT_1 & 2):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="...">
<h2 class="genres"><img alt="extracting" class="png"></h2>
<li><a title="Extr 2" href="**URL_TO_EXTRACT_1**">2</a></li>
<li><a title="Con 1" class="sel" href="**URL_TO_EXTRACT_2**">1</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
I find that TokeParser is a very crude tool requiring too much code, its fault is that only supports the procedural style of programming.
A better alternatives which require less code due to declarative programming is Web::Query:
use Web::Query 'wq';
my $results = wq($response)->find('div#zone-extract a')->map(sub {
my (undef, $elem_a) = #_;
my $link_url = $elem_a->attr('href');
return unless $link_url && $link_url !~ m/\s/ && …
# Further checks like in the question go here.
return [$link_url => $elem_a->text];
});
Code is untested because there is no example HTML in the question.

Cannot print on multiple pages using PDF::API2

I have been tinkering around with PDF::API2 and i am facing a problem, create a pdf file very well and add text into it. However say if the text to be written flows over to more than one page, the script does not print over to the next page. I have tried researching for an answer to this but to no avail. I would like each page to have exactly 50 lines of text. My script is as below. It only prints on the first page, creates the other pages but does not print into them. Anyone with a solution
!/usr/bin/perl
use PDF::API2;
use POSIX qw(setsid strftime);
my $filename = scalar(strftime('%F', localtime));
my $pdf = PDF::API2->new(-file => "$filename.pdf");
$pdf->mediabox(595,842);
my $page = $pdf->page;
my $fnt = $pdf->corefont('Arial',-encoding => 'latin1');
my $txt = $page->text;
$txt->textstart;
$txt->font($fnt, 20);
$txt->translate(100,800);
$txt->text("Lines for $filename");
my $i=0;
my $line = 780;
while($i<310)
{
if(($i%50) == 0)
{
my $page = $pdf->page;
my $fnt = $pdf->corefont('Arial',-encoding => 'latin1');
my $txt = $page->text;
}
$txt->font($fnt, 10);
$txt->translate(100,$line);
$txt->text("$i This is the first line");
$line=$line-15;
$i++;
}
$txt->textend;
$pdf->save;
$pdf->end( );
The problem is that you are making new page, but forget new variables instantly:
if(($i%50) == 0)
{
my $page = $pdf->page;
my $fnt = $pdf->corefont('Arial',-encoding => 'latin1');
my $txt = $page->text;
}
All my variables you make disappear on closing parentheses. Just remove my and you will modify variables from top-level scope.
Edit: You also probably want to reset $line variable when making new page.
The typeface, $fnt, does not have to be changed since it depends on the PDF, $pdf, and not the page, $page.
As much as I love Perl, I learned enough Python to use the ReportLabs library for PDF generation. Creating PDF is one of the weak spots of Perl v. Python.

How can I get the next immediate tag with HTML::Tokeparser?

I'm trying to get the tags that occur immediately after a particular div tag. For e.g., I have html code
<div id="example">
<h2>Example</h2>
<p>Hello !World</p>
</div>
I'm doing the following,
while ( $tag = $stream->get_tag('div') ) {
if( $tag->[1]{id} eq 'Example' ) {
$tag = $stream->get_tag;
$tag = $stream->get_tag;
if ( $tag->[0] eq 'div' ) {
...
}
}
}
But this throws the error
Can't use string ("</h2>") as a HASH ref while "strict refs" in use
It works fine if I say
$tag = $stream->get_tag('h2');
$tag = $stream->get_tag('p');
But I can't have that because I need to get the immediate two tags and verify if they are what i expect them to be.
It would be easier to tell if you posted a runnable example program, but it looks like the problem is you didn't realize that get_tag returns both start and end tags. End tags don't have attributes. Start tags are returned as [$tag, $attr, $attrseq, $text], and end tags are returned as ["/$tag", $text].