Trouble Replacing Text in HTML fragment using Mojo::DOM - perl

I need to scan through html fragments looking for certain strings in text (not within element attributes) and wrapping those matching strings with a <span></span>. Here's a sample attempt with output:
use v5.10;
use Mojo::DOM;
my $body = qq|
<div>
<p>Boring Text:</p>
<p>
Highlight Cool whenever we see it.
but not here.
<code>
sub Cool {
print "Foo\n";
}
</code>
And here is more Cool.
</p>
</div>
|;
my $dom = Mojo::DOM->new($body);
foreach my $e ($dom->find('*')->each) {
my $text = $e->text;
say "e text is: $text ";
if ($text =~ /Cool/) {
(my $newtext = $text ) =~ s/Cool/<span class="fun">Cool<\/span>/g;
$e->replace_content($newtext);
}
}
say $dom->root;
the output:
e text is:
e text is: Boring Text:
e text is: Highlight Cool whenever we see it. but not. And here is more Cool.
e text is: here
e text is: sub Cool { print "Foo "; }
<div>
<p>Boring Text:</p>
<p>Highlight <span class="fun">Cool</span> whenever we see it. but not. And here is more <span class="fun">Cool</span>.</p>
</div>
Close but what I really want to see is something like the following:
<div>
<p>Boring Text:</p>
<p>Highlight <span class="fun">Cool</span> whenever we see it. but not here.
<code>
sub <span class="fun">Cool<span> {
print "Foo\n";
}
</code>
And here is more <span class="fun">Cool</span>.</p>
</div>
Any help / pointers would be greatly appreciated.
Thanks,
Todd

Having looked into XML::Twig I'm not so sure it's the correct tool. It's surprising how awkward such a simple task can be.
This is a working program that uses HTML::TreeBuilder. Unfortunately it doesn't produce formatted output so I've added some whitespace myself.
use strict;
use warnings;
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<__HTML__);
<div>
<p>Boring Text:</p>
<p>
Highlight Cool whenever we see it.
but not here.
<code>
sub Cool {
print "Foo\n";
}
</code>
And here is more Cool.
</p>
</div>
__HTML__
$html->objectify_text;
for my $text_node ($html->look_down(_tag => '~text')) {
my $text = $text_node->attr('text');
if (my #replacement = process_text($text)) {
my $old_node = $text_node->replace_with(#replacement);
$old_node->delete;
}
}
$html->deobjectify_text;
print $html->guts->as_XML;
sub process_text {
my #nodes = split /\bCool\b/, shift;
return unless #nodes > 1;
my $span = HTML::Element->new('span', class => 'fun');
$span->push_content('Cool');
for (my $i = 1; $i < #nodes; $i += 2) {
splice #nodes, $i, 0, $span->clone;
}
$span->delete;
#nodes;
}
output
<div>
<p>Boring Text:</p>
<p>
Highlight <span class="fun">Cool</span> whenever we see it.
but not here.
<code> sub <span class="fun">Cool</span> { print "Foo "; } </code>
And here is more <span class="fun">Cool</span>.
</p>
</div>

Here is a start using XML::Twig. One issue is the literal newline inside <code> tag. I guess that the parser cannot see the difference between it and a normal one. Perhaps it would help to encode it as &#10 or use CDATA sections. Otherwise I don't know how to handle it:
Content of script.pl:
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
my $body = qq|
<div>
<p>Boring Text:</p>
<p>
Highlight Cool whenever we see it.
but not here.
<code>
sub Cool {
print "Foo\n";
}
</code>
And here is more Cool.
</p>
</div>
|;
XML::Twig::Elt::set_replaced_ents(q{});
my $elt = XML::Twig::Elt->new( 'span' => { class => 'fun' }, 'Cool' );
my $twig = XML::Twig->new( pretty_print => 'nice' )->parse( $body );
$twig->subs_text( 'Cool', $elt->sprint );
$twig->print;
Running it like:
perl script.pl
It yields:
<div>
<p>Boring Text:</p>
<p>
Highlight <span class="fun">Cool</span>
whenever we see it.
but not here.
<code>
sub <span class="fun">Cool</span>
{
print "Foo
";
}
</code>
And here is more <span class="fun">Cool</span>
.
</p>
</div>

Related

How to parse html with HTML::TreeBuilder?

This is the code I'd like to parse
[...]
<div class="item" style="clear:left;">
<div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);">
</div>
<h2>Acid Splash</h2>
<p>Caster Level(s): Wizard / Sorcerer 0
<br />Innate Level: 0
<br />School: Conjuration
<br />Descriptor(s): Acid
<br />Component(s): Verbal, Somatic
<br />Range: Medium
<br />Area of Effect / Target: Single
<br />Duration: Instant
<br />Save: None
<br />Spell Resistance: Yes
<p>
You fire a small orb of acid at the target for 1d3 points of acid damage.
</div>
[...]
This is my algorithm:
my $text = '';
scan_child($spells);
print $text, "\n";
sub scan_child {
my $element = $_[0];
return if ($element->tag eq 'script' or
$element->tag eq 'a'); # prune!
foreach my $child ($element->content_list) {
if (ref $child) { # it's an element
scan_child($child); # recurse!
} else { # it's a text node!
$child =~ s/(.*)\:/\\item \[$1\]/; #itemize
$text .= $child;
$text .= "\n";
}
}
return;
}
It gets the pattern <key> : <value> and prunes garbage like <script> or <a>...</a>.
I'd like to improve it in order to get <h2>...</h2> header and all the <p>...<p> block so I can add some LaTeX tags.
Any clue?
Thanks in advance.
Because this may be an XY Problem...
Mojo::DOM is a somewhat more modern framework for parsing HTML using css selectors. The following pulls the P element that you want from the document:
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
for my $h2 ($dom->find('h2')->each) {
next unless $h2->all_text eq 'Acid Splash';
# Get following P
my $next_p = $h2;
while ($next_p = $next_p->next_sibling()) {
last if $next_p->node eq 'tag' and $next_p->type eq 'p';
}
print $next_p;
}
__DATA__
<html>
<body>
<div class="item" style="clear:left;">
<div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);">
</div>
<h2>Acid Splash</h2>
<p>Caster Level(s): Wizard / Sorcerer 0
<br />Innate Level: 0
<br />School: Conjuration
<br />Descriptor(s): Acid
<br />Component(s): Verbal, Somatic
<br />Range: Medium
<br />Area of Effect / Target: Single
<br />Duration: Instant
<br />Save: None
<br />Spell Resistance: Yes
<p>
You fire a small orb of acid at the target for 1d3 points of acid damage.
</div>
</body>
</html>
Outputs:
<p>Caster Level(s): Wizard / Sorcerer 0
<br>Innate Level: 0
<br>School: Conjuration
<br>Descriptor(s): Acid
<br>Component(s): Verbal, Somatic
<br>Range: Medium
<br>Area of Effect / Target: Single
<br>Duration: Instant
<br>Save: None
<br>Spell Resistance: Yes
</p>
I use the look_down() method scan HTML.
Using look_down() I can return first get a list of all the divs of class="item".
Then I can iterate of them, and find and process the h2 and the p, which I would then split using // as my splitter.

how can I split div tag only without any moveing?

I am new XML-Twig... I want split para tag....
XML File:
<xml>
<p class="indent">text <i>text<i> incluce <div>text</div> ateas</p>
<p class="text">text text incluce <div>text</div> <b>ateas<b></p>
<p class="text">text <p>text</p> incluce <div>text</div> ateas</p>
</xml>
Here I want split Para tags. How can I split and How can assign para tag without inline para tag and div tag...
I need output as:
<xml>
<p class="indent">text <i>text</i> incluce</p>
<div>text</div>
<p class="indent">ateas</p>
<p class="text">text text incluce</p>
<div>text</div>
<p class="text"><b>ateas</b></p>
<p class="text">text</p>
<p>text</p>
<p class="text">incluce</p>
<div>text</div>
<p class="text">ateas</p>
</xml>
How can I split this....
Script:
#!/usr/bin/perl
use warnings;
use strict;
use XML::Twig;
open(my $output , '>', "output.xml") || die "can't open the Output $!\n";
my $xml = XML::Twig->new( twig_handlers => { p => \&split_tag } );
$xml->parsefile("sample.xml");
$xml->print($output);
sub split_tag {
my ($twig, $p) = #_;
$_->wrap_in('p', $p->atts) for $p->children('#TEXT');
$p->erase;
}
But I can't get extract output.. How can I do this?
This code seems to match your new requirements. If this doesn't work please try to fix it yourself before asking for more free code.
I have ignored the third line of your sample data as nested <p> elements are illegal in HTML.
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
twig_handlers => { p => \&split },
pretty_print => 'indented',
);
$twig ->parsefile('sample.xml');
$twig->print_to_file('output.xml');
sub split{
my ($twig, $p) = #_;
return if $p->contains_only_text;
my #children = $p->cut_children;
my #newchildren;
my $newpara = $p->copy;
for my $child (#children) {
if ($child->is_elt and $child->tag eq 'div') {
push #newchildren, $newpara if $newpara->has_children;
push #newchildren, $child;
$newpara = $p->copy;
}
else {
$child->paste(last_child => $newpara);
}
}
push #newchildren, $newpara if $newpara->has_children;
$p->replace_with(#newchildren);
}
output
<xml>
<p class="indent">text <i>text</i> incluce </p>
<div>text</div>
<p class="indent"> ateas</p>
<p class="text">text text incluce </p>
<div>text</div>
<p class="text"> <b>ateas</b></p>
<p class="text">text <p>text</p> incluce </p>
<div>text</div>
<p class="text"> ateas</p>
</xml>

Perl Form Validation using CGI scripting

I'm trying to achieve one last task for my assignment is to validate the form before submit it to the another CGI program.
What happen is that I have a simple CGI program that will ask user to input the data
#!/usr/bin/perl -w
use CGI qw/:standard/;
# Standard HTTP header
print header();
# Write information to data file and produce a form
&printForm();
# Finish HTML page
print end_html();
# This sub will create a form to access the print_fortune.cgi script
sub printForm
{
print qq~
<html>
<head><title>My Search Engine</title>
</head>
<body>
<form action="b1.cgi" method="GET">
What is your e-msil address? <input type="text" name="passing" size=40>
<input type="submit" value="send address">
<input type="hidden" name="form" value="insert" />
</form>
<form method="get" action="b1.cgi" enctype="application/x-www-form-urlencoded">
<input type="text" name="search" value="" size="30" /><br />
<label><input type="radio" name="option" value="name" checked="checked" />name</label>
<label><input type="radio" name="option" value="author" />author</label><label>
<input type="radio" name="option" value="url" />url</label>
<label><input type="radio" name="option" value="keyword" />keyword</label>
<input type="submit" name=".submit" value="Search" />
<input type="hidden" name="passing" value="http://default.com" />
<div><input type="hidden" name="form" value="search" /></div></form>
</body>
So the above program contains two forms. One is to add new data to the database and the other one is to search from the database.
#!/usr/bin/perl
print "Content-type: text/html\n\n";
use LWP::Simple;
use CGI;
use HTML::HeadParser;
use DBI;
my $serverName = "";
my $serverPort = "";
my $serverUser = "";
my $serverPass = "";
my $serverDb = "";
my $serverTabl = "";
$cgi = CGI->new;
my $pass = $cgi->param('passing');
$URL = get ("$pass");
$head = HTML::HeadParser->new;
$head->parse("$URL");
my $methods = $cgi->param('form');
if ($methods eq "insert"){
insert_entry();
}
show_entries();
sub insert_entry {
my ($dbh, $success, $name, $author, $url,$temp);
$dbh = DBI->connect("DBI:mysql:database=$serverDb;host=$serverName;port=$serverPort",$serverUser,$serverPass);
$name = $head->header('X-Meta-Name');
$author = $head->header('X-Meta-Author');
$url = $cgi->param('passing');
$temp = $head->header('X-Meta-Keywords');
#keyword = split(/,/,$temp);
$success = $dbh->do("INSERT INTO $serverTabl(name,author,url,keyword1,keyword2,keyword3,keyword4,keyword5) VALUES(?,?,?,?,?,?,?,?)", undef,$name,$
author,$url,$keyword[0],$keyword[1],$keyword[2],$keyword[3],$keyword[4]);
$dbh->disconnect;
if($success != 1) {
return "Sorry, the database was unable to add your entry.
Please try again later.";
} else {
return;
}
}
sub show_entries {
my ($dbh, $sth, #row);
my $search = $cgi->param('search');
my $option = $cgi->param('option');
$dbh = DBI->connect("DBI:mysql:database=$serverDb;host=$serverName;port=$serverPort",$serverUser,$serverPass);
$sth = $dbh->prepare("SELECT *
FROM $serverTabl
WHERE $option LIKE '%$search%'");
$sth->execute;
print "Existing Entries",HR;
while(#row = $sth->fetchrow_array) {
$row[5] = scalar(localtime($row[5]));
print "<table border='2'><tr>";
print "<td>" . $row[0] . "</td>";
print "<td>Name" . $row[1] . "</td>";
print "<td>Author" . $row[2] . "</td>";
print "<td>URL" . $row[3] . "</td>";
print "<td>Keyword1" . $row[4] . "</td>";
print "<td>Keyword2" . $row[5] . "</td>";
print "<td>Keyword3" . $row[6] . "</td>";
print "<td>Keyword4" . $row[7] . "</td>";
print "<td>Keyword5" . $row[8] . "</td>";
print "</tr></table>";
}
$sth->finish;
$dbh->disconnect;
}
So now the question is how can I do a regular expression for the form submission before it goes to the second program?
I want to do validation for
name allows spaces but only alphabetical characters
author allows spaces but only alphabetical characters
keywords allows no spaces and only alphabetical characters
url only allows alphanumerical characters and the following :/.~?=+& No two periods can exist consecutively.
I'm really sorry but I'm really new to Perl. We are only been taught about PHP, but Perl almost nothing....
The perluniprops Perl document lists all the \p regular expression properties.
For a string that contains only letters, you want
/^[\p{Alpha}]+$/
For a string that contains only letters and spaces you want
/^[\p{Alpha}\x20]+$/
To match a URL the documentation of the URI module gives this as an official pattern to match a URL
m|^(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?$|
Be sure to cite the references in your work to get extra marks!

Using Web::Scraper

Im trying to parse some html tags using perl module Web::Scraper but seems Im an inept using perl. I wonder if anyone can look for mistakes in my code...:
This is my HTML to parse (2 urls inside li tags):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="123">
<h2 class="genres"></h2>
<li>1</li>
<li><a class="sel" href="**URL_TO_EXTRACT_2**">2</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
Im trying to obtain:
ID:1 Link:URL_TO_EXTRACT_1
ID:2 Link:URL_TO_EXTRACT_2
With this perl code:
my $scraper = scraper {
process ".zone-extract > a[href]", urls => '#href', id => 'TEXT';
result 'urls';
};
my $links = $scraper->scrape($response);
This is one of the infinite process combinations I tried, with two different results: An empty return, or all the urls inside code (and I only need links inside zone-extract).
Resolved with mob's contribution... #zone-extract instead .zone-extract :)
#!/usr/bin/env perl
use strict;
use warnings;
use Web::Scraper;
my $html = q[
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="123">
<h2 class="genres"></h2>
<li>1</li>
<li><a class="sel" href="**URL_TO_EXTRACT_2**">2</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
]; # / (turn off wrong syntax highlighting)
my $parser = scraper {
process '//div[#id="zone-extract"]//a', 'urls[]' => sub {
my $url = $_[0]->attr('href') ;
return $url;
};
};
my $ref = $parser->scrape(\$html);
print "$_\n" for #{ $ref->{urls} };

How do I get the entire inner content of an XML node including element tags?

Using XML::Twig, is there a way to get the entire HTML of a node? I do not want the text of the node, but the entire HTML with tags and all.
input XML
<content> <p>blah blah <b> bla bla </b> </p>
<p> line 2 <i> test </i? </p>
</content>
Code
my $twig = new XML::Twig(
TwigRoots => {'content' => 1},
TwigHandlers => $twig_handlers
);
my $twig_handlers = {'count/p' => \&count_ps};
sub count_ps {
my ($twig, $test) = #_;
$Data .= $test->text();
}
$data should show me the entire HTML.
Use the xml_string method.
my $data;
XML::Twig->new(
twig_handlers => {
content => sub {
my ($twig, $elt) = #_;
$data = $elt->xml_string;
$twig->purge;
}
}
)->parse(' <content> <p>blah blah <b> bla bla </b> </p>
<p> line 2 <i> test </i> </p>
</content>
')->purge;