How to parse html with HTML::TreeBuilder?

How to parse html with HTML::TreeBuilder? - perl

This is the code I'd like to parse
[...]
<div class="item" style="clear:left;">
<div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);">
</div>
<h2>Acid Splash</h2>
<p>Caster Level(s): Wizard / Sorcerer 0
<br />Innate Level: 0
<br />School: Conjuration
<br />Descriptor(s): Acid
<br />Component(s): Verbal, Somatic
<br />Range: Medium
<br />Area of Effect / Target: Single
<br />Duration: Instant
<br />Save: None
<br />Spell Resistance: Yes
<p>
You fire a small orb of acid at the target for 1d3 points of acid damage.
</div>
[...]
This is my algorithm:
my $text = '';
scan_child($spells);
print $text, "\n";
sub scan_child {
my $element = $_[0];
return if ($element->tag eq 'script' or
$element->tag eq 'a'); # prune!
foreach my $child ($element->content_list) {
if (ref $child) { # it's an element
scan_child($child); # recurse!
} else { # it's a text node!
$child =~ s/(.*)\:/\\item \[$1\]/; #itemize
$text .= $child;
$text .= "\n";
}
}
return;
}
It gets the pattern <key> : <value> and prunes garbage like <script> or <a>...</a>.
I'd like to improve it in order to get <h2>...</h2> header and all the <p>...<p> block so I can add some LaTeX tags.
Any clue?
Thanks in advance.

Because this may be an XY Problem...
Mojo::DOM is a somewhat more modern framework for parsing HTML using css selectors. The following pulls the P element that you want from the document:
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
for my $h2 ($dom->find('h2')->each) {
next unless $h2->all_text eq 'Acid Splash';
# Get following P
my $next_p = $h2;
while ($next_p = $next_p->next_sibling()) {
last if $next_p->node eq 'tag' and $next_p->type eq 'p';
}
print $next_p;
}
__DATA__
<html>
<body>
<div class="item" style="clear:left;">
<div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);">
</div>
<h2>Acid Splash</h2>
<p>Caster Level(s): Wizard / Sorcerer 0
<br />Innate Level: 0
<br />School: Conjuration
<br />Descriptor(s): Acid
<br />Component(s): Verbal, Somatic
<br />Range: Medium
<br />Area of Effect / Target: Single
<br />Duration: Instant
<br />Save: None
<br />Spell Resistance: Yes
<p>
You fire a small orb of acid at the target for 1d3 points of acid damage.
</div>
</body>
</html>
Outputs:
<p>Caster Level(s): Wizard / Sorcerer 0
<br>Innate Level: 0
<br>School: Conjuration
<br>Descriptor(s): Acid
<br>Component(s): Verbal, Somatic
<br>Range: Medium
<br>Area of Effect / Target: Single
<br>Duration: Instant
<br>Save: None
<br>Spell Resistance: Yes
</p>

I use the look_down() method scan HTML.
Using look_down() I can return first get a list of all the divs of class="item".
Then I can iterate of them, and find and process the h2 and the p, which I would then split using // as my splitter.

Related

Why am I getting this error when testing a Perl/CGI scipt on Padre?

Spent better part of 3-days writing a CGI script to handle the input form data from my HTML. Used Padre as an editor but now receive this error when running the script.
"uncaught exception from user code: "-T" is on the #! line, it must also be used on the command line at scipt.pl"
I'd like some pointers if anyone is willing to look over my code and offer guidance. This is my first endeavor into Perl and CGI. What I desire for endstate is a form a webuser enters data and then hits submit. After validation a page is sent back to the browser with information and errors if they exist. here is my code and thanks in advance!
HTML Code:
<html>
<head><title>My First CGI-PERL</title>
<!--Link to css for styling here-->
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<header>
<h1>Welcome to my first CGI-PERL Form submission</h1></header>
<br>
<p>Please Enter the following information in the fields and click
SUBMIT.</p>
<hr>
<br>
<div class="container1">
<form action="/cgi-bin/text.pl" method="post"> <!--script to process
this section-->
Item Number<input type="text" name="Item"><br> <!--Cannot be blank-->
Product Name<input type="text" name="Name"><br> <!--Cannot be
blank-->
Product Cost<input type="text" name="Cost"><br> <!--must be between .50 and $1000-->
Selling Price<input type="text" name="Price"><br> <!--price from 1.00 to 2000.00-->
Quantity on Hand<input type="text" name="Quantity"><br> <!--cannot be negative-->
</form></div>
<br>
<br>
<hr>
<br>
<h2>Choose A Product Category</h2> <!--must be one of the categories-->
<br>
<div class="container2">
<form action="/cgi-bin/radio.pl" method="post"> <!--name of script that handles this section-->
<input type="radio" name="letter" value="F">F<br>
<input type="radio" name="letter" value="H">H<br>
<input type="radio" name="letter" value="M">M<br>
<input type="radio" name="letter" value="C">C<br>
<input type="radio" name="letter" value="T">T<br>
<br>
</form></div>
<hr>
<br>
<div class="container4">
<form action="/cgi-bin/myfirstcgi.cgi" method="post"> <!--script to process submit and send a second page back-->
<input type="submit" name="submit" value="SUBMIT"><br>
</form></div> <!--close container 2-->
<!--Profit should be auto generated on the second page created by the script-->
</body>
</html>
Perl Script:
#!/usr/bin/perl
print "Content-type: text/html\n\n";
use strict;
use warnings;
use CGI qw(:standard);
print "<html><body>";
print "Thank you for submitting the form. <br>\n";
#return to html page with form
print "To go back to the form page click"; #how can i insert a link to the form page here
print "You chose item number ", param("Item")," \n";
print "The product name is ", param("Name"), " \n";
print "The Cost is ", param("Cost")," \n";
print "Selling Price ", param("Price")," \n";
print "Quantity on Hand is ", param("Quantity")," \n";
#scalar variables to hold form input data
my $item = param("Item");
my $name = param("Name");
my $cost = param("Cost");
my $price = param("Price");
my $category = param("Category"); #radio button chosen
my $quantity = param("Quantity");
my $profit = $cost - $price;
#radio buttons
my %categories = ("F", "H", "M", "C", "T");
#validation a category was chosen
my $categories = param("letter");
if (exists $categories{$category}) {
print "The product Category is $category";
}
else {
error ("You must select a category");
}
#validate input
if ($item eq "") {
error ("Field cannot be blank");
}
if ($name eq "") {
error ("Field cannot be blank");
}
if ($cost < .50 && $cost > 1000) {
error ("Invalid Entry Cost must be between $.50 and $1000");
}
if ($price < 1.00 && $cost > 2000) {
error ("Invalid Amount Please enter Price between $1.00 and $2000");
}
if ($quantity < 0) {
error ("Quantity cannot be negative number");
}
sub error {
my ($errormsg) = #_;
print "<h2>Error</h2>\n";
print "$errormsg<p>\n";
print "</body></html>\n";
exit;
}

Your homework should be done by you. But Since there is lot of syntax mistakes there I will try to help you.
First of all Bind all the elements inside the html into a single form with single submit button so that it can go to server in the form of query string in a single go.
Your html should be:
<html>
<head><title>My First CGI-PERL</title>
<!--Link to css for styling here-->
<!--<link rel="stylesheet" type="text/css" href="style.css">-->
</head>
<body>
<header>
<h1>Welcome to my first CGI-PERL Form submission</h1></header>
<br>
<p>Please Enter the following information in the fields and click
SUBMIT.</p>
<hr>
<br>
<div class="container1">
<form action="action.cgi" method="post"> <!--script to process
this section-->
Item Number<input type="text" name="Item"><br> <!--Cannot be blank-->
Product Name<input type="text" name="Name"><br> <!--Cannot be
blank-->
Product Cost<input type="text" name="Cost"><br> <!--must be between .50 and $1000-->
Selling Price<input type="text" name="Price"><br> <!--price from 1.00 to 2000.00-->
Quantity on Hand<input type="text" name="Quantity"><br> <!--cannot be negative-->
<br>
<br>
<hr>
<br>
<h2>Choose A Product Category</h2> <!--must be one of the categories-->
<br>
<div class="container2">
<input type="radio" name="letter" value="F">F<br>
<input type="radio" name="letter" value="H">H<br>
<input type="radio" name="letter" value="M">M<br>
<input type="radio" name="letter" value="C">C<br>
<input type="radio" name="letter" value="T">T<br>
<br>
<hr>
<br>
<div class="container4">
<input type="submit" name="submit" value="SUBMIT"><br>
</form></div> <!--close container 2-->
<!--Profit should be auto generated on the second page created by the script-->
</body>
</html>
Your cgi script I named as action.cgi and It should be like this for the above html from your approach. Whatever the error you had I tried to show it using commented lines.
#!/usr/bin/perl -T
print "Content-type: text/html\n\n";
use strict;
use warnings;
use CGI qw(:standard);
print "<html><body>";
print '<h1>"Thank you for submitting the form. <br>"\n';
#return to html page with form
print '<h2>To go back to the form page click Here.</h2>';#missing quotes
print "<hr><br><br>";
print "You chose item number ",param("Item")," <br>";
print "The product name is ", param("Name"), " <br>";print "The Cost is
", param("Cost")," \n";
print "Selling Price ", param("Price")," <br />";
print "Quantity on Hand is ", param("Quantity")," <br> ";
#scalar variables to hold form input data
my $item = param("Item");
my $name = param("Name");
my $cost = param("Cost");
my $price = param("Price");
my $category = param("Category"); #radio button chosen
my $quantity = param("Quantity");
my $profit = $cost - $price;
#radio buttons
my #categories = ("F", "H", "M", "C", "T");#better use array
#vali`dation a category was chosen
$category = param("letter");
if(grep { /$category/ } #categories) { # check if category exist in your predefined array
print "The product Category is $category \n";
}
else {
error ("You must select a category");
}
#validate input
if ($item eq "") {
error ("Field cannot be blank");
}
if ($name eq "" ){
error ("Field cannot be blank");
}
if ($cost < .50 && $cost > 1000) {
error ("Invalid Entry Cost must be between $.50 and $1000");
}
if ($price < 1.00 && $cost > 2000) {
error ("Invalid Amount Please enter Price between $1.00 and $2000");
}
if ($quantity < 0) {
error ("Quantity cannot be negative number");
}
#Error subroutine
sub error {
my $errormsg = shift;#you shouldn't assign array to variable
print "<h2>Error</h2>\n";
print "$errormsg<p>\n";
print "</body></html>\n";
}
If you really want to learn perl after basic understanding try to learn
perldsc,perlvar,perlref,perloo,perlobj

Padre runs Perl programs using a command like:
/usr/bin/perl <your_file.pl>
Taint checking requires deep changes to how the compiler works, so it needs to be turned on immediately after the compiler starts up. You have -T on the shebang line inside your program and that isn't parsed until after the compiler starts - too late for taint mode to be enabled. It would be confusing for Perl to start running your code not in taint mode when you think that taint mode has been turned on, so it halts execution with the error message that you have seen.
You can fix this by configuring Padre to run your code with as slightly different command:
/usr/bin/perl -T <your_file.pl>
In Padre choose Tools -> Preferences from the menu and then select the "Language - Perl 5" option. There is a text input labelled "Interpreter arguments". You can put your -T there and save the changes.
Also, I'll just reiterate my previous advice that using CGI in 2015 is ridiculous. Please take a look at CGI::Alternatives and switch to a more modern architecture.

Replace all the spaces in content of any tag with ` `

Task
Replace all the spaces in content of any tag with .
y.html (sample file)
<p class=MsoNormal style='margin-top:1.0pt;margin-right:0cm;margin-bottom:1.0pt;
margin-left:34.0pt;text-indent:-19.8pt'><span lang=NL-BE style='font-size:10.0pt;
font-family:Symbol;color:black;mso-ansi-language:NL-BE'>·</span><span
class=GramE><span style='font-size:7.0pt;color:black'>
</span><span style='font-size:10.0pt;font-family:Arial;color:black'>Kit</span></span><span
style='font-size:10.0pt;font-family:Arial;color:black'> </span><span
class=SpellE><i><span style='font-size:10.0pt;font-family:Arial'>Strongyloides</span></i></span><i><span
style='font-size:10.0pt;font-family:Arial'> <span class=SpellE>ratti</span></span></i><span
style='font-size:10.0pt;font-family:Arial'> (nr. 9450) van <span class=SpellE>Bordier</span>
Affinity Products. </span><span lang=NL-BE style='font-size:10.0pt;font-family:
Arial;mso-ansi-language:NL-BE'>Zie bijsluiter in bijlage: CLKB_B_0306. Te
bewaren bij 2 – 8 °C tot vervaldatum.</span><span lang=NL-BE style='mso-ansi-language:
NL-BE'><o:p></o:p></span></p>
What I tried
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
open (my $fh, "<", "y.html") or die $!;
my $dom = Mojo::DOM->new(do{local $/ = undef; <$fh>});
$dom->find("*")->each( sub { $_->content( $_->content =~ s/\s/\ /gr ) } );
print $dom;
Result from above script
<p class="MsoNormal" style="margin-top:1.0pt;margin-right:0cm;margin-bottom:1.0pt;
margin-left:34.0pt;text-indent:-19.8pt"><span lang="nl-be" style="font-size:10.0pt; font-family:symbol;color:black;mso-ansi-language:nl-be">·<span class="grame"><span style="font-s
ize:7.0pt;color:black"> <span style="font-size:10.0pt;font-family:arial;color:black">Kit<span style="font-size:10.0pt;font-family:arial;color:black"> <span class="spelle"><i><span&nb
sp;style="font-size:10.0pt;font-family:arial">Strongyloides<i><span style="font-size:10.0pt;font-family:arial"> <span class="spelle">ratti<span style="font-size:10.0pt;font-family:arial"> (n
r. 9450) van <span class="spelle">Bordier Affinity Products. <span lang="nl-be" style="font-size:10.0pt;font-family: arial;mso-ansi-language:nl-be">Zie bijsluiter in bijlage: CLKB_B_030
6. Te bewaren bij 2 – 8 °C tot vervaldatum.<span lang="nl-be" style="mso-ansi-language: nl-be"><o:p></o:p></span lang="nl-be" style="mso-ansi-language: nl-be"></span lang
="nl-be" style="font-size:10.0pt;font-family: arial;mso-ansi-language:nl-be"></span class="spelle"></span style="font-size:10.0pt;font-family:arial"></span class="spelle"></span&nb
sp;style="font-size:10.0pt;font-family:arial"></i></span style="font-size:10.0pt;font-family:arial"></i></span class="spelle"></span style="font-size:10.0pt;font-family:arial;color:black"></
span style="font-size:10.0pt;font-family:arial;color:black"></span style="font-size:7.0pt;color:black"></span class="grame"></span lang="nl-be" style="font-size:10.0pt; font-f
amily:symbol;color:black;mso-ansi-language:nl-be"></p>
I'm not getting the desired output, it's adding in tag also (eg: </span ), I want that to be done only on the content.
PS: I tried it with Mojo::DOM, but it's not necessary to use it, you can try any other parser if you want, still I would like to know what's wrong with my code?

This is a job where tokenizing the input makes it easier to work with. I therefore advise using HTML::TokeParser
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use HTML::TokeParser;
my $data = do {local $/; <DATA>};
my $p = HTML::TokeParser->new(\$data);
while (my $token = $p->get_token) {
if ($token->[0] eq 'T') {
my $text = $token->[1];
$text =~ s/ / /g;
print $text;
} else {
print "$token->[-1]";
}
}
__DATA__
<html>
<body>
<p class=MsoNormal style='margin-top:1.0pt;margin-right:0cm;margin-bottom:1.0pt;
margin-left:34.0pt;text-indent:-19.8pt'><span lang=NL-BE style='font-size:10.0pt;
font-family:Symbol;color:black;mso-ansi-language:NL-BE'>·</span><span
class=GramE><span style='font-size:7.0pt;color:black'>
</span><span style='font-size:10.0pt;font-family:Arial;color:black'>Kit</span></span><span
style='font-size:10.0pt;font-family:Arial;color:black'> </span><span
class=SpellE><i><span style='font-size:10.0pt;font-family:Arial'>Strongyloides</span></i></span><i><span
style='font-size:10.0pt;font-family:Arial'> <span class=SpellE>ratti</span></span></i><span
style='font-size:10.0pt;font-family:Arial'> (nr. 9450) van <span class=SpellE>Bordier</span>
Affinity Products. </span><span lang=NL-BE style='font-size:10.0pt;font-family:
Arial;mso-ansi-language:NL-BE'>Zie bijsluiter in bijlage: CLKB_B_0306. Te
bewaren bij 2 – 8 °C tot vervaldatum.</span><span lang=NL-BE style='mso-ansi-language:
NL-BE'><o:p></o:p></span></p>
</body>
</html>
Outputs:
<html>
<body>
<p class=MsoNormal style='margin-top:1.0pt;margin-right:0cm;margin-bottom:1.0pt;
margin-left:34.0pt;text-indent:-19.8pt'><span lang=NL-BE style='font-size:10.0pt;
font-family:Symbol;color:black;mso-ansi-language:NL-BE'>·</span><span
class=GramE><span style='font-size:7.0pt;color:black'>
</span><span style='font-size:10.0pt;font-family:Arial;color:black'>Kit</span></span><span
style='font-size:10.0pt;font-family:Arial;color:black'> </span><span
class=SpellE><i><span style='font-size:10.0pt;font-family:Arial'>Strongyloides</span></i></span><i><span
style='font-size:10.0pt;font-family:Arial'> <span class=SpellE>ratti</span></span></i><span
style='font-size:10.0pt;font-family:Arial'> (nr. 9450) van <span class=SpellE>Bordier</span>
Affinity Products. </span><span lang=NL-BE style='font-size:10.0pt;font-family:
Arial;mso-ansi-language:NL-BE'>Zie bijsluiter in bijlage: CLKB_B_0306. Te
bewaren bij 2 – 8 °C tot vervaldatum.</span><span lang=NL-BE style='mso-ansi-language:
NL-BE'><o:p></o:p></span></p>
</body>
</html>

Trouble Replacing Text in HTML fragment using Mojo::DOM

I need to scan through html fragments looking for certain strings in text (not within element attributes) and wrapping those matching strings with a <span></span>. Here's a sample attempt with output:
use v5.10;
use Mojo::DOM;
my $body = qq|
<div>
<p>Boring Text:</p>
<p>
Highlight Cool whenever we see it.
but not here.
<code>
sub Cool {
print "Foo\n";
}
</code>
And here is more Cool.
</p>
</div>
|;
my $dom = Mojo::DOM->new($body);
foreach my $e ($dom->find('*')->each) {
my $text = $e->text;
say "e text is: $text ";
if ($text =~ /Cool/) {
(my $newtext = $text ) =~ s/Cool/<span class="fun">Cool<\/span>/g;
$e->replace_content($newtext);
}
}
say $dom->root;
the output:
e text is:
e text is: Boring Text:
e text is: Highlight Cool whenever we see it. but not. And here is more Cool.
e text is: here
e text is: sub Cool { print "Foo "; }
<div>
<p>Boring Text:</p>
<p>Highlight <span class="fun">Cool</span> whenever we see it. but not. And here is more <span class="fun">Cool</span>.</p>
</div>
Close but what I really want to see is something like the following:
<div>
<p>Boring Text:</p>
<p>Highlight <span class="fun">Cool</span> whenever we see it. but not here.
<code>
sub <span class="fun">Cool<span> {
print "Foo\n";
}
</code>
And here is more <span class="fun">Cool</span>.</p>
</div>
Any help / pointers would be greatly appreciated.
Thanks,
Todd

Having looked into XML::Twig I'm not so sure it's the correct tool. It's surprising how awkward such a simple task can be.
This is a working program that uses HTML::TreeBuilder. Unfortunately it doesn't produce formatted output so I've added some whitespace myself.
use strict;
use warnings;
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<__HTML__);
<div>
<p>Boring Text:</p>
<p>
Highlight Cool whenever we see it.
but not here.
<code>
sub Cool {
print "Foo\n";
}
</code>
And here is more Cool.
</p>
</div>
__HTML__
$html->objectify_text;
for my $text_node ($html->look_down(_tag => '~text')) {
my $text = $text_node->attr('text');
if (my #replacement = process_text($text)) {
my $old_node = $text_node->replace_with(#replacement);
$old_node->delete;
}
}
$html->deobjectify_text;
print $html->guts->as_XML;
sub process_text {
my #nodes = split /\bCool\b/, shift;
return unless #nodes > 1;
my $span = HTML::Element->new('span', class => 'fun');
$span->push_content('Cool');
for (my $i = 1; $i < #nodes; $i += 2) {
splice #nodes, $i, 0, $span->clone;
}
$span->delete;
#nodes;
}
output
<div>
<p>Boring Text:</p>
<p>
Highlight <span class="fun">Cool</span> whenever we see it.
but not here.
<code> sub <span class="fun">Cool</span> { print "Foo "; } </code>
And here is more <span class="fun">Cool</span>.
</p>
</div>

Here is a start using XML::Twig. One issue is the literal newline inside <code> tag. I guess that the parser cannot see the difference between it and a normal one. Perhaps it would help to encode it as &#10 or use CDATA sections. Otherwise I don't know how to handle it:
Content of script.pl:
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
my $body = qq|
<div>
<p>Boring Text:</p>
<p>
Highlight Cool whenever we see it.
but not here.
<code>
sub Cool {
print "Foo\n";
}
</code>
And here is more Cool.
</p>
</div>
|;
XML::Twig::Elt::set_replaced_ents(q{});
my $elt = XML::Twig::Elt->new( 'span' => { class => 'fun' }, 'Cool' );
my $twig = XML::Twig->new( pretty_print => 'nice' )->parse( $body );
$twig->subs_text( 'Cool', $elt->sprint );
$twig->print;
Running it like:
perl script.pl
It yields:
<div>
<p>Boring Text:</p>
<p>
Highlight <span class="fun">Cool</span>
whenever we see it.
but not here.
<code>
sub <span class="fun">Cool</span>
{
print "Foo
";
}
</code>
And here is more <span class="fun">Cool</span>
.
</p>
</div>

Web::Scrape with Xpath returns too many lines

Using Web::Scrape on some nasty nested tables, with no CSS style. Having to learn XPATH, and getting tripped up.
Updated: Fixed some XPATH issues, now just have one remaining question regarding attributes
#!perl
use warnings;
use Web::Scraper;
use Data::Dumper;
my $html = do { local $/; <DATA> };
my $scraper = scraper {
# Wrong! The 'tbody' element does not exist.
# process ".//[#id='cfg-surface-detail']/center/table/tbody/tr/td[2]/select",
# I used Chrome to get the XPath, and it inserts tbody elements when rendering bad HTML
# also, I changed the start of the XPATH from './/' to '//*'
# which I think means "relative to anywhere" or something.
process "//*[#id='cfg-surface-detail']/center/table/tr/td[2]/select",
'sensorType[]' => 'TEXT';
};
my $res = $scraper->scrape($html);
print Dumper($res);
__DATA__
<html><head><title>...</title></head>
<body>
<form action="/foo" method=post id=cfg-surface-detail name=cfg-surface-detail>
<center>
<table bgcolor="#FFFFFF">
<tr><td>Sensor Type</td><td>
<select name="cfg-sensor-type" >
<option value="1 Fred's Sensor" selected>Fred's Sensor
<option value="2 Other">Other Sensor
</select>
</td></tr>
</table>
</center>
</form>
</body>
</html>
This now outputs:
$VAR1 = {
'sensorType' => [
'Fred\'s Sensor Other Sensor '
]
};
So I'm getting close. How now do I specify the <option> that has the selected attribute?
Update: Solved. Xpath is //*[#id="cfg-surface-detail"]/center/table/tr/td[2]/select/option[#selected]
This helped: http://www.w3schools.com/xpath/xpath_syntax.asp

#!perl
use warnings;
use Web::Scraper;
use Data::Dumper;
my $html = do { local $/; <DATA> };
my $scraper = scraper {
process '#cfg-surface-detail//select',
'sensorType[]' => 'TEXT';
};
my $res = $scraper->scrape($html);
print Dumper($res);
__DATA__
<html><head><title>...</title></head>
<body>
<form action="/foo" method=post id=cfg-surface-detail name=cfg-surface-detail>
<center>
<table bgcolor="#FFFFFF">
<tr><td>Sensor Type</td><td>
<select name="cfg-sensor-type" >
<option value="1 Fred's Sensor" selected>Fred's Sensor
<option value="2 Other">Other Sensor
</select>
</td></tr>
</table>
</center>
</form>
</body>
</html>

If it were me, I'd go with css. Css solution for selected option is:
'select[name="cfg-sensor-type"] option[selected]'

The answer was a bit from both of the previous answers:
$scraper = scraper {
process '//select[#name="cfg-sensor-type"]/option[#selected]', 'SensorType' => 'TEXT';
};

Perl Form Validation using CGI scripting

I'm trying to achieve one last task for my assignment is to validate the form before submit it to the another CGI program.
What happen is that I have a simple CGI program that will ask user to input the data
#!/usr/bin/perl -w
use CGI qw/:standard/;
# Standard HTTP header
print header();
# Write information to data file and produce a form
&printForm();
# Finish HTML page
print end_html();
# This sub will create a form to access the print_fortune.cgi script
sub printForm
{
print qq~
<html>
<head><title>My Search Engine</title>
</head>
<body>
<form action="b1.cgi" method="GET">
What is your e-msil address? <input type="text" name="passing" size=40>
<input type="submit" value="send address">
<input type="hidden" name="form" value="insert" />
</form>
<form method="get" action="b1.cgi" enctype="application/x-www-form-urlencoded">
<input type="text" name="search" value="" size="30" /><br />
<label><input type="radio" name="option" value="name" checked="checked" />name</label>
<label><input type="radio" name="option" value="author" />author</label><label>
<input type="radio" name="option" value="url" />url</label>
<label><input type="radio" name="option" value="keyword" />keyword</label>
<input type="submit" name=".submit" value="Search" />
<input type="hidden" name="passing" value="http://default.com" />
<div><input type="hidden" name="form" value="search" /></div></form>
</body>
So the above program contains two forms. One is to add new data to the database and the other one is to search from the database.
#!/usr/bin/perl
print "Content-type: text/html\n\n";
use LWP::Simple;
use CGI;
use HTML::HeadParser;
use DBI;
my $serverName = "";
my $serverPort = "";
my $serverUser = "";
my $serverPass = "";
my $serverDb = "";
my $serverTabl = "";
$cgi = CGI->new;
my $pass = $cgi->param('passing');
$URL = get ("$pass");
$head = HTML::HeadParser->new;
$head->parse("$URL");
my $methods = $cgi->param('form');
if ($methods eq "insert"){
insert_entry();
}
show_entries();
sub insert_entry {
my ($dbh, $success, $name, $author, $url,$temp);
$dbh = DBI->connect("DBI:mysql:database=$serverDb;host=$serverName;port=$serverPort",$serverUser,$serverPass);
$name = $head->header('X-Meta-Name');
$author = $head->header('X-Meta-Author');
$url = $cgi->param('passing');
$temp = $head->header('X-Meta-Keywords');
#keyword = split(/,/,$temp);
$success = $dbh->do("INSERT INTO $serverTabl(name,author,url,keyword1,keyword2,keyword3,keyword4,keyword5) VALUES(?,?,?,?,?,?,?,?)", undef,$name,$
author,$url,$keyword[0],$keyword[1],$keyword[2],$keyword[3],$keyword[4]);
$dbh->disconnect;
if($success != 1) {
return "Sorry, the database was unable to add your entry.
Please try again later.";
} else {
return;
}
}
sub show_entries {
my ($dbh, $sth, #row);
my $search = $cgi->param('search');
my $option = $cgi->param('option');
$dbh = DBI->connect("DBI:mysql:database=$serverDb;host=$serverName;port=$serverPort",$serverUser,$serverPass);
$sth = $dbh->prepare("SELECT *
FROM $serverTabl
WHERE $option LIKE '%$search%'");
$sth->execute;
print "Existing Entries",HR;
while(#row = $sth->fetchrow_array) {
$row[5] = scalar(localtime($row[5]));
print "<table border='2'><tr>";
print "<td>" . $row[0] . "</td>";
print "<td>Name" . $row[1] . "</td>";
print "<td>Author" . $row[2] . "</td>";
print "<td>URL" . $row[3] . "</td>";
print "<td>Keyword1" . $row[4] . "</td>";
print "<td>Keyword2" . $row[5] . "</td>";
print "<td>Keyword3" . $row[6] . "</td>";
print "<td>Keyword4" . $row[7] . "</td>";
print "<td>Keyword5" . $row[8] . "</td>";
print "</tr></table>";
}
$sth->finish;
$dbh->disconnect;
}
So now the question is how can I do a regular expression for the form submission before it goes to the second program?
I want to do validation for
name allows spaces but only alphabetical characters
author allows spaces but only alphabetical characters
keywords allows no spaces and only alphabetical characters
url only allows alphanumerical characters and the following :/.~?=+& No two periods can exist consecutively.
I'm really sorry but I'm really new to Perl. We are only been taught about PHP, but Perl almost nothing....

The perluniprops Perl document lists all the \p regular expression properties.
For a string that contains only letters, you want
/^[\p{Alpha}]+$/
For a string that contains only letters and spaces you want
/^[\p{Alpha}\x20]+$/
To match a URL the documentation of the URI module gives this as an official pattern to match a URL
m|^(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?$|
Be sure to cite the references in your work to get extra marks!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to parse html with HTML::TreeBuilder? - perl

I use the look_down() method scan HTML. Using look_down() I can return first get a list of all the divs of class="item". Then I can iterate of them, and find and process the h2 and the p, which I would then split using // as my splitter.

Related

Why am I getting this error when testing a Perl/CGI scipt on Padre?

Replace all the spaces in content of any tag with ` `

Trouble Replacing Text in HTML fragment using Mojo::DOM

Web::Scrape with Xpath returns too many lines

Perl Form Validation using CGI scripting

Categories

Resources