How to parse a tricky HTML file, using HTML::TreeBuilder

How to parse a tricky HTML file, using HTML::TreeBuilder - perl

Suppose, we have the following HTML file:
test.htm
<!DOCTYPE html>
<html>
<head>
<title>test</title>
</head>
<body>
<b>weight:</b> 120kg<br>
<b>length:</b> 10cm<br>
</body>
</html>
How can I get the following data from it?
{
'weight' => '120kg',
'length' => '10cm',
}
parser.pl
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new;
$root->parse_file('test.htm');
#what to do here?
$root->delete( );

This gets you very close to what you want (you'll need to tweak the text strings you're getting for the keys and values slightly).
But I think you'll find it far simpler using a tool like Web:Scraper.
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new;
$root->parse_file(\*DATA);
my $data;
foreach my $elem ($root->find('b')) {
$data->{($elem->content_list)[0]} = $elem->right;
}
say Dumper $data;
__END__
<!DOCTYPE html>
<html>
<head>
<title>test</title>
</head>
<body>
<b>weight:</b> 120kg<br>
<b>length:</b> 10cm<br>
</body>
</html>
Output:
$VAR1 = {
'length:' => ' 10cm',
'weight:' => ' 120kg'
};

Two solutions using Mojo::DOM:
use strict;
use warnings;
use Mojo::DOM;
use Data::Dump;
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
my %hash = do {
my $text = $dom->find('body')->all_text();
split ' ', $text;
};
dd \%hash;
my %hash2 = map {
$_->all_text() => $_->next_sibling() =~ s{^\s+|\s+$}{}gr
} $dom->find('b')->each;
dd \%hash2;
__DATA__
<!DOCTYPE html>
<html>
<head>
<title>test</title>
</head>
<body>
<b>weight:</b> 120kg<br>
<b>length:</b> 10cm<br>
</body>
</html>
Outputs:
{ "length:" => "10cm", "weight:" => "120kg" }
{ "length:" => "10cm", "weight:" => "120kg" }

Related

How to Insert lines at specific location in file using perl script

this is my problem I'm trying to read an HTML file(index.html) then search all links an put it on a second file named salida.html, I read this answer, I read this answer and I tried to do it, but it didn't work for me.
This is my perl code:
use strict;
use warnings;
use 5.010;
use Tie::File;
my $entrada='index.html';
my $salida='salida.html';
open(A,"<$entrada");
my #links;
foreach my $linea (<A>){
print "Renglon => $linea\n" if $linea =~ m/a href/;
#print $B $linea if $linea =~ m/a href/;
push #links, $linea if $linea =~ m/a href/;
}
tie my #resultado, 'Tie::File', 'salida.html' or die "Nelson";
for (#resultado) {
if ($_ =~ m/<main class="contenido">/){
foreach my $found (#links){
$_ .= '<br/>'.$found;
}
last;
}
}
close(A);
My Perl code runs without problems but in the for of my code I'm trying to write the links that I have in my variable $links in a specific part of my salida.html file:
<!DOCTYPE html>
<html lang="es-mx">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Resultados de la busqueda</title>
<link rel="stylesheet" href="style-salida.css">
</head>
<body>
<div class="contenedor">
<header class="header">
<h2>Resultados de la busqueda</h2>
</header>
*<main class="contenido">
</main>*
<footer class="footer">
<h4>
Gerardo Saucedo Arevalo - 15092087 - Topicos selectos de tecnologias web - Búsqueda de enlaces dentro de
una página web
</h4>
</footer>
</div>
</body>
</html>
But my code always add the lines at the end of the file, I ran this code once and it worked perfectly, but then I add some lines and when I tried to run one more time didn't work.
I restored my file at the moment when it worked but it does not work anymore.
What I'm doing wrong?

Always process HTML or XML with an appropriate parser and then implement your processing on the DOM. My solution uses HTML::TreeBuilder. As your question doesn't include the contents of index.html I have appended my own to the solution:
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;
# Extract links from <DATA>
my $root1 = HTML::TreeBuilder->new->parse_file(\*DATA)
or die "HTML: $!\n";
my #links = $root1->look_down(_tag => 'a');
# Process salida.html from STDIN
my $root2 = HTML::TreeBuilder->new;
$root2->ignore_unknown(0);
$root2->parse_file(\*STDIN)
or die "HTML: $!\n";
# insert links in correct section
if (my #nodes = $root2->look_down(class => 'contenido')) {
$nodes[0]->push_content(#links);
}
print $root2->as_HTML(undef, ' '), "\n";
# IMPORTANT: must delete manually
$root2->delete;
$root1->delete;
exit 0;
__DATA__
<!DOCTYPE html>
<html>
<head>
<title>test</title>
</head>
<body>
<div>
Link 1
Link 2
</div>
</body>
</html>
Test run:
$ perl dummy.pl <dummy.html
<!DOCTYPE html>
<html lang="es-mx">
...
<main class="contenido"> Link 1Link 2</main>
...
</html>

Perl - How to validate Chinese character input from web form?

My web page uses Charset UTF-8 to allow Chinese character input in a textarea form field. I want to test if the input contains a certain character. I've writtena test script to see how Perl is going to handle the Chinese input. It's not finding the match when there is a known match.
Here is my test form:
<!DOCTYPE html>
<head>
<meta charset="utf-8">
</head>
<body>
<form method="post" action="http://www.my_domain.com/cgi-bin/my_test_script.pl">
<textarea name="user_input" rows="" cols=""></textarea>
<input type="submit" name="submit" value="submit">
</form>
</body>
</html>
Here is my code:
#!/usr/bin/perl -T
use strict;
use warnings;
use CGI;
use CGI::Carp qw(warningsToBrowser fatalsToBrowser);
use utf8;
print "Content-type: text/html; charset=UTF-8\n\n";
print "<meta http-equiv='content-type' content='text/html;charset=UTF-8'>";
my $query = new CGI;
my $msg = $query->param('user_input');
chomp $msg;
my $msg_code = ord($msg);
print "<p> Message was: ".$msg."\n";
print "<p> Message Code is: ".$msg_code."\n";
my $char_from_code_point = "\N{U+89C6}";
my $char_from_code_point_reverse_code = ord($char_from_code_point);
print "<p> char_from_code_point= ".$char_from_code_point."\n";
print "<p> char_from_code_point_reverse_code = ".$char_from_code_point_reverse_code."\n";
if ($msg =~ m/$char_from_code_point/) {
print "<p>Matched!\n";
}
else {
print "<p> NOT matched\n";
}
And here is the output from submitting the correct character:
Message was: 视
Message Code is: 232
char_from_code_point= 视
char_from_code_point_reverse_code = 35270
NOT matched
Could someone please point out what I'm doing wrong?
Thank you.

How to upload multiple files using Mojolicious?

I'm new to Mojolicious, trying to learn it. I'm trying to upload multiple files using form but only one file is uploaded at a time. Any suggestion?
#!perl -w
use Mojolicious::Lite;
use Mojo::Upload;
use v5.14;
get '/' => 'page';
post '/' => sub {
my $self = shift;
my #files;
for my $file ($self->req->upload('files')) {
my $size = $file->size;
my $name = $file->filename;
push #files, "$name ($size)";
$file->move_to("C:\\Program Files\\Apache Software Foundation\\Apache24\\htdocs\\ProcessingFolder\\".$name);
}
$self->render(text => "#files");
} => 'save';
app->start;
__DATA__
## page.html.ep
<!DOCTYPE html>
<html>
<body>
<form action="<%=/ProcessingFolder/%>" method="POST"
enctype="multipart/form-data">
<input name="files" type="file" enctype="multipart/form-data" multiple="multiple">
<button type="submit">Upload</button>
</form>
</body>
</html>

Found Answer. Used uploads() instead of upload() which returns Array reference.
#!perl -w
use Mojolicious::Lite;
use Mojo::Upload;
use v5.14;
get '/' => 'page';
post '/' => sub {
my $self = shift;
my #files;
for my $file (#{$self->req->uploads('files')}) {
my $size = $file->size;
my $name = $file->filename;
push #files, "$name ($size)";
$file->move_to("C:\\Program Files\\Apache Software Foundation\\Apache24\\htdocs\\ProcessingFolder\\".$name);
}
$self->render(text => "#files");
} => 'save';
app->start;
__DATA__
## page.html.ep
<!DOCTYPE html>
<html>
<body>
<form action="<%=/ProcessingFolder/%>" method="POST"
enctype="multipart/form-data">
<input name="files" type="file" enctype="multipart/form-data" multiple="multiple">
<button type="submit">Upload</button>
</form>
</body>
</html>

I think there is a small bug in the answer above - the variable ProcessingFolder is obsolete ...
#!perl -w
# install Mojolicious by : sudo -s 'curl -L cpanmin.us | perl - Mojolicious'
# run by : perl mojo-file-upload.pl daemon -m production -l http://*:8083
# point your browser #: http://127.0.0.1/upload-files
#
use Mojolicious::Lite;
use Mojo::Upload;
use v5.14;
get '/' => 'upload-files-page';
get '/upload-files' => 'upload-files-page';
post '/upload-files' => sub {
my $self = shift;
my #files;
for my $file (#{$self->req->uploads('files')}) {
my $size = $file->size;
my $name = $file->filename;
push #files, "$name ($size)";
$file->move_to("/tmp/".$name);
}
$self->render(text => "#files");
} => 'save';
app->start;
__DATA__
## upload-files-page.html.ep
<!DOCTYPE html>
<html>
<body>
<form action="/upload-files" method="POST"
enctype="multipart/form-data">
<input name="files" type="file"
enctype="multipart/form-data" multiple="multiple">
<button type="submit">Upload</button>
</form>
</body>
</html>

Search and replace the content between a specific tag

#!/usr/bin/perl
use strict;
use warnings;
my $html = q|
<html>
<head>
<style>
.classname{
color: red;
}
</style>
</head>
<body>
classname will have a color property.
</body>
</html>
|;
$html=~s/classname/NEW/g;
print $html;
This replaces classname in both places. How can I limit the replacement only to content of <body>? I'd like to see it done using HTML::Parser or HTML::TreeBuilder.

I believe this does what you want, replaces classname with your regexp on all children of body element, using HTML::TreeBuilder.
I added another dummy div to input to make sure it was being processed correctly.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $html = q|
<html>
<head>
<style>
.classname{
color: red;
}
</style>
</head>
<body>
classname will have a color property.
<div>more text with classname in it</div>
</body>
</html>
|;
my $tree = HTML::TreeBuilder->new_from_content($html);
replace_text( $tree->find_by_tag_name("body") );
print $tree->as_HTML."\n";
sub replace_text {
my $html_element = shift;
for my $el ( $html_element->content_refs_list ){
if ( ref( $$el ) ){
replace_text( $$el );
next;
}
$$el =~ s /classname/NEW/g;
}
return $html_element;
}

How to redirect a page in perl script

I am fairly new to Perl.
I have a form that reads into a script.pl and does the validation check and etc.
How can I make it so once its done showing the validation, loops back to the home page after a few seconds automatically?
I tried using the following and it didn't work:
use strict;
use warnings;
my $url = "http://google.com";
print "Location: $url\n\n";
An Example of HTML for this would be: <META HTTP-EQUIV="REFRESH" CONTENT="10;URL=index.htm">
here is what i have:
#!/usr/bin/perl
use strict;
use warnings;
my $url = "google.com";;
print "Location: $url\n\n";
print "Content-type: text/html\n\n";
%form=&parse_form();
etc....etc...

You could use the following alternative:
use strict;
use warnings;
my $url = "http://google.com";
print "Content-type: text/html\n\n";
print qq[
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>Redirecting...</title>
<meta HTTP-EQUIV="REFRESH" CONTENT="10;URL=$url">
</head>
<body>
</body>
</html>
];
The following is valid and should work fine:
use strict;
use warnings;
my $url = "http://google.com";
print "Location: $url\n\n";

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to parse a tricky HTML file, using HTML::TreeBuilder - perl

Related

How to Insert lines at specific location in file using perl script

Perl - How to validate Chinese character input from web form?

How to upload multiple files using Mojolicious?

Search and replace the content between a specific tag

How to redirect a page in perl script

Categories

Resources