Extract style tag data using Perl - perl

Is there anyway that I can extract style tag data from a HTML page using Perl
#!/usr/bin/perl
use strict;
my $HTML = <<"EOF";
<HTML>
<head>
<style type='text/css'>
#yui-dt0-bdrow0 td{background:#CFF;}
#yui-dt0-bdrow1 td{background:#CFF;}
#yui-dt0-bdrow2 td{background:#CFF;}
</style>
</head>
</HTML>
EOF
I need to extract yui-dt0-bdrow0 td{background:#CFF;} information from the above HTML code.
I googled for lot of modules but didn't find the right one. Other than that I didn't try writing any code to extract the information
Any help is appreciated.

Use Mojo::DOM
Sample:
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $HTML = <<"EOF";
<HTML>
<head>
<style type='text/css'>
#yui-dt0-bdrow0 td{background:#CFF;}
#yui-dt0-bdrow1 td{background:#CFF;}
#yui-dt0-bdrow2 td{background:#CFF;}
</style>
</head>
</HTML>
EOF
my $dom = Mojo::DOM->new( $HTML );
print $dom->find('style')->text;
Output
chankey#pathak:~/myscripts$ perl mojo.pl
#yui-dt0-bdrow0 td{background:#CFF;}
#yui-dt0-bdrow1 td{background:#CFF;}
#yui-dt0-bdrow2 td{background:#CFF;}
You can now filter out the desired data.
For a 8 minute video tutorial on Mojo::DOM and Mojo::UserAgent check out Mojocast Episode 5

Related

How to Insert lines at specific location in file using perl script

this is my problem I'm trying to read an HTML file(index.html) then search all links an put it on a second file named salida.html, I read this answer, I read this answer and I tried to do it, but it didn't work for me.
This is my perl code:
use strict;
use warnings;
use 5.010;
use Tie::File;
my $entrada='index.html';
my $salida='salida.html';
open(A,"<$entrada");
my #links;
foreach my $linea (<A>){
print "Renglon => $linea\n" if $linea =~ m/a href/;
#print $B $linea if $linea =~ m/a href/;
push #links, $linea if $linea =~ m/a href/;
}
tie my #resultado, 'Tie::File', 'salida.html' or die "Nelson";
for (#resultado) {
if ($_ =~ m/<main class="contenido">/){
foreach my $found (#links){
$_ .= '<br/>'.$found;
}
last;
}
}
close(A);
My Perl code runs without problems but in the for of my code I'm trying to write the links that I have in my variable $links in a specific part of my salida.html file:
<!DOCTYPE html>
<html lang="es-mx">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Resultados de la busqueda</title>
<link rel="stylesheet" href="style-salida.css">
</head>
<body>
<div class="contenedor">
<header class="header">
<h2>Resultados de la busqueda</h2>
</header>
*<main class="contenido">
</main>*
<footer class="footer">
<h4>
Gerardo Saucedo Arevalo - 15092087 - Topicos selectos de tecnologias web - Búsqueda de enlaces dentro de
una página web
</h4>
</footer>
</div>
</body>
</html>
But my code always add the lines at the end of the file, I ran this code once and it worked perfectly, but then I add some lines and when I tried to run one more time didn't work.
I restored my file at the moment when it worked but it does not work anymore.
What I'm doing wrong?
Always process HTML or XML with an appropriate parser and then implement your processing on the DOM. My solution uses HTML::TreeBuilder. As your question doesn't include the contents of index.html I have appended my own to the solution:
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;
# Extract links from <DATA>
my $root1 = HTML::TreeBuilder->new->parse_file(\*DATA)
or die "HTML: $!\n";
my #links = $root1->look_down(_tag => 'a');
# Process salida.html from STDIN
my $root2 = HTML::TreeBuilder->new;
$root2->ignore_unknown(0);
$root2->parse_file(\*STDIN)
or die "HTML: $!\n";
# insert links in correct section
if (my #nodes = $root2->look_down(class => 'contenido')) {
$nodes[0]->push_content(#links);
}
print $root2->as_HTML(undef, ' '), "\n";
# IMPORTANT: must delete manually
$root2->delete;
$root1->delete;
exit 0;
__DATA__
<!DOCTYPE html>
<html>
<head>
<title>test</title>
</head>
<body>
<div>
Link 1
Link 2
</div>
</body>
</html>
Test run:
$ perl dummy.pl <dummy.html
<!DOCTYPE html>
<html lang="es-mx">
...
<main class="contenido"> Link 1Link 2</main>
...
</html>

Replacing XML nodes using perl and Mojo::DOM

I would like to exchange node in an XML file using Mojo::DOM.
I'm pretty sure it is possible but I didn't find a way yet.
Given the following XML:
my $xml = q~
<html>
<div>
<p>1</p>
<p>2</p>
<img />
</div>
</html>
~;
I would like to remove the div and instead insert a body tag, so that the result looks like this:
my $xml = q~
<html>
<body>
<p>1</p>
<p>2</p>
<img />
</body>
</html>
~;
I thought about replace, but I didn't find an example where the replacement is the $dom of the replaced tag.
It's very simple to just find the <div> element and use the tag method to change its tag
This program demonstrates. The CSS selector html > div finds the (first) <div> element that is a child of an <html> element
use strict;
use warnings;
use Mojo::DOM;
my $xml = q~
<html>
<div>
<p>1</p>
<p>2</p>
<img />
</div>
</html>
~;
my $dom = Mojo::DOM->new($xml);
$dom->at('html > div')->tag('body');
print $dom, "\n";
output
<html>
<body>
<p>1</p>
<p>2</p>
<img>
</body>
</html>

Search and replace the content between a specific tag

#!/usr/bin/perl
use strict;
use warnings;
my $html = q|
<html>
<head>
<style>
.classname{
color: red;
}
</style>
</head>
<body>
classname will have a color property.
</body>
</html>
|;
$html=~s/classname/NEW/g;
print $html;
This replaces classname in both places. How can I limit the replacement only to content of <body>? I'd like to see it done using HTML::Parser or HTML::TreeBuilder.
I believe this does what you want, replaces classname with your regexp on all children of body element, using HTML::TreeBuilder.
I added another dummy div to input to make sure it was being processed correctly.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $html = q|
<html>
<head>
<style>
.classname{
color: red;
}
</style>
</head>
<body>
classname will have a color property.
<div>more text with classname in it</div>
</body>
</html>
|;
my $tree = HTML::TreeBuilder->new_from_content($html);
replace_text( $tree->find_by_tag_name("body") );
print $tree->as_HTML."\n";
sub replace_text {
my $html_element = shift;
for my $el ( $html_element->content_refs_list ){
if ( ref( $$el ) ){
replace_text( $$el );
next;
}
$$el =~ s /classname/NEW/g;
}
return $html_element;
}

How can I remove an attribute from all DOM elements with Mojolicious?

I want to remove the bgcolor attribute from all elements of a page I am scraping via Mojolicious.
My attempt has been the following:
$dom->all_contents->each(sub { $_->attr('bgcolor' => undef) });
but this seems not to work.
How do I do it right?
The following uses Mojo::DOM to delete the bgcolor attribute for every node:
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
for my $node ($dom->find('*')->each) {
delete $node->{bgcolor};
}
print $dom;
__DATA__
<html>
<head>
<title>Hello background color</title>
</head>
<body bgcolor="white">
<h1>Hello world</h1>
<table>
<tr><td bgcolor="blue">blue</td></tr>
<tr><td bgcolor="green">green</td></tr>
</table>
</body>
</html>
Outputs:
<html>
<head>
<title>Hello background color</title>
</head>
<body>
<h1>Hello world</h1>
<table>
<tr><td>blue</td></tr>
<tr><td>green</td></tr>
</table>
</body>
</html>
Notes:
It's possible to use CSS Selectors to limit the returned nodes to only those containing the specific attribute:
for my $node ($dom->find('[bgcolor]')->each) {
One can also let Mojo handle the iteration like the following:
$dom->find('*')->each(sub {
delete $_->{bgcolor};
});
As I understand it, the DOM attribute you're looking for isn't bgcolor but background-color, the css variety. bgcolor fell out of popularity a while ago, in favor of defining classes and using CSS to set the styling on an object (including its background color). Try background-color instead.

How to redirect a page in perl script

I am fairly new to Perl.
I have a form that reads into a script.pl and does the validation check and etc.
How can I make it so once its done showing the validation, loops back to the home page after a few seconds automatically?
I tried using the following and it didn't work:
use strict;
use warnings;
my $url = "http://google.com";
print "Location: $url\n\n";
An Example of HTML for this would be: <META HTTP-EQUIV="REFRESH" CONTENT="10;URL=index.htm">
here is what i have:
#!/usr/bin/perl
use strict;
use warnings;
my $url = "google.com";;
print "Location: $url\n\n";
print "Content-type: text/html\n\n";
%form=&parse_form();
etc....etc...
You could use the following alternative:
use strict;
use warnings;
my $url = "http://google.com";
print "Content-type: text/html\n\n";
print qq[
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>Redirecting...</title>
<meta HTTP-EQUIV="REFRESH" CONTENT="10;URL=$url">
</head>
<body>
</body>
</html>
];
The following is valid and should work fine:
use strict;
use warnings;
my $url = "http://google.com";
print "Location: $url\n\n";