Ignore Text in HTML::TreeBuilder Output Perl - perl

I need to ignore or remove all text in between all HTML elements so I can generate a blank template from a given web page.
I am parsing using the perl module HTML::TreeBuilder and HTML::Element.
I have tried the ignore_text method noted in the documentation but that doesn't provide correct results.
I have also tried using DOMXpath with PHP to do the same thing and results seemed too cumbersome to manage. Regex's might work but are a last resort to me.
This is part of my current code, very basic. Bottom is just output to file. All code is functional I just need formatting to work so I can generate template files.
my $url= "http://www.example.com";
my $page = get($url) or die $!;
my $tree = HTML::TreeBuilder->new_from_content($page);
$tree->parse_file($page);
$tree->ignore_text;
$tree->elementify;
open OUTPUT, "+>".$body;
my $output = $tree->as_HTML;
print OUTPUT $output;
close OUTPUT;
Thanks in advance for the help!
EDIT: I found the problem - the ignore text only works when you parse from a physical file. I had to save the page as a temp file to parse then output the way I wanted with no text then I just did unlink($tmp) at the bottom to delete the file. My script has since grown much more complicated with reading and writing to database and each time I need to create this temp file which is kind of annoying...
Thanks for the reply below!

You are very close.
It looks like you need to set ignore_text with a true value. $tree->ignore_text(1) and then make sure its set before calling parse_file.
Sorry this is a bit long but i hope it helps.
Here is quick pass at the new code, hard to test without example page:
my $tree = HTML::TreeBuilder->new;
$tree->ignore_text(1);
$tree->elementify;
$tree->parse_file( $page );
Here is my quick test script using a local file:
use strict;
use warnings;
use HTML::TreeBuilder;
my $page = 'test.html';
my $tree = HTML::TreeBuilder->new();
$tree->ignore_text(1);
$tree->parse_file($page);
$tree->elementify;
print $tree->as_HTML;
Input test.html:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>title text</title>
</head>
<body>
<h1>Heading 1</h1>
<p>paragraph text</p>
</body>
</html>
And output:
<html xmlns="http://www.w3.org/1999/xhtml"><head><title></title></head><body><h1></h1><p></body></html>
Good luck

Maybe you should use HTML::Parser for this task. It is maybe a little bit more code, but should not to complicated.

Related

WWW::Mechanize Extraction Help - PERL

I'm try to automate the extraction of a transcript found on a website. The entire transcript is found between dl tags since the site formatted the interview in a description list. The script I have below allows me to search the site and extract the text in a plain-text format, but I'm actually looking for it to include everything between the dl tags, meaning dd's, dt's, etc. This will allow us to develop our own CSS for the interview.
Something to note about the page is that there are break statements inserted at various points during the interview. Some tools we've found that extract information from webpages using pairings have found this to be a problem since it only grabs the information up until the break statement. Just something to keep in mind if you point me in a different direction. Here's what I have so far.
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;
my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/madeleine-k-albright");
# find all <dl> tags
my #list = $mech->find('dl');
foreach ( #list ) {
print $_->as_text();
}
If there is a tool that essentially prints what I have, only this time as HTML, please let me know of it!
Your code is fine, just change the as_text() method to as_HTML() and it will show the content with HTML tags included.

First cgi script in perl and don't know what it does

I am new to Perl and I am looking into CGI programs.
I tried the following from Perl Monks and it works. But I have no idea what it does.
1) What is the END_HERE? that is followed by HTML? :
print <<END_HERE;
<html>
<head>
<title>My First CGI Script</title>
</head>
<body bgcolor="#FFFFCC">
<h1>This is a pretty lame Web page</h1>
<p>Who is this Ovid guy, anyway?</p>
</body>
</html>
END_HERE
2) I modified the sample script by adding:
my $query = new CGI;
my $p= $query->param('myparam');
I.e. the new script is:
#!C:\perl\bin\perl.exe -wT
use strict;
use CGI;
my $query = new CGI;
print $query->header( "text/html" );
my $time = $query->param('fromDate');
print <<END_HERE;
<html>
<head>
<title>My First CGI Script $time</title>
</head>
<body bgcolor="#FFFFCC">
<h1>This is a pretty lame Web page</h1>
<p>Who is this Ovid guy, anyway?</p>
</body>
</html>
END_HERE
# must have a line after "END_HERE" or Perl won't recognize
# the token
It stopped working. I get the following error message:
Undefined subroutine &main::param called at C:/.../test2.cgi line 10.
How can I get the parameters send by the browser if not this way?
... <<END_HERE ...
foo
bar
END_HERE
means
... "foo
bar
" ...
The choice of terminator is up to you. You can use any bareword or any string if you add quotes. Both the following are equivalent to "foo\nbar\n":
<<MEOW
foo
bar
MEOW
<<"And they lived happily ever after."
foo
bar
And they lived happily ever after.
The script you posted has two problems, neither of them resulting in the error you specified.
Perl can't find the end of the here-doc since no line contains solely END_HERE. You have one that contains END_HERE with a whole bunch of leading spaces, but that's not the same thing. Remove the leading spaces.
It allows an arbitrary string to be placed in the HTML. Do escape (using, say, HTML::Entities's encode_entities)! Consider what happens if someone passes the following to the fromDate parameter:
<script>alert("owned")</script>

Perl CGI display image to browser from file

Friends,
I have been scouring the web for a solution to displaying images to a web browser with Perl and have found nothing that works for me.
I've tried possible solutions such as:
How To Display an Image with Perl
Outputting Image Data
Return an Image From a Script
and none of it works for what I'm doing. I want to deny client access to the image (Or even simply place the image file out of the www root) and dish it out server-side.
Here is an example of what I'm doing:
In my main perl file:
...
my $query = CGI->new();
sub main {
### This grabs page content from a module, depending on the page name.
### The module returns the HTML.
my $html = get_page('page', 'session');
### Perform any special conditionals here before printing the header...
print $query->header(some cookie/session data here);
print $html
}
In one of the modules:
sub return_page_content {
return <<HTML;
<html>
<body>
<img src="GET IMAGE HERE..." />
</body>
</html>
HTML
}
I've thought about just creating a copy of the image in a temp directory location, but that seems like it would defeat the entire purpose of keeping the image out of client-side access.
The probable solutions do not generate the image. I'm not sure where to go from here, so I am hoping someone here has an idea. Thank you so much!
Please let me know if I need to provide additional information. I feel like this could be beneficial to a lot of people. I hope at least ;-)
For a simple 'send the file to the browser solution', just send the browser the correct headers (to let the browser know what's coming), and then open your image and print the content to STDOUT.
select(STDOUT); $| = 1; #unbuffer STDOUT
print "Content-type: image/png\n\n";
open (IMAGE, '<', '/image_outside_webroot/image.png');
print <IMAGE>;
close IMAGE;
Once that is working, take a look at ImageMagick. There are all kinds of on-the-fly, fun image manipulations you can do (resizing, colorizing, etc.)
Your cgi script would contain code that looks something like this:
select(STDOUT); $| = 1; #unbuffer STDOUT
my $image = Image::Magick->new();
my $x = $image->Read(filename =>"/images_outside_web_root/image1.png");
#some manipulation of the image here
print "Content-type: image/png\n\n";
binmode STDOUT;
$x = $image->Write(.png:-');
You can read more about this on the site linked above.
Hope that helps.
Well, it turns out the reason this wasn't working is because I completely forgot I do not operate with CGI out of the standard cgi_bin directory. Instead, I use an .htaccess file to tell the server how and when to interpret files from the root directory as cgi.
So, this is what I ended up using in my image dishing script:
imagedish.pm
use CGI;
my $cgi = new CGI;
open (IMAGE, 'logo.png');
my $size = -s "logo.png";
my $data;
read IMAGE, $data, $size;
close (IMAGE);
print $cgi->header(-type=>'image/png'), $data;
This did the trick, as it should have been doing from the beginning, but in my .htaccess file, I added:
<files imagedish.pm>
SetHandler cgi-script
</files>
And that did the trick (Well, that, as well as going into Terminal and running chmod +x imagedish.pm to make it executable)! Of course the next steps are additional security measures, but at least it's working now! :-)
The full solution:
mainfile:
!/usr/bin/perl
use CGI;
use CGI::Carp qw(fatalsToBrowser);
use CGI::Session qw/-ip-match/;
use DBI;
use strict;
use warnings;
# Variables
my $query = CGI->new();
my %vars = $query->Vars();
sub main {
my $p1 = $vars{p1};
$p1 = 'Home' if (!$p1);
my $html = get_page();
#I use this method in case we have multiple sessions
#I've omitted how I acquire the session, as this is not part of the solution ;-)
print $query->header(-cookie=>[$query->cookie($vars{p1}=>$session->id)]);
print $html;
}
sub get_page {
return <<HTML;
<!DOCTYPE HTML>
<html>
<head>
<title>Image Disher</title>
<link rel="shortcut icon" href="images/favicon.ico" />
<link href="css/style.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div class="container">
<div class="addentry">
<div class="iaddentry">
<form name="client" action="" method="POST">
<div class="form-header" action="">
<center>
<img src="imagedish.pm" width="305" alt=""/><br>
</center>
<br>
</div>
</form>
</div>
</div>
</div>
</body>
</html>
HTML
}
main();
In here, the img tag looks for the source "imagedish.pm", and once it finds it, the .htaccess file tells it to execute as a CGI script. At that point, it dishes out information appropriately, not like before.
Please note, this is not the most-secure way to do it, but it gets me going in the right direction.
The links you found (except the first one) describe how to do it, I think you are just getting confused with the difference between delivering an image and delivering an html file with an img tag on it. Keep in mind that when your browser is parsing an html file and it encounters an img tag, it takes the src url in the tag and makes an additional get request for it.
Try capturing the raw output from a request for an image using curl, wireshark, etc. The result is what you want to try to create. It's just a matter of returning the content type http header, followed by the binary image data.
Have another look at this example, and get rid of the random_file sub, and replace this line:
my $image = random_file( IMAGE_DIRECTORY, '\\.(png|jpg|gif)$' );
with this:
my $image = "path to an image file accessible by www-user";
Hopefully once you that working and understand what it's doing, you'll know what you need to do next.
Yet another way is to embed the image using the <img src="data:image/...,base64,..."> format.
This defeats browser caching and isn't great for very large images. But is useful if its easier to construct the image as part of the initial page load or you don't want the hassle of managing/serving them via a file system.
#!/user/bin/perl
use warnings; use strict;
use MIME::Base64 qw();
sub return_page_content {
my $image_type = shift;
my $image_data = shift;
my $image_base64 = MIME::Base64::encode($image_data);
$image_base64 =~ s{\n}{}g; # lose newlines
return <<HTML;
<html>
<body>
<img src="data:image/${image_type};base64,${image_base64}" />
</body>
</html>
HTML
}
my $image_path = "/tmp/test.jpg";
open(my $fh, '<', $image_path)
or die "unable to open file $image_path: $!";
binmode($fh);
my $image_data = do {local $/; <$fh>};
close($fh);
print return_page_content("jpeg", $image_data);

Error while reading text out of a pdf using perl api pdf::api2

This is the code to read text of a pdf using perl
#!/usr/bin/perl
use PDF::API2;
$pdf = PDF::API2->new;
$pdf = PDF::API2->open('01443325.pdf');
$page = $pdf->page;
$pagenum=10;
$pdf->stringify;
$page = $pdf->openpage($pagenum);
print $page;
I dont get any output when i Run this code . How to remove the error ?
When you run $pdf->stringify above, it returns the content of the file as a string, but then you don't do anything with it. If you were to print it, though, it would not give you the text representation you are after as it is simply the original PDF bytes in a string.
Likewise, setting $pagenum to 10 has no consequences for the rest of the program as the variable is not linked to either the $pdf or $page object in any way.
I think the easiest option is to not try to do this with PDF::API2, but to look at whether you can run something like pdftotext from xpdf or poppler first and then read in the output.
If not, then there are some suggestions on the Perl Monks page http://www.perlmonks.org/?node_id=810721, and many more on Google under "perl extract text from pdf". There's even a previous SO question at How can I extract text from a PDF file in Perl?.
Good luck!

Why doesn't my Perl CGI program work on Windows?

I have written following in index.pl which is the C:\xampp\htdocs\perl folder:
#!/usr/bin/perl
print "<html>";
print "<h2>PERL IT!</h2>";
print "this is some text that should get displyed in browser";
print "</html>";
When I browse to http://localhost:88/perl/ the above HTML doesn't get displayed (I have tried in IE FF and chrome).
What would be the reason?
I have xampp and apache2.2 installed on this Windows XP system.
See also How do I troubleshoot my Perl CGI Script?.
Your problem was due to the fact that your script did not send the appropriate headers.
A valid HTTP response consists of two sections: Headers and body.
You should make sure that you use a proper CGI processing module. CGI.pm is the de facto standard. However, it has a lot of historical baggage and CGI::Simple provides a cleaner alternative.
Using one of those modules, your script would have been:
#!/usr/bin/perl
use strict; use warnings;
use CGI::Simple;
my $cgi = CGI::Simple->new;
print $cgi->header, <<HTML;
<!DOCTYPE HTML>
<html>
<head><title>Test</title></head>
<body>
<h1>Perl CGI Script</h1>
<p>this is some text that should get displyed in browser</p>
</body>
</html>
HTML
Keep in mind that print has no problem with multiple arguments. There is no reason to learn to program like it's 1999.
Maybe it's because you didn't put your text between <body> tags. Also you have to specify the content type as text/html.
Try this:
print "Content-type: text/html\n\n";
print "<html>";
print "<h2>PERL IT!</h2>";
print "<body>";
print "this is some text that should get displyed in browser";
print "</body>";
print "</html>";
Also, from the link rics gave,
Perl:
Executable: \xampp\htdocs and \xampp\cgi-bin
Allowed endings: .pl
so you should be accessing your script like:
http://localhost/cgi-bin/index.pl
I am just guessing.
Have you started the apache server?
Is 88 the correct port for reaching your apache?
You may also try http://localhost:88/perl/index.pl (so adding the script name to the correct address).
Check this documentation for help.