I am trying to scrape only the test information from a web page which is set up with a set of divs, tags etc. I want to only extract information from a specific div class, and again only the information within the tags.
<div class="col col60 moduledetail"><table cellspacing="2" cellpadding="0" border="0" id="moduleDetail"><tr><th class="moduleCode">test<small>CRN: 33413</small></th><th>test</th></tr><tr><td class="label"><nobr>Campus</nobr></td><td><a target="_blank" href="test/">test</a></td></tr><tr><td class="label">
above is a snippet of what is contained within the web page. My attempt at getting the page contents is doing exactly what it says, its getting everything from the web page, how can i narrow this down to this class and only the textual information within the tags.
thanks
Use a HTML parser. Here's an example using HTML::TreeBuilder:
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new;
$mech->get($url);
my $tree = HTML::TreeBuilder->new_from_content($mech->content);
if (my $div = $tree->look_down(_tag => "div", class => "col col60 moduledetail")) {
print $div->as_text(), "\n";
}
$tree->delete();
Related
I am creating a simple perl script to create a web page to register users. This is just a learning program for me. It is very simple. I will display a page on the browser. The user enters name, user name, and password. After the user presses submit, I will check the user name against the database. If the user name exists in the database, I just want to display an error and bring up the register page again. I am using the cgi->redirect function. I am not sure if that is how I should use the redirection function. It does not work like I thought. It display "The document has moved here". Please point me to the right way. Thanks.
Here is the scripts
registeruser.pl
#!/usr/bin/perl
print "Content-type: text/html\n\n";
print <<PAGE;
<html>
<head>
<link rel="stylesheet" type="text/css" href="tracker.css"/>
</head>
<body>
<div id="header">
<h1> Register New User</h1>
</div>
<div id="content">
<form action="adduser.pl" method="POST">
<b>Name:</b> <input type="text" name="name"><br>
<b>UserName:</b> <input type="text" name="username"><br>
<b>Password:</b> <input type="password" name="password"><br>
<input type="submit">
</div>
</body>
<html>
PAGE
adduser.pl
#!/usr/bin/perl
use CGI;
use DBI;
$cgiObj = CGI->new;
print $cgiObj->header ('text/html');
# get post data
$newUser = $cgiObj->param('username');
$newName = $cgiObj->param('name');
$newPass = $cgiObj->param('password');
# set up sql connection
$param = 'DBI:mysql:Tracker:localhost';
$user = 'madison';
$pass = 'qwerty';
$connect = DBI->connect ($param, $user, $pass);
$sql = 'select user from users where user = "' . $newUser . '"';
$query = $connect->prepare ($sql);
$query->execute;
$found = 0;
while (#row = $query->fetchrow_array)
{
$found = 1;
}
if ($found == 0)
{
# no user found add new user
$sql = 'insert into users (user, name, passwd) values (?, ?, ?)';
$insert = $connect->prepare ($sql);
$insert->execute ($newUser, $newName, $newPass);
}
else
{
# user already exists, get new user name
# What do I do here ????
print $cgiObj->redirect ("registerusr.pl");
}
One thing to look out for, SQL Injection. For an illustrated example, Little Bobby Tables.
As it stands your code is inescure, and can allow people to do bad things to your database. DBI provides placeholders as a secure way of querying a database with user input. Example http://bobby-tables.com/perl.html
Also, in this day and age even the CGI module warns you not to use it:
The rational for this decision is that CGI.pm is no longer considered good practice for developing web applications, including quick prototyping and small web scripts. There are far better, cleaner, quicker, easier, safer, more scalable, more extensible, more modern alternatives available at this point in time. These will be documented with CGI::Alternatives.
I suggest you use Dancer to make your life easier.
Three things
Include use strict; and use warnings; in EVERY perl script. No exceptions.
This is the #1 thing that you can do to be a better perl programmer. It will save you an incalculable amount of time during both development and testing.
Don't use redirects to switch between form processing and form display
Keep your form display and form processing in the same script. This enables you to display error messages in the form and only move on to a new step upon a successfully processed form.
You simply need to test the request_method to determine if the form is needing to be processed or just displayed.
CGI works for learning perl, but look at CGI::Alternatives for live code.
The following is your form refactored with the first 2 guidelines in mind:
register.pl:
#!/usr/bin/perl
use strict;
use warnings;
use CGI;
my $q = CGI->new;
my $name = $q->param('name') // '';
my $username = $q->param('username') // '';
my $password = $q->param('password') // '';
# Process Form
my #errors;
if ( $q->request_method() eq 'POST' ) {
if ( $username =~ /^\s*$/ ) {
push #errors, "No username specified.";
}
if ( $password =~ /^\s*$/ ) {
push #errors, "No password specified.";
}
# Successful Processing
if ( !#errors ) {
# Obfuscate for display
$password =~ s/./*/g;
print $q->header();
print <<"END_PAGE";
<html>
<head><title>Success</title></head>
<body>
<p>Name = $name</p>
<p>Username = $username</p>
<p>Password = $password</p>
</body>
</html>
END_PAGE
exit;
}
}
# Display Form
print $q->header();
print <<"END_PAGE";
<html>
<head>
<link rel="stylesheet" type="text/css" href="tracker.css"/>
</head>
<body>
<div id="header">
<h1>Register New User</h1>
</div>
#{[ #errors ? join("\n", map "<p>Error: $_</p>", #errors) : '' ]}
<div id="content">
<form action="register.pl" method="POST">
<b>Name:</b> #{[ $q->textfield( -name => 'name' ) ]}<br>
<b>UserName:</b> #{[ $q->textfield( -name => 'username' ) ]}<br>
<b>Password:</b> #{[ $q->password_field( -name => 'password' ) ]}<br>
<input type="submit">
</div>
</body>
<html>
END_PAGE
__DATA__
I'm trying to follow a link in Perl.
My initial code:
use WWW::Mechanize::Firefox;
use Crypt::SSLeay;
use HTML::TagParser;
use URI::Fetch;
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME}=0; #not verifying certificate
my $url = 'https://';
$url = $url.#ARGV[0];
my $mech = WWW::Mechanize::Firefox->new;
$mech->get($url);
$mech->follow_link(tag => 'a', text => '<span class=\"normalNode\">VSCs</span>');
$mech->reload();
I found here that the tag and text options work this way but I got the error MozRepl::RemoteObject: SyntaxError: The expression is not a legal expression. I tried to escape some characters in the text, but the error was still the same.
Then I changed my code adding:
my #list = $mech->find_all_links();
my $found = 0;
my $i=0;
while($i<=$#list && $found == 0){
print #list[$i]->url()."\n";
if(#list[$i]->text() =~ /VSCs/){
print #list[$i]->text()."\n";
my $follow =#list[$i]->url();
$mech->follow_link( url => $follow);
}
$i++;
}
But then again there's an error: No link found matching '//a[(#href = "https://... and a lot of more text that seems to be the link's description.
I hope I made myself clear, if not, please tell me what else to add. Thanks to all for your help.
Here's the part where the link I want to follow is:
<li id="1" class="liClosed"><span class="bullet clickable"> </span><b><span class="normalNode">VSCs</span></b>
<ul id="1.l1">
<li id="i1.i1" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">First</span></b></li>
<li id="i1.i2" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">Second</span></b></li>
<li id="i1.i3" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">Third</span></b></li>
<li id="i1.i4" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">Fourth</span></b></li>
<li id="i1.i5" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">None</span></b></li>
</ul>
I'm working in Windows 7, MozRepl is version 1.1 and I'm using Strawberry perl 5.16.2.1 for 64 bits
After poking around with the given code I was able to make W::M::F to follow the links in a following manner:
use WWW::Mechanize::Firefox;
use Crypt::SSLeay;
use HTML::TagParser;
use URI::Fetch;
...
$mech->follow_link(xpath => '//a[text() = "<span class=\"normalNode\">VSCs</span>"]');
$mech->reload();
Note xpath parameter given instead of text.
I didn't take a long look into W::M::F sources, but under the hood it tries to translate given text parameter into XPath string, and if text contains number of XML/HTML tags, which is your case, it probably drives him crazy.
I recommend you to try :
$mech->follow_link( url_regex => qr/selector=All/ );
I'm trying to create a website with forms for people to fill out and when the user presses submit button the texts in each form field are concatenated into a single text string to be used to make a QR code. How could I do this and what language would be the best for most browsers to be compatible.
In addition, I would like to have the text fields have a new line (\n) associated with it to make the format a little more pretty when the user scans the QR code.
Please let me know.. Thanks in advance.. could you include a sample code of a website that has three text areas to concatenate?
The Imager::QRCode module makes this easy. I just knocked the following up in 5 minutes.
#!/Users/quentin/perl5/perlbrew/perls/perl-5.14.2/bin/perl
use v5.12;
use CGI; # This is a quick demo. I recommend Plack/PSGI for production.
use Imager::QRCode;
my $q = CGI->new;
my $text = $q->param('text');
if (defined $text) {
my $qrcode = Imager::QRCode->new(
size => 5,
margin => 5,
version => 1,
level => 'M',
casesensitive => 1,
lightcolor => Imager::Color->new(255, 255, 255),
darkcolor => Imager::Color->new(0, 0, 0),
);
my $img = $qrcode->plot($text);
print $q->header('image/gif');
$img->write(fh => \*STDOUT, type => 'gif')
or die $img->errstr;
} else {
print $q->header('text/html');
print <<END_HTML;
<!DOCTYPE html>
<meta charset="utf-8">
<title>QR me</title>
<h1>QR me</h1>
<form>
<div>
<label>
What text should be in the QR code?
<textarea name="text"></textarea>
</label>
<input type="submit">
</div>
</form>
END_HTML
}
How could I do this and what language would be the best for most browsers to be compatible.
If it runs on the server then you just need to make sure the output is compatible across browsers; so use GIF or PNG.
could you include a sample code of a website that has three text areas to concatenate?
Just use a . to concatenate string variables in Perl.
my $img = $qrcode->plot($foo . $bar . $baz);
Add binmode to display the image of the qr code, for example:
print $q->header('image/png');
binmode STDOUT;
$img->write(fh => \*STDOUT, type => 'png');
I'm an old-newbie in Perl, and Im trying to create a subroutine in perl using HTML::TokeParser and URI.
I need to extract ALL valid links enclosed within on div called "zone-extract"
This is my code:
#More perl above here... use strict and other subs
use HTML::TokeParser;
use URI;
sub extract_links_from_response {
my $response = $_[0];
my $base = URI->new( $response->base )->canonical;
# "canonical" returns it in the one "official" tidy form
my $stream = HTML::TokeParser->new( $response->content_ref );
my $page_url = URI->new( $response->request->uri );
print "Extracting links from: $page_url\n";
my($tag, $link_url);
while ( my $div = $stream->get_tag('div') ) {
my $id = $div->get_attr('id');
next unless defined($id) and $id eq 'zone-extract';
while( $tag = $stream->get_tag('a') ) {
next unless defined($link_url = $tag->[1]{'href'});
next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL.
next unless length $link_url; # sanity check!
$link_url = URI->new_abs($link_url, $base)->canonical;
next unless $link_url->scheme eq 'http'; # sanity
$link_url->fragment(undef); # chop off any "#foo" part
print $link_url unless $link_url->eq($page_url); # Don't note links to itself!
}
}
return;
}
As you can see, I have 2 loops, first using get_tag 'div' and then look for id = 'zone-extract'. The second loop looks inside this div and retrieve all links (or that was my intention)...
The inner loop works, it extracts all links correctly working standalone, but I think there is some issues inside the first loop, looking for my desired div 'zone-extract'... Im using this post as a reference: How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?
But all I have by the moment is this error:
Can't call method "get_attr" on unblessed reference
Some ideas? Help!
My HTML (Note URL_TO_EXTRACT_1 & 2):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="...">
<h2 class="genres"><img alt="extracting" class="png"></h2>
<li><a title="Extr 2" href="**URL_TO_EXTRACT_1**">2</a></li>
<li><a title="Con 1" class="sel" href="**URL_TO_EXTRACT_2**">1</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
I find that TokeParser is a very crude tool requiring too much code, its fault is that only supports the procedural style of programming.
A better alternatives which require less code due to declarative programming is Web::Query:
use Web::Query 'wq';
my $results = wq($response)->find('div#zone-extract a')->map(sub {
my (undef, $elem_a) = #_;
my $link_url = $elem_a->attr('href');
return unless $link_url && $link_url !~ m/\s/ && …
# Further checks like in the question go here.
return [$link_url => $elem_a->text];
});
Code is untested because there is no example HTML in the question.
I need to parse the following sample html using xpath query..
<td id="msgcontents">
<div class="user-data">Just seeing if I can post a link... please ignore post
http://finance.yahoo.com
</div>
</td>
<td id="msgcontents">
<div class="user-data">some text2...
http://abc.com
</div>
</td>
<td id="msgcontents">
<div class="user-data">some text3...
</div>
</td>
The above html may repeat n no of times in a page.
Also sometimes the ..... portion may be absent as shown in the above html blocks.
What I need is the xpath syntax so that I can get the parsed strings as
array1[0]= "Just seeing if I can post a link... please ignore post ttp://finance.yahoo.com"
array[1]="some text2 htp://abc.com"
array[2]="sometext3"
Maybe something like the following:
$remote = file_get_contents('http://www.sitename.com');
$dom = new DOMDocument();
//Error suppression unfortunately, as an invalid xhtml document throws up warnings.
$file = #$dom->loadHTML($remote);
$xpath = new DOMXpath($dom);
//Get all data with the user-data class.
$userdata = $xpath->query('//*[contains(#class, \'user-data\')]');
//get links
$links = $xpath->query('//a/#href');
So to access one of these variables, you need to use nodeValue:
$ret = array();
foreach($userdata as $data) {
$ret[] = $data->nodeValue;
}
Edit: I thought I'd mention that this will get all the links on a given page, I assume this is what you wanted?
Use:
concat(/td/div/text[1], ' ', /td/div/a)
You can use instead of the ' ' above, whatever delimiter you'd like to appear between the two strings.