I'm trying to follow a link in Perl.
My initial code:
use WWW::Mechanize::Firefox;
use Crypt::SSLeay;
use HTML::TagParser;
use URI::Fetch;
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME}=0; #not verifying certificate
my $url = 'https://';
$url = $url.#ARGV[0];
my $mech = WWW::Mechanize::Firefox->new;
$mech->get($url);
$mech->follow_link(tag => 'a', text => '<span class=\"normalNode\">VSCs</span>');
$mech->reload();
I found here that the tag and text options work this way but I got the error MozRepl::RemoteObject: SyntaxError: The expression is not a legal expression. I tried to escape some characters in the text, but the error was still the same.
Then I changed my code adding:
my #list = $mech->find_all_links();
my $found = 0;
my $i=0;
while($i<=$#list && $found == 0){
print #list[$i]->url()."\n";
if(#list[$i]->text() =~ /VSCs/){
print #list[$i]->text()."\n";
my $follow =#list[$i]->url();
$mech->follow_link( url => $follow);
}
$i++;
}
But then again there's an error: No link found matching '//a[(#href = "https://... and a lot of more text that seems to be the link's description.
I hope I made myself clear, if not, please tell me what else to add. Thanks to all for your help.
Here's the part where the link I want to follow is:
<li id="1" class="liClosed"><span class="bullet clickable"> </span><b><span class="normalNode">VSCs</span></b>
<ul id="1.l1">
<li id="i1.i1" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">First</span></b></li>
<li id="i1.i2" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">Second</span></b></li>
<li id="i1.i3" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">Third</span></b></li>
<li id="i1.i4" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">Fourth</span></b></li>
<li id="i1.i5" class="liBullet"><span class="bullet"> </span><b><span class="normalNode">None</span></b></li>
</ul>
I'm working in Windows 7, MozRepl is version 1.1 and I'm using Strawberry perl 5.16.2.1 for 64 bits
After poking around with the given code I was able to make W::M::F to follow the links in a following manner:
use WWW::Mechanize::Firefox;
use Crypt::SSLeay;
use HTML::TagParser;
use URI::Fetch;
...
$mech->follow_link(xpath => '//a[text() = "<span class=\"normalNode\">VSCs</span>"]');
$mech->reload();
Note xpath parameter given instead of text.
I didn't take a long look into W::M::F sources, but under the hood it tries to translate given text parameter into XPath string, and if text contains number of XML/HTML tags, which is your case, it probably drives him crazy.
I recommend you to try :
$mech->follow_link( url_regex => qr/selector=All/ );
Related
I am using HTML::TreeBuilder to process HTML files. In those files I can have definition lists where there is term "Database" with definition "Database Name". Simulated html looks like this:
#!/usr/bin/perl -w
use strict;
use warnings;
use HTML::TreeBuilder 5 -weak;
use feature qw( say );
my $exampleContent = '<dl>
<dt data-auto="citation_field_label">
<span class="medium-bold">Language:</span>
</dt>
<dd data-auto="citation_field_value">
<span class="medium-normal">English</span>
</dd>
<dt data-auto="citation_field_label">
<span class="medium-bold">Database:</span>
</dt>
<dd data-auto="citation_field_value">
<span class="medium-normal">Data Archive</span>
</dd>
</dl>';
my $root = HTML::TreeBuilder->new_from_content($exampleContent);
my $dlist = $root->look_down("_tag" => "dl");
foreach my $e ($dlist->look_down("_tag" => 'dt', "data-auto" => "citation_field_label")) {
if ($e->as_text =~ m/Datab.*/) {
say $e->as_text; # I have found "Database:" 'dt' field
# now I need to go to the next field 'dd' and return the value of that
}
}
I need to identify which database the file has come from and return the value.
I would like to be able to say something like say $dlist->right()->as_text; when I have identified <dt> with "Database:" in it, but I do not know how. Your thoughts would be much appreciated.
You were almost there. Using
$e->right->as_text;
Gives me the "Data Archive".
I'm trying to create a website with forms for people to fill out and when the user presses submit button the texts in each form field are concatenated into a single text string to be used to make a QR code. How could I do this and what language would be the best for most browsers to be compatible.
In addition, I would like to have the text fields have a new line (\n) associated with it to make the format a little more pretty when the user scans the QR code.
Please let me know.. Thanks in advance.. could you include a sample code of a website that has three text areas to concatenate?
The Imager::QRCode module makes this easy. I just knocked the following up in 5 minutes.
#!/Users/quentin/perl5/perlbrew/perls/perl-5.14.2/bin/perl
use v5.12;
use CGI; # This is a quick demo. I recommend Plack/PSGI for production.
use Imager::QRCode;
my $q = CGI->new;
my $text = $q->param('text');
if (defined $text) {
my $qrcode = Imager::QRCode->new(
size => 5,
margin => 5,
version => 1,
level => 'M',
casesensitive => 1,
lightcolor => Imager::Color->new(255, 255, 255),
darkcolor => Imager::Color->new(0, 0, 0),
);
my $img = $qrcode->plot($text);
print $q->header('image/gif');
$img->write(fh => \*STDOUT, type => 'gif')
or die $img->errstr;
} else {
print $q->header('text/html');
print <<END_HTML;
<!DOCTYPE html>
<meta charset="utf-8">
<title>QR me</title>
<h1>QR me</h1>
<form>
<div>
<label>
What text should be in the QR code?
<textarea name="text"></textarea>
</label>
<input type="submit">
</div>
</form>
END_HTML
}
How could I do this and what language would be the best for most browsers to be compatible.
If it runs on the server then you just need to make sure the output is compatible across browsers; so use GIF or PNG.
could you include a sample code of a website that has three text areas to concatenate?
Just use a . to concatenate string variables in Perl.
my $img = $qrcode->plot($foo . $bar . $baz);
Add binmode to display the image of the qr code, for example:
print $q->header('image/png');
binmode STDOUT;
$img->write(fh => \*STDOUT, type => 'png');
I am trying to scrape only the test information from a web page which is set up with a set of divs, tags etc. I want to only extract information from a specific div class, and again only the information within the tags.
<div class="col col60 moduledetail"><table cellspacing="2" cellpadding="0" border="0" id="moduleDetail"><tr><th class="moduleCode">test<small>CRN: 33413</small></th><th>test</th></tr><tr><td class="label"><nobr>Campus</nobr></td><td><a target="_blank" href="test/">test</a></td></tr><tr><td class="label">
above is a snippet of what is contained within the web page. My attempt at getting the page contents is doing exactly what it says, its getting everything from the web page, how can i narrow this down to this class and only the textual information within the tags.
thanks
Use a HTML parser. Here's an example using HTML::TreeBuilder:
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new;
$mech->get($url);
my $tree = HTML::TreeBuilder->new_from_content($mech->content);
if (my $div = $tree->look_down(_tag => "div", class => "col col60 moduledetail")) {
print $div->as_text(), "\n";
}
$tree->delete();
I'm an old-newbie in Perl, and Im trying to create a subroutine in perl using HTML::TokeParser and URI.
I need to extract ALL valid links enclosed within on div called "zone-extract"
This is my code:
#More perl above here... use strict and other subs
use HTML::TokeParser;
use URI;
sub extract_links_from_response {
my $response = $_[0];
my $base = URI->new( $response->base )->canonical;
# "canonical" returns it in the one "official" tidy form
my $stream = HTML::TokeParser->new( $response->content_ref );
my $page_url = URI->new( $response->request->uri );
print "Extracting links from: $page_url\n";
my($tag, $link_url);
while ( my $div = $stream->get_tag('div') ) {
my $id = $div->get_attr('id');
next unless defined($id) and $id eq 'zone-extract';
while( $tag = $stream->get_tag('a') ) {
next unless defined($link_url = $tag->[1]{'href'});
next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL.
next unless length $link_url; # sanity check!
$link_url = URI->new_abs($link_url, $base)->canonical;
next unless $link_url->scheme eq 'http'; # sanity
$link_url->fragment(undef); # chop off any "#foo" part
print $link_url unless $link_url->eq($page_url); # Don't note links to itself!
}
}
return;
}
As you can see, I have 2 loops, first using get_tag 'div' and then look for id = 'zone-extract'. The second loop looks inside this div and retrieve all links (or that was my intention)...
The inner loop works, it extracts all links correctly working standalone, but I think there is some issues inside the first loop, looking for my desired div 'zone-extract'... Im using this post as a reference: How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?
But all I have by the moment is this error:
Can't call method "get_attr" on unblessed reference
Some ideas? Help!
My HTML (Note URL_TO_EXTRACT_1 & 2):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="...">
<h2 class="genres"><img alt="extracting" class="png"></h2>
<li><a title="Extr 2" href="**URL_TO_EXTRACT_1**">2</a></li>
<li><a title="Con 1" class="sel" href="**URL_TO_EXTRACT_2**">1</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
I find that TokeParser is a very crude tool requiring too much code, its fault is that only supports the procedural style of programming.
A better alternatives which require less code due to declarative programming is Web::Query:
use Web::Query 'wq';
my $results = wq($response)->find('div#zone-extract a')->map(sub {
my (undef, $elem_a) = #_;
my $link_url = $elem_a->attr('href');
return unless $link_url && $link_url !~ m/\s/ && …
# Further checks like in the question go here.
return [$link_url => $elem_a->text];
});
Code is untested because there is no example HTML in the question.
I'm new to using the Perl treebuilder module for HTML parsing and can't figure out what the issue is with this.. I have spent a few hours trying to get this to work and looked at a few tutorials but am still getting this error: "Use of uninitialized value in pattern match ", referring to this line in my code:
sub{ $_[0]-> tag() eq 'div' and ($_[0]->attr('class') =~ /snap_preview/)}
);
This error prints out many times in the terminal, I have checked everything over and over and its definitely getting the input as the $downloaded page is a full HTML file that contains the string I give below... any advice is greatly appreciated.
sample string, contained within the $downloadedpage variable
<div class='snap_preview'><p><img src="http://www.dishbase.com/recipe_images/large/chicken-enchiladas-12005010871.jpg" width="160" height="115" align="left" border="0" alt="Mexican dishes recipes" style="border:none;"><b>Mexican dishes recipes</b> <i></i><br />
Mexican cuisine is popular the world over for its intense flavor and colorful presentation. Traditional Mexican recipes such as tacos, quesadillas, enchiladas and barbacoa are consistently explored for options by some of the world’s foremost gourmet chefs. A celebration of spices and unique culinary trends, Mexican food is now dominating world cuisines.</p>
<div style="margin-top: 1em" class="possibly-related"><hr /><p><strong>Possibly related posts: (automatically generated)</strong></p><ul><li><a rel='related' href='http://vireja59.wordpress.com/2010/02/13/all-best-italian-dishes-recipes/' style='font-weight:bold'>All best Italian dishes recipes</a></li><li><a rel='related' href='http://vireja59.wordpress.com/2010/05/24/liver-dishes-recipes/' style='font-weight:bold'>Liver dishes recipes</a></li><li><a rel='related' href='http://vireja59.wordpress.com/2010/04/24/parsley-in-cooking/' style='font-weight:bold'>Parsley in cooking</a></li></ul></div>
my code:
my $tree = HTML::TreeBuilder->new();
$tree->parse($downloadedpage);
$tree->eof();
#the article is in the div with class "snap_preview"
#article = $tree->look_down(
sub{ $_[0]-> tag() eq 'div' and ($_[0]->attr('class') =~ /snap_preview/)}
);
Using the exact code and example you gave,
use warnings;
use strict;
use HTML::TreeBuilder;
my $downloadedpage=<<EOF;
<div class='snap_preview'><p><img src="http://www.dishbase.com/recipe_images/large/chicken-enchiladas-12005010871.jpg" width="160" height="115" align="left" border="0" alt="Mexican dishes recipes" style="border:none;"><b>Mexican dishes recipes</b> <i></i><br />
Mexican cuisine is popular the world over for its intense flavor and colorful presentation. Traditional Mexican recipes such as tacos, quesadillas, enchiladas and barbacoa are consistently explored for options by some of the world’s foremost gourmet chefs. A celebration of spices and unique culinary trends, Mexican food is now dominating world cuisines.</p>
<div style="margin-top: 1em" class="possibly-related"><hr /><p><strong>Possibly related posts: (automatically generated)</strong></p><ul><li><a rel='related' href='http://vireja59.wordpress.com/2010/02/13/all-best-italian-dishes-recipes/' style='font-weight:bold'>All best Italian dishes recipes</a></li><li><a rel='related' href='http://vireja59.wordpress.com/2010/05/24/liver-dishes-recipes/' style='font-weight:bold'>Liver dishes recipes</a></li><li><a rel='related' href='http://vireja59.wordpress.com/2010/04/24/parsley-in-cooking/' style='font-weight:bold'>Parsley in cooking</a></li></ul></div>
EOF
my $tree = HTML::TreeBuilder->new();
$tree->parse($downloadedpage);
$tree->eof();
#the article is in the div with class "snap_preview"
my #article = $tree->look_down(
sub{ $_[0]-> tag() eq 'div' and ($_[0]->attr('class') =~ /snap_preview/)}
);
I don't get any errors at all. My first guess would be that there are some <div>s in the HTML which don't have a class attribute.
Maybe you need to write
sub{
$_[0]-> tag() eq 'div' and
$_[0]->attr('class') and
($_[0]->attr('class') =~ /snap_preview/)
}
there?