Get HTML within an <a> element using WWW::Mechanize - perl

I'm extracting special links within an HTML page by using WWW::Mechanize.
my $mech = WWW::Mechanize->new();
$mech->get( $uri );
my #links = $mech->find_all_links(url_regex => qr/cgi-bin/);
for my $link ( #links ) {
# try to get everything between HERE
}
The links look like this
<div><div><span>foo bar</span> I WANT THIS TEXT</div></div>
By using $link->text I get foo bar I WANT THIS TEXT without knowing which text was inside the <span> element.
Is there any way to get the raw HTML code instead of the stripped text?
In other words I need to find a way to only get I WANT THIS TEXT without knowing the exact text within the <span> tag.

As simbabque has said you can't do that with WWW::Mechanize
In fact there's very little point in using WWW::Mechanize if you don't want any of its features. If all you're using it for is to fetch a web page, then use LWP::UserAgent instead. WWW::Mechanize is just a subclass of LWP::UserAgent with lots of additional stuff that you don't want
Here's an example that uses HTML::TreeBuilder to construct a parse tree of the HTML and locate the links that you want. I've used HTML::TreeBuilder because it's pretty good at tolerating malformed HTML in a way similar to modern browsers
I've been unable to test it as you haven't provided proper sample data and I'm not inclined to create my own
use strict;
use warnings 'all';
use feature 'say';
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new;
$mech->get('http://www.example.com/');
my $tree = HTML::TreeBuilder->new_from_content($mech->content);
for my $link ( #{ $tree->extract_links('a') } ) {
my ($href, $elem, $attr, $tag) = #$link;
# Exclude non-CGI links
next unless $link =~ /cgi-bin/;
# Find all immediate child text nodes and concatenate them
# References are non-text children
my $text = join ' ', grep { not ref } $elem->content_list;
next unless $text =~ /\S/;
# Trim and consolidate spaces
$text =~ s/\A\s+|\s+\z//g;
$text =~ s/\s+/ /g;
say $text;
}

Related

Find Favicons in HTML using Perl

I'm trying to look for favicons (and variants) for a given URL using Perl (I'd like to avoid using an external service such as Google's favicon finder). There's a CPAN module, WWW::Favicon, but it hasn't been updated in over a decade -- a decade in which now important variants such as "apple-touch-icon" have come to replace the venerable "ico" file.
I thought I found the solution in WWW::Mechanize, since it can list all of the links in a given URL, including <link> header tags. However, I cannot seem to find a clean way to use the "find_link" method to search for the rel attribute.
For example, I tried using 'rel' as the search term, hoping maybe it was in there despite not being mentioned in the documentation, but it doesn't work. This code returns an error about an invalid "link-finding parameter."
my $results = $mech->find_link( 'rel' => "apple-touch-icon" );
use Data::Dumper;
say STDERR Dumper $results;
I also tried using other link-finding parameters, but none of them seem to be suited to searching out a rel attribute.
The only way I could figure out how to do it is by iterating through all links and looking for a rel attribute like this:
my $results = $mech->find_all_links( );
foreach my $result (#{ $results }) {
my $attrs = $result->attrs();
#'tag' => "apple-touch-icon"
foreach my $attr (sort keys %{ $attrs }) {
if ($attrs->{'rel'} =~ /^apple-touch-icon.*$/) {
say STDERR "I found it:" . $result->url();
}
# Add tests for other types of icons here.
# E.g. "mask-icon" and "shortcut icon."
}
}
That works, but it seems messy. Is there a better way?
Here's how I'd do it with Mojo::DOM. Once you fetch an HTML page, use dom to do all the parsing. From that, use a CSS selector to find the interesting nodes:
link[rel*=icon i][href]
This CSS selector looks for link tags that have the rel and href tags at the same time. Additionally, I require that the value in rel contain (*=) "icon", case insensitively (the i). If you want to assume that all nodes will have the href, just leave off [href].
Once I have the list of links, I extract just the value in href and turn that list into an array reference (although I could do the rest with Mojo::Collection methods):
use v5.10;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);
my $results = $ua->get( shift )
->result
->dom
->find( 'link[rel*=icon i][href]' )
->map( attr => 'href' )
->to_array
;
say join "\n", #$results;
That works pretty well so far:
$ perl mojo.pl https://www.perl.org
https://cdn.perl.org/perlweb/favicon.ico
$ perl mojo.pl https://www.microsoft.com
https://c.s-microsoft.com/favicon.ico?v2
$ perl mojo.pl https://leanpub.com/mojo_web_clients
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-57x57-b83f183ad6b00aa74d8e692126c7017e.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-60x60-6dc1c10b7145a2f1156af5b798565268.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-72x72-5037b667b6f7a8d5ba8c4ffb4a62ec2d.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-76x76-57860ca8a817754d2861e8d0ef943b23.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-114x114-27f9c42684f2a77945643b35b28df6e3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-120x120-3819f03d1bad1584719af0212396a6fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-144x144-a79479b4595dc7ca2f3e6f5b962d16fd.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-152x152-aafe015ef1c22234133158a89b29daf5.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-16x16-c1207cd2f3a20fd50de0e585b4b307a3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-32x32-e9b1d6ef3d96ed8918c54316cdea011f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-96x96-842fcd3e7786576fc20d38bbf94837fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-128x128-e97066b91cc21b104c63bc7530ff819f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-196x196-b8cab44cf725c4fa0aafdbd237cdc4ed.png
Now, the problem comes if you find more interesting cases that you can't easily write a selector for. Suppose not all of the rel values have "icon" in them. You can get a little more fancy by specifying multiple selectors separated by commas so you don't have to use the experimental case insensitivity flag:
link[rel*=icon][href], link[rel*=ICON][href]
or different values in rel:
link[rel="shortcut icon"][href], link[rel="apple-touch-icon-precomposed"][href]
Line up as many of those as you like.
But, you could also filter your results without the selectors. Use Mojo::Collection's grep to pick out the nodes that you want:
my %Interesting = ...;
my $results = $ua->get( shift )
->result
->dom
->find( '...' )
->grep( sub { exists $Interesting{ $_->attr('rel') } } )
->map( attr => 'href' )
->to_array
;
I have a lot more examples of Mojo::DOM in Mojo Web Clients, and I think I'll go add this example now.
The problem is very easy to solve with:
assistance of any module allowing to load webpage
define $regex for all possible favicon variations
look for <link rel="$regex" href="icon_address" ...>
Note:
The script has default YouTube url embedded in the code
use strict;
use warnings;
use feature 'say';
use HTTP::Tiny;
my $url = shift || 'https://www.youtube.com/';
my $icons = get_favicon($url);
say for #{$icons};
sub get_favicon {
my $url = shift;
my #lookup = (
'shortcut icon',
'apple-touch-icon',
'image_src',
'icon',
'alternative icon'
);
my $re = join('|',#lookup);
my $html = load_page($url);
my #icons = ($html =~ /<link rel="(?:$re)" href="(.*?)"/gmsi);
return \#icons;
}
sub load_page {
my $url = shift;
my $response = HTTP::Tiny->new->get($url);
my $html;
if ($response->{success}) {
$html = $response->{content};
} else {
say 'ERROR: Could not extract webpage';
say 'Status: ' . $response->{status};
say 'Reason: ' . $response->{reason};
exit;
}
return $html;
}
Run as script.pl
https://www.youtube.com/s/desktop/8259e7c9/img/favicon.ico
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_32.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_48.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_96.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_144.png
https://www.youtube.com/img/desktop/yt_1200.png
Run as script.pl "http://www.microsoft.com/"
https://c.s-microsoft.com/favicon.ico?v2
Run as script.pl "http://finance.yahoo.com/"
https://s.yimg.com/cv/apiv2/default/icons/favicon_y19_32x32_custom.svg

Using Mojo::DOM to extract untagged text after heading

I'm trying to extract some text without tags from a HTML file using Mojo::DOM (I'm new at this). In particular, the description text after the H2 heading (there are other headings in the file).
<h2>Description</h2>This text is the description<div class="footer">[Edit description
I've been able to find the heading, but don't know how to access the text after is, since I have not tag to jump to...
my $dom = Mojo::DOM->new( $htmlfile );
my $desc = $dom
->find('h2')
->grep(sub { $_->all_text =~ /Description/ })
->first;
Can anyone recommend to me a way how to grab the "This text is the description" string?
One can go through all nodes, what also catches those which aren't inside an HTML element (tag). Then use the fact that you need the node that follows the h2 tag.
More precisely, it follows the text-node which is the child of the (identifiable) h2 tag-node.
use warnings;
use strict;
use feature 'say';
use Mojo::DOM;
my $html = q(<h2>Description</h2> This text is the description <p>More...</p>);
my $dom = Mojo::DOM->new($html);
my $is_next = 0;
foreach my $node ($dom->descendant_nodes->each) {
my $par = $node->parent;
if ($node->type eq 'text' and $par->type eq 'tag' and $par->tag eq 'h2') {
$is_next = 1;
}
elsif ($is_next) {
say $node; #--> This text is the description
$is_next = 0;
}
}
More criteria for exactly which h2 nodes are of interest can be added (unless it's really all such nodes), by interrogating either the previous text-node (text of the h2 tag) or its parent (the tag).
The node itself should likely be checked as well, for example to see whether it's indeed just loose text and not actually a next tag.
I've tested with far more complex HTML; the above is a near-minimal testable markup.
In this simple example just $dom->text catches the needed text. However, that won't be the case in more complex fragments where the sought text doesn't come after the very first element.
Try this code, I have just added a parent element:
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $html = q{<div class="container"><h2>Description</h2>This text is the description<div class="footer">[Edit description
</div></div>};
my $dom = Mojo::DOM->new($html);
print $dom->at('div.container')->text();
Alternatively, using your HTML snippet, the following can be done:
print $dom->text();

Getting html content between specific <div> tag only

I wrote the following code to scrape text content between <div id=aaa-bbb> and the next </div> tag, but it only prints out the whole HTML source.
use LWP::Simple;
$url = 'http://domain.com/?xxxxxxx';
my $content = get($url);
$data =~ m/<div id="aaa-bbb">(.*?)<\/div>/g;
if (is_success(getprint($url))) {
print $_;
}
# or using the following line directly without if statement
print $data;
The HTML piece that I'm interested in looks like this:
<div id="aaa-bbb">
<p>text text text text text text text text text</p><p>text text text</p>
</div>
That specific div tag id appears only once in the whole HTML document.
I'm also looking to strip out <p></p> tags or tidy the output by line breaks for storing as a text file later or reusing.
After reading your valuable comments I tried using
WWW::Mechanize
and
WWW::Mechanize::TreeBuilder
instead, like this
use strict;
use warnings;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;
my $mech = WWW::Mechanize->new;
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get( 'domain.com/?xxxxxx' );
my #list = $mech->find('div id="aaa-bbb"'); # or <div id="aaa-bbb"> or "<div id="aaa-bbb">"
foreach (#list) {
print $_->as_text();
}
It works for simple tags but can't get it to work with <div id="aaaa">. It just exits the script without printing anything. I used double and single quotes, it already has double quotes inside the tag id.
This type of parsing is much easier with a DOM parser. My parser of choice is Mojo::DOM which is part of the Mojolicious suite.
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get( 'domain.com/?xxxxxx' )->res->dom;
my $text = $dom->at('#aaa-bbb')->all_text;
The at method is a special case of the find method, which finds all the instances; at finds the first (or in your case, only). The # is the CSS selector syntax for ids.

Using Perl, how can I remove the contents of a specific tag in an HTML document?

I'm new to Perl. I have to find and replace in multiple line.
$content =~ s/<picture[^>]*>(.*?)<\/picture>//gis;
I tried these code. All the tags are replaced in my file. What is my mistake any one help me?
My tag is,
<picture width='960' height='705' baseline='360'>
<pict-header>
</pict-header>
</picture>
now replaced by
<picture></picture>
Based on what I think you want to do, here is what you need to change. Your expression matches any <picture> tag up to the first </picture> tag, no matter whether it actually closes the first tag you match or not.
Assuming that it's illegal to nest <picture> tags, all you need to do is add <picture></picture> in the replacement section of the substitution, e.g.:
$content =~ s/<picture[^>]*>(.*?)<\/picture>/<picture><\/picture>/gis;
TIP: When dealing with expressions that have slashes in them, do something like this so you don't need to escape your slashes:
$content =~ s#<picture[^>]*>(.*?)</picture>#<picture></picture>#gis;
This is still not perfect! For example, this:
<picture stuff="adfgerth"><picture stuff="235wefw45"><somejunk /></picture></picture>
will be replaced with:
<picture></picture></picture>
but now you've gotten to the point where a regular expression is not enough, and you probably want an XML parser.
You can save a lot of headache by using an HTML parser to parse HTML:
#!/usr/bin/env perl
use strict; use warnings;
use HTML::TokeParser::Simple;
die "Need filename\n" unless #ARGV == 1;
my ($filename) = #ARGV;
my $parser = HTML::TokeParser::Simple->new(file => $filename);
while (my $token = $parser->get_token) {
if ($token->is_start_tag('picture')) {
$parser->get_tag('/picture');
print "<picture></picture>";
}
else {
print $token->as_is;
}
}
Sinan's answer works, but an even better solution might be to use a DOM parser, such as Mojo::DOM (which is part of the Mojolicious framework). Then you can do some very simple manipulations like
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::DOM;
my $content = <<'END';
<picture width='960' height='705' baseline='360'>
<pict-header>
</pict-header>
</picture>
END
my $dom = Mojo::DOM->new($content);
$dom->at('picture')->replace('<picture></picture>')->root;
print $dom;

Downloading Text from Several Links using WWW::Mechanize

For an entire week I have been attempting to write a code that will download links from a webpage and then loop through each link to dump the content written on each link's page. The original webpage I downloaded has 500 links to separate web pages that each contain important information for me. I only want to go one level down. However I am having several issues.
RECAP: I want to download the links from a webpage and automatically have my program print off the text contained in those links. I would prefer to have them printed in a file.
1) When I download the links from the original website, the useful ones are not written out fully. (ie they say "/festevents.nsf/all?openform" which is not a usable webpage)
2) I have been unable to print the text content of the page. I have been able to print the font details, but that is useless.
#Download all the modules I used#
use LWP::UserAgent;
use HTML::TreeBuilder;
use HTML::FormatText;
use WWW::Mechanize;
use Data::Dumper;
#Download original webpage and acquire 500+ Links#
$url = "http://wx.toronto.ca/festevents.nsf/all?openform";
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->get($url);
my $title = $mechanize->title;
print "<b>$title</b><br />";
my #links = $mechanize->links;
foreach my $link (#links) {
# Retrieve the link URL
my $href = $link->url_abs;
#
# $URL1= get("$link");
#
my $ua = LWP::UserAgent->new;
my $response = $ua->get($href);
unless($response->is_success) {
die $response->status_line;
}
my $URL1 = $response->decoded_content;
die Dumper($URL1);
#This part of the code is just to "clean up" the text
$Format=HTML::FormatText->new;
$TreeBuilder=HTML::TreeBuilder->new;
$TreeBuilder->parse($URL1);
$Parsed=$Format->format($TreeBuilder);
open(FILE, ">TorontoParties.txt");
print FILE "$Parsed";
close (FILE);
}
Please help me! I am desperate! If possible please explain to me the logic behind each step? I have been frying my brain on this for a week and I want help seeing other peoples logic behind the problems.
Too much work. Study the WWW::Mechanize API to realise that almost all of that functionality is already built-in. Untested:
use strictures;
use WWW::Mechanize qw();
use autodie qw(:all);
open my $h, '>:encoding(UTF-8)', 'TorontoParties.txt';
my $mechanize = WWW::Mechanize->new;
$mechanize->get('http://wx.toronto.ca/festevents.nsf/all?openform');
foreach my $link (
$mechanize->find_all_links(url_regex => qr'/festevents[.]nsf/[0-9a-f]{32}/[0-9a-f]{32}[?]OpenDocument')
) {
$mechanize->get($link->url_abs);
print {$h} $mechanize->content(format => 'text');
}
close $h;