Find Favicons in HTML using Perl - perl

I'm trying to look for favicons (and variants) for a given URL using Perl (I'd like to avoid using an external service such as Google's favicon finder). There's a CPAN module, WWW::Favicon, but it hasn't been updated in over a decade -- a decade in which now important variants such as "apple-touch-icon" have come to replace the venerable "ico" file.
I thought I found the solution in WWW::Mechanize, since it can list all of the links in a given URL, including <link> header tags. However, I cannot seem to find a clean way to use the "find_link" method to search for the rel attribute.
For example, I tried using 'rel' as the search term, hoping maybe it was in there despite not being mentioned in the documentation, but it doesn't work. This code returns an error about an invalid "link-finding parameter."
my $results = $mech->find_link( 'rel' => "apple-touch-icon" );
use Data::Dumper;
say STDERR Dumper $results;
I also tried using other link-finding parameters, but none of them seem to be suited to searching out a rel attribute.
The only way I could figure out how to do it is by iterating through all links and looking for a rel attribute like this:
my $results = $mech->find_all_links( );
foreach my $result (#{ $results }) {
my $attrs = $result->attrs();
#'tag' => "apple-touch-icon"
foreach my $attr (sort keys %{ $attrs }) {
if ($attrs->{'rel'} =~ /^apple-touch-icon.*$/) {
say STDERR "I found it:" . $result->url();
}
# Add tests for other types of icons here.
# E.g. "mask-icon" and "shortcut icon."
}
}
That works, but it seems messy. Is there a better way?

Here's how I'd do it with Mojo::DOM. Once you fetch an HTML page, use dom to do all the parsing. From that, use a CSS selector to find the interesting nodes:
link[rel*=icon i][href]
This CSS selector looks for link tags that have the rel and href tags at the same time. Additionally, I require that the value in rel contain (*=) "icon", case insensitively (the i). If you want to assume that all nodes will have the href, just leave off [href].
Once I have the list of links, I extract just the value in href and turn that list into an array reference (although I could do the rest with Mojo::Collection methods):
use v5.10;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);
my $results = $ua->get( shift )
->result
->dom
->find( 'link[rel*=icon i][href]' )
->map( attr => 'href' )
->to_array
;
say join "\n", #$results;
That works pretty well so far:
$ perl mojo.pl https://www.perl.org
https://cdn.perl.org/perlweb/favicon.ico
$ perl mojo.pl https://www.microsoft.com
https://c.s-microsoft.com/favicon.ico?v2
$ perl mojo.pl https://leanpub.com/mojo_web_clients
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-57x57-b83f183ad6b00aa74d8e692126c7017e.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-60x60-6dc1c10b7145a2f1156af5b798565268.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-72x72-5037b667b6f7a8d5ba8c4ffb4a62ec2d.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-76x76-57860ca8a817754d2861e8d0ef943b23.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-114x114-27f9c42684f2a77945643b35b28df6e3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-120x120-3819f03d1bad1584719af0212396a6fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-144x144-a79479b4595dc7ca2f3e6f5b962d16fd.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-152x152-aafe015ef1c22234133158a89b29daf5.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-16x16-c1207cd2f3a20fd50de0e585b4b307a3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-32x32-e9b1d6ef3d96ed8918c54316cdea011f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-96x96-842fcd3e7786576fc20d38bbf94837fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-128x128-e97066b91cc21b104c63bc7530ff819f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-196x196-b8cab44cf725c4fa0aafdbd237cdc4ed.png
Now, the problem comes if you find more interesting cases that you can't easily write a selector for. Suppose not all of the rel values have "icon" in them. You can get a little more fancy by specifying multiple selectors separated by commas so you don't have to use the experimental case insensitivity flag:
link[rel*=icon][href], link[rel*=ICON][href]
or different values in rel:
link[rel="shortcut icon"][href], link[rel="apple-touch-icon-precomposed"][href]
Line up as many of those as you like.
But, you could also filter your results without the selectors. Use Mojo::Collection's grep to pick out the nodes that you want:
my %Interesting = ...;
my $results = $ua->get( shift )
->result
->dom
->find( '...' )
->grep( sub { exists $Interesting{ $_->attr('rel') } } )
->map( attr => 'href' )
->to_array
;
I have a lot more examples of Mojo::DOM in Mojo Web Clients, and I think I'll go add this example now.

The problem is very easy to solve with:
assistance of any module allowing to load webpage
define $regex for all possible favicon variations
look for <link rel="$regex" href="icon_address" ...>
Note:
The script has default YouTube url embedded in the code
use strict;
use warnings;
use feature 'say';
use HTTP::Tiny;
my $url = shift || 'https://www.youtube.com/';
my $icons = get_favicon($url);
say for #{$icons};
sub get_favicon {
my $url = shift;
my #lookup = (
'shortcut icon',
'apple-touch-icon',
'image_src',
'icon',
'alternative icon'
);
my $re = join('|',#lookup);
my $html = load_page($url);
my #icons = ($html =~ /<link rel="(?:$re)" href="(.*?)"/gmsi);
return \#icons;
}
sub load_page {
my $url = shift;
my $response = HTTP::Tiny->new->get($url);
my $html;
if ($response->{success}) {
$html = $response->{content};
} else {
say 'ERROR: Could not extract webpage';
say 'Status: ' . $response->{status};
say 'Reason: ' . $response->{reason};
exit;
}
return $html;
}
Run as script.pl
https://www.youtube.com/s/desktop/8259e7c9/img/favicon.ico
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_32.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_48.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_96.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_144.png
https://www.youtube.com/img/desktop/yt_1200.png
Run as script.pl "http://www.microsoft.com/"
https://c.s-microsoft.com/favicon.ico?v2
Run as script.pl "http://finance.yahoo.com/"
https://s.yimg.com/cv/apiv2/default/icons/favicon_y19_32x32_custom.svg

Related

How can I extract these table rows from HTML? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I want to find a team score by entering team after matching then show next number or word after team name to be able to display score.
And I want this to apply on left and right side teams (left side: FT Crystal Palace 1 - right side: 1 Leicester)
The main issue is to find team score from our matching entered team.
#!/usr/bin/perl -wT
use strict;
use warnings;
use CGI;
use HTML::TableExtract;
use LWP::UserAgent;
my $cgi = CGI->new;
my $ua = LWP::UserAgent->new;
my $input = $cgi->param("team");
my $response ='';
if ($input) {
my $req = HTTP::Request->new(GET => "");
my $res = $ua->request($req);
my $html = $res->decoded_content;
$html =~ s/<span.*?<\/span>//gs;
$html =~ s/<script.*?<\/script>//gs;
$html =~ s/<td(?=[^>]*class="events-button button first-occur")[^>]*>//gs;
my $table = HTML::TableExtract->new();
$table->parse($html);
print "Content-type: text/html\n\n";
foreach my $row ($table->rows) {
my $output = join("", #$row);
$output =~ s/\R//g;
print $output;
# Find matching team (Its a must to find team after word FT)
if ($output =~ m/$input/) {
# After that team the next is score (Gets team scores)
my $score = $output =~ /$input\s*?(\S+)/;
$response ="Found Team And Score is: $score";
} else {
$response ="Can't find team";
}
}
}
print <<EOF;
<!DOCTYPE html>
<html>
<body>
<h1>Test</h1>
<form method="post">
<label for="team">Enter Team:</label>
<input type="text" name="team"><br><br>
<input type="submit" value="Find team">
</form>
<h4>$response</h4>
</body>
</html>
EOF
I like using Mojolicious for these things because everything I need is built in. With a little CSS Selector magic, you can easily zero in on the data you want without manual string processing.
Looking at the source, I see that the page has results in a table with class matches_new and within that table, the interesting rows have the class match. Within those rows, the interesting table cells are in classes team-a, team-b, and score-time (although the "score" may also be a game status).
Note that many of the rows don't load their data until you click on the team name. That's JavaScript at work and a Perl web scraping library isn't going to help you there. I don't know why the English Premier League already has data filled in for me (is it that way for everyone?) or if we can rely on that always being the case.
But let's extract that data.
The basic Mojo HTML-parsing process is to make a request then get the DOM (Document Object Model). In that DOM, find and at locate particular parts. find gets everything that matches its CSS Selector, and at gets the next match.
So, I find all the right rows, and each row is another (smaller) DOM object that I can explore further to get the teams and score.
The find returns a Mojo::Collection (fancy interface to an array) and I call map to process each row in turn and extract all the text from each table cell (and trim removes the leading and trailing whitespace). The all_text gets everything, including text in child nodes.
#!perl
use v5.10;
use Mojo::UserAgent;
use Mojo::Util qw(dumper trim);
my $url = 'https://int.soccerway.com/matches/2020/12/28/';
my $ua = Mojo::UserAgent->new;
my #results = $ua
->get( $url )
->result
->dom
->find( 'table.matches_new tr.match' )
->map( sub {
my $row = $_;
my #results =
map { trim( $row->at( $_ )->all_text ) }
qw( td.team-a td.team-b td.score-time );
return \#results;
} )
->to_array;
say dumper( #results );
Here is the output, with just the soccer matches :
[
[
"Everton",
"Manchester City",
"PSTP"
],
[
"Crystal Palace",
"Leicester City",
"1 - 1"
],
[
"Chelsea",
"Aston Villa",
"1 - 1"
]
]
You aren't strictly "wrong" for using LWP or HTTP::Request, but they don't come with powerful tools to handle the data. The HTML::TableExtract is fine, but that table it gets is long and you don't want most of its rows. In my career, I've written many, many programs like the one you presented: grab the source, remove are much irrelevant HTML as you can, and process what's left. Having all the tools in one package where everything is designed to work together in the same milieu is much nicer.
Targeting exactly what you want with selectors is much easier, less fragile, and much less code. You can see this in the Mojo docs and I wrote about in it detail with lots of examples in Mojo Web Clients.
Also, you can just use the Mojo::DOM stuff as long as you have the HTML, no matter the source. I write about this in Extracting from HTML with Mojo::DOM for Perl.com. For you, take your decoded content and give it to Mojo::DOM and do the same processing I've already done:
my $html = HTTP::Request->new( ... )->request( ... )->decoded_content;
my $dom = Mojo::DOM->new( $html );
my #results = $dom->find( ... )->...;

Using Mojo::DOM to extract untagged text after heading

I'm trying to extract some text without tags from a HTML file using Mojo::DOM (I'm new at this). In particular, the description text after the H2 heading (there are other headings in the file).
<h2>Description</h2>This text is the description<div class="footer">[Edit description
I've been able to find the heading, but don't know how to access the text after is, since I have not tag to jump to...
my $dom = Mojo::DOM->new( $htmlfile );
my $desc = $dom
->find('h2')
->grep(sub { $_->all_text =~ /Description/ })
->first;
Can anyone recommend to me a way how to grab the "This text is the description" string?
One can go through all nodes, what also catches those which aren't inside an HTML element (tag). Then use the fact that you need the node that follows the h2 tag.
More precisely, it follows the text-node which is the child of the (identifiable) h2 tag-node.
use warnings;
use strict;
use feature 'say';
use Mojo::DOM;
my $html = q(<h2>Description</h2> This text is the description <p>More...</p>);
my $dom = Mojo::DOM->new($html);
my $is_next = 0;
foreach my $node ($dom->descendant_nodes->each) {
my $par = $node->parent;
if ($node->type eq 'text' and $par->type eq 'tag' and $par->tag eq 'h2') {
$is_next = 1;
}
elsif ($is_next) {
say $node; #--> This text is the description
$is_next = 0;
}
}
More criteria for exactly which h2 nodes are of interest can be added (unless it's really all such nodes), by interrogating either the previous text-node (text of the h2 tag) or its parent (the tag).
The node itself should likely be checked as well, for example to see whether it's indeed just loose text and not actually a next tag.
I've tested with far more complex HTML; the above is a near-minimal testable markup.
In this simple example just $dom->text catches the needed text. However, that won't be the case in more complex fragments where the sought text doesn't come after the very first element.
Try this code, I have just added a parent element:
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $html = q{<div class="container"><h2>Description</h2>This text is the description<div class="footer">[Edit description
</div></div>};
my $dom = Mojo::DOM->new($html);
print $dom->at('div.container')->text();
Alternatively, using your HTML snippet, the following can be done:
print $dom->text();

Get HTML within an <a> element using WWW::Mechanize

I'm extracting special links within an HTML page by using WWW::Mechanize.
my $mech = WWW::Mechanize->new();
$mech->get( $uri );
my #links = $mech->find_all_links(url_regex => qr/cgi-bin/);
for my $link ( #links ) {
# try to get everything between HERE
}
The links look like this
<div><div><span>foo bar</span> I WANT THIS TEXT</div></div>
By using $link->text I get foo bar I WANT THIS TEXT without knowing which text was inside the <span> element.
Is there any way to get the raw HTML code instead of the stripped text?
In other words I need to find a way to only get I WANT THIS TEXT without knowing the exact text within the <span> tag.
As simbabque has said you can't do that with WWW::Mechanize
In fact there's very little point in using WWW::Mechanize if you don't want any of its features. If all you're using it for is to fetch a web page, then use LWP::UserAgent instead. WWW::Mechanize is just a subclass of LWP::UserAgent with lots of additional stuff that you don't want
Here's an example that uses HTML::TreeBuilder to construct a parse tree of the HTML and locate the links that you want. I've used HTML::TreeBuilder because it's pretty good at tolerating malformed HTML in a way similar to modern browsers
I've been unable to test it as you haven't provided proper sample data and I'm not inclined to create my own
use strict;
use warnings 'all';
use feature 'say';
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new;
$mech->get('http://www.example.com/');
my $tree = HTML::TreeBuilder->new_from_content($mech->content);
for my $link ( #{ $tree->extract_links('a') } ) {
my ($href, $elem, $attr, $tag) = #$link;
# Exclude non-CGI links
next unless $link =~ /cgi-bin/;
# Find all immediate child text nodes and concatenate them
# References are non-text children
my $text = join ' ', grep { not ref } $elem->content_list;
next unless $text =~ /\S/;
# Trim and consolidate spaces
$text =~ s/\A\s+|\s+\z//g;
$text =~ s/\s+/ /g;
say $text;
}

How to define multiple subsections for methods with Pod::Weaver?

I have some Moose classes that define several small groups of related methods. I would like to make these groups obvious in the package POD.
I use Dist::Zilla and Pod::Weaver with the =method command. Is it possible to insert some =head2-like commands between my =method commands to achieve the desired effect?
I wrote a post on how I did it for Redis::Client here: Falling in Love with Pod::Weaver.
The simplest thing to do is add custom Collect directives to your weaver.ini and organize your methods by giving each type a different custom POD command, like so:
[Collect / FOO METHODS]
command = foo_method
[Collect / BAR METHODS]
command = bar_method
[Collect / BAZ METHODS]
command = baz_method
Then write your POD like this
=foo_method blah blah
and Weaver will automatically collect them under their own =head1.
If you want to do something more complicated than that, you can write your own Pod::Weaver plugin. The gist is to search through the parsed POD for a custom command name and transform them by returning Pod::Elemental objects. Here's the plugin I wrote:
package Pod::Weaver::Plugin::RedisLinks;
# ABSTRACT: Add links to Redis documentation
use Moose;
with 'Pod::Weaver::Role::Transformer';
use Data::Dumper;
use Scalar::Util 'blessed';
use aliased 'Pod::Elemental::Element::Pod5::Ordinary';
sub transform_document {
my ( $self, $doc ) = #_;
my #children = $doc->children;
my #new_children;
foreach my $child( #{ $children[0] } ) {
if ( $child->can( 'command' )
&& $child->command =~ /^(?:key|str|list|hash|set|zset|conn|serv)_method/ ) {
my $meth_name = $child->content;
$meth_name =~ s/^\s*?(\S+)\s*$/$1/;
my $cmd_name = uc $meth_name;
$cmd_name =~ tr/_/ /;
my $link_name = $meth_name;
$link_name =~ tr/_/-/;
my $new_para = Ordinary->new(
content => sprintf 'Redis L<%s|%s> command.',
$cmd_name, 'http://redis.io/commands/' . $link_name );
push #new_children, $child, $new_para;
next;
}
push #new_children, $child;
}
$doc->children( \#new_children );
}
__PACKAGE__->meta->make_immutable;
1;
The transform_document method gets passed the parsed document as a parameter. It then goes through the top-level commands looking for elements labeled /^(?:key|str|list|hash|set|zset|conn|serv)_method/, munges the name a bit, and then builds a new POD paragraph containing the formatted POD content that I want.

perl html parsing lib/tool

Is there some powerful tools/libs for perl like BeautifulSoup to python?
Thanks
HTML::TreeBuilder::XPath is a decent solution for most problems.
I never used BeautifulSoup, but from quick skim over its documentation you might want HTML::TreeBuilder. It can process even broken documents well and allows traverse over parsed tree or query items - look at look_down method in HTML::Element.
If you like/know XPath, see daxim's recommendation. If you like to pick items via CSS selector, have a look at Web::Scraper or Mojo::DOM.
As you're looking for power, you can use XML::LibXML to parse HTML. The advantage then is that you have all the power of the fastest and best XML toolchain (excecpt MSXML, which is MS only) available to Perl to process your document, including XPath and XSLT (which would require a re-parse if you used another parser than XML::LibXML).
use strict;
use warnings;
use XML::LibXML;
# In 1.70, the recover and suppress_warnings options won't shup up the
# warnings. Hence, a workaround is needed to keep the messages away from
# the screen.
sub shutup_stderr {
my( $subref, $bufref ) = #_;
open my $fhbuf, '>', $bufref;
local *STDERR = $fhbuf;
$subref->(); # execute code that needs to be shut up
return;
}
# ==== main ============================================================
my $url = shift || 'http://www.google.de';
my $parser = XML::LibXML->new( recover => 2 ); # suppress_warnings => 1
# Note that "recover" and "suppress_warnings" might not work - see above.
# https://rt.cpan.org/Public/Bug/Display.html?id=58024
my $dom; # receive document
shutup_stderr
sub { $dom = $parser->load_html( location => $url ) }, # code
\my $errmsg; # buffer
# Now process document as XML.
my #nodes = $dom->getElementsByLocalName( 'title' );
printf "Document title: %s\n", $_->textContent for #nodes;
printf "Lenght of error messages: %u\n", length $errmsg;
print '-' x 72, "\n";
print $dom->toString( 1 );