How can I extract these table rows from HTML? [closed] - perl

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I want to find a team score by entering team after matching then show next number or word after team name to be able to display score.
And I want this to apply on left and right side teams (left side: FT Crystal Palace 1 - right side: 1 Leicester)
The main issue is to find team score from our matching entered team.
#!/usr/bin/perl -wT
use strict;
use warnings;
use CGI;
use HTML::TableExtract;
use LWP::UserAgent;
my $cgi = CGI->new;
my $ua = LWP::UserAgent->new;
my $input = $cgi->param("team");
my $response ='';
if ($input) {
my $req = HTTP::Request->new(GET => "");
my $res = $ua->request($req);
my $html = $res->decoded_content;
$html =~ s/<span.*?<\/span>//gs;
$html =~ s/<script.*?<\/script>//gs;
$html =~ s/<td(?=[^>]*class="events-button button first-occur")[^>]*>//gs;
my $table = HTML::TableExtract->new();
$table->parse($html);
print "Content-type: text/html\n\n";
foreach my $row ($table->rows) {
my $output = join("", #$row);
$output =~ s/\R//g;
print $output;
# Find matching team (Its a must to find team after word FT)
if ($output =~ m/$input/) {
# After that team the next is score (Gets team scores)
my $score = $output =~ /$input\s*?(\S+)/;
$response ="Found Team And Score is: $score";
} else {
$response ="Can't find team";
}
}
}
print <<EOF;
<!DOCTYPE html>
<html>
<body>
<h1>Test</h1>
<form method="post">
<label for="team">Enter Team:</label>
<input type="text" name="team"><br><br>
<input type="submit" value="Find team">
</form>
<h4>$response</h4>
</body>
</html>
EOF

I like using Mojolicious for these things because everything I need is built in. With a little CSS Selector magic, you can easily zero in on the data you want without manual string processing.
Looking at the source, I see that the page has results in a table with class matches_new and within that table, the interesting rows have the class match. Within those rows, the interesting table cells are in classes team-a, team-b, and score-time (although the "score" may also be a game status).
Note that many of the rows don't load their data until you click on the team name. That's JavaScript at work and a Perl web scraping library isn't going to help you there. I don't know why the English Premier League already has data filled in for me (is it that way for everyone?) or if we can rely on that always being the case.
But let's extract that data.
The basic Mojo HTML-parsing process is to make a request then get the DOM (Document Object Model). In that DOM, find and at locate particular parts. find gets everything that matches its CSS Selector, and at gets the next match.
So, I find all the right rows, and each row is another (smaller) DOM object that I can explore further to get the teams and score.
The find returns a Mojo::Collection (fancy interface to an array) and I call map to process each row in turn and extract all the text from each table cell (and trim removes the leading and trailing whitespace). The all_text gets everything, including text in child nodes.
#!perl
use v5.10;
use Mojo::UserAgent;
use Mojo::Util qw(dumper trim);
my $url = 'https://int.soccerway.com/matches/2020/12/28/';
my $ua = Mojo::UserAgent->new;
my #results = $ua
->get( $url )
->result
->dom
->find( 'table.matches_new tr.match' )
->map( sub {
my $row = $_;
my #results =
map { trim( $row->at( $_ )->all_text ) }
qw( td.team-a td.team-b td.score-time );
return \#results;
} )
->to_array;
say dumper( #results );
Here is the output, with just the soccer matches :
[
[
"Everton",
"Manchester City",
"PSTP"
],
[
"Crystal Palace",
"Leicester City",
"1 - 1"
],
[
"Chelsea",
"Aston Villa",
"1 - 1"
]
]
You aren't strictly "wrong" for using LWP or HTTP::Request, but they don't come with powerful tools to handle the data. The HTML::TableExtract is fine, but that table it gets is long and you don't want most of its rows. In my career, I've written many, many programs like the one you presented: grab the source, remove are much irrelevant HTML as you can, and process what's left. Having all the tools in one package where everything is designed to work together in the same milieu is much nicer.
Targeting exactly what you want with selectors is much easier, less fragile, and much less code. You can see this in the Mojo docs and I wrote about in it detail with lots of examples in Mojo Web Clients.
Also, you can just use the Mojo::DOM stuff as long as you have the HTML, no matter the source. I write about this in Extracting from HTML with Mojo::DOM for Perl.com. For you, take your decoded content and give it to Mojo::DOM and do the same processing I've already done:
my $html = HTTP::Request->new( ... )->request( ... )->decoded_content;
my $dom = Mojo::DOM->new( $html );
my #results = $dom->find( ... )->...;

Related

Find Favicons in HTML using Perl

I'm trying to look for favicons (and variants) for a given URL using Perl (I'd like to avoid using an external service such as Google's favicon finder). There's a CPAN module, WWW::Favicon, but it hasn't been updated in over a decade -- a decade in which now important variants such as "apple-touch-icon" have come to replace the venerable "ico" file.
I thought I found the solution in WWW::Mechanize, since it can list all of the links in a given URL, including <link> header tags. However, I cannot seem to find a clean way to use the "find_link" method to search for the rel attribute.
For example, I tried using 'rel' as the search term, hoping maybe it was in there despite not being mentioned in the documentation, but it doesn't work. This code returns an error about an invalid "link-finding parameter."
my $results = $mech->find_link( 'rel' => "apple-touch-icon" );
use Data::Dumper;
say STDERR Dumper $results;
I also tried using other link-finding parameters, but none of them seem to be suited to searching out a rel attribute.
The only way I could figure out how to do it is by iterating through all links and looking for a rel attribute like this:
my $results = $mech->find_all_links( );
foreach my $result (#{ $results }) {
my $attrs = $result->attrs();
#'tag' => "apple-touch-icon"
foreach my $attr (sort keys %{ $attrs }) {
if ($attrs->{'rel'} =~ /^apple-touch-icon.*$/) {
say STDERR "I found it:" . $result->url();
}
# Add tests for other types of icons here.
# E.g. "mask-icon" and "shortcut icon."
}
}
That works, but it seems messy. Is there a better way?
Here's how I'd do it with Mojo::DOM. Once you fetch an HTML page, use dom to do all the parsing. From that, use a CSS selector to find the interesting nodes:
link[rel*=icon i][href]
This CSS selector looks for link tags that have the rel and href tags at the same time. Additionally, I require that the value in rel contain (*=) "icon", case insensitively (the i). If you want to assume that all nodes will have the href, just leave off [href].
Once I have the list of links, I extract just the value in href and turn that list into an array reference (although I could do the rest with Mojo::Collection methods):
use v5.10;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);
my $results = $ua->get( shift )
->result
->dom
->find( 'link[rel*=icon i][href]' )
->map( attr => 'href' )
->to_array
;
say join "\n", #$results;
That works pretty well so far:
$ perl mojo.pl https://www.perl.org
https://cdn.perl.org/perlweb/favicon.ico
$ perl mojo.pl https://www.microsoft.com
https://c.s-microsoft.com/favicon.ico?v2
$ perl mojo.pl https://leanpub.com/mojo_web_clients
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-57x57-b83f183ad6b00aa74d8e692126c7017e.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-60x60-6dc1c10b7145a2f1156af5b798565268.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-72x72-5037b667b6f7a8d5ba8c4ffb4a62ec2d.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-76x76-57860ca8a817754d2861e8d0ef943b23.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-114x114-27f9c42684f2a77945643b35b28df6e3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-120x120-3819f03d1bad1584719af0212396a6fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-144x144-a79479b4595dc7ca2f3e6f5b962d16fd.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-152x152-aafe015ef1c22234133158a89b29daf5.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-16x16-c1207cd2f3a20fd50de0e585b4b307a3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-32x32-e9b1d6ef3d96ed8918c54316cdea011f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-96x96-842fcd3e7786576fc20d38bbf94837fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-128x128-e97066b91cc21b104c63bc7530ff819f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-196x196-b8cab44cf725c4fa0aafdbd237cdc4ed.png
Now, the problem comes if you find more interesting cases that you can't easily write a selector for. Suppose not all of the rel values have "icon" in them. You can get a little more fancy by specifying multiple selectors separated by commas so you don't have to use the experimental case insensitivity flag:
link[rel*=icon][href], link[rel*=ICON][href]
or different values in rel:
link[rel="shortcut icon"][href], link[rel="apple-touch-icon-precomposed"][href]
Line up as many of those as you like.
But, you could also filter your results without the selectors. Use Mojo::Collection's grep to pick out the nodes that you want:
my %Interesting = ...;
my $results = $ua->get( shift )
->result
->dom
->find( '...' )
->grep( sub { exists $Interesting{ $_->attr('rel') } } )
->map( attr => 'href' )
->to_array
;
I have a lot more examples of Mojo::DOM in Mojo Web Clients, and I think I'll go add this example now.
The problem is very easy to solve with:
assistance of any module allowing to load webpage
define $regex for all possible favicon variations
look for <link rel="$regex" href="icon_address" ...>
Note:
The script has default YouTube url embedded in the code
use strict;
use warnings;
use feature 'say';
use HTTP::Tiny;
my $url = shift || 'https://www.youtube.com/';
my $icons = get_favicon($url);
say for #{$icons};
sub get_favicon {
my $url = shift;
my #lookup = (
'shortcut icon',
'apple-touch-icon',
'image_src',
'icon',
'alternative icon'
);
my $re = join('|',#lookup);
my $html = load_page($url);
my #icons = ($html =~ /<link rel="(?:$re)" href="(.*?)"/gmsi);
return \#icons;
}
sub load_page {
my $url = shift;
my $response = HTTP::Tiny->new->get($url);
my $html;
if ($response->{success}) {
$html = $response->{content};
} else {
say 'ERROR: Could not extract webpage';
say 'Status: ' . $response->{status};
say 'Reason: ' . $response->{reason};
exit;
}
return $html;
}
Run as script.pl
https://www.youtube.com/s/desktop/8259e7c9/img/favicon.ico
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_32.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_48.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_96.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_144.png
https://www.youtube.com/img/desktop/yt_1200.png
Run as script.pl "http://www.microsoft.com/"
https://c.s-microsoft.com/favicon.ico?v2
Run as script.pl "http://finance.yahoo.com/"
https://s.yimg.com/cv/apiv2/default/icons/favicon_y19_32x32_custom.svg

Using Mojo::DOM to extract untagged text after heading

I'm trying to extract some text without tags from a HTML file using Mojo::DOM (I'm new at this). In particular, the description text after the H2 heading (there are other headings in the file).
<h2>Description</h2>This text is the description<div class="footer">[Edit description
I've been able to find the heading, but don't know how to access the text after is, since I have not tag to jump to...
my $dom = Mojo::DOM->new( $htmlfile );
my $desc = $dom
->find('h2')
->grep(sub { $_->all_text =~ /Description/ })
->first;
Can anyone recommend to me a way how to grab the "This text is the description" string?
One can go through all nodes, what also catches those which aren't inside an HTML element (tag). Then use the fact that you need the node that follows the h2 tag.
More precisely, it follows the text-node which is the child of the (identifiable) h2 tag-node.
use warnings;
use strict;
use feature 'say';
use Mojo::DOM;
my $html = q(<h2>Description</h2> This text is the description <p>More...</p>);
my $dom = Mojo::DOM->new($html);
my $is_next = 0;
foreach my $node ($dom->descendant_nodes->each) {
my $par = $node->parent;
if ($node->type eq 'text' and $par->type eq 'tag' and $par->tag eq 'h2') {
$is_next = 1;
}
elsif ($is_next) {
say $node; #--> This text is the description
$is_next = 0;
}
}
More criteria for exactly which h2 nodes are of interest can be added (unless it's really all such nodes), by interrogating either the previous text-node (text of the h2 tag) or its parent (the tag).
The node itself should likely be checked as well, for example to see whether it's indeed just loose text and not actually a next tag.
I've tested with far more complex HTML; the above is a near-minimal testable markup.
In this simple example just $dom->text catches the needed text. However, that won't be the case in more complex fragments where the sought text doesn't come after the very first element.
Try this code, I have just added a parent element:
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $html = q{<div class="container"><h2>Description</h2>This text is the description<div class="footer">[Edit description
</div></div>};
my $dom = Mojo::DOM->new($html);
print $dom->at('div.container')->text();
Alternatively, using your HTML snippet, the following can be done:
print $dom->text();

Bypass optional field in pattern match

I'm trying to pull out the names of the players and totals, but in some cases there is an extra html tag following the number of the player in the list. So how can I bypass that extra field when it appears. I can't put parenthesis around it because it will try to match it, correct?
<tr><td>10<td>MANNY MACHADO - FA</td><td>37</td></tr>
<tr><td>107</td><td>ALEDMYS DIAZ - HOU</td><td>18</td></tr>
while($content =~ /<tr><td>\d+?\S+?<td>(.*?)\s-.*?<\/td><td>(\d+?)</g) {
my $player = $1;
my $total = $2;
print "\nPlayer => $player Total => $total\n";
}
I tried using the '\S+?' to bypass it, but in this case it doesn't print out anything where the number of the player is less than 10.
It is generally a bad idea to use regexes for HTML, XML, etc.
Instead you should use an appropriate parser to convert it to a DOM and then implement your algorithm in the DOM domain. Using your example:
parse the HTML from file or string
(find the correct table in the document - left out in the example as I don't have the complete HTML)
loop over the rows in the table
extract the information you are looking for from the columns of a row
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;
my $parser = new HTML::TreeBuilder;
my $root = $parser->parse_file(\*DATA)
or die "HTML\n";
foreach my $row ($root->look_down(_tag => 'tr')) {
if (my #columns = $row->look_down(_tag => 'td')) {
my $player = $columns[1]->as_text();
my $total = $columns[2]->as_text();
print "Player => $player Total => $total\n";
}
}
exit 0;
__DATA__
<body>
<tr><td>10<td>MANNY MACHADO - FA</td><td>37</td></tr>
<tr><td>107</td><td>ALEDMYS DIAZ - HOU</td><td>18</td></tr>
</body>
Test run:
$ perl dummy.pl
Player => MANNY MACHADO - FA Total => 37
Player => ALEDMYS DIAZ - HOU Total => 18
With Mojo::DOM:
use strict;
use warnings;
use Mojo::DOM;
my $html = <<'EOD';
<tr><td>10<td>MANNY MACHADO - FA</td><td>37</td></tr>
<tr><td>107</td><td>ALEDMYS DIAZ - HOU</td><td>18</td></tr>
EOD
my $dom = Mojo::DOM->new($html);
foreach my $tr ($dom->find('tr')->each) {
my #cells = $tr->children('td')->each;
my $player = $cells[1]->all_text;
my $total = $cells[2]->all_text;
# or alternatively
my $player = $tr->at('td:nth-of-type(2)')->all_text;
my $total = $tr->at('td:nth-of-type(3)')->all_text;
print "\nPlayer => $player Total => $total\n";
}
You need to match an optional </tr>, so You can do that with the following (?:<\/tr>)? in your regex. This makes a non capturing group, because of the ?: at the start, that matches 0 or 1 times. So your new regex is
/<tr><td>\d+(?:<\/td>)?<td>(.*?)\s-.*?<\/td><td>(\d+?)</g
Normally I'd add a bit about not using regex to parse HTML, but as this is not well formed HTML I'll let it pass. However if you can exercise some control over what is creating the HTML try and fix it so that the <td> and </td> tags are balanced.
I'm also someone who would go for a proper HTML or XML modul to extract information like others above already said. So I will not elaborate on that.
If I would have to extract from the wrongly formatted html you showed, I'd stick with a multi step aproach.
cleanup
extract
more cleanup
For cleanup I'd first check what's common. In this case every line starts with <tr> so I'd settle for that to find my lines, skipping those not starting with <tr>, after some optional whitespace:
while (<>) {
next unless /^\s*<tr>/;
The next common thing I noticed is that every interesting field starts with td. So I'd replace it with something more easy like a tab. Assuming there could be tabs already, I'd first replace them with spaces:
tr/\t/ /;
s/<td>/\t/g;
Now what I have is some tags sprinkled around the data I really need. And the data I really need is prepended with a tab. So let's delete the tags:
s/<.*?>//g;
Finally I can extract my data:
my($dummy, $number, $player, $total)= split /\t/;
But since the player has some stuff appended (after -) let's remove that as well
$player=~ s/\s-.*//;
print "\nPlayer => $player Total => $total\n";
}
Putting it together and using DATA:
while (<DATA>) {
next unless /^\s*<tr>/;
tr/\t/ /;
s/<td>/\t/g;
s/<.*?>//g;
my($dummy, $number, $player, $total)= split /\t/;
$player=~ s/\s-.*//;
print "\nPlayer => $player Total => $total\n";
}
__DATA__
<tr><td>10<td>MANNY MACHADO - FA</td><td>37</td></tr>
<tr><td>107</td><td>ALEDMYS DIAZ - HOU</td><td>18</td></tr>
Please be prepared that you might come across data with more whitespace and the approach will fail.
Example:
<tr>
<td>10
<td>MANNY MACHADO - FA</td>
<td>37</td>
</tr>
<tr><td>107</td>
<td>ALEDMYS DIAZ - HOU</td>
<td>18</td>
</tr>

Get HTML within an <a> element using WWW::Mechanize

I'm extracting special links within an HTML page by using WWW::Mechanize.
my $mech = WWW::Mechanize->new();
$mech->get( $uri );
my #links = $mech->find_all_links(url_regex => qr/cgi-bin/);
for my $link ( #links ) {
# try to get everything between HERE
}
The links look like this
<div><div><span>foo bar</span> I WANT THIS TEXT</div></div>
By using $link->text I get foo bar I WANT THIS TEXT without knowing which text was inside the <span> element.
Is there any way to get the raw HTML code instead of the stripped text?
In other words I need to find a way to only get I WANT THIS TEXT without knowing the exact text within the <span> tag.
As simbabque has said you can't do that with WWW::Mechanize
In fact there's very little point in using WWW::Mechanize if you don't want any of its features. If all you're using it for is to fetch a web page, then use LWP::UserAgent instead. WWW::Mechanize is just a subclass of LWP::UserAgent with lots of additional stuff that you don't want
Here's an example that uses HTML::TreeBuilder to construct a parse tree of the HTML and locate the links that you want. I've used HTML::TreeBuilder because it's pretty good at tolerating malformed HTML in a way similar to modern browsers
I've been unable to test it as you haven't provided proper sample data and I'm not inclined to create my own
use strict;
use warnings 'all';
use feature 'say';
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new;
$mech->get('http://www.example.com/');
my $tree = HTML::TreeBuilder->new_from_content($mech->content);
for my $link ( #{ $tree->extract_links('a') } ) {
my ($href, $elem, $attr, $tag) = #$link;
# Exclude non-CGI links
next unless $link =~ /cgi-bin/;
# Find all immediate child text nodes and concatenate them
# References are non-text children
my $text = join ' ', grep { not ref } $elem->content_list;
next unless $text =~ /\S/;
# Trim and consolidate spaces
$text =~ s/\A\s+|\s+\z//g;
$text =~ s/\s+/ /g;
say $text;
}

Downloading Text from Several Links using WWW::Mechanize

For an entire week I have been attempting to write a code that will download links from a webpage and then loop through each link to dump the content written on each link's page. The original webpage I downloaded has 500 links to separate web pages that each contain important information for me. I only want to go one level down. However I am having several issues.
RECAP: I want to download the links from a webpage and automatically have my program print off the text contained in those links. I would prefer to have them printed in a file.
1) When I download the links from the original website, the useful ones are not written out fully. (ie they say "/festevents.nsf/all?openform" which is not a usable webpage)
2) I have been unable to print the text content of the page. I have been able to print the font details, but that is useless.
#Download all the modules I used#
use LWP::UserAgent;
use HTML::TreeBuilder;
use HTML::FormatText;
use WWW::Mechanize;
use Data::Dumper;
#Download original webpage and acquire 500+ Links#
$url = "http://wx.toronto.ca/festevents.nsf/all?openform";
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->get($url);
my $title = $mechanize->title;
print "<b>$title</b><br />";
my #links = $mechanize->links;
foreach my $link (#links) {
# Retrieve the link URL
my $href = $link->url_abs;
#
# $URL1= get("$link");
#
my $ua = LWP::UserAgent->new;
my $response = $ua->get($href);
unless($response->is_success) {
die $response->status_line;
}
my $URL1 = $response->decoded_content;
die Dumper($URL1);
#This part of the code is just to "clean up" the text
$Format=HTML::FormatText->new;
$TreeBuilder=HTML::TreeBuilder->new;
$TreeBuilder->parse($URL1);
$Parsed=$Format->format($TreeBuilder);
open(FILE, ">TorontoParties.txt");
print FILE "$Parsed";
close (FILE);
}
Please help me! I am desperate! If possible please explain to me the logic behind each step? I have been frying my brain on this for a week and I want help seeing other peoples logic behind the problems.
Too much work. Study the WWW::Mechanize API to realise that almost all of that functionality is already built-in. Untested:
use strictures;
use WWW::Mechanize qw();
use autodie qw(:all);
open my $h, '>:encoding(UTF-8)', 'TorontoParties.txt';
my $mechanize = WWW::Mechanize->new;
$mechanize->get('http://wx.toronto.ca/festevents.nsf/all?openform');
foreach my $link (
$mechanize->find_all_links(url_regex => qr'/festevents[.]nsf/[0-9a-f]{32}/[0-9a-f]{32}[?]OpenDocument')
) {
$mechanize->get($link->url_abs);
print {$h} $mechanize->content(format => 'text');
}
close $h;