Is there any simple way to convert a HTML file into a Perl hash? For example a working Perl modules or something?
I was search on cpan.org but did'nt find anything what can do what I want. I wanna do something like this:
use Example::Module;
my $hashref = Example::Module->new('/path/to/mydoc.html');
After this I want to refer to second div element something like this:
my $second_div = $hashref->{'body'}->{'div'}[1];
# or like this:
my $second_div = $hashref->{'body'}->{'div'}->findByClass('.myclassname');
# or like this:
my $second_div = $hashref->{'body'}->{'div'}->findById('#myid');
Is there any working solution for this?
HTML::TreeBuilder::XPath gives you a lot more power than a simple hash would.
From the synopsis:
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( "mypage.html");
my $nb=$tree->findvalue('/html/body//p[#class="section_title"]/span[#class="nb"]');
my $id=$tree->findvalue('/html/body//p[#class="section_title"]/#id');
my $p= $html->findnodes('//p[#id="toto"]')->[0];
my $link_texts= $p->findvalue( './a'); # the texts of all a elements in $p
$tree->delete; # to avoid memory leaks, if you parse many HTML documents
More on XPath.
Mojo::DOM (docs found here) builds a simple DOM, that can be accessed in a CSS-selector style:
# Find
say $dom->at('#b')->text;
say $dom->find('p')->pluck('text');
say $dom->find('[id]')->pluck(attr => 'id');
In case you're using xhtml you could also use XML::Simple, which produces a data structure similar to the one you describe.
Related
I'm beginner. I want to know how to fetch one table form the source HTML file using LWP module? Is it possible to use Regex with LWP?
You can use LWP to get the HTML source of a web page. Most easily, by using the get() function from LWP::Simple.
my $html = get('http://example.com/');
Now, in $html you have a text string (potentially a very long text string) which contains HTML. You can use any techniques you want to extract data from that string.
(Hint: Using a regex to do this is likely to be a very bad idea. It will be far harder than you expect and probably very fragile. Perhaps use a better tool - like HTML::TableExtract instead.)
use Web::Query::LibXML 'wq';
wq('https://www.december.com/html/demo/table.html')
->find('table th')
->each(sub {
my (undef, $e) = #_;
print $e->text . "\n";
});
__END__
Outer Table
Inner Table
CORNER
Head1
Head2
Head3
Head4
Head5
Head6
Little
i want to write a little script to grab and play around with the results of the autocomplete of an input field.
So I find a principal solution with Selenium::Firefox, that base on module Selenium::Remote::Driver, but the description of the methods is without any examples.
I have this basic example, that is able to open google and insert a search string.
Then you can see that a result list is suggested and i want to get this list.
But i have no idea how this can be obtained?
Here is my code so far:
#!/usr/bin/perl
use strict;
use warnings;
use Selenium::Firefox;
my $mech = Selenium::Firefox->new(
startup_timeout => 20,
firefox_binary => '/srv/bin/firefox.62.0/firefox',
binary => '/usr/local/bin/geckodriver',
marionette_enabled => 1
);
my $search = "perl";
my $url = "https://www.google.com/";
$mech->get($url);
$mech->find_element_by_name("q");
sleep(3);
my $result = $mech->get_active_element();
$result->send_keys($search);
sleep (10);
$mech->shutdown_binary;
exit 0;
I could find no examples to use this Perl module - and there are more questions for it.
As for instance: find_element
How can i turn on the warnings instead of killing the script?
Or how can i step through the objects of the wep page?
Is it possible to connect to an already opened browser?
The description of the module is not understandable for people who are not experts and the authors did not answer to questions so far.
But my hope is that experts here can give me a hint?
I'm trying to scrape some data from metacriti* website using mechanize, but I'm getting no output
Here's my code with a url example:
my $metaURL = "http://www.metacriti*.com/game/pc/dota-2";
my $mech = WWW::Mechanize->new();
$mech->get($metaURL) or die "unable to get $metaURL";
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse($mech->content);
my #nodes = $tree->findnodes(q{//*[#id="main"]//a[contains(./#href, "user-reviews")]/span[#class="score_value"]});
print $_->string_value, "\n" foreach(#nodes); # text
#nodes array seems to be empty, my xpath seems good and since i'm using the same syntax in another working script, I really couldnt figure out what is wrong with this one...
Also since this is just the begining, maybe you can suggest me another easy way to scrape/parse websites... If there's any better one :)
Thank you in advance
The HTML seems to be really bad, if you search for $tree->findnodes( '//div[#id="main"]')->[0]->as_HTML you get a very bare div:
<div class="col main_col" id="main"><div itemscope="itemscope" itemtype="http://schema.org/SoftwareApplication"></div></div>
this indeed does not contain any a, which explains the result you got.
I tried using tidy to pretty print the HTML, but it barfed on the file.
If you forget about the div and use q{//a[contains(./#href, "user-reviews")]/span[#class="score_value"]} you will get a result though, 7.9 in this case.
I feel like I must be missing something obvious. I know the XPath to a WebElement and I want to compare the text in that element to some string.
Say the XPath for the element is /html/body/div/a/strong and I want to compare it to $str.
So it shows up in source like ...<strong>find this string</strong>
So I say
use strict;
use warnings;
use Selenium::Remote::Driver;
# Fire up a Selenium object, get to the page, etc..
Test::More::ok($sel->find_element("/html/body/div/a") eq $str, "Text matched");
When I run this the test fails when it should pass. When I try to print the value of find_element($xpath) I get some hash reference. I've googled around some and find examples telling me to try find_element($xpath)->get_text() but get_text() isn't even a method in the original package. Is it an obsolete method that used to actually work?
Most of the examples online for this module say "It's easy!" then show me how to get_title() but not how to check the text at an XPath. I might be going crazy.
Following up on our comment thread, I'm posting this as an answer:
Selenium::Remote::WebDriver definitely has a method named find_element(), and Selenium::Remote::WebElement definitely has a method named get_text().
Something like this...
my $text = $sel->find_element(...)->get_text();
...works fine on my end, though it looks like it'll error out if the element isn't found.
Suppose I have:
my $a = "http://site.com";
my $part = "index.html";
my $full = join($a,$part);
print $full;
>> http://site.com/index.html
What do I have to use as join, in order to get my snippet to work?
EDIT: I'm looking for something more general. What if a ends with a slash, and part starts with one? I'm sure in some module, someone has this covered.
I believe what you're looking for is URI::Split, e.g.:
use URI::Split qw(uri_join);
$uri = uri_join('http', 'site.com', 'index.html')
use URI;
URI->new("index.html")->abs("http://site.com")
will produce
"http://site.com/index.html"
URI->abs will take care of merging the paths properly following your uri specification,
so
URI->new("/bah")->abs("http://site.com/bar")
will produce
"http://site.com/bah"
and
URI->new("index.html")->abs("http://site.com/barf")
will produce
"http://site.com/barf/index.html"
and
URI->new("../uplevel/foo")->abs("http://site.com/foo/bar/barf")
will produce
"http://site.com/foo/uplevel/foo"
alternatively, there's a shortcut sub in URI namespace that I just noticed:
URI->new_abs($url, $base_url)
so
URI->new_abs("index.html", "http://site.com")
will produce
"http://site.com/index.html"
and so on.
No need for ‘join‘, just use string interpolation.
my $a = "http://site.com";
my $part = "index.html";
my $full = "$a/$part";
print $full;
>> http://site.com/index.html
Update:
Not everything requires a module. CPAN is wonderful, but restraint is needed.
The simple approach above works very well if you have clean inputs. If you need to handle unevenly formatted strings, you will need to normalize them somehow. Using a library in the URI namespace that meets your needs is probably the best course of action if you need to handle user input. If the variance is minor File::Spec or a little manual clean-up may be good enough for your needs.
my $a = 'http://site.com';
my #paths = qw( /foo/bar foo //foo/bar );
# bad paths don't work:
print join "\n", "Bad URIs:", map "$a/$_", #paths;
my #cleaned = map s:^/+::, #paths;
print join "\n", "Cleaned URIs:", map "$a/$_", #paths;
When you have to handle bad stuff like $path = /./foo/.././foo/../foo/bar; is when you want definitely want to use a library. Of course, this could be sorted out using File::Spec's cannonical path function.
If you are worried about bad/bizarre stuff in the URI rather than just path issues (usernames, passwords, bizarre protocol specifiers) or URL encoding of strings, then using a URI library is really important, and is indisputably not overkill.
You might want to take a look at this, an implementation of a function similar to Python's urljoin, but in Perl:
http://sveinbjorn.org/urljoin_function_implemented_using_Perl
As I am used to Java java.net.URL methods, I was looking for a similar way to concatenate URI without any assumption about scheme, host or port (in my case, it is for possibly complex Subversion URL):
http://site.com/page/index.html
+ images/background.jpg
=> http://site.com/page/images/background.jpg
Here is the way to do it in Perl:
use URI;
my $base = URI->new("http://site.com/page/index.html");
my $result = URI->new_abs("images/background.jpg", $base);