I'm trying to scrape some data from metacriti* website using mechanize, but I'm getting no output
Here's my code with a url example:
my $metaURL = "http://www.metacriti*.com/game/pc/dota-2";
my $mech = WWW::Mechanize->new();
$mech->get($metaURL) or die "unable to get $metaURL";
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse($mech->content);
my #nodes = $tree->findnodes(q{//*[#id="main"]//a[contains(./#href, "user-reviews")]/span[#class="score_value"]});
print $_->string_value, "\n" foreach(#nodes); # text
#nodes array seems to be empty, my xpath seems good and since i'm using the same syntax in another working script, I really couldnt figure out what is wrong with this one...
Also since this is just the begining, maybe you can suggest me another easy way to scrape/parse websites... If there's any better one :)
Thank you in advance
The HTML seems to be really bad, if you search for $tree->findnodes( '//div[#id="main"]')->[0]->as_HTML you get a very bare div:
<div class="col main_col" id="main"><div itemscope="itemscope" itemtype="http://schema.org/SoftwareApplication"></div></div>
this indeed does not contain any a, which explains the result you got.
I tried using tidy to pretty print the HTML, but it barfed on the file.
If you forget about the div and use q{//a[contains(./#href, "user-reviews")]/span[#class="score_value"]} you will get a result though, 7.9 in this case.
Related
I'm beginner. I want to know how to fetch one table form the source HTML file using LWP module? Is it possible to use Regex with LWP?
You can use LWP to get the HTML source of a web page. Most easily, by using the get() function from LWP::Simple.
my $html = get('http://example.com/');
Now, in $html you have a text string (potentially a very long text string) which contains HTML. You can use any techniques you want to extract data from that string.
(Hint: Using a regex to do this is likely to be a very bad idea. It will be far harder than you expect and probably very fragile. Perhaps use a better tool - like HTML::TableExtract instead.)
use Web::Query::LibXML 'wq';
wq('https://www.december.com/html/demo/table.html')
->find('table th')
->each(sub {
my (undef, $e) = #_;
print $e->text . "\n";
});
__END__
Outer Table
Inner Table
CORNER
Head1
Head2
Head3
Head4
Head5
Head6
Little
i want to write a little script to grab and play around with the results of the autocomplete of an input field.
So I find a principal solution with Selenium::Firefox, that base on module Selenium::Remote::Driver, but the description of the methods is without any examples.
I have this basic example, that is able to open google and insert a search string.
Then you can see that a result list is suggested and i want to get this list.
But i have no idea how this can be obtained?
Here is my code so far:
#!/usr/bin/perl
use strict;
use warnings;
use Selenium::Firefox;
my $mech = Selenium::Firefox->new(
startup_timeout => 20,
firefox_binary => '/srv/bin/firefox.62.0/firefox',
binary => '/usr/local/bin/geckodriver',
marionette_enabled => 1
);
my $search = "perl";
my $url = "https://www.google.com/";
$mech->get($url);
$mech->find_element_by_name("q");
sleep(3);
my $result = $mech->get_active_element();
$result->send_keys($search);
sleep (10);
$mech->shutdown_binary;
exit 0;
I could find no examples to use this Perl module - and there are more questions for it.
As for instance: find_element
How can i turn on the warnings instead of killing the script?
Or how can i step through the objects of the wep page?
Is it possible to connect to an already opened browser?
The description of the module is not understandable for people who are not experts and the authors did not answer to questions so far.
But my hope is that experts here can give me a hint?
I feel like I must be missing something obvious. I know the XPath to a WebElement and I want to compare the text in that element to some string.
Say the XPath for the element is /html/body/div/a/strong and I want to compare it to $str.
So it shows up in source like ...<strong>find this string</strong>
So I say
use strict;
use warnings;
use Selenium::Remote::Driver;
# Fire up a Selenium object, get to the page, etc..
Test::More::ok($sel->find_element("/html/body/div/a") eq $str, "Text matched");
When I run this the test fails when it should pass. When I try to print the value of find_element($xpath) I get some hash reference. I've googled around some and find examples telling me to try find_element($xpath)->get_text() but get_text() isn't even a method in the original package. Is it an obsolete method that used to actually work?
Most of the examples online for this module say "It's easy!" then show me how to get_title() but not how to check the text at an XPath. I might be going crazy.
Following up on our comment thread, I'm posting this as an answer:
Selenium::Remote::WebDriver definitely has a method named find_element(), and Selenium::Remote::WebElement definitely has a method named get_text().
Something like this...
my $text = $sel->find_element(...)->get_text();
...works fine on my end, though it looks like it'll error out if the element isn't found.
Is there any simple way to convert a HTML file into a Perl hash? For example a working Perl modules or something?
I was search on cpan.org but did'nt find anything what can do what I want. I wanna do something like this:
use Example::Module;
my $hashref = Example::Module->new('/path/to/mydoc.html');
After this I want to refer to second div element something like this:
my $second_div = $hashref->{'body'}->{'div'}[1];
# or like this:
my $second_div = $hashref->{'body'}->{'div'}->findByClass('.myclassname');
# or like this:
my $second_div = $hashref->{'body'}->{'div'}->findById('#myid');
Is there any working solution for this?
HTML::TreeBuilder::XPath gives you a lot more power than a simple hash would.
From the synopsis:
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( "mypage.html");
my $nb=$tree->findvalue('/html/body//p[#class="section_title"]/span[#class="nb"]');
my $id=$tree->findvalue('/html/body//p[#class="section_title"]/#id');
my $p= $html->findnodes('//p[#id="toto"]')->[0];
my $link_texts= $p->findvalue( './a'); # the texts of all a elements in $p
$tree->delete; # to avoid memory leaks, if you parse many HTML documents
More on XPath.
Mojo::DOM (docs found here) builds a simple DOM, that can be accessed in a CSS-selector style:
# Find
say $dom->at('#b')->text;
say $dom->find('p')->pluck('text');
say $dom->find('[id]')->pluck(attr => 'id');
In case you're using xhtml you could also use XML::Simple, which produces a data structure similar to the one you describe.
I would like to write a script that lets me use this website
http://proteinmodel.org/AS2TS/LGA/lga.html
(I need to use it a few hundred times, and I don't feel like doing that manually)
I have searched the internet for ways how this could be done using Perl, and I came across WWW::Mechanize, which seemed to be just what I was looking for. But now I have discovered that the form on that website which I want to use has no name - its declaration line simply reads
<FORM METHOD="POST" ACTION="./lga-form.cgi" ENCTYPE=multipart/form-data>
At first I tried simply not setting my WWW::Mechanize object's form_name property, which gave me this error message when I provided a value for the form's email address field:
Argument "my_email#address.com" isn't numeric in numeric gt (>) at /usr/share/perl5/WWW/Mechanize.pm line 1618.
I then tried setting form_name to '' and later ' ', but it was to no avail, I simply got this message:
There is no form named " " at ./automate_LGA.pl line 40
What way is there to deal with forms that have no names? It would be most helpful if someone on here could answer this question - even if the answer points away from using WWW::Mechanize, as I just want to get the job done, (more or less) no matter how.
Thanks a lot in advance!
An easy and more robust way is to use the $mech->form_with_fields() method from WWW::Mechanize to select the form you want based on the fields it contains.
Easier still, use the submit_form method with the with_fields option.
For instance, to locate a form which has fields named 'username' and 'password', complete them and submit the form, it's as easy as:
$mech->submit_form(
with_fields => { username => $username, password => $password }
);
Doing it this way has the advantage that if they shuffle their HTML around, changing the order of the forms in the HTML, or adding a new form before the one you're interested in, your code will continue to work.
I don't know about WWW::Mechanize, but its Python equivalent, mechanize, gives you an array of forms that you can iterate even if you don't know their names.
Example (taken from its homepage):
import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
for form in br.forms():
print form
EDIT: searching in the docs of WWW::Mechanize I found the $mech->forms() method, that could be what you need. But since I don't know perl or WWW::Mechanize, I'll leave there my python answer.
Okay, I have found the answer. I can address the nameless form by its number (there's just one form on the webpage, so I guessed it would be number 1, and it worked). Here's part of my code:
my $lga = WWW::Mechanize->new();
my $address = 'my_email#address.com';
my $options = '-3 -o0 -d:4.0';
my $pdb_2 = "${pdb_id}_1 ${pdb_id}_2";
$lga->get('http://proteinmodel.org/AS2TS/LGA/lga.html');
$lga->success or die "LGA GET fail\n";
$lga->form_number(1);
$lga->field('Address', $address);
$lga->field('Options', $options);
$lga->field('PDB_2', $pdb_2);
$lga->submit();
$lga->success or die "LGA POST fail\n";