How to fetch a one table from HTML source file using lwp module?

How to fetch a one table from HTML source file using lwp module? - perl

I'm beginner. I want to know how to fetch one table form the source HTML file using LWP module? Is it possible to use Regex with LWP?

You can use LWP to get the HTML source of a web page. Most easily, by using the get() function from LWP::Simple.
my $html = get('http://example.com/');
Now, in $html you have a text string (potentially a very long text string) which contains HTML. You can use any techniques you want to extract data from that string.
(Hint: Using a regex to do this is likely to be a very bad idea. It will be far harder than you expect and probably very fragile. Perhaps use a better tool - like HTML::TableExtract instead.)

use Web::Query::LibXML 'wq';
wq('https://www.december.com/html/demo/table.html')
->find('table th')
->each(sub {
my (undef, $e) = #_;
print $e->text . "\n";
});
__END__
Outer Table
Inner Table
CORNER
Head1
Head2
Head3
Head4
Head5
Head6
Little

Related

How to convert HTML file into a hash in Perl?

Is there any simple way to convert a HTML file into a Perl hash? For example a working Perl modules or something?
I was search on cpan.org but did'nt find anything what can do what I want. I wanna do something like this:
use Example::Module;
my $hashref = Example::Module->new('/path/to/mydoc.html');
After this I want to refer to second div element something like this:
my $second_div = $hashref->{'body'}->{'div'}[1];
# or like this:
my $second_div = $hashref->{'body'}->{'div'}->findByClass('.myclassname');
# or like this:
my $second_div = $hashref->{'body'}->{'div'}->findById('#myid');
Is there any working solution for this?

HTML::TreeBuilder::XPath gives you a lot more power than a simple hash would.
From the synopsis:
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( "mypage.html");
my $nb=$tree->findvalue('/html/body//p[#class="section_title"]/span[#class="nb"]');
my $id=$tree->findvalue('/html/body//p[#class="section_title"]/#id');
my $p= $html->findnodes('//p[#id="toto"]')->[0];
my $link_texts= $p->findvalue( './a'); # the texts of all a elements in $p
$tree->delete; # to avoid memory leaks, if you parse many HTML documents
More on XPath.

Mojo::DOM (docs found here) builds a simple DOM, that can be accessed in a CSS-selector style:
# Find
say $dom->at('#b')->text;
say $dom->find('p')->pluck('text');
say $dom->find('[id]')->pluck(attr => 'id');
In case you're using xhtml you could also use XML::Simple, which produces a data structure similar to the one you describe.

Perl XML::SAX - character() method error

I'm new to using Perl XML::SAX and I encountered a problem with the characters event that is triggered. I'm trying to parse a very large XML file using perl.
My goal is to get the content of each tag (I do not know the tag names - given any xml file, I should be able to crack the record pattern and return every record with its data and tag like Tag:Data).
While working with small files, everything is ok. But when running on a large file, the characters{} event does partial reading of the content. There is no specific pattern in the way it cuts down the reading. Sometimes its the starting few characters of data and sometimes its last few characters and sometimes its just one letter from the actual data.
The Sax Parser is:
$myhandler = MyFilter->new();
$parser = XML::SAX::ParserFactory->parser(Handler => $myhandler);
$parser->parse_file($filename);
And, I have written my own Handler called MyFilter and overridding the character method of the parser.
sub characters {
my ($self, $element) = #_;
$globalvar = $element->{Data};
print "content is: $globalvar \n";
}
Even this print statement, reads the values partially at times.
I also tried loading the Parsesr Package before calling the $parser->parse() as:
$XML::SAX::ParserPackage = "XML::SAX::ExpatXS";
Stil doesn't work. Could anyone help me out here? Thanks in advance!

Sounds like you need XML::Filter::BufferText.
http://search.cpan.org/dist/XML-Filter-BufferText/BufferText.pm
From the description "One common cause of grief (and programmer error) is that XML parsers aren't required to provide character events in one chunk. They can, but are not forced to, and most don't. This filter does the trivial but oft-repeated task of putting all characters into a single event."
It's very easy to use once you have it installed and will solve your partial character data problem.

Perl DBI Query -> JSON -> JQuery AutoComplete

I've been trying to read up on how to implement a JSON solution in order to use JQueryUI's autocomplete functionality. I am trying to use autocomplete to search a database on for a name and after selection populate the ID to a hidden object. I've seen alot of examples around the web, but haven't found the best way to implement this. The database doesn't change that often, so I'm not sure how to best approach this performance wise.
Backend:
#!/usr/bin/perl
use CGI;
use DBI;
use strict;
use warnings;
my $cgi = CGI->new;
my $dbh = DBI->connect('dbi:mysql:hostname=localhost;database=test',"test","test") or die $DBI::errstr;
my $sth = $dbh->prepare(qq{select id, name from test;}) or die
$dbh->errstr;
$sth->execute() or die $sth->errstr;
my $json = undef;
while(my #user = $sth->fetchrow_array()) {
$json .= qq{{"$user[0]" : "$user[1]"}};
}
print $cgi->header(-type => "application/json", -charset => "utf-8");
print $json;

The jQuery autocomplete needs a "value" or "label" field returned with the json result. If you do not include it, the jquery autocomplete will not work:
The basic functionality of the autocomplete works with the results of the query assigned to the ‘label’ and ‘value’ fields. Explanation on the ‘label’ and ‘value’ fields from the jQuery UI site:
“The local data can be a simple Array of Strings, or it contains Objects for each item in the array, with either a label or value property or both. The label property is displayed in the suggestion menu. The value will be inserted into the input element after the user selected something from the menu. If just one property is specified, it will be used for both, eg. if you provide only value-properties, the value will also be used as the label.”
Link to full example: http://www.jensbits.com/2011/05/09/jquery-ui-autocomplete-widget-with-perl-and-mysql/

You need to grap the JSON package from CPAN instead of doing this:
my $json = undef;
while(my #user = $sth->fetchrow_array()) {
$json .= qq{{"$user[0]" : "$user[1]"}};
}
For example, with JSON it'd look like this:
use JSON;
my $json = {};
while(my #user = $sth->fetchrow_array()) {
$json->{$user[0]} = $user[1];
}
print JSON::to_json($json);
The JSON package will automatically construct a valid JSON string from any Perl data structure you provide it. We use it all over the place on Melody and it's proved to be a real life saver for sanely converting a structure into valid JSON.

Here I'm talking about performance.
There is some trigger you can set to improve performance, client side you can set the minimum number of characters required before the request is sent.
You can also set the "timeout" between two characters typing before the request is sent.
If your database table is really huge, I suggest you put a LIMIT on results you retrieve.
First to avoid long request processing, but also because some clients like IE6 arent't really fast handling more than a hundred results (Not to say, it's also not really user friendly).
On a project using IE6, we limited the elements returned to 100. If the user can't reduce the search to 100 elements, we presume he/she doesn't know what he/she is looking for.
Hope it helps a bit.

How to deal with nameless forms on websites?

I would like to write a script that lets me use this website
http://proteinmodel.org/AS2TS/LGA/lga.html
(I need to use it a few hundred times, and I don't feel like doing that manually)
I have searched the internet for ways how this could be done using Perl, and I came across WWW::Mechanize, which seemed to be just what I was looking for. But now I have discovered that the form on that website which I want to use has no name - its declaration line simply reads
<FORM METHOD="POST" ACTION="./lga-form.cgi" ENCTYPE=multipart/form-data>
At first I tried simply not setting my WWW::Mechanize object's form_name property, which gave me this error message when I provided a value for the form's email address field:
Argument "my_email#address.com" isn't numeric in numeric gt (>) at /usr/share/perl5/WWW/Mechanize.pm line 1618.
I then tried setting form_name to '' and later ' ', but it was to no avail, I simply got this message:
There is no form named " " at ./automate_LGA.pl line 40
What way is there to deal with forms that have no names? It would be most helpful if someone on here could answer this question - even if the answer points away from using WWW::Mechanize, as I just want to get the job done, (more or less) no matter how.
Thanks a lot in advance!

An easy and more robust way is to use the $mech->form_with_fields() method from WWW::Mechanize to select the form you want based on the fields it contains.
Easier still, use the submit_form method with the with_fields option.
For instance, to locate a form which has fields named 'username' and 'password', complete them and submit the form, it's as easy as:
$mech->submit_form(
with_fields => { username => $username, password => $password }
);
Doing it this way has the advantage that if they shuffle their HTML around, changing the order of the forms in the HTML, or adding a new form before the one you're interested in, your code will continue to work.

I don't know about WWW::Mechanize, but its Python equivalent, mechanize, gives you an array of forms that you can iterate even if you don't know their names.
Example (taken from its homepage):
import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
for form in br.forms():
print form
EDIT: searching in the docs of WWW::Mechanize I found the $mech->forms() method, that could be what you need. But since I don't know perl or WWW::Mechanize, I'll leave there my python answer.

Okay, I have found the answer. I can address the nameless form by its number (there's just one form on the webpage, so I guessed it would be number 1, and it worked). Here's part of my code:
my $lga = WWW::Mechanize->new();
my $address = 'my_email#address.com';
my $options = '-3 -o0 -d:4.0';
my $pdb_2 = "${pdb_id}_1 ${pdb_id}_2";
$lga->get('http://proteinmodel.org/AS2TS/LGA/lga.html');
$lga->success or die "LGA GET fail\n";
$lga->form_number(1);
$lga->field('Address', $address);
$lga->field('Options', $options);
$lga->field('PDB_2', $pdb_2);
$lga->submit();
$lga->success or die "LGA POST fail\n";

How do I create an absolute URL from two components, in Perl?

Suppose I have:
my $a = "http://site.com";
my $part = "index.html";
my $full = join($a,$part);
print $full;
>> http://site.com/index.html
What do I have to use as join, in order to get my snippet to work?
EDIT: I'm looking for something more general. What if a ends with a slash, and part starts with one? I'm sure in some module, someone has this covered.

I believe what you're looking for is URI::Split, e.g.:
use URI::Split qw(uri_join);
$uri = uri_join('http', 'site.com', 'index.html')

use URI;
URI->new("index.html")->abs("http://site.com")
will produce
"http://site.com/index.html"
URI->abs will take care of merging the paths properly following your uri specification,
so
URI->new("/bah")->abs("http://site.com/bar")
will produce
"http://site.com/bah"
and
URI->new("index.html")->abs("http://site.com/barf")
will produce
"http://site.com/barf/index.html"
and
URI->new("../uplevel/foo")->abs("http://site.com/foo/bar/barf")
will produce
"http://site.com/foo/uplevel/foo"
alternatively, there's a shortcut sub in URI namespace that I just noticed:
URI->new_abs($url, $base_url)
so
URI->new_abs("index.html", "http://site.com")
will produce
"http://site.com/index.html"
and so on.

No need for ‘join‘, just use string interpolation.
my $a = "http://site.com";
my $part = "index.html";
my $full = "$a/$part";
print $full;
>> http://site.com/index.html
Update:
Not everything requires a module. CPAN is wonderful, but restraint is needed.
The simple approach above works very well if you have clean inputs. If you need to handle unevenly formatted strings, you will need to normalize them somehow. Using a library in the URI namespace that meets your needs is probably the best course of action if you need to handle user input. If the variance is minor File::Spec or a little manual clean-up may be good enough for your needs.
my $a = 'http://site.com';
my #paths = qw( /foo/bar foo //foo/bar );
# bad paths don't work:
print join "\n", "Bad URIs:", map "$a/$_", #paths;
my #cleaned = map s:^/+::, #paths;
print join "\n", "Cleaned URIs:", map "$a/$_", #paths;
When you have to handle bad stuff like $path = /./foo/.././foo/../foo/bar; is when you want definitely want to use a library. Of course, this could be sorted out using File::Spec's cannonical path function.
If you are worried about bad/bizarre stuff in the URI rather than just path issues (usernames, passwords, bizarre protocol specifiers) or URL encoding of strings, then using a URI library is really important, and is indisputably not overkill.

You might want to take a look at this, an implementation of a function similar to Python's urljoin, but in Perl:
http://sveinbjorn.org/urljoin_function_implemented_using_Perl

As I am used to Java java.net.URL methods, I was looking for a similar way to concatenate URI without any assumption about scheme, host or port (in my case, it is for possibly complex Subversion URL):
http://site.com/page/index.html
+ images/background.jpg
=> http://site.com/page/images/background.jpg
Here is the way to do it in Perl:
use URI;
my $base = URI->new("http://site.com/page/index.html");
my $result = URI->new_abs("images/background.jpg", $base);