WWW::Mechanize Extraction Help - PERL - perl

I'm try to automate the extraction of a transcript found on a website. The entire transcript is found between dl tags since the site formatted the interview in a description list. The script I have below allows me to search the site and extract the text in a plain-text format, but I'm actually looking for it to include everything between the dl tags, meaning dd's, dt's, etc. This will allow us to develop our own CSS for the interview.
Something to note about the page is that there are break statements inserted at various points during the interview. Some tools we've found that extract information from webpages using pairings have found this to be a problem since it only grabs the information up until the break statement. Just something to keep in mind if you point me in a different direction. Here's what I have so far.
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;
my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/madeleine-k-albright");
# find all <dl> tags
my #list = $mech->find('dl');
foreach ( #list ) {
print $_->as_text();
}
If there is a tool that essentially prints what I have, only this time as HTML, please let me know of it!

Your code is fine, just change the as_text() method to as_HTML() and it will show the content with HTML tags included.

Related

Fetch a URL 100 times using Perl

The problem I met is that I need to get one URL (I cannot be specific that link exactly, this link is doing request and looks like http://link.com/?name=name&password=password& and etc)
And I need to fetch this URL 100 times in a row. I can not do this manually using browser - this takes much time.
Is there any option to run (just run, like you put link in browser and press enter) this link 100 times in a row using Perl scripting?
I have not met before with the Perl and therefore asking the help directly. As I google before some information and make a little script, but seems like I missing something in my knowledge:
#!/usr/bin/perl -w
use LWP::Simple;
my $uri = 'http://my link here';
my $content = get $uri;
Could you please advise to me how I can finish this script?
Use a (simple) for loop.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $uri = 'http://my link here';
get $uri for 1 .. 100;
Update: Just read in a comment that you don't care about the returned data, so I've edited my answer to remove the unnecessary $content variable.

Automatic Search Using WWW::Mechanize

I am trying to write a Perl script which will automatically key in search variables on this LexisNexis search page and retrieve the search results.
I am using the WWW::Mechanize module but I am not sure how to figure out the field name of the search bar itself. This is the script I have so far ->
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $m = WWW::Mechanize->new();
my $url = "http://www.lexisnexis.com/hottopics/lnacademic/?verb=sr&csi=379740";
$m->get($url);
$m->form_name('f');
$m->field('q', 'Test');
my $response = $m->submit();
print $response->content();
However, I think the "Name" of the search box in this website is not "q". I am getting the following Error - "Can't call method "value" on an undefined value at site/lib/WWW/Mechanize.pm line 1442." Any help is much appreciated. Thank you !
If you disable the JavaScript in your browser then you will notice that the search form doesn't load which means it's being loaded by JavaScript, that's why you are unable to handle it with WWW::Mechanize. Have a look at WWW::Mechanize::Firefox, this might help you with your task. Check out the example scripts, cookbook and FAQs.
You can also do the same using Selenium, see Gabor's tutorial on Selenium.

PERL CGI mailto does not work

I have a web report written in PERL CGI. It pulls some constantly changing data from a flat-file DB and displays the current status in a table on the web page. I want to be able to click a link that will push all of that data into an email that can be edited before sending.
This is what I have as my last chunk of HTML on the page. The "Go To Status" link works but the mailto:xxx#xx.com link causes server errors. Does "mailto" not work in a CGI script for some reason? It gets rendered as HTMl so I'm not sure why it wouldn't.
sub EndHtml {
print "<P align=right> <a href='http://www.xxx.com/~a0868183/cgi-bin/xxx.cgi'>Go to Status</a> </p>\n";
print "<p align=right> <a href='mailto:xxx#xx.com'></a>Send EOS</p>\n";
print "</BODY></HTML>\n";
}
(Once I figure this out I will then put the variables with the data into the email)
Thanks,
Jared
# has special meaning in a double quote delimited string.
Always start your script with:
use strict;
use warnings;
Then you will get alerted (if you read your log files):
Possible unintended interpolation of #xx in string
Then you can escape it:
mailto:xxx\#xx.com
Or use a single quoted string:
print q{<p align=right> <a href='mailto:xxx#xx.com'></a>Send EOS</p>\n};
Or don't embed your HTML in the middle of your Perl and use a Template language (like Template Toolkit).
You probably want to put some content in the anchor too.

Can I rate a song in iTunes (on a Mac) using Perl?

I've tried searching CPAN. I found Mac::iTunes, but not a way to assign a rating to a particular track.
If you're not excited by Mac::AppleScript, which just takes a big blob of AppleScript text and runs it, you might prefer Mac::AppleScript::Glue, which provides a more object-oriented interface. Here's the equivalent to Iamamac's sample code:
#!/usr/bin/env perl
use Modern::Perl;
use Mac::AppleScript::Glue;
use Data::Dumper;
my $itunes = Mac::AppleScript::Glue::Application->new('iTunes');
# might crash if iTunes isn't playing anything yet
my $track = $itunes->current_track;
# for expository purposes, let's see what we're dealing with
say Dumper \$itunes, \$track;
say $track->rating; # initially undef
$track->set(rating => 100);
say $track->rating; # should print 100
All that module does is build a big blob of AppleScript, run it, and then break it all apart into another AppleScript expression that it can use on your next command. You can see that in the _ref value of the track object when you run the above script. Because al it's doing is pasting and parsing AppleScript, this module won't be any faster than any other AppleScript-based approach, but it does allow you to intersperse other Perl commands within your script, and it keeps your code looking a little more like Perl, for what that's worth.
You can write AppleScript to fully control iTunes, and there is a Perl binding Mac::AppleScript.
EDIT Code Sample:
use Mac::AppleScript qw(RunAppleScript);
RunAppleScript(qq(tell application "iTunes" \n set rating of current track to $r \n end tell));
Have a look at itunes-perl, it seems to be able to rate tracks.

How can I extract links from an HTML file with Perl?

I have some input with a link and I want to open that link. For instance, I have an HTML file and want to find all links in the file and open their contents in an Excel spreadsheet.
It sounds like you want the linktractor script from my HTML::SimpleLinkExtor module.
You might also be interested in my webreaper script. I wrote that a long, long time ago to do something close to this same task. I don't really recommend it because other tools are much better now, but you can at least look at the code.
CPAN and Google are your friends. :)
Mojo::UserAgent is quite nice for this, too:
use Mojo::UserAgent
print Mojo::UserAgent
->new
->get( $ARGV[0] )
->res
->dom->find( "a" )
->map( attr => "href" )
->join( "\n" );
That sounds like a job for WWW::Mechanize. It provides a fairly high level interface to fetching and studying web pages.
Once you've read the docs, I think you'll have a good idea how to go about it.
There is also Web::Query:
#!/usr/bin/env perl
use 5.10.0;
use strict;
use warnings;
use Web::Query;
say for wq( shift )->find('a')->attr('href');
Or, from the cli:
$ perl -MWeb::Query -E'say for wq(shift)->find("a")->attr("href")' \
http://techblog.babyl.ca
I've used URI::Find for this in the past (for when the file is not HTML).