Web::Scraper and Perl

Web::Scraper and Perl - perl

I have the following script that scrapes my schools CS department to get a list of all the courses. I want to be able to extract the CRN (course number) and other important information to put into a database which I can let users browse through a web app.
Here is an example URL:
http://courses.illinois.edu/cis/2011/spring/schedule/CS/411.html
I would like to extract info from pages like this. The first level of the scraper just constructs the individual sites from a list of all of the courses. Once I'm at a course specific catalog page, I use the second scraper to attempt to get all of this info i want. For some reason, although the CRN's and Course Instructors are all 'td' elements. My scraper seems to be returning nothing when scraping. I tried to scrape specifically for 'div' instead and I get a bunch of info for each relevant page. So somehow I'm failing to get the 'td' element, but I'm scraping from the right page.
my $tweets = scraper {
# Parse all LIs with the class "status", store them into a resulting
# array 'tweets'. We embed another scraper for each tweet.
# process "h4.ws-ds-name.detail-title", "array[]" => 'TEXT';
process "div.ws-row", "array[]" => 'TEXT';
};
my $res = $tweets->scrape( URI- >new("http://courses.illinois.edu/cis/2011/spring/schedule/CS/index.html?skinId=2169") );
foreach my $elem (#{$res->{array}}){
my $coursenum = substr($elem,2,4);
my $secondLevel = scraper{
process "td.ws-row", "array2[]" => 'TEXT';
};
my $res2 = $secondLevel->scrape(URI- >new("http://courses.illinois.edu/cis/2011/spring/schedule/CS/$coursenum.html"));
my $num = #{$res2->{array2}};
print $num;
print "---------------------", "\n";
my #curr = #{$res2->{array2}};
foreach my $elem2 (#curr){
$num++;
print $elem2, " ", "\n";
}
print "---------------------", "\n";
}
Any ideas?
Thanks

Looks to me like
my $coursenum = substr($elem,2,4)
should be
my $coursenum = substr($elem,3,3)

The easiest way to go in this case is use
HTML::TableExtract
In case you are looking for data from the table only.

I played a bit with your problem. You can get course id, title and link to individual course page within initial scraper:
my $courses = scraper {
process 'div.ws-row',
'course[]' => scraper {
process 'div.ws-course-number', 'id' => 'TEXT';
process 'div.ws-course-title', 'title' => 'TEXT';
process 'div.ws-course-title a', 'link' => '#href';
};
result 'course';
};
The result of scraping is arrayref with hashrefs like this:
{ id => "CS 103",
title => "Introduction to Programming",
link => bless(do{\(my $o = "http://courses.illinois.edu/cis/2011/spring/schedule/CS/103.html?skinId=2169")}, "URI::http"),
},
....
Then you can do additional scraping for each course from their individual pages and add such information into original structure:
for my $course (#$res) {
my $crs_scraper = scraper {
process 'div.ws-description', 'desc' => 'TEXT';
# ... add more items here
};
my $additional_data = $crs_scraper->scrape(URI->new($course->{link}));
# slice assignment to add them into course definition
#{$course}{ keys %$additional_data } = values %$additional_data;
}
Source combined together is as follows:
use strict; use warnings;
use URI;
use Web::Scraper;
use Data::Dump qw(dump);
my $url = 'http://courses.illinois.edu/cis/2011/spring/schedule/CS/index.html?skinId=2169';
my $courses = scraper {
process 'div.ws-row',
'course[]' => scraper {
process 'div.ws-course-number', 'id' => 'TEXT';
process 'div.ws-course-title', 'title' => 'TEXT';
process 'div.ws-course-title a', 'link' => '#href';
};
result 'course';
};
my $res = $courses->scrape(URI->new($url));
for my $course (#$res) {
my $crs_scraper = scraper {
process 'div.ws-description', 'desc' => 'TEXT';
# ... add more items here
};
my $additional_data = $crs_scraper->scrape(URI->new($course->{link}));
# slice assignment to add them into course definition
#{$course}{ keys %$additional_data } = values %$additional_data;
}
dump $res;

Related

In mojolicious, how to protect images from public view

Somebody can help me, please. I have app in mojolicious Lite
I wanna block images from all without session login
when i type http://m.y.i.p:3000/images/imageX.jpg
i wanna show images only with session login.
my html code is
shwo image 1

Same way as any other content. Set up a handler for requests, render (or don't render) the content.
get '/images/*img' => sub {
my $c = shift;
if (!$c->session("is_authenticated")) {
return $c->render( text => "Forbidden", status => 403 );
}
my $file = $c->param("img");
if (!open(my $fh, '<', $IMAGE_DIR/$file)) {
return $c->render( text => "Not found", status => 404 );
}
my $data = do { local $/; <$fh> };
close $fh;
$c->render( data => $data, format => 'jpg' );
};
Your request handler will take priority over the default handler that serves content from public folders, but once you have this handler in place, you don't need to store the files it serves in a public folder.

another sololution is
get '/pays/*img' => sub {
my $self = shift;
my $img = $self->param('img');
plugin 'RenderFile';
my #imgext = split(/\./, $img);
my $ext = $imgext[-1];
$self->render_file(
'filepath' => "/directiry/$img",
'format' => "$ext", # will change Content-Type "application/x-download" to "application/pdf"
'content_disposition' => 'inline', # will change Content-Disposition from "attachment" to "inline"
# delete file after completed
);
It is use plugin 'RenderFile';

How to loop through subarrays of a SOAP::Lite response in Perl?

I have a Perl script that is successfully getting a response from my ShoreTel Phone server. The server provides information on what calls are currently connected for the extension entered. However I am having issues looping through the sub arrays to get more than one response when there are multiple items. In this case I want to get each of the caller IDs that is currently connected.
My SOAP:LITE request is successfully pulling data from the server using the following code:
use strict;
use warnings;
use SOAP::Lite;
use CGI;
use Data::Dumper;
my $myWebService = SOAP::Lite
-> uri('http://www.ShoreTel.com/ProServices/SDK/Web')
-> proxy('http://10.1.##.##:8070/ShoreTelWebSDK/WebService')
-> on_action(sub {sprintf '%s/ShoreTelWebService/%s', $_[0], $_[1]});
my $query = new CGI;
my $ip = $query->remote_host; # IP address of remote party...use later as unique identifier
my $myClientID = $query->param('MyClientID'); # Possible client ID from previous script passed into us.
my $extnNr = $query->param('MyExtn'); # Has to be at least an extension number so we know who to status.
my $url = CGI::url(-path_info=>1); # What is my URL?
# There should be an extension number given, else what would we status.
if (defined($refreshNr) && defined($extnNr) && ($extnNr ne '') && ($refreshNr ne ''))
{
# If there is a client ID defined, use it...otherwise registering and getting a client ID
# is the first thing we need to do when using our web service.
unless (defined($myClientID))
{
# To use our service, we need to register ourselves as a client...use remote IP address
# as a unique name for association to this session.
my $regClientResult = $myWebService->RegisterClient(SOAP::Data->name('clientName' => $ip));
if ($regClientResult->fault)
{
print '<p>FAULT', $myClientID->faultcode, ', ', $myClientID->faultstring;
}
else
{
# Retrieve client ID which we will be using for subsequent communication.
$myClientID = $regClientResult->valueof('//RegisterClientResponse/RegisterClientResult/');
}
}
if (defined($myClientID))
{
# Use our web service to open the line. This is necessary to get a line ID.
# print '<br>Client ID ', $myClientID, ' has been registered.<br>';
my $openResult = $myWebService->OpenLine(SOAP::Data->name('clientHandle' => $myClientID), SOAP::Data->name('lineAddress' => $extnNr));
my $lineID = $openResult->valueof('//OpenLineResponse/OpenLineResult/lineID/');
my $lineType = $openResult->valueof('//OpenLineResponse/OpenLineResult/lineType/');
my $lineName = $openResult->valueof('//OpenLineResponse/OpenLineResult/lineName/');
my $lineState = $openResult->valueof('//OpenLineResponse/OpenLineResult/lineState/');
# Call GetActiveCalls to see if anything is going on with this line.
my $result = $myWebService->GetActiveCalls(SOAP::Data->name('clientHandle' => $myClientID), SOAP::Data->name('lineID' => $lineID));
my $callID = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callID/');
if ($callID ne '')
{
# print '<br>Call ID is ', $callID;
my $isExternal = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/isExternal/');
my $isInbound = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/isInbound/');
my $callReason = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callReason/');
my $connectedID = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/connectedID/');
my $connectedIDName = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/connectedIDName/');
my $callerID = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callerID/');
my $callerIDName = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callerIDName/');
my $calledID = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/calledID/');
my $calledIDName = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/calledIDName/');
my $callState = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callState/');
my $callStateDetail = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callStateDetail/');
# Print call information.
print <<EndOfCallInfo;
HTML CODE
EndOfCallInfo
}
else
{
print <<EndOfCallInfo2;
HTML CODE
EndOfCallInfo2
}
}
}
But I am only able to access the first result in the multidimensional array.
I have tried looping through the results using
for my $t ($result->result({ShoreTelCallStateInfo}{callInfo}')) {
print $t->{callerID} . "\n";}
But I am getting absolutely no results. It appears that the the loop is not even entered.
The following code I have works fine, but only pulls the first caller ID, in this case 1955.
my $callerID = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callerID/');
What can I do to make my loop work?
So that you can see what I am receiving from the server I have included the response from the SOAP Server using DUMP :
$VAR1 = { 'ShoreTelCallStateInfo' => [
{ 'callStateDetail' => 'Active',
'callState' => 'OnHold',
'callInfo' =>
{ 'callerIDName' => 'Joel LASTNAME',
'callID' => '69105', 'lineID' => '3947',
'connectedIDName' => 'VM-Forward',
'calledID' => '2105',
'callerID' => '1955',
'isInbound' => 'false',
'calledIDName' => 'VM-Forward',
'callReason' => 'None',
'callUniqueID' => '1369702515',
'connectedID' => '2105',
'isExternal' => 'false',
'callGUID' => '{00030000-66C2-537E-3FD8-0010492377D9}'
}
},
{ 'callStateDetail' => 'Active',
'callState' => 'Connected',
'callInfo' =>
{ 'callerIDName' => 'LASTNAME Joel ',
'callID' => '71649',
'lineID' => '3947',
'connectedIDName' => 'LASTNAME Joel ',
'calledID' => '1955',
'callerID' => '+1385#######',
'isInbound' => 'true',
'calledIDName' => 'Joel LASTNAME',
'callReason' => 'None',
'callUniqueID' => '1117287558',
'connectedID' => '+1385#######',
'isExternal' => 'true',
'callGUID' => '{00030000-66C5-537E-3FD8-0010492377D9}'
}
}
]
};

Just a guess...
The following code I have works fine, but only pulls the first caller
ID, in this case 1955.
my $callerID = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callerID/');
What can I do to make my loop work?
SOAP::Lite docs say:
valueof()
Returns the value of a (previously) matched node. It accepts a node
path. In this case, it returns the value of matched node, but does not
change the current node. Suitable when you want to match a node and
then navigate through node children:
$som->match('/Envelope/Body/[1]'); # match method
$som->valueof('[1]'); # result
$som->valueof('[2]'); # first out parameter (if present)
The returned value depends on the context. In a scalar context it will
return the first element from matched nodeset. In an array context it
will return all matched elements.
Does this give the behavior you expect? It imposes list context on the valueof method.
for my $callerID ($result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callerID/')) {
...
# do something with each callerID
}
or
my #callerIDs = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callerID/');

Trouble combining certain arrays in Perl(web::scraper)

So, I'm trying to learn Perl in my spare time, and I've watched a lot of tutorials and read some material on it.
I'm trying to combine these arrays,(Not that I cut out some of the other scraper blocks for brevity, they are tested and work, and resemble the PCMAG scraper given)
#PCMAG
my $pcmag1_items = scraper {
process ".news-content", "pcmag1_items[]" => scraper {
process "h3", title => 'TEXT';
process "p", body => 'TEXT';
process ".time-stamp", date => 'TEXT';
process 'a', link => '#href';
};
};
my $pcmag2_items = scraper {
process ".news-reviews > ul > li", "pcmag2_items[]" => scraper {
process "h3", title => 'TEXT';
process "p", body => 'TEXT';
process ".time-stamp", date => 'TEXT';
process 'a', link => '#href';
};
};
my $pcmag1_res = $pcmag1_items->scrape( URI->new("http://www.pcmag.com/news"));
my $pcmag2_res = $pcmag2_items->scrape( URI->new("http://www.pcmag.com/news"));
#MAIN
my #items = (#{$pcmag1_res->{pcmag1_items}}, #{$pcmag2_res->{pcmag2_items}}, #{$cnn_res->{cnn_items}}, #{$cnet_res->{cnet_items}});
for my $item (#items) {
print "$item->{title} $item->{body} $item->{date} (link: $item->{link})\n \n";
}
the error I get is:
greg#Linux:~/Desktop/Perl$ perl scraper2.pl
Use of uninitialized value $xpath in concatenation (.) or string at /usr/local/share/perl/5.14.2/Web/Scraper.pm line 136.
'' doesn't look like a valid XPath expression: Not a _primary_expr at
greg#Linux:~/Desktop/Perl$

How to post array form using LWP

I am having problems creating an array that I can pass as a form using LWP. Basic code is
my $ua = LWP::UserAgent->new();
my %form = { };
$form->{'Submit'} = '1';
$form->{'Action'} = 'check';
for (my $i=0; $i<1; $i++) {
$form->{'file_'.($i+1)} = [ './test.txt' ];
$form->{'desc_'.($i+1)} = '';
}
$resp = $ua->post('http://someurl/test.php', 'Content_Type' => 'multipart/form-data'
, 'Content => [ \%form ]');
if ($resp->is_success()) {
print "OK: ", $resp->content;
}
} else {
print $claimid->as_string;
}
I guess I am not creating the form array correctly or using the wrong type as when I check the _POST variables in test.php nothing has been set :(

The problem is that for some reason you've enclosed your form values in single quotes. You want to send the data structure. E.g.:
$resp = $ua->post('http://someurl/test.php',
'Content_Type' => 'multipart/form-data',
'Content' => \%form);
You want to either send the hash reference of %form, not the has reference contained within an array reference as you had ([ \%form ]). If you had wanted to send the data as an array reference, then you'd just use[ %form ]` which populates the array with the key/value pairs from the hash.
I'd suggest that you read the documentation for HTTP::Request::Common, the POST section in particular for a cleaner way of doing this.

how to query eXist using XPath?

I decided to use eXist as a database for an application that I am writing in Perl and
I am experimenting with it. The problem is that I have stored a .xml document with the following structure
<foo-bar00>
<perfdata datum="GigabitEthernet3_0_18">
<cli cmd="whatsup" detail="GigabitEthernet3/0/18" find="" given="">
<input_rate>3</input_rate>
<output_rate>3</output_rate>
</cli>
</perfdata>
<timeline>2011-5-23T11:15:33</timeline>
</foo-bar00>
and it is located in the "/db/LAB/foo-bar00/2011/5/23/11_15_33.xml" collection.
I can successfully query it, like
my $xquery = 'doc("/db/LAB/foo-bar00/2011/5/23/11_15_33.xml")' ;
or $xquery can be equal to
= doc("/db/LAB/foo-bar00/2011/5/23/11_15_33.xml")/foo-bar00/perfdata/cli/data(output_rate)
or
= doc("/db/LAB/foo-bar00/2011/5/23/11_15_33.xml")/foo-bar00/data(timeline)
my ($rc1, $set) = $eXist->executeQuery($xquery) ;
my ($rc2, $count) = $eXist->numberOfResults($set) ;
my ($rc3, #data) = $eXist->retrieveResults($set) ;
$eXist->releaseResultSet($set) ;
print Dumper(#data) ;
And the result is :
$VAR1 = {
'hitCount' => 1,
'foo-bar00' => {
'perfdata' => {
'cli' => {
'given' => '',
'detail' => 'GigabitEthernet3/0/18',
'input_rate' => '3',
'cmd' => 'whatsup',
'output_rate' => '3',
'find' => ''
},
'datum' => 'GigabitEthernet3_0_18'
},
'timeline' => '2011-5-23T11:15:33'
}
};
---> Given that I know the xml document that I want to retrieve info from.
---> Given that I want to retrieve the timeline information.
When I am writing :
my $db_xml_doc = "/db/LAB/foo-bar00/2011/5/23/11_15_33.xml" ;
my ($db_rc, $db_datum) = $eXist->queryXPath("/foo-bar00/timeline", $db_xml_doc, "") ;
print Dumper($db_datum) ;
The result is :
$VAR1 = {
'hash' => 1717362942,
'id' => 3,
'results' => [
{
'node_id' => '1.2',
'document' => '/db/LAB/foo-bar00/2011/5/23/11_15_33.xml'
}
]
};
The question is : How can I retrieve the "timeline" info ? Seems that the "node_id" variable (=1.2) can points to the "timeline" info, but how can I use it ?
Thank you.

use XML::LibXML qw( );
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file('a.xml');
my $root = $doc->documentElement();
my ($timeline) = $root->findnodes('timeline');
if ($timeline) {
print("Exists: ", $timeline->textContent(), "\n");
}
or
my ($timeline) = $root->findnodes('timeline/text()');
if ($timeline) {
print("Exists: ", $timeline->getValue(), "\n");
}
I could have used /foo-bar00/timeline instead of timeline, but I didn't see the need.

Don't know if you're still interested, but you could either retrieve the doc as DOM and apply an xquery to the DOM, or, probably better, only pull out the info you want in the query that you submit to the server.
Something like this:
for $p in doc("/db/LAB/foo-bar00/2011/5/23/11_15_33.xml")//output_rate
return
<vlaue>$p</value>

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Web::Scraper and Perl - perl

Looks to me like my $coursenum = substr($elem,2,4) should be my $coursenum = substr($elem,3,3)

The easiest way to go in this case is use HTML::TableExtract In case you are looking for data from the table only.

Related

In mojolicious, how to protect images from public view

How to loop through subarrays of a SOAP::Lite response in Perl?

Trouble combining certain arrays in Perl(web::scraper)

How to post array form using LWP

how to query eXist using XPath?

Categories

Resources