Trouble combining certain arrays in Perl(web::scraper) - perl

So, I'm trying to learn Perl in my spare time, and I've watched a lot of tutorials and read some material on it.
I'm trying to combine these arrays,(Not that I cut out some of the other scraper blocks for brevity, they are tested and work, and resemble the PCMAG scraper given)
#PCMAG
my $pcmag1_items = scraper {
process ".news-content", "pcmag1_items[]" => scraper {
process "h3", title => 'TEXT';
process "p", body => 'TEXT';
process ".time-stamp", date => 'TEXT';
process 'a', link => '#href';
};
};
my $pcmag2_items = scraper {
process ".news-reviews > ul > li", "pcmag2_items[]" => scraper {
process "h3", title => 'TEXT';
process "p", body => 'TEXT';
process ".time-stamp", date => 'TEXT';
process 'a', link => '#href';
};
};
my $pcmag1_res = $pcmag1_items->scrape( URI->new("http://www.pcmag.com/news"));
my $pcmag2_res = $pcmag2_items->scrape( URI->new("http://www.pcmag.com/news"));
#MAIN
my #items = (#{$pcmag1_res->{pcmag1_items}}, #{$pcmag2_res->{pcmag2_items}}, #{$cnn_res->{cnn_items}}, #{$cnet_res->{cnet_items}});
for my $item (#items) {
print "$item->{title} $item->{body} $item->{date} (link: $item->{link})\n \n";
}
the error I get is:
greg#Linux:~/Desktop/Perl$ perl scraper2.pl
Use of uninitialized value $xpath in concatenation (.) or string at /usr/local/share/perl/5.14.2/Web/Scraper.pm line 136.
'' doesn't look like a valid XPath expression: Not a _primary_expr at
greg#Linux:~/Desktop/Perl$

Related

Invalid Content-Type 'multipart/; boundary

I am using Perl and Email::MIME to access an email account via IMAP. When I try to download the attachment, I get this error:
Invalid Content-Type 'multipart/;
boundary="===============6113972194662902815=="' at
/path/to/class/Reader.pm line 99.
Here is my code:
my $env = $self->env();
my $imap = $self->imap();
my $logger = $self->logger();
my ($OUT, $out_file);
# parse attachment
my $message_string = $imap->message_string($imap_id);
Email::MIME->new($message_string)->walk_parts(sub {
my ($part) = #_;
if ($part->content_type =~ /octet\-stream/) {
($OUT, $out_file) = tempfile();
binmode $OUT;
print $OUT $part->body;
close $OUT;
$logger->info("downloaded attached report: [$out_file]");
}
});
Here is the end of the dump of the $part:
MixOLE4sTiwwLE4NCiI9IiIyMDE4LTA0LTE5IDA5OjU5OjI5LjQ5NCIiIixhZTJhNDEzMy1hYWRj
LTQ4ZjgtYWY0My1jYjdhMGEzYzQzMzIsIkFwcGxpYW5jZSBwYXJ0cywgaG91c2Vob2xkIixBcHBs
aWFuY2UgU2FsZXMsTUlETE9USElBTixWQSxWYWN1dW0gU3lzdGVtcyw4MDQ1NDg0MTgxLCAsYTEx
OGUxODIyMTAwNzJkYSxEVVJBVElPTiw0NSwyLjQ3ODQsTixOLE4sMCxODQoiPSIiMjAxOC0wNC0x
OSAwOTo1OToyOS45MSIiIiwxZTNlZGQzZi02NGM0LTQ3M2UtODk2Yy00MTI3ZTVhYzIwYWUsRklO
QU5DRXxGSU5BTkNJQUwgSU5TVElUVVRJT05TfEJBTktTLEJhbmtzLEdSRUVOU0JVUkcsUEEsRmly
c3QgRmVkZXJhbCBTYXZpbmdzICYgTG9hbiBBc3NvY2lhdGlvbiBPZiAgR3JlZW4sODQ0MjU1MTk0
MSwgLDgzNDA0NmQ2MWEyYzRkY2UsRFVSQVRJT04sNDUsNS4wLE4sTixOLDAsTg0KIj0iIjIwMTgt
MDQtMTkgMDk6NTk6MjkuOTE1IiIiLGY1YTBmOTc1LTg1MmUtNDAwNC05YTY1LWEzYzgyNDJlYTQy
NywiaW5zdXJhbmNlLCBsZW5kZXJzLCBtb3J0Z2FnZXMiLExvYW5zICYgTW9ydGdhZ2VzLE1hbmFz
cXVhbixOSixDb21tZXJjaWFsIE1vcnRnYWdlIEFzc29jaWF0ZXMgSW5jLiw3MzI0NTEzMzU0LCAs
ZGQ5YTQ4OGVmYjJiY2NlMSxEVVJBVElPTiw0NSwzLjQ0MzUsTixOLE4sMCxODQo=
--===============6113972194662902815==--
",
ct => {
attributes => {
charset => "us-ascii"
},
composite => "plain",
discrete => "text",
subtype => "plain",
type => "text"
},
encode_check => 1,
header => Email::MIME::Header,
mycrlf => "
",
parts => []
}
}
I have tried several different methods of downloading this file and I keep getting that error. Let me know if there are any questoins

error when sending email using Dancer2::Plugin::Email;

I am sending email using Dancer2 via the Dancer2::Plugin::Email package. The main code that I have for this is:
sub sendEmail {
my ($params,$email_address,$template) = #_;
my $text = '';
my $tt = Template->new({
INCLUDE_PATH => config->{views},
INTERPOLATE => 1,
OUTPUT => \$text
}) || die "$Template::ERROR\n";
my $out = $tt->process($template,$params);
my $email = email {
from => XXXXX,
to => $email_address,
subject => XXXXX,
body => $text,
'Content-Type' => 'text/html'
};
}
where I have hidden a couple of the fields. I have gotten the following error:
Route exception: open body: Invalid argument at
/usr/local/share/perl/5.22.1/MIME/Entity.pm line 1878. in
/usr/local/share/perl/5.22.1/Dancer2/Core/App.pm l. 1454
It is not occurring all of the time and I haven't been able to find a consistent piece of code that always fails.
I have set the host parameter of the mail server that I am using in the configuration as explained here: https://metacpan.org/pod/Dancer2::Plugin::Email Simple tests show it works, but I get sporadic errors that I can't track down.

How to loop through subarrays of a SOAP::Lite response in Perl?

I have a Perl script that is successfully getting a response from my ShoreTel Phone server. The server provides information on what calls are currently connected for the extension entered. However I am having issues looping through the sub arrays to get more than one response when there are multiple items. In this case I want to get each of the caller IDs that is currently connected.
My SOAP:LITE request is successfully pulling data from the server using the following code:
use strict;
use warnings;
use SOAP::Lite;
use CGI;
use Data::Dumper;
my $myWebService = SOAP::Lite
-> uri('http://www.ShoreTel.com/ProServices/SDK/Web')
-> proxy('http://10.1.##.##:8070/ShoreTelWebSDK/WebService')
-> on_action(sub {sprintf '%s/ShoreTelWebService/%s', $_[0], $_[1]});
my $query = new CGI;
my $ip = $query->remote_host; # IP address of remote party...use later as unique identifier
my $myClientID = $query->param('MyClientID'); # Possible client ID from previous script passed into us.
my $extnNr = $query->param('MyExtn'); # Has to be at least an extension number so we know who to status.
my $url = CGI::url(-path_info=>1); # What is my URL?
# There should be an extension number given, else what would we status.
if (defined($refreshNr) && defined($extnNr) && ($extnNr ne '') && ($refreshNr ne ''))
{
# If there is a client ID defined, use it...otherwise registering and getting a client ID
# is the first thing we need to do when using our web service.
unless (defined($myClientID))
{
# To use our service, we need to register ourselves as a client...use remote IP address
# as a unique name for association to this session.
my $regClientResult = $myWebService->RegisterClient(SOAP::Data->name('clientName' => $ip));
if ($regClientResult->fault)
{
print '<p>FAULT', $myClientID->faultcode, ', ', $myClientID->faultstring;
}
else
{
# Retrieve client ID which we will be using for subsequent communication.
$myClientID = $regClientResult->valueof('//RegisterClientResponse/RegisterClientResult/');
}
}
if (defined($myClientID))
{
# Use our web service to open the line. This is necessary to get a line ID.
# print '<br>Client ID ', $myClientID, ' has been registered.<br>';
my $openResult = $myWebService->OpenLine(SOAP::Data->name('clientHandle' => $myClientID), SOAP::Data->name('lineAddress' => $extnNr));
my $lineID = $openResult->valueof('//OpenLineResponse/OpenLineResult/lineID/');
my $lineType = $openResult->valueof('//OpenLineResponse/OpenLineResult/lineType/');
my $lineName = $openResult->valueof('//OpenLineResponse/OpenLineResult/lineName/');
my $lineState = $openResult->valueof('//OpenLineResponse/OpenLineResult/lineState/');
# Call GetActiveCalls to see if anything is going on with this line.
my $result = $myWebService->GetActiveCalls(SOAP::Data->name('clientHandle' => $myClientID), SOAP::Data->name('lineID' => $lineID));
my $callID = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callID/');
if ($callID ne '')
{
# print '<br>Call ID is ', $callID;
my $isExternal = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/isExternal/');
my $isInbound = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/isInbound/');
my $callReason = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callReason/');
my $connectedID = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/connectedID/');
my $connectedIDName = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/connectedIDName/');
my $callerID = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callerID/');
my $callerIDName = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callerIDName/');
my $calledID = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/calledID/');
my $calledIDName = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/calledIDName/');
my $callState = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callState/');
my $callStateDetail = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callStateDetail/');
# Print call information.
print <<EndOfCallInfo;
HTML CODE
EndOfCallInfo
}
else
{
print <<EndOfCallInfo2;
HTML CODE
EndOfCallInfo2
}
}
}
But I am only able to access the first result in the multidimensional array.
I have tried looping through the results using
for my $t ($result->result({ShoreTelCallStateInfo}{callInfo}')) {
print $t->{callerID} . "\n";}
But I am getting absolutely no results. It appears that the the loop is not even entered.
The following code I have works fine, but only pulls the first caller ID, in this case 1955.
my $callerID = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callerID/');
What can I do to make my loop work?
So that you can see what I am receiving from the server I have included the response from the SOAP Server using DUMP :
$VAR1 = { 'ShoreTelCallStateInfo' => [
{ 'callStateDetail' => 'Active',
'callState' => 'OnHold',
'callInfo' =>
{ 'callerIDName' => 'Joel LASTNAME',
'callID' => '69105', 'lineID' => '3947',
'connectedIDName' => 'VM-Forward',
'calledID' => '2105',
'callerID' => '1955',
'isInbound' => 'false',
'calledIDName' => 'VM-Forward',
'callReason' => 'None',
'callUniqueID' => '1369702515',
'connectedID' => '2105',
'isExternal' => 'false',
'callGUID' => '{00030000-66C2-537E-3FD8-0010492377D9}'
}
},
{ 'callStateDetail' => 'Active',
'callState' => 'Connected',
'callInfo' =>
{ 'callerIDName' => 'LASTNAME Joel ',
'callID' => '71649',
'lineID' => '3947',
'connectedIDName' => 'LASTNAME Joel ',
'calledID' => '1955',
'callerID' => '+1385#######',
'isInbound' => 'true',
'calledIDName' => 'Joel LASTNAME',
'callReason' => 'None',
'callUniqueID' => '1117287558',
'connectedID' => '+1385#######',
'isExternal' => 'true',
'callGUID' => '{00030000-66C5-537E-3FD8-0010492377D9}'
}
}
]
};
Just a guess...
The following code I have works fine, but only pulls the first caller
ID, in this case 1955.
my $callerID = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callerID/');
What can I do to make my loop work?
SOAP::Lite docs say:
valueof()
Returns the value of a (previously) matched node. It accepts a node
path. In this case, it returns the value of matched node, but does not
change the current node. Suitable when you want to match a node and
then navigate through node children:
$som->match('/Envelope/Body/[1]'); # match method
$som->valueof('[1]'); # result
$som->valueof('[2]'); # first out parameter (if present)
The returned value depends on the context. In a scalar context it will
return the first element from matched nodeset. In an array context it
will return all matched elements.
Does this give the behavior you expect? It imposes list context on the valueof method.
for my $callerID ($result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callerID/')) {
...
# do something with each callerID
}
or
my #callerIDs = $result->valueof('//GetActiveCallsResponse/GetActiveCallsResult/ShoreTelCallStateInfo/callInfo/callerID/');

how to query eXist using XPath?

I decided to use eXist as a database for an application that I am writing in Perl and
I am experimenting with it. The problem is that I have stored a .xml document with the following structure
<foo-bar00>
<perfdata datum="GigabitEthernet3_0_18">
<cli cmd="whatsup" detail="GigabitEthernet3/0/18" find="" given="">
<input_rate>3</input_rate>
<output_rate>3</output_rate>
</cli>
</perfdata>
<timeline>2011-5-23T11:15:33</timeline>
</foo-bar00>
and it is located in the "/db/LAB/foo-bar00/2011/5/23/11_15_33.xml" collection.
I can successfully query it, like
my $xquery = 'doc("/db/LAB/foo-bar00/2011/5/23/11_15_33.xml")' ;
or $xquery can be equal to
= doc("/db/LAB/foo-bar00/2011/5/23/11_15_33.xml")/foo-bar00/perfdata/cli/data(output_rate)
or
= doc("/db/LAB/foo-bar00/2011/5/23/11_15_33.xml")/foo-bar00/data(timeline)
my ($rc1, $set) = $eXist->executeQuery($xquery) ;
my ($rc2, $count) = $eXist->numberOfResults($set) ;
my ($rc3, #data) = $eXist->retrieveResults($set) ;
$eXist->releaseResultSet($set) ;
print Dumper(#data) ;
And the result is :
$VAR1 = {
'hitCount' => 1,
'foo-bar00' => {
'perfdata' => {
'cli' => {
'given' => '',
'detail' => 'GigabitEthernet3/0/18',
'input_rate' => '3',
'cmd' => 'whatsup',
'output_rate' => '3',
'find' => ''
},
'datum' => 'GigabitEthernet3_0_18'
},
'timeline' => '2011-5-23T11:15:33'
}
};
---> Given that I know the xml document that I want to retrieve info from.
---> Given that I want to retrieve the timeline information.
When I am writing :
my $db_xml_doc = "/db/LAB/foo-bar00/2011/5/23/11_15_33.xml" ;
my ($db_rc, $db_datum) = $eXist->queryXPath("/foo-bar00/timeline", $db_xml_doc, "") ;
print Dumper($db_datum) ;
The result is :
$VAR1 = {
'hash' => 1717362942,
'id' => 3,
'results' => [
{
'node_id' => '1.2',
'document' => '/db/LAB/foo-bar00/2011/5/23/11_15_33.xml'
}
]
};
The question is : How can I retrieve the "timeline" info ? Seems that the "node_id" variable (=1.2) can points to the "timeline" info, but how can I use it ?
Thank you.
use XML::LibXML qw( );
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file('a.xml');
my $root = $doc->documentElement();
my ($timeline) = $root->findnodes('timeline');
if ($timeline) {
print("Exists: ", $timeline->textContent(), "\n");
}
or
my ($timeline) = $root->findnodes('timeline/text()');
if ($timeline) {
print("Exists: ", $timeline->getValue(), "\n");
}
I could have used /foo-bar00/timeline instead of timeline, but I didn't see the need.
Don't know if you're still interested, but you could either retrieve the doc as DOM and apply an xquery to the DOM, or, probably better, only pull out the info you want in the query that you submit to the server.
Something like this:
for $p in doc("/db/LAB/foo-bar00/2011/5/23/11_15_33.xml")//output_rate
return
<vlaue>$p</value>

Web::Scraper and Perl

I have the following script that scrapes my schools CS department to get a list of all the courses. I want to be able to extract the CRN (course number) and other important information to put into a database which I can let users browse through a web app.
Here is an example URL:
http://courses.illinois.edu/cis/2011/spring/schedule/CS/411.html
I would like to extract info from pages like this. The first level of the scraper just constructs the individual sites from a list of all of the courses. Once I'm at a course specific catalog page, I use the second scraper to attempt to get all of this info i want. For some reason, although the CRN's and Course Instructors are all 'td' elements. My scraper seems to be returning nothing when scraping. I tried to scrape specifically for 'div' instead and I get a bunch of info for each relevant page. So somehow I'm failing to get the 'td' element, but I'm scraping from the right page.
my $tweets = scraper {
# Parse all LIs with the class "status", store them into a resulting
# array 'tweets'. We embed another scraper for each tweet.
# process "h4.ws-ds-name.detail-title", "array[]" => 'TEXT';
process "div.ws-row", "array[]" => 'TEXT';
};
my $res = $tweets->scrape( URI- >new("http://courses.illinois.edu/cis/2011/spring/schedule/CS/index.html?skinId=2169") );
foreach my $elem (#{$res->{array}}){
my $coursenum = substr($elem,2,4);
my $secondLevel = scraper{
process "td.ws-row", "array2[]" => 'TEXT';
};
my $res2 = $secondLevel->scrape(URI- >new("http://courses.illinois.edu/cis/2011/spring/schedule/CS/$coursenum.html"));
my $num = #{$res2->{array2}};
print $num;
print "---------------------", "\n";
my #curr = #{$res2->{array2}};
foreach my $elem2 (#curr){
$num++;
print $elem2, " ", "\n";
}
print "---------------------", "\n";
}
Any ideas?
Thanks
Looks to me like
my $coursenum = substr($elem,2,4)
should be
my $coursenum = substr($elem,3,3)
The easiest way to go in this case is use
HTML::TableExtract
In case you are looking for data from the table only.
I played a bit with your problem. You can get course id, title and link to individual course page within initial scraper:
my $courses = scraper {
process 'div.ws-row',
'course[]' => scraper {
process 'div.ws-course-number', 'id' => 'TEXT';
process 'div.ws-course-title', 'title' => 'TEXT';
process 'div.ws-course-title a', 'link' => '#href';
};
result 'course';
};
The result of scraping is arrayref with hashrefs like this:
{ id => "CS 103",
title => "Introduction to Programming",
link => bless(do{\(my $o = "http://courses.illinois.edu/cis/2011/spring/schedule/CS/103.html?skinId=2169")}, "URI::http"),
},
....
Then you can do additional scraping for each course from their individual pages and add such information into original structure:
for my $course (#$res) {
my $crs_scraper = scraper {
process 'div.ws-description', 'desc' => 'TEXT';
# ... add more items here
};
my $additional_data = $crs_scraper->scrape(URI->new($course->{link}));
# slice assignment to add them into course definition
#{$course}{ keys %$additional_data } = values %$additional_data;
}
Source combined together is as follows:
use strict; use warnings;
use URI;
use Web::Scraper;
use Data::Dump qw(dump);
my $url = 'http://courses.illinois.edu/cis/2011/spring/schedule/CS/index.html?skinId=2169';
my $courses = scraper {
process 'div.ws-row',
'course[]' => scraper {
process 'div.ws-course-number', 'id' => 'TEXT';
process 'div.ws-course-title', 'title' => 'TEXT';
process 'div.ws-course-title a', 'link' => '#href';
};
result 'course';
};
my $res = $courses->scrape(URI->new($url));
for my $course (#$res) {
my $crs_scraper = scraper {
process 'div.ws-description', 'desc' => 'TEXT';
# ... add more items here
};
my $additional_data = $crs_scraper->scrape(URI->new($course->{link}));
# slice assignment to add them into course definition
#{$course}{ keys %$additional_data } = values %$additional_data;
}
dump $res;