Web::Scraper nested structures & elements only containing spesific data

Web::Scraper nested structures & elements only containing spesific data - perl

I have the following code to scrape a form for inputs and get the attributes id and name.
#!/usr/bin/perl
use warnings;
use strict;
use URI;
use Data::Dumper::Simple;
use Web::Scraper;
my $urlToScrape = "http://digitalarkivet.arkivverket.no/finn_kilde";
my $scrap = scraper {
process 'div.listGroup.open > ul.grouped > li.expandable', 'data[]' => scraper {
process 'input', 'id' => '#id', name => '#name';
process 'label', 'label_for' => '#for';
process 'span.listExpander ', 'Text' => 'TEXT';
process 'ul.sublist1', 'sublist[]' => scraper {
process 'input', 'id' => '#id', name => '#name';
process 'label', 'label_for' => '#for';
process 'span', 'label' => 'TEXT';
};
};
};
my $res = $scrap->scrape(URI->new($urlToScrape));
print Dumper($res);
which gives me (shortend $res to fit screen better)
$res = {
'data' => [
{
'label_for' => 'ka0',
'sublist' => [
{
'label' => 'Statlig folketelling',
'label_for' => 'ka0kt0',
'name' => 'kt[]',
'id' => 'ka0kt0'
}
],
'name' => 'ka[]',
'id' => 'ka0>',
'Text' => 'Folketellinger'
},
{
'sublist' => [
{
'label' => 'Manntall',
'name' => 'kt[]',
'label_for' => 'ka1kt0',
'id' => 'ka1kt0'
}
],
'label_for' => 'ka1',
'id' => 'ka1>',
'name' => 'ka[]',
'Text' => 'Manntall'
},
....
{
'label_for' => 'r0',
'sublist' => [
{
'label_for' => 'r0f0',
'id' => 'r0f0',
'name' => 'f[]',
'label' => "01 Østfold"
}
],
'id' => 'r0',
'name' => 'r[]',
'Text' => "Østlandet"
},
{
'Text' => "Sørlandet",
'id' => 'r1',
'sublist' => [
{
'label_for' => 'r1f0',
'name' => 'f[]',
'id' => 'r1f0',
'label' => '09 Aust-Agder'
}
],
'label_for' => 'r1',
'name' => 'r[]'
}
]
};
I' have 2 issues I need to fix. First, I only want to get data for inputs having 'name' = ka[] (at top level).
Second, I only get data for first ul.sublist1 (If you study the page I'm scraping you can see that several "Kildekategori" have subsets of data, which are revealed if expanded/ clicked upon. Putting brackets on Text[] only gets me the sublist textnames, but not their attributes.
I'm thinking I might have to grab data in 2 scrapes instead, since nested values are revealed by id and label_for.

Solved it by scraping three times, foreach "level"
#!/usr/bin/perl
use strict;
use warnings;
use URI;
use Web::Scraper;
use Data::Dumper::Simple;
my %site;
my #res;
my $i;
my $j;
my $label_for;
my #scrape;
$site{'siteID'} = 1;
$site{'url'} = "http://digitalarkivet.arkivverket.no/finn_kilde";
$site{'name'} = "finn_kilde";
open FIL, ">$site{'name'}.csv" or die $!;
my $seperator=";";
$scrape[0] = scraper {
process 'div.listGroup.open > ul.grouped > li.expandable', 'data[]' => scraper {
process 'input',
'id' => '#id',
'value' => '#value',
'type' => '#type',
'name' => '#name';
process 'label', 'label_for' => '#for';
process 'span.listExpander ', 'text' => 'TEXT';
};
};
$scrape[1] = scraper {
process 'ul.sublist1 > li', 'data[]' => scraper {
process 'input',
'id' => '#id',
'value' => '#value',
'type' => '#type',
'name' => '#name';
process 'label', 'label_for' => '#for';
process 'span', 'text' => 'TEXT';
}
};
$scrape[2] = scraper {
process 'ul.sublist2 > li', 'data[]' => scraper {
process 'input',
'id' => '#id',
'value' => '#value',
'type' => '#type',
'name' => '#name';
process 'label', 'label_for' => '#for';
process 'span', 'text' => 'TEXT';
}
};
for $i (0 .. $#scrape){
$res[$i] = $scrape[$i]->scrape(URI->new($site{'url'}));
unless ($i) {
print FIL join($seperator,"label_for","text","name","value","id","type")."\n";
}
for $j (0 .. $#{$res[$i]->{data}}) {
if (defined($res[$i]->{data}[$j]->{label_for})){
$label_for=$res[$i]->{data}[$j]->{label_for};
} else {
$label_for="";
}
if (length($label_for)>0) {
my $name=$res[$i]->{data}[$j]->{name};
my $text=$res[$i]->{data}[$j]->{text};
my $value=$res[$i]->{data}[$j]->{value};
my $id=$res[$i]->{data}[$j]->{id};
my $type=$res[$i]->{data}[$j]->{type};
my #row=($label_for,$text,$name,$value,$id,$type);
print FIL join($seperator,#row);
print FIL "\n";
}
}
sleep(2);
}
close FIL;
print Dumper(\#res);
1;

Related

Accessing a nested data structure

I have an array of hashes nested to multiple levels. I need to extract a value from all deeply-nested hashes that have a given value for a different key in the same hash
This is a collection of entities from our database, and the data represents contacts within each entity and all of their contact values.
There is a hash key contact_method_type_id which refers to an integer defining the type of contact method. The contact_method_type_id that I care about is 1, which is email.
The first contact has three different contact_methods. The first is 4 which is an office phone, the second is a 2 which is a home phone, and the third is a 1 which is email.
Within the same hash is there is a 'contact_method_value', which is the string representation of their email address.
I need a way to extract just these values into a new array
Here are the contents of the first element of the array
$VAR1 = [
{ 'total' => '2',
'results' => [
{ 'contact_type_name' => 'Primary Technical Contact',
'street' => undef,
'state_id' => undef,
'state_name' => undef,
'last_name' => 'Barb',
'entities' => [
{ 'entity_name' => 'XXXXX',
'entity_id' => 'XXXXX'
}
],
'state_abbr_name' => undef,
'city' => undef,
'country_id' => undef,
'latitude' => undef,
'contact_id' => 'XXXXXX',
'contact_type_id' => '1',
'roles' => [],
'contact_methods' => [
{ 'entity_name' => undef,
'contact_method_value' => 'XXXXXXX',
'contact_method_type_id' => '4',
'contact_method_id' => '24041',
'entity_id' => undef,
'contact_method_type_name' => 'Cell Phone'
},
{ 'entity_name' => undef,
'contact_method_value' => 'XXXXXX',
'contact_method_type_id' => '2',
'contact_method_id' => '24051',
'entity_id' => undef,
'contact_method_type_name' => 'Office Phone'
},
{ 'entity_name' => undef,
'contact_method_value' => 'example#example.com',
'contact_method_type_id' => '1',
'contact_method_id' => '24061',
'entity_id' => undef,
'contact_method_type_name' => 'Email'
}
],
'country_name' => undef,
'longitude' => undef,
'country_abbr_name' => undef,
'full_name' => 'NAME',
'networks' => [
{ 'network_name' => 'NET',
'network_id' => 'X'
}
],
'timezone_id' => undef,
'zip' => undef,
'timezone_name' => undef,
'title' => 'MAC/Network Specialist',
'first_name' => 'Terri'
},
{ 'contact_type_name' => 'Primary Technical Contact',
'street' => 'STREET',
'state_id' => undef,
'state_name' => undef,
'last_name' => 'NAME',
'entities' => [
{ 'entity_name' => 'NAME',
'entity_id' => '2679'
}
],
'state_abbr_name' => undef,
'city' => 'CITY',
'country_id' => undef,
'latitude' => undef,
'contact_id' => '7896',
'contact_type_id' => '1',
'roles' => [],
'contact_methods' => [
{ 'entity_name' => undef,
'contact_method_value' => 'example#example.com',
'contact_method_type_id' => '1',
'contact_method_id' => '16796',
'entity_id' => undef,
'contact_method_type_name' => 'Email'
},
{ 'entity_name' => undef,
'contact_method_value' => 'number',
'contact_method_type_id' => '2',
'contact_method_id' => '16797',
'entity_id' => undef,
'contact_method_type_name' => 'Office Phone'
}
],
'country_name' => undef,
'longitude' => undef,
'country_abbr_name' => undef,
'full_name' => 'NAME',
'networks' => [
{ 'network_name' => 'net',
'network_id' => '17'
}
],
'timezone_id' => undef,
'zip' => 'zip',
'timezone_name' => undef,
'title' => 'Infrastructure Manager',
'first_name' => 'name'
}
],
'offset' => '0'
},
...

This looks suspiciously like something that XML::Simple would have generated.
Assuming this is the case, then I would suggest that you've fallen for the classic mistake of assuming XML::Simple actually helps.
Under that assumption, if you instead use XML::Twig:
Taking your $VAR1. Although - ideally you'll just parse the original source with parse or parsefile:
use XML::Twig;
use XML::Simple;
my $twig = XML::Twig->parse( XMLout($VAR1) );
print $_->att('contact_method_value'), "\n" for $twig->findnodes('//contact_methods[#contact_method_type_name="Email"]');
Which given your sample (as $VAR1):
example#example.com
example#example.com
Edit: Because you've commented that it's JSON then I wouldn't necessarily do this (Although - it does actually work, despite that).

If the data structures are all of the same kind, this is very trivial. You just need to iterate all the outer hashrefs (I called those resultsets). Inside those, you need to look at all results, and in each result you need to look at all the contact methods. If one of them has a contact_method_type_id of 1, you take the contact_method_value. And that's it.
my #email_addresses;
foreach my $resultset ( #{$data} ) {
foreach my $result ( #{ $resultset->{results} } ) {
foreach my $contact ( #{ $result->{contact_methods} } ) {
push #email_addresses , [ $contact->{contact_method_value} ]
if $contact->{contact_method_type_id} == 1;
}
}
}
This code assumes your structure is called $data. #email_addresses looks like this when output.
[
[ 'EMAIL' ],
[ 'EMAIL' ]
];

If you have this on a database then you should use an SQL query to retrieve it, rather than fetching everything into memory and processing what you have
The output from Data::Dumper shows the contents of your data, but it doesn't explain what you're dealing with in your code. Specifically, you don't have a $VAR1 in your code, but I have no idea what you do have
In the end, I think I wouldn't start from here. But since it's the only starting point I have to work with, it's a simple matter of recursing through the data structure
I've assumed that you want
$VAR1->[*]{results}[*]{contact_methods}[*]{contact_method_value}
where
$VAR1->[*]{results}[*]{contact_methods}[*]{contact_method_type_name} eq 'Email'
Update
Since your comments I've altered my code to select the same values where
$VAR1->[*]{results}[*]{contact_methods}[*]{contact_method_type_id} == 1
Since you said nothing about your code at all, I've had to assume a variable $data which contains a reference to the array that you show in your question
for my $item ( #$data ) {
my $results = $item->{results};
for my $result ( #$results ) {
my $methods = $result->{contact_methods} or die;
for my $method ( #$methods ) {
#my $type_name = $method->{contact_method_type_name};
#next unless $type_name eq 'Email';
my $type_id = $method->{contact_method_type_id};
next unless $type_id == 1; ## Email
my $value = $method->{contact_method_value};
print "$value\n";
}
}
}
output
example#example.com
example#example.com

ElasticSearch (search_context_missing_exception) with Search::ElasticSearch::Scroll

I'm using Search::Elasticsearch and Search::Elasticsearch::Scroll for search and scroll into my elasticsearch server.
In scrolling process, for some querys, I'm seeing the next errors while I'm scrolling the search results:
2016/03/22 11:03:38 - 265885 FATAL: [Daemon.pm][8221]: Something gone wrong, error $VAR1 = bless( {
'msg' => '[Missing] ** [http://localhost:9200]-[404] Not Found, called from sub Search::Elasticsearch::Scroll::next at searcher.pl line 92. With vars: {\'body\' => {\'hits\' => {\'hits\' => [],\'max_score\' => \'0\',\'total\' => 5215},\'timed_out\' => bless( do{\\(my $o = 0)}, \'JSON::XS::Boolean\' ),\'_shards\' => {\'failures\' => [{\'index\' => undef,\'reason\' => {\'reason\' => \'No search context found for id [4920053]\',\'type\' => \'search_context_missing_exception\'},\'shard\' => -1},{\'index\' => undef,\'reason\' => {\'reason\' => \'No search context found for id [5051485]\',\'type\' => \'search_context_missing_exception\'},\'shard\' => -1},{\'index\' => undef,\'reason\' => {\'reason\' => \'No search context found for id [4920059]\',\'type\' => \'search_context_missing_exception\'},\'shard\' => -1},{\'index\' => undef,\'reason\' => {\'reason\' => \'No search context found for id [5051496]\',\'type\' => \'search_context_missing_exception\'},\'shard\' => -1},{\'index\' => undef,\'reason\' => {\'reason\' => \'No search context found for id [5051500]\',\'type\' => \'search_context_missing_exception\'},\'shard\' => -1}],\'failed\' => 5,\'successful\' => 0,\'total\' => 5},\'_scroll_id\' => \'c2NhbjswOzE7dG90YWxfaGl0czo1MjE1Ow==\',\'took\' => 2},\'request\' => {\'serialize\' => \'std\',\'path\' => \'/_search/scroll\',\'ignore\' => [],\'mime_type\' => \'application/json\',\'body\' => \'c2Nhbjs1OzQ5MjAwNTM6bHExbENzRDVReEc0OV9UMUgzd3Vkdzs1MDUxNDg1OnJrQ3lsUkRKVHRxRWRWeURoOTB4WVE7NDkyMDA1OTpscTFsQ3NENVF4RzQ5X1QxSDN3dWR3OzUwNTE0OTY6cmtDeWxSREpUdHFFZFZ5RGg5MHhZUTs1MDUxNTAwOnJrQ3lsUkRKVHRxRWRWeURoOTB4WVE7MTt0b3RhbF9oaXRzOjUyMTU7\',\'qs\' => {\'scroll\' => \'1m\'},\'method\' => \'GET\'},\'status_code\' => 404}
',
'stack' => [
[
'searcher.pl',
92,
'Search::Elasticsearch::Scroll::next'
]
],
'text' => '[http://localhost:9200]-[404] Not Found',
'vars' => {
'body' => {
'hits' => {
'hits' => [],
'max_score' => '0',
'total' => 5215
},
'timed_out' => bless( do{\(my $o = 0)}, 'JSON::XS::Boolean' ),
'_shards' => {
'failures' => [
{
'index' => undef,
'reason' => {
'reason' => 'No search context found for id [4920053]',
'type' => 'search_context_missing_exception'
},
'shard' => -1
},
{
'index' => undef,
'reason' => {
'reason' => 'No search context found for id [5051485]',
'type' => 'search_context_missing_exception'
},
'shard' => -1
},
{
'index' => undef,
'reason' => {
'reason' => 'No search context found for id [4920059]',
'type' => 'search_context_missing_exception'
},
'shard' => -1
},
{
'index' => undef,
'reason' => {
'reason' => 'No search context found for id [5051496]',
'type' => 'search_context_missing_exception'
},
'shard' => -1
},
{
'index' => undef,
'reason' => {
'reason' => 'No search context found for id [5051500]',
'type' => 'search_context_missing_exception'
},
'shard' => -1
}
],
'failed' => 5,
'successful' => 0,
'total' => 5
},
'_scroll_id' => 'c2NhbjswOzE7dG90YWxfaGl0czo1MjE1Ow==',
'took' => 2
},
'request' => {
'serialize' => 'std',
'path' => '/_search/scroll',
'ignore' => [],
'mime_type' => 'application/json',
'body' => 'c2Nhbjs1OzQ5MjAwNTM6bHExbENzRDVReEc0OV9UMUgzd3Vkdzs1MDUxNDg1OnJrQ3lsUkRKVHRxRWRWeURoOTB4WVE7NDkyMDA1OTpscTFsQ3NENVF4RzQ5X1QxSDN3dWR3OzUwNTE0OTY6cmtDeWxSREpUdHFFZFZ5RGg5MHhZUTs1MDUxNTAwOnJrQ3lsUkRKVHRxRWRWeURoOTB4WVE7MTt0b3RhbF9oaXRzOjUyMTU7',
'qs' => {
'scroll' => '1m'
},
'method' => 'GET'
},
'status_code' => 404
},
'type' => 'Missing'
}, 'Search::Elasticsearch::Error::Missing' );
The code I'm using is the next one (simplified) :
# Retrieve scroll
my $scroll = $self->getScrollBySignature($item);
# Retrieve all affected documents ids
while (my #docs = $scroll->next(500)) {
# Do stuff with #docs
}
The function getScrollBySignature have the next code in order to call to elasticSearch
my $scroll = $self->{ELASTIC}->scroll_helper(
index => $self->{INDEXES},
search_type => 'scan',
ignore_unavailable => 1,
body => {
size => $self->{PAGINATION},
query => {
filtered => {
filter => {
bool => {
must => [{term => {signature_id => $item->{profileId}}}, {terms => {channel_type_id => $type}}]
}
}
}
}
}
);
As you can see, I'm doing the scroll without passing scroll parameter then as documentation says, the time that scroll is alive is 1 min.
The elasticSearch is a cluster of 3 servers, and the query that ends with that error retrieves a bit more than 5000 docs.
My first solution was to update the life time for scroll to 5 minutes and the error didn't appear.
The question is, as I understand every time I'm calling $scroll->next() the life time off scroll affected is upgraded 1m more, then how is possible to receive those context related errors?
I'm doing something in a bad manner?
Thank you all.

The first thing that comes to mind is that the timer is not updated. Have you checked this? You can do a query every 10 seconds for example and see if at the 6th query it gives you the error ...

Well, a good rule of thumb is inside a ->next() block, don't stay by iteration more than time that you've configured in scroll.
Between each call of ->next() you cannot stay more than that time configured. If you stay more, the scroll may be not be there and the error earch_context_missing_exception will appear.
My solution for this problem was inside next block only store data into array/hash structure and once the scroll process ended work with all data.
The solution of the question example:
# Retrieve scroll
my $scroll = $self->getScrollBySignature($item);
# Retrieve all affected documents ids
my #allDocs;
while (my #docs = $scroll->next(500)) {
push #allDocs, map {$_->{_id}} #docs
}
foreach (#allDocs) {
# Do stuff with doc
}

Perl extracting data from a hash

I am trying to extract data from a hash value.
my $triggers = $zabbix->raw('trigger','get', $options);
print Dumper($triggers);
foreach my $trigger (#{$triggers} )
{
push #triggerid,$trigger->{'triggerid'};
my #t=$trigger->{'hosts'};
my $lt = localtime($trigger->{'lastchange'});
print "$trigger->{'description'} $lt \n";
}
Output of Dumper is
[
{
'hosts' => [
{
'hostid' => '19914',
'host' => 'pc10bcf18.syd.sf.priv'
}
],
'priority' => '2',
'status' => '0',
'templateid' => '10652913',
'comments' => '',
'state' => '0',
'triggerid' => '10653191',
'expression' => '{15070357}#1',
'error' => '',
'url' => '',
'flags' => '0',
'value' => '1',
'name' => 'pc10_BizX_A_CF',
'description' => 'pc10bcf18.syd.sf.priv: Core Path not \'/dumps/java/core\' (Path=/export/home/jboss/j...)',
'value_flags' => '0',
'lastchange' => '1429181103',
'type' => '0'
},
]
From my above code, i was able to print 'description'. How do i access and print the value of 'host' value?

To maintain the for / push pattern that you have already coded, you can write this
my $triggers = $zabbix->raw('trigger', 'get', $options);
my #triggerid;
for my $trigger ( #$triggers ) {
push #triggerid, $trigger->{triggerid};
my #hosts;
my $hosts = $trigger->{hosts};
for my $host ( #$hosts ) {
push #hosts, $host->{host};
}
my $lt = localtime($trigger->{lastchange});
print "$trigger->{description} $lt\n";
}

Looks like there can be more than one host, so
my #hosts =
map { $_->{host} }
#{ $trigger->{hosts} };
To get the first one (assuming there will always be at least one),
my $first_host = $trigger->{hosts}[0]{host};

$triggers->{'hosts'}->[0]->{'host'}

Not an array reference message is thrown when using XML:Simple

I am new to perl and trying to parse XML using XML:Simple
My xml is
<suite suiteId="45" instanceId="3485">
<project>Test project</project>
<testcase id="2346" name="abc" suite="TEst1" priority="1" severity="1" owner="domain" category="BAT" timeout="10">
<description>Checking Test1</description>
<testExecTimeInMins>2</testExecTimeInMins>
<status>Failed</status>
<testServer id="86" name="host1" ip="1.2.3.4" platform="Linux" database="MySQL" buildNo="" />
<error></error>
</testcase>
<testcase id="2346456" name="abc123" suite="TEst2" priority="1" severity="1" owner="domain" category="BAT" timeout="10">
<description>Checking Test2</description>
<testExecTimeInMins>6</testExecTimeInMins>
<status>Passed</status>
<testServer id="86" name="host1" ip="1.2.3.4" platform="Linux" database="MySQL" buildNo="" />
<error />
</testcase>
</suite>
How do I get the values of testcase id, name, suite?
How to get the values for testServer, id, name?
I tried to access it as shown below but it throws "Not an Array reference at"
$XMLData = XMLin($targetFile);
foreach my $testcases (#{$XMLData->{testcase}}){
$logger->info("$testcases->{id}");
}

If you examine the actual data that is coming out of XMLin, you would see that you don't have an array reference (just like the error message states), but instead have a hashref that is keyed on the name of the testcase:
use strict;
use warnings;
use Data::Dumper;
use XML::Simple;
my $xml = XMLin("/Users/mcmillhj/temp.xml");
print Dumper $xml;
__DATA__
{
'instanceId' => '3485',
'project' => 'Test project',
'testcase' => {
'abc123' => {
'owner' => 'domain',
'priority' => '1',
'status' => 'Passed',
'suite' => 'TEst2',
'testExecTimeInMins' => '6',
'description' => 'Checking Test2',
'timeout' => '10',
'error' => {},
'category' => 'BAT',
'id' => '2346456',
'severity' => '1',
'testServer' => {
'database' => 'MySQL',
'buildNo' => '',
'ip' => '1.2.3.4',
'name' => 'host1',
'platform' => 'Linux',
'id' => '86'
}
},
'abc' => {
'owner' => 'domain',
'priority' => '1',
'status' => 'Failed',
'suite' => 'TEst1',
'testExecTimeInMins' => '2',
'description' => 'Checking Test1',
'timeout' => '10',
'error' => {},
'category' => 'BAT',
'id' => '2346',
'severity' => '1',
'testServer' => {
'database' => 'MySQL',
'buildNo' => '',
'ip' => '1.2.3.4',
'name' => 'host1',
'platform' => 'Linux',
'id' => '86'
}
}
},
'suiteId' => '45'
};
To get all testcases, you just need to iterate over the keys:
foreach my $testcasename ( keys %{ $XMLData->{testcase} } ){
$logger->info($XMLdata->{testcase}->{$testcasename}->{id});
}

trying to parse text and html from email over IMAP using MAIL::IMAPClient but text is hidden in parts and multi-parts

I can connect to the IMAP mail server easy enough:
use Mail::IMAPClient;
use MIME::Base64;
use MIME::Parser;
my $imap = Mail::IMAPClient->new(
Server => '192.168.2.2',
User => 'xxxxxx',
Password => 'yyyyyy',
Ssl => 1,
Uid => 1,
);
my $folders = $imap->folders
or die "List folders error: ", $imap->LastError, "\n";
print "Folders: #$folders\n";
$sfolder="INBOX.2012";
$imap->select( $sfolder )
or die "Select '$Opt{sfolder}' error: ", $imap->LastError, "\n";
my #msgs = $imap->messages or die "Could not messages: $#\n";
However, the text and html I want is not easily parsed due to codes like this:
Content-Transfer-Encoding:base64
Content-Type:text/html; charset=utf-8
Content-Transfer-Encoding:base64
Content-Type:text/html; charset=utf-8
Content-Transfer-Encoding:
Content-Type:multipart/mixed; boundary="----------=_4F0F4830.7079357A"
Multipart
Content-Transfer-Encoding:
Content-Type:multipart/mixed; boundary="----=_Part_4487195_1184536749.1326753403034"
Multipart
Content-Transfer-Encoding:
Content-Type:multipart/alternative; boundary=--boundary_164442_d184e417-739f-
46d6-824a-6ea1846e79de
Multipart
Content-Transfer-Encoding:
Content-Type:multipart/mixed; boundary="----=_Part_3882878_23916831.1326509484032"
Multipart
Content-Transfer-Encoding:
I tried this but it only works on a tiny number of different encodings.
if ($imap->get_header($msg,"Content-Transfer-Encoding")=~ /base64/i) {
print "\nMatch base64";
if ($imap->get_header($msg,"Content-Type")=~m/text/i ) {
push(#mail,decode_base64($imap->body_string($msg)));
}
elsif ($imap->get_header($msg,"Content-Type")=~m/image/i )
{ print "\nImage detected"; }
elsif ($imap->get_header($msg,"Content-Type")=~m/application/i )
{ print "\nApplication detected"; }
There are 7bit and 8bit variants and other encoding methods that contain the html or text I want for later use. I successfully use decode_base64() to decode base64. The worse ones to decode are the ones that contain multi-part codes. I feel like I am re-inventing the wheel and there must be a library or module that can do all the heavy lifting for me.
Other content types such as .jpg,.gif, and .pdf should simply be ignored. The multi-part emails contain at least 1 part that I an interested but many that are useless to me.
After further research this structure has some of the information I need but don't know how to get it out efficiently is another matter.
Dumping:$VAR1 = bless( {
'bodyparms' => {
'boundary' => '----=_NextPart_002_BC64_7D688C1F.A2FF9BE0'
},
'bodyextra' => undef,
'_top' => 1,
'bodydisp' => 'NIL',
'_id' => 'HEAD',
'bodysubtype' => 'mixed',
'PartsIndex' => {
'1.3' => bless( {
'bodyparms' => 'NIL',
'bodyid' => '<d9e26cc0-019c-4ac0-9b1e-9c9ac8424f52>',
'bodyextra' => 'NIL',
'bodydisp' => 'NIL',
'_id' => '1.3',
'bodysubtype' => 'jpeg',
'_prefix' => '1.3',
'bodysize' => '4808',
'bodytype' => 'image',
'bodyMD5' => 'NIL',
'bodylang' => 'NIL',
'bodydesc' => 'NIL',
'bodyenc' => 'base64'
}, 'Mail::IMAPClient::BodyStructure' ),
'1.1' => bless( {
'bodyparms' => {
'boundary' => '----=_NextPart_000_36AE_880DDD08.0A776E35'
},
'bodyextra' => undef,
'bodydisp' => 'NIL',
'_id' => '1.1',
'bodysubtype' => 'alternative',
'_prefix' => '1.1',
'bodytype' => 'MULTIPART',
'bodystructure' => [
bless( {
'bodyparms' => {
'charset' => 'utf-8'
},
'bodyextra' => 'NIL',
'bodyid' => 'NIL',
'bodydisp' => 'NIL',
'_id' => '1.1.1',
'bodysubtype' => 'PLAIN',
'_prefix' => '1.1.1',
'bodysize' => '1971',
'bodytype' => 'TEXT',
'bodyMD5' => 'NIL',
'textlines' => '74',
'bodylang' => 'NIL',
'bodydesc' => 'NIL',
'bodyenc' => 'quoted-printable'
}, 'Mail::IMAPClient::BodyStructure' ),
bless( {
'bodyparms' => {
'charset' => 'utf-8'
},
'bodyextra' => 'NIL',
'bodyid' => 'NIL',
'bodydisp' => 'NIL',
'_id' => '1.1.2',
'bodysubtype' => 'HTML',
'_prefix' => '1.1.2',
'bodysize' => '23364',
'bodytype' => 'TEXT',
'bodyMD5' => 'NIL',
'textlines' => '331',
'bodylang' => 'NIL',
'bodydesc' => 'NIL',
'bodyenc' => 'quoted-printable'
}, 'Mail::IMAPClient::BodyStructure' )
],
'bodyloc' => 'NIL',
'bodylang' => 'NIL'
}, 'Mail::IMAPClient::BodyStructure' ),
'1' => bless( {
'bodyparms' => {
'boundary' => '----=_NextPart_001_EA96_2BF8DEDE.32622D51'
},
'bodyextra' => undef,
'bodydisp' => 'NIL',
'_id' => 1,
'bodysubtype' => 'related',
'_prefix' => 1,
'bodytype' => 'MULTIPART',
'bodystructure' => [
$VAR1->{'PartsIndex'}{'1.1'},
bless( {
'bodyparms' => 'NIL',
'bodyid' => '<5dff39db-e81c-4410-be75-8662564fd328>',
'bodyextra' => 'NIL',
'bodydisp' => 'NIL',
'_id' => '1.2',
'bodysubtype' => 'jpeg',
'_prefix' => '1.2',
'bodysize' => '14406',
'bodytype' => 'image',
'bodyMD5' => 'NIL',
'bodylang' => 'NIL',
'bodydesc' => 'NIL',
'bodyenc' => 'base64'
}, 'Mail::IMAPClient::BodyStructure' ),
$VAR1->{'PartsIndex'}{'1.3'},
bless( {
'bodyparms' => 'NIL',
'bodyid' => '<717f2ef4-f795-4d1c-87cc-283c9b0a59b0>',
'bodyextra' => 'NIL',
'bodydisp' => 'NIL',
'_id' => '1.4',
'bodysubtype' => 'gif',
'_prefix' => '1.4',
'bodysize' => '2912',
'bodytype' => 'image',
'bodyMD5' => 'NIL',
'bodylang' => 'NIL',
'bodydesc' => 'NIL',
'bodyenc' => 'base64'
}, 'Mail::IMAPClient::BodyStructure' )
],
'bodyloc' => 'NIL',
'bodylang' => 'NIL'
}, 'Mail::IMAPClient::BodyStructure' ),
'1.2' => $VAR1->{'PartsIndex'}{'1'}{'bodystructure'}[1],
'1.1.2' => $VAR1->{'PartsIndex'}{'1.1'}{'bodystructure'}[1],
'2' => bless( {
'bodyparms' => {
'name' => 'BKD-7361945220.pdf'
},
'bodyid' => 'NIL',
'bodyextra' => 'NIL',
'bodydisp' => {
'attachment' => {
'filename' => 'BKD-7361945220.pdf'
}
},
'_id' => 2,
'bodysubtype' => 'octetstream',
'_prefix' => 2,
'bodysize' => '47540',
'bodytype' => 'application',
'bodyMD5' => 'NIL',
'bodystructure' => [],
'bodylang' => 'NIL',
'bodydesc' => 'NIL',
'bodyenc' => 'base64'
}, 'Mail::IMAPClient::BodyStructure' ),
'1.4' => $VAR1->{'PartsIndex'}{'1'}{'bodystructure'}[3],
'1.1.1' => $VAR1->{'PartsIndex'}{'1.1'}{'bodystructure'}[0]
},
'_prefix' => 'HEAD',
'PartsList' => [
1,
'1.1',
'1.1.1',
'1.1.2',
'1.2',
'1.3',
'1.4',
2
],
'bodytype' => 'MULTIPART',
'bodystructure' => [
$VAR1->{'PartsIndex'}{'1'},
$VAR1->{'PartsIndex'}{'2'}
],
'bodyloc' => 'NIL',
'bodylang' => 'NIL'
}, 'Mail::IMAPClient::BodyStructure' );
As you can see none of the values are guaranteed to be part of every part on the PartsIndex and some them are nested.
variable of interest for each PartsIndex item:
bodytype
bodysubtype
bodyenc

Parse mail messages with Courriel:
use strictures;
use Mail::IMAPClient qw();
use Courriel qw();
sub walk_parts {
my ($obj, $callback) = #_;
if ($obj->is_multipart) {
for my $part ($obj->parts) {
walk_parts($part, $callback);
}
} else {
$callback->($obj);
}
}
my $imap = Mail::IMAPClient->new(
…
) or die $#;
my $folders = $imap->folders
or die $imap->LastError;
$imap->select('INBOX')
or die $imap->LastError;
my #messages = $imap->messages
or die $imap->LastError;
for my $id (#messages) {
my $raw = $imap->message_string($id)
or die $imap->LastError;
my $email = Courriel->parse(text => $raw);
walk_parts $email, sub {
my ($part) = #_;
my $content = $part->content;
my $type = $part->mime_type;
}
}

I tried using a couple of prebuilt modules but they had too many dependencies and was hard to work with. This solution adds no dependencies beyond the original. I also had issues with the dependencies for libMagic, see above, and I did not want anyone who uses my program to have to deal with that issue either.
You have to call decode twice once for the main parent, and again for each child. Since this $imap->get_bodystructure($msg); contains all the information you need why add dependencies where none are needed. It took many many hours to figure out how to decode it manually, but it was worth it.
You can add whatever decoders you want to the decode() subroutine. I only need to decode the text/html and base64 encoded versions there of. The IMAPClient functions give you a list of all parents and children so you don't have to go making a list by yourself. The tricky part is you can have any number of parent each with any number of children, but only the children contain useful data. The parents can be ignored, since many of their values are blank,undef, or 'NIL' (literally). In fact a vast number of variables have the value of 'NIL'. Even ones that the email client could have answered for the user like bodyMD5 and bodylang are USUALLY equal 'NIL'. Due to the overwhelming use of 'NIL' parsing and using other fields may prove futile. Depend on your imap server and the people you recieve email from you mileage may vary.
If you have further questions leave a comment.
use Mail::IMAPClient;
use MIME::Base64;
use MIME::Parser;
sub decode {
($process, $imap) =#_;
if ($process->bodytype eq "TEXT") {
print "\n Text SubType:".$process->bodysubtype;
if ($process->bodyenc eq "base64") {
return decode_base64($imap->bodypart_string($msg,$process->id));
}
elsif (index(" -7bit- -8bit- -quoted-printable- ",lc($process->bodyenc)) !=-1 ) {
return $imap->bodypart_string($msg,$process->id);
}
print "\n==========Insert new decoder here============";
print "\n".$imap->bodypart_string($msg,$process->id);
print "\n=================================================";
}
return "";
}
#insert your login code with credentials here
$imap->select( $sfolder )
or die "Select '$Opt{sfolder}' error: ", $imap->LastError, "\n";
my #msgs = $imap->messages or die "Could not messages: $#\n";
foreach $msg (#msgs) {
my $raw = $imap->message_string($msg)
or die $imap->LastError;
$struct = $imap->get_bodystructure($msg);
#MULTIPART is a container designation and does not contain anything useful by itself.
#However it will still process all of the children that have content
if ($struct->bodytype ne "MULTIPART") { print "\n BodyEnc:".$struct->bodyenc();}
$rDecode=decode($struct,$imap);
#do not insert blanks.
if ($rDecode ne "" && (length($rDecode)>2)) {push(#mail,$rDecode); }
foreach $dumpme ($struct->bodystructure()) {
if ($dumpme->bodytype() eq "MULTIPART") {next;}
$rDecode="";
$rDecode=decode($dumpme,$imap);
#do not insert blanks.
if (($rDecode ne "") && (length($rDecode)>2) ) {
push(#mail,$rDecode); }
}
}

You need a MIME parser. Unfortunately, even then, you will need some normalization of your own, because there are multiple ways to represent the same data in MIME.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Web::Scraper nested structures & elements only containing spesific data - perl

Related

Accessing a nested data structure

ElasticSearch (search_context_missing_exception) with Search::ElasticSearch::Scroll

Perl extracting data from a hash

Not an array reference message is thrown when using XML:Simple

trying to parse text and html from email over IMAP using MAIL::IMAPClient but text is hidden in parts and multi-parts

Categories

Resources