how to check xml file line by line with perl script - perl

i would like to compare the two file one is user's input file txt file and another file is config file which is xml file. if user's input file value is match with config file then show matched function.
this is user's input file
L84A:FIP:70:155:15:18:
L83A:55FIP:70:155:15:
In the above file: L84A is Design_ID, FIP is Process_ID, and 70 to 18 is register_ID.
this is config file
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Sigma>
<Run>
<DESIGN_ID>L83A</DESIGN_ID>
<PROCESS_ID>55FIP</PROCESS_ID>
<RegisterList>
<Register>70</Register>
<Register>155</Register>
</RegisterList>
</Run>
<Run>
<DESIGN_ID>L83A</DESIGN_ID>
<PROCESS_ID>FRP</PROCESS_ID>
<RegisterList>
<Register>141</Register>
<Register>149</Register>
<Register>151</Register>
</RegisterList>
</Run>
<Run>
<DESIGN_ID>L84A</DESIGN_ID>
<PROCESS_ID>55FIP</PROCESS_ID>
<RegisterList>
<Register>70</Register>
<Register>155</Register>
</RegisterList>
</Run>
</Sigma>
so in this case output should show:
L84A: doesn't has FIP process ID in config file.
L83A:
55FIP
70 - existing register ID
155 - existing register ID
15 - no existing register ID.
my code doesn't check respective process ID and register ID .it shows below.
L84A
FIP
70 - existing register ID
155 - existing register ID
15 - existing register ID
18 - no existing register ID
L83A
55FIP
70 - existing register ID
155 - existing register ID
15 - existing register ID
below is my code:
use strict;
use warnings;
use vars qw($file1 $file1cnt #output);
use XML::Simple;
use Data::Dumper;
# create object
my $xml = new XML::Simple;
# read XML file
my $data = $xml->XMLin("sigma_loader.xml");
my $file1 = "userinput.txt";
readFileinString($file1, \$file1cnt);
while($file1cnt=~m/^((\w){4})\:([^\n]+)$/mig)
{
my $DID = $1;
my $reqconfig = $3;
while($reqconfig=~m/^((\w){5})\:([^\n]+)$/mig) #Each line from user request
{
my $example1 = $1; #check for FPP/QBPP process
my $example2 = $3; #display bin full lists.
if(Dumper($data) =~ $DID)
{
print"$DID\n";
if(Dumper($data) =~ $example1)
{
print"$example1\n";
my #second_values = split /\:/, $example2;
foreach my $sngletter(#second_values)
{
if( Dumper($data) =~ $sngletter)
{
print"$sngletter - existing register ID\n";
}
else
{
print"$sngletter - no existing register ID\n";
}
}
}
else
{
print"$DID doesn't has $example1 process ID in config file\n";
}
}
else
{
print"new Design ID deteced\n";
}
}
while($reqconfig=~m/^((\w){3})\:([^\n]+)$/mig) #Each line from user request
{
my $example1 = $1; #check for FPP/QBPP process
my $example2 = $3; #display bin full lists.
if(Dumper($data) =~ $DID)
{
print"$DID\n";
if(Dumper($data) =~ $example1)
{
print"$example1\n";
my #second_values = split /\:/, $example2;
foreach my $sngletter(#second_values)
{
if( Dumper($data) =~ $sngletter)
{
print"$sngletter - existing register ID\n";
}
else
{
print"$sngletter - no existing register ID\n";
}
}
}
else
{
print"$DID doesn't has $example1 process ID in config file\n";
}
}
else
{
print"new Design ID deteced\n";
}
}
}
sub readFileinString
#------------------>
{
my $File = shift;
my $string = shift;
use File::Basename;
my $filenames = basename($File);
open(FILE1, "<$File") or die "\nFailed Reading File: [$File]\n\tReason: $!";
read(FILE1, $$string, -s $File, 0);
close(FILE1);
}

There are a couple of things in your code that do not really make sense, like using Data::Dumper and parsing the output with a regular expression. I'm not going to review your code as that is off-topic on Stack Overflow, but instead going to give you an alternate solution and walk you through it.
Please note that XML::Simple is not a great tool. Its use is discouraged because it is very bad at handling certain cases. But for your very simple XML structure it will work, so I have kept it.
use strict;
use warnings;
use XML::Simple;
use feature 'say';
# read XML file and reorganise it for easier use
my $data;
foreach my $run (#{XMLin(\*DATA)->{Run}}) {
$data->{$run->{DESIGN_ID}}->{$run->{PROCESS_ID}} =
{map { $_ => 1 } #{$run->{RegisterList}->{Register}}};
}
# read the text file - I've skipped the read
my #user_input = qw(
L84A:FIP:70:155:15:18:
L83A:55FIP:70:155:15:
);
foreach my $line (#user_input) {
chomp $line
; # we don't need this in my example, but you do when you read from a file
my ($design_id, $process_id, #register_ids) = split /:/, $line;
# extra error checking just in case
if (not exists $data->{$design_id}) {
say "$design_id does't exist in data";
next;
}
if (not exists $data->{$design_id}->{$process_id}) {
say "$design_id: doesn't have $process_id";
next;
}
say "$design_id:";
say " $process_id";
foreach my $register_id (#register_ids) {
if (exists $data->{$design_id}->{$process_id}->{$register_id}) {
say " $register_id - existing register ID";
}
else {
say " $register_id - no existing register ID";
}
}
}
__DATA__
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Sigma>
<Run>
<DESIGN_ID>L83A</DESIGN_ID>
<PROCESS_ID>55FIP</PROCESS_ID>
<RegisterList>
<Register>70</Register>
<Register>155</Register>
</RegisterList>
</Run>
<Run>
<DESIGN_ID>L83A</DESIGN_ID>
<PROCESS_ID>FRP</PROCESS_ID>
<RegisterList>
<Register>141</Register>
<Register>149</Register>
<Register>151</Register>
</RegisterList>
</Run>
<Run>
<DESIGN_ID>L84A</DESIGN_ID>
<PROCESS_ID>55FIP</PROCESS_ID>
<RegisterList>
<Register>70</Register>
<Register>155</Register>
</RegisterList>
</Run>
</Sigma>
I've made a few assumptions.
You already know how to read the text file, so I've stuck that into an array line by line. Your file reading code has some issues though, you should be using three-arg open and lexical filehandles. Your call to open should look like this:
open my $fh, '<', $filename or die "$!: error...";
Alternatively, consider using Path::Tiny.
I'm taking the XML file from the __DATA__ section. This is like a filehandle.
So let's look at my code.
When we read the XML structure, it looks like this straight out of XMLin.
\ {
Run [
[0] {
DESIGN_ID "L83A",
PROCESS_ID "55FIP",
RegisterList {
Register [
[0] 70,
[1] 155
]
}
},
[1] {
DESIGN_ID "L83A",
PROCESS_ID "FRP",
RegisterList {
Register [
[0] 141,
[1] 149,
[2] 151
]
}
},
[2] {
DESIGN_ID "L84A",
PROCESS_ID "55FIP",
RegisterList {
Register [
[0] 70,
[1] 155
]
}
}
]
}
This is not very useful for what we plan to do, so we have to rearrange it. I want to use exists on hash references later, to make it easier to look up if there are matches for the IDs we are looking at. This is called a lookup hash. We can through away the ->{Run} key as XML::Simple combines all <Run> elements into an array reference, and the <Sigma> tag is just skipped because it's the root element.
Every Design ID can have multiple Processes, so we organise these two hierarchically, and we put in another lookup hash, where every register is a key, and we just use 1 as a key. The key does not matter.
This gives us a different data structure:
\ {
L83A {
55FIP {
70 1,
155 1
},
FRP {
141 1,
149 1,
151 1
}
},
L84A {
55FIP {
70 1,
155 1
}
}
}
That's much easier to understand and use later on.
Now we parse the user input, and iterate over each line. The format seems clear. It's a bit like a CSV file, but using colons :, so we can split. This gives us the two IDs, and all following values are registers, so we stick them in an array.
Your example doesn't have a case where the Design ID does not exist in the XML file, but given this is based on user input, we should check anyway. In the real world data is always dirty.
We can then check if the $process_id exists inside the $design_id in our data. If it does not, we tell the user and skip to the next line.
Then we have to iterate all the Register IDs. Either the $register_id exists in our second lookup hash, or it doesn't.
This gives us the exact output you're expecting.
L84A: doesn't have FIP
L83A:
55FIP
70 - existing register ID
155 - existing register ID
15 - no existing register ID
This code is much shorter, easier to read and runs faster. I've used Data::Printer to show the data structures.

Related

Perl: should the function TreeBuilder be adapted when it is in a loop foreach?

My code is to enter an actor name and the program, via the given actor's filmography in IMDB, lists on a hash table all the cinematic genres of the movies he has acted in as well as their frequency. However, I have a problem: When I type a name like "brad pitt" or "bruce willis" after running the program at the prompt, execution takes indefinitely. How do you know what the problem is?
Another problem: when I type "nicolas bedos" (an actor name that I entered from the beginning), it works but it seems that the index is only made for a single movie selected in the #url_links list. Should the look_down function of the TreeBuilder module within a foreach loop be adapted? I was telling myself that the #genres list was overwritten on each iteration so I added a push () but the result remains the same.
use LWP::Simple;
use PerlIO::locale;
use HTML::TreeBuilder;
use WWW::Mechanize;
binmode STDOUT, ':locale';
use strict;
use warnings;
print "Enter the actor's name:";
my $acteur1 = <STDIN>; # the user enters the name of the actor
print "We will analyze the filmography of the actor $actor1 by genre\n";
#we put the link with the given actor in Mechanize variable in order to browse the internet links
my $lien1 = "https://www.imdb.com/find?s=nm&q=$acteur1";
my $mech = WWW::Mechanize->new();
$mech->get($lien1); #we access the search page with the get function
$mech->follow_link( url_regex => qr/nm0/i ); #we access the first result using the follow_link function and the regular expression nm0 which is in the URL
my #url_links= $mech->find_all_links( url_regex => qr/title\/tt/i ); #owe insert in an array all the links having as regular expression "title" in their URL
my $nb_links = #url_links; #we record the number of links in the list in this variable
my $tree = HTML::TreeBuilder->new(); #we create the TreeBuilder module to access a specific text on the page via the tags
my %index; #we create a hashing table
my #genres = (); #we create the genre list to insert all the genres encountered
foreach (#url_links) { #we make a loop to browse all the saved links
my $mech2 = WWW::Mechanize->new();
my $html = $_->url(); #we take the url of the link
if ($html =~ m=^/title=) { #if the url starts with "/title"
$mech2 ->get("https://www.imdb.com$html"); #we complete the link
my $content = $mech2->content; #we take the content of the page
$tree->parse($content); #we access the url and we use the tree to find the strings that interest us
#genres = $tree->look_down ('class', 'see-more inline canwrap', #We have as criterion to access the class = "see-more .."
sub {
my $link = $_[0]->look_down('_tag','a'); #new conditions: <a> tags
$link->attr('href') =~ m{genres=}; #autres conditions: "genres" must be in the URL
}
);
}
}
my #genres1 = (); #we create a new list to insert the words found (the genres of films)
foreach my $e (#genres){ #we create a loop to browse the list
my $genre = $e->as_text; #the text of the list element is inserted into the variable
#genres1 = split(/[à| ]/,$genre); #we remove the unnecessary characters that are spaces, at and | which allow to keep that the terms of genre cine
}
foreach my $e (#genres1){ #another loop to filter listing errors (Genres: etc ..) and add the correct words to the hash table
if ($e ne ("Genres:" or "") ) {
$index{$e}++;
}
}
$tree->delete; #we delete the tree as we no longer need it
foreach my $cle (sort{$index{$b} <=> $index{$a}} keys %index){
print "$cle : $index{$cle}\n"; #we display the hash table with the genres and the number of times that appear in the filmography of the given actor
}
Thank you in advance for your help,
wobot
 
The IMDB Conditions of Use say this:
Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.
So you might want to reconsider what you're doing. Perhaps you could look at the OMDB API instead.

Use Archive::Zip to determine if a member is a text file or not

I'm working on a script that will grep the contents of members of zip archives when the member name matches a pattern, using a given search string.
I have the following sub that processes a single archive (the script can take more than one archive on the command line):
sub processArchive($$$$) {
my ($zip, $searchstr, $match, $zipName) = #_;
print "zip[$zip] searchstr[$searchstr] match[$match] zipName[$zipName]\n";
my #matchingList = $zip->membersMatching($match);
my $len = #matchingList;
if ($len > 0) {
print $zipName . ":\n";
for my $member (#matchingList) {
print "member[$member]\n";
print "textfile[" . $member->isTextFile() . "] contents[" . $member->contents() . "]\n";
if ($member->isTextFile()) {
print "Is a text file.\n";
}
else {
print "Is not a text file.\n";
}
my #matchingLines = grep /$searchstr/, $member->contents();
my $len = #matchingLines;
if ($len > 0) {
print #matchingLines;
}
}
}
}
The logic isn't even complete yet. I'm first experimenting with calling "isTextFile()" to see what it does. I must be doing something wrong, because I get "Is not a text file" for at least one member that is clearly a text file.
I also note that when I print the value of the return from "isTextFile()", it's always an empty string. Is that what I should expect from printing a "true" or "false" value, or is something else wrong here?
The "text file" status is read from a flag in the ZIP file. Many archiving tools do not set this flag properly, as it is rarely used and has no impact on normal use.
If you actually need to check whether a file contains text, you will need to extract it and see for yourself.

Perl parse email and attachments from Outlook inbox

I'm using Mail::IMAPClient to connect to our Outlook mail server. I can get the mail just fine and print the text version of that mail to a file. But I'm having trouble using MIME::Parser to parse through the email.
I've tried giving the parser a file handle to the text file that I wrote the email to. I've tried giving the parser just the text of the email but it won't work how I'm expecting it to work. The entity parts always equals 0.
When I dump the entity skeleton I get
Content-type: text/plain
Effective-type: text/plain
Body-file: NONE
--
I can see all of the parts of the email in the file. The two PDFs that are attached are there, encoded in base64, so I know that the script is actually retrieving the email and the attachments. I've also tried parse and parse_data.
my $msgCount = 0;
$msgCount = $imap->message_count();
#or abortMission("", "Could not get message count: ". $imap->LastError );
if ( $msgCount > 0 ) {
#get all the messages from the inbox folder
my #msgseqnos = $imap->messages
or abortMission("", "Could not retreive messages:". $imap->LastError);
my ($x, $bh, $attachment, $attachmentName);
foreach my $seqno ( #msgseqnos ) {
my $input_file;
my $parser = new MIME::Parser;
my $emailText = $imap->body_string($seqno) # should be the entire email as text.
or abortMission("", "Could not get message string: " . $imap->LastError);
$parser->ignore_errors(1);
$parser->output_to_core(1);
open my $emailFileHandle, ">", "invoiceText.txt";
print $emailFileHandle $emailText;
#$imap->message_to_file($emailFileHandle, $seqno);
my $entity = $parser->parse_data($emailText);
$entity->dump_skeleton;
if ( $entity->parts > 0 ) {
for ( my $i = 0; $i < $entity->parts; $i++ ) {
my $subentity = $entity->parts($i);
# grab attachment name and contents
foreach $x ( #attypes ) {
if ( $subentity->mime_type =~ m/$x/i ) {
$bh = $subentity->bodyhandle;
$attachment = $bh->as_string;
$attachmentName = $subentity->head->mime_attr('content-disposition.filename');
open FH, ">$attachmentName";
print FH $attachment;
close FH;
#push #attachment, $attachment;
#push #attname, $subentity->head->mime_attr('content-disposition.filename');
}
}
}
}
else {
stillAGo("eData VehicleInvoices problem", "Perl can't find an attachment in an email in the VehicleInvoices folder of eData email address");
}
close $emailFileHandle;
# say $emailText;
# next;
#open OUT_FILE, ">invoiceText.txt";
#print OUT_FILE $emailText;
#print OUT_FILE $imap->bodypart_string($seqno,1);
#close OUT_FILE;
#print $emailText;
}
}
I'm trying to retrieve the attachments from emails automatically and save them to disk to be processed by another job.
I'd like to include the invoiceText.txt file so people can see the actual output but it's 1200 lines long. I'm not sure where to upload a file to link in here.
The body_string method doesn't return the entire email. As the documentation describes, and the name implies, it returns the body of the message, excluding the headers. That is why dump_skeleton shows no headers apart from the defaults
What you probably want, although I haven't tried it, is message_string, which does return the entire email
I see you've used message_to_file but commented it out. That would probably have worked if you got MIME::Parse to read from the file

Perl Read a file into a variable and add suffix to each lines

I'm very new to Perl and I'm having a hard time find out what I want.
I have a text file containing something like
text 2015-02-02:
- blabla1
- blabla2
text2 2014-12-12:
- blabla
- ...
I'm trying to read the file, put it in var, add to end of each line (of my var) and use it to send it to a web page.
This is what I have for the moment. It works except for the part.
if (open (IN, "CHANGELOG.OLD")) {
local $/;
$oldchangelog = <IN>'</br>';
close (IN);
$tmplhtml{'CHANGELOG'} = $oldchangelog;
} else {
# changelog not available
$tmplhtml{'CHANGELOG'} = "Changelog not available";
}
thanks for the help!
As someone comments - this looks like YAML, so parsing as YAML is probably more appropriate.
However to address your scenario:
3 argument file opens are good.
you're using local $/; which means you're reading the whole file into a string. This is not suitable for line by line processing.
Looks like you're putting everything into one element of a hash. Is there any particular reason you're doing this?
Anyway:
if ( open ( my $input, "<", "CHANGELOG.OLD" ) ) {
while ( my $line = <$input> ) {
$tmplhtml{'CHANGELOG'} .= $line . " <BR/>\n";
}
}
else {
$tmplhtml{'CHANGELOG'} = "Changelog not available";
}
As an alternative - you can render text 'neatly' to HTML using <PRE> tags.

How can my previously untainted data become tainted again?

I have a bit of a mystery here that I am not quite understanding the root cause of. I am getting an 'Insecure dependency in unlink while running with -T switch' when trying to invoke unlink from a script. That is not the mystery, as I realize that this means Perl is saying I am trying to use tainted data. The mystery is that this data was previously untainted in another script that saved it to disk without any problems.
Here's how it goes... The first script creates a binary file name using the following
# For the binary file upload
my $extensioncheck = '';
my $safe_filename_characters = "a-zA-Z0-9_.";
if ( $item_photo )
{
# Allowable File Type Check
my ( $name, $path, $extension ) = fileparse ( $item_photo, '\..*' );
$extensioncheck = lc($extension);
if (( $extensioncheck ne ".jpg" ) && ( $extensioncheck ne ".jpeg" ) &&
( $extensioncheck ne ".png" ) && ( $extensioncheck ne ".gif" ))
{
die "Your photo file is in a prohibited file format.";
}
# Rename file to Ad ID for adphoto directory use and untaint
$item_photo = join "", $adID, $extensioncheck;
$item_photo =~ tr/ /_/;
$item_photo =~ s/[^$safe_filename_characters]//g;
if ( $item_photo =~ /^([$safe_filename_characters]+)$/ ) { $item_photo = $1; }
else { die "Filename contains invalid characters"; }
}
$adID is generated by the script itself using a localtime(time) function, so it should not be tainted. $item_photo is reassigned using $adID and $extensioncheck BEFORE the taint check, so the new $item_photo is now untainted. I know this because $item_photo itself has no problem with unlink itself latter in the script. $item_photo is only used long enough to create three other image files using ImageMagick before it's tossed using the unlink function. The three filenames created from the ImageMagick processing of $item_photo are created simply like so.
$largepicfilename = $adID . "_large.jpg";
$adpagepicfilename = $adID . "_adpage.jpg";
$thumbnailfilename = $adID . "_thumbnail.jpg";
The paths are prepended to the new filenames to create the URLs, and are defined at the top of the script, so they can't be tainted as well. The URLs for these files are generated like so.
my $adpageURL = join "", $adpages_dir_URL, $adID, '.html';
my $largepicURL = join "", $adphotos_dir_URL, $largepicfilename;
my $adpagepicURL = join "", $adphotos_dir_URL, $adpagepicfilename;
my $thumbnailURL = join "", $adphotos_dir_URL, $thumbnailfilename;
Then I write them to the record, knowing everything is untainted.
Now comes the screwy part. In a second script I read these files in to be deleted using the unlink function, and this is where I am getting my 'Insecue dependency' flag.
# Read in the current Ad Records Database
open (ADRECORDS, $adrecords_db) || die("Unable to Read Ad Records Database");
flock(ADRECORDS, LOCK_SH);
seek (ADRECORDS, 0, SEEK_SET);
my #adrecords_data = <ADRECORDS>;
close(ADRECORDS);
# Find the Ad in the Ad Records Database
ADRECORD1:foreach $AdRecord(#adrecords_data)
{
chomp($AdRecord);
my($adID_In, $adpageURL_In, $largepicURL_In, $adpagepicURL_In, $thumbnailURL_In)=split(/\|/,$AdRecord);
if ($flagadAdID ne $adID_In) { $AdRecordArrayNum++; next ADRECORD1 }
else
{
#Delete the Ad Page and Ad Page Images
unlink ("$adpageURL_In");
unlink ("$largepicURL_In");
unlink ("$adpagepicURL_In");
unlink ("$thumbnailURL_In");
last ADRECORD1;
}
}
I know I can just untaint them again, or even just blow them on through knowing that the data is safe, but that is not the point. What I want is to understand WHY this is happening in the first place, as I am not understanding how this previously untainted data is now being seen as tainted. Any help to enlighten where I am missing this connection would be truly appreciated, because I really want to understand this rather than just write the hack to fix it.
Saving data to a file doesn't save any "tainted" bit with the data. It's just data, coming from an external source, so when Perl reads it it becomes automatically tainted. In your second script, you will have to explicitly untaint the data.
After all, some other malicious program could have changed the data in the file before the second script has a chance to read it.