Parsing a structured text file in Perl - perl

I'm quite new to Perl and I'm having immense difficulty writing a Perl script that will successfully parse a structured text file.
I have a collection of files that look like this:
name:
John Smith
occupation:
Electrician
date of birth:
2/6/1961
hobbies:
Boating
Camping
Fishing
And so on. The field name is always followed by a colon, and all the data associated with those fields is always indented by a single tab (\t).
I would like to create a hash that will directly associate the field contents with the field name, like this:
$contents{$name} = "John Smith"
$contents{$hobbies} = "Boating, Camping, Fishing"
Or something along those lines.
So far I've been able to get all the field names into a hash by themselves, but I've not had any luck wrangling the field data into a form that can be nicely stored in a hash. Clearly substituting/splitting newlines followed by tabs won't work (I've tried, somewhat naively). I've also tried a crude lookahead where I create a duplicate array of lines from the file and using that to figure out where the field boundaries are, but it's not that great in terms of memory consumption.
FWIW, currently I'm going through the file line by line, but I'm not entirely convinced that this is the best solution. Is there any way to do this parsing in a straightforward manner?

Reading the file line by line is a good way to go. Here I am creating a hash of array references. This is how you would just read one file. You could read each file this way and put the hash of arrays into a hash of hashes of array.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %contents;
my $key;
while(<DATA>){
chomp;
if ( s/:\s*$// ) {
$key = $_;
} else {
s/^\s+//g; # remove extra whitespace
push #{$contents{$key}}, $_;
}
}
print Dumper \%contents;
__DATA__
name:
John Smith
occupation:
Electrician
date of birth:
2/6/1961
hobbies:
Boating
Camping
Fishing
Output:
$VAR1 = {
'occupation' => [
'Electrician'
],
'hobbies' => [
'Boating',
'Camping',
'Fishing'
],
'name' => [
'JohnSmith'
],
'date of birth' => [
'2/6/1961'
]
};

This text file is actually quite close to yaml. And its not difficult to convert it into a valid yaml file:
Once you have a yaml file you can use YAML::Tiny or another module to parse it, which leads to cleaner code:
#!/usr/bin/perl
use strict;
use warnings;
use YAML::Tiny;
use Data::Dumper;
convert( './data.yaml', 'output.yaml' );
parse('output.yaml');
sub parse {
my $yaml = shift;
my $yamlobj = YAML::Tiny->read($yaml);
my $name = $yamlobj->[0]->{name}[0];
my $occ = $yamlobj->[0]{occupation}[0];
my $birth = $yamlobj->[0]{'date of birth'}[0];
my $hobbies = $yamlobj->[0]{hobbies};
my $hobbiestring = join ", ", #$hobbies;
my $contents = {
name => $name,
occupation => $occ,
birth => $birth,
hobbies => $hobbiestring,
};
print "#RESULT:\n\n";
print Dumper($contents);
}
sub convert {
my ( $input, $output ) = #_;
open my $infh, '<', $input or die "$!";
open my $outfh, '>', $output or die "$!";
while ( my $line = <$infh> ) {
$line =~ s/^\s+\K$/-/g;
print $outfh ($line);
}
}

Related

Perl Creating hash reference and looping through one element from each branch at a time

As a beginner I have what I think is a rather complicated problem I am hoping someone could help with.
I have the following text file (tab delminated)...
FILE1.txt
Dog Big
Dog Medium
Dog Small
Rabbit Huge
Rabbit Tiny
Rabbit Middle
Donkey Massive
Donkey Little
Donkey Gigantic
I need to read FILE1.txt into a hash reference to get something like the following... (using Data::Dumper)
$VAR1 = {
'Dog' => {
'Big',
'Medium',
'Small'
},
'Rabbit => {
'Huge',
'Tiny',
'Middle'
},
'Donkey => {
'Massive',
'Little',
'Gigantic'
},
};
The problem I am having:
I then need to loop through each branch of the hash reference one at a time, I will use the value from the hash reference to check if this matches my keyword, if so it will then return it's corresponding key.... for example...
What I need it to do:
my $keyword == "Little";
Dog->Big
if 'Big' matches my keyword then return $found = Dog
else go to the next branch
Rabbit->Huge
if 'Huge' matches my keyword then return $found = Rabbit
else go to the next branch
Donkey->Massive
if 'Massive' matches my keyword then return $found = Donkey
else go to the next branch (which is Dog again, but the second element this time)
Dog->Medium
if 'Medium' matches my keyword then return $found = Dog
else go to the next branch
Rabbit->Tiny
if 'Tiny' matches my keyword then return $found = Rabbit
else go the the next branch
Donkey->Little
if 'Little' matches my keyword then return $found = Donkey
..... and so on until the keyword is found or we reach the end of the hash reference
This is the kind of thing I am trying to achieve but don't know how to go about doing this, or whether a hash reference is the best way to do this, or if it can even be done with a hash/hash reference?
your help with this is much appreciated, thanks
Choosing proper data structure is often key step to the solution, but first of all you should define what you are trying achieve. What is overall goal? For example I have this data file and in mine application/program I need frequently ask for this information. It is crucial to ask proper question because for example if you don't need ask frequently for keyword it doesn't make sense creating hash at all.
perl -anE'say $F[0] if $F[1] eq "Little"' FILE1.txt
Yes it is that simple. Look in perlrun manpage for switches and what they mean and how to do same thing in bigger application.
If you need frequently ask for this question you should arrange your data in way which helps you and not in way you have to battle with.
use strict;
use warnings;
use feature qw(say);
use autodie;
open my $f, '<', 'FILE1.txt';
my %h;
while(<$f>) {
chomp;
my ($animal, $keyword) = split' ';
$h{$keyword} = $animal unless exists $h{$keyword};
}
close $f;
for my $keyword (qw(Little Awkward Small Tiny)) {
say $h{$keyword} ? "$keyword $h{$keyword}" : "keyword $keyword not found";
}
But if you still insist you want to traverse hash you can do it but you has been warned.
open my $f, '<', 'FILE1.txt';
my %h;
while (<$f>) {
chomp;
my ( $animal, $keyword ) = split ' ';
push #{ $h{$animal} }, $keyword;
}
close $f;
KEYWORD:
for my $keyword (qw(Little Awkward Small Tiny)) {
for my $animal (keys %h) {
for my $k (#{$h{$animal}}) {
if($k eq $keyword) {
say "$keyword $animal";
next KEYWORD;
}
}
}
say "keyword $keyword not found";
}
To critique my own answer: the structuring of the part that does the search could be better. And maybe it is pointless even using an ordered hash as the search is through a linear list. Maybe it should be an array of arrays
use strict;
use warnings;
use Tie::IxHash;
#open file
open(my $fh,"ani.txt") ||die $!;
#make an ordered hash
tie my %sizes, 'Tie::IxHash';
#read file into hash of arrays
while(<$fh>) {
(my $animal,my $size)=split(/\s+/);
if (!exists($sizes{$animal})) {
$sizes{$animal} = [$size];
} else {
push #{$sizes{$animal}},$size;
}
}
my $keyword="Little";
my $running=1;
my $depth=0;
while( $running ) {
$running = 0;
for my $search (keys %sizes) {
next if ($depth > #{$sizes{$search}});
$running = 1;
if ($keyword eq $sizes{$search}[$depth]) {
print "FOUND!!!!!! $search $depth";
exit(0);
}
}
$depth++;
}
Here is another version of a solution to the stated problem. To solve the actual problem given there is no need to store anything except the first "size" key for each animal in a hash
This hash can then be trivally used to look up the animal
use strict;
use warnings;
open(my $fh,"ani.txt") ||die $!;
my %animals;
#read file into hash
while(<$fh>) {
(my $animal,my $size)=split(/\s+/);
#only add the animal the first time the size is found
if (!exists($animals{$size})) {
$animals{$size} = $animal;
}
}
my $keyword="Little";
print "animal is ", $animals{$keyword};

Push into end of hash in Perl

So what I am trying to do with the following code is push a string, let's say "this string" onto the end of each key in a hash. I'm completely stumped on how to do this. Here's my code:
use warnings;
use strict;
use File::Find;
my #name;
my $filename;
my $line;
my #severity = ();
my #files;
my #info = ();
my $key;
my %hoa;
my $xmlfile;
my $comment;
my #comments;
open( OUTPUT, "> $ARGV[0]" );
my $dir = 'c:/programs/TEST/Test';
while ( defined( $input = glob( $dir . "\\*.txt" ) ) ) {
open( INPUT, "< $input" );
while (<INPUT>) {
chomp;
if (/File/) {
my #line = split /:/;
$key = $line[1];
push #{ $hoa{$key} }, "Filename\n";
}
if ( /XML/ ... /File/ ) {
$xmlfile = $1;
push #{ $hoa{$key} }, "XML file is $xmlfile\n";
}
if (/Important/) {
push #{ $hoa{$key} }, "Severity is $_\n";
}
if (/^\D/) {
next if /Important/;
push #{ $hoa{$key} }, "Given comment is $_\n";
}
push #{ $hoa{$key} }, "this string\n";
}
}
foreach my $k ( keys %hoa ) {
my #list = #{ $hoa{$k} };
foreach my $l (#list) {
print OUTPUT $l, "\n";
}
}
}
close INPUT;
close OUTPUT;
Where I have "this string" is where I was trying to push that string onto the end of the array. However, what ended up happening was that it ended up printing "this string" three times, and not at the end of every key like I wanted. When I tried to put it outside the while() loop, it said that the value of $key was not initialized. So please, any help? And if you need any clarification on what I'm asking, just let me know. Thank you!
No offence, but there are so many issues in this code I don't even know where to start...
First, the 'initialization block' (all these my $something; my #somethings lines at the beginning of this script) is not required in Perl. In fact, it's not just 'redundant' - it's actually confusing: I had to move my focus back and forth every time I encountered a new variable just to check its type. Besides, even with all this $input var is still not declared as local; it's either missing in comments, or the code given has omissions.
Second, why do you declare your intention to use File::Find (good) - but then do not use it at all? It could greatly simplify all this while(glob) { while(<FH>) { ... } } routine.
Third, I'm not sure why you assign something to $key only when the line read is matched by /File/ - but then use its value as a key in all the other cases. Is this an attempt to read the file organized in sections? Then it can be done a bit more simple, either by slurp/splitting or localizing $/ variable...
Anyway, the point is that if the first line of the file scanned is not matched by /File/, the previous (i.e., from the previous file!) value is used - and I'm not quite sure that it's intended. And if the very first line of the first file is not /File/-matched, then an empty string is used as a key - again, it smells like a bug...
Could you please describe your task in more details? Give some test input/output results, perhaps... It'd be great to proceed in short tasks, organizing your code in process.
Your program is ill-conceived and breaks a lot of good practice rules. Rather than enumerate them all, here is an equivalent program with a better structure.
I wonder if you are aware that all of the if statements will be tested and possibly executed? Perhaps you need to make use of elsif?
Aside from the possibility that $key is undefined when it is used, you are also setting $xmlfile to $1 which will never be defined as there are no captures in any of your regular expressions.
It is impossible to tell from your code what you are trying to do, so we can help you only if you show us your output, input and say how to derive one from the other.
use strict;
use warnings;
use File::Find;
my ($outfile) = #ARGV;
my $dir = 'c:/programs/TEST/Test';
my %hoa;
my $key;
while (my $input = glob "$dir/*.txt") {
open my $in, '<', $input or die $!;
while (<$in>) {
chomp;
if (/File/) {
my $key = (split /:/)[1];
push #{ $hoa{$key} }, "Filename\n";
}
if (/XML/ ... /File/) {
my $xmlfile = $1;
push #{ $hoa{$key} }, "XML file is $xmlfile\n";
}
if (/Important/) {
push #{ $hoa{$key} }, "Severity is $_\n";
}
if (/^\D/) {
next if /Important/;
push #{ $hoa{$key} }, "Given comment is $_\n";
}
push #{ $hoa{$key} }, "this string\n";
}
close $in;
}
open my $out, '>', $outfile or die $!;
foreach my $k (keys %hoa) {
foreach my $l (#{ $hoa{$k} }) {
print $out $l, "\n";
}
}
close $out;
I suspect based on your code, that the line where $key is set is not called each time through the loop, and that you do not trigger any of the other if statements.
This would append "this string" to the end of the array. Based on that you are getting 3 of the "this strings" at the end of the array, I would suspect that two lines do not go through the if (/FILE/) or any of the other if statements. This would leave the $key value the same and at the end, you would append "this string" to the array, using whatever the last value of $key was when it was set.
This will append the string "this string" to every element of the hash %hoa, which elements are array refs:
for (values(%hoa)) { push #{$_}, "this string"; }
Put that outside your while loop, and you'll print "this string" at the end of each element of %hoa.
It will autovivify array refs where it finds undefined elements. It will also choke if it cannot dereference an element as an array, and will manipulate arrays by symbolic reference if it finds a simple scalar and is not running under strict:
my %autoviv = ( a => ['foo'], b => undef );
push #$_, "PUSH" for values %autoviv; # ( a => ['foo', 'PUSH'], b => ['PUSH'] )
my %fatal = ( a => {} );
push #$_, "PUSH" for values %fatal; # FATAL: "Not an ARRAY reference at..."
my %dangerous = (a => "foo");
push #$_, "PUSH" for values %dangerous; # Yikes! #foo is now ("PUSH")
use strict;
my %kablam = (a => "foo");
push #$_, "PUSH" for values %kablam; # "Can't use string ("foo") as an ARRAY ref ..."
As I understand it, traverse the hash with a map command to modify its keys. An example:
EDIT: I've edited because I realised that the map command can be assigned to the same hash. No need to create a new one.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
my %hash = qw|
key1 value1
key2 value2
key3 value3
|;
my %hash = map { $_ . "this string" => $hash{ $_ } } keys %hash;
print Dump \%hash;
Run it like:
perl script.pl
With following output:
$VAR1 = {
'key3this string' => 'value3',
'key2this string' => 'value2',
'key1this string' => 'value1'
};

help in parsing

I am having a XML file as shown below,
<message1>
<val1>100</val1>
<val2>200</val2>
<val3>300</val3>
<val4>400</val4>
</message1>
<message2>
<val1>100</val1>
<val2>200</val2>
<val3>300</val3>
<val4>400</val4>
</message2>
I have to parse the values (val) and i could not use XML::Simple module. The parsing should be started from <message1> and i have to put the values in an array till </message1> and then i have to repeat this for <message2> till </message2>.
Pictorially it is like
<message1>
----100
----200
----300
----400
</message1>
<message2>
----100
----200
----300
----400
</message2>
Can any one help me .. I am struggling a lot
Thanks
Senthil kumar
Since we're back in 1999, I think I would forget about strict and warnings, use symbolic references and string eval, and be done with it:
#!/usr/bin/perl
while( <DATA>)
{ s{<(message\d)>}{\#$1=(}; # #message1=(
s{<val\d>}{}; #
s{<\/val\d>}{,}; # ,
s{</message\d>}{);}; # );
$s.=$_;
};
eval $s;
$,= ", "; $\= "\n";
foreach (1..2) { print "\#message$_: ", #{"message$_"}; }
__DATA__
<message1>
<val1>100</val1>
<val2>200</val2>
<val3>300</val3>
<val4>400</val4>
</message1>
<message2>
<val1>100</val1>
<val2>200</val2>
<val3>300</val3>
<val4>400</val4>
</message2>
(in case that's not clear: that's a joke! As they say "Have you tried using an XML parser instead?")
Assuming your input is completely regular as you show, the following should work.
But you are far better off getting a real XML parser to work, by wrapping a root element around all your content or by parsing each message separately.
use strict;
use warnings;
my %data;
while (<>) {
# skip blank lines
next unless /\S/;
my ($tag) = /^<(.*)>$/
or warn("expected tag, got $_ "), next;
$data{$tag} ||= [];
while (<>) {
last if /^<\/\Q$tag\E>$/;
my (undef, $value) = /^<val(\d+)>(.*)<\/val\1>$/
or warn("expected val, got $_ "), next;
push #{ $data{$tag} }, $value;
}
}
use Data::Dumper;
print Dumper \%data;

Dealing with multiple capture groups in multiple records

Data Format:
attribname: data
Data Example:
cheese: good
pizza: good
bagel: good
fire: bad
Code:
my $subFilter='(.+?): (.+)';
my #attrib = ($dataSet=~/$subFilter/g);
for (#attrib)
{
print "$_\n";
}
The code spits out:
cheese
good
pizza
good
[etc...]
I was wondering what an easy Perly way to do this is? I am parsing the data from a log the data above is trash for simplicity. I am newer to Perl, I suspect I could do this via fanangling indexes, but I was wondering if there is a short method of implementing this? Is there any way to have the capture groups put into two different variables instead of serially appended to the list along with all matches?
Edit: I want the attribute and it's associated value together so I can the do what I need to to them. For example if within my for loop I could access both the attribute name and attribute value.
Edit:
I tried
my %attribs;
while (my $line = <$data>)
{
my ($attrib, $value) = ($line=~m/$subFilter/);
print $attribs{$attrib}," : ", $value,"\n";
}
and no luck :( I don't get any output with this. My data is in a variable not a file, because it parsed out of a set of parent data which is in a file. It would be convenient if the my variable worked so that my (#attrib, #value) = ($line=~/$subFilter/g); filled the lists appropriately with the multiple matches.
Solution:
my #line = ($7 =~/(.+?)\n/g);
for (#line)
{
my ($attrib, $value) = ($_=~m/$subFilter/);
if ($attrib ne "")
{
print $attrib," : ", $value,"\n";
}
}
I'm not really clear on what you actually want to store, but here's how you could store the data in a hash table, with '1' indicating good and '0' indicating 'bad':
use strict;
use warnings;
use Data::Dumper;
my %foods;
while (my $line = <DATA>)
{
chomp $line;
my ($food, $good) = ($line =~ m/^(.+?): (.+)$/);
$foods{$food} = ($good eq 'good' ? 1 : 0);
}
print Dumper(\%foods);
__DATA__
cheese: good
pizza: good
bagel: good
fire: bad
This prints:
$VAR1 = {
'bagel' => 1,
'cheese' => 1,
'fire' => 0,
'pizza' => 1
};
A sensible approach would be to make use of the split function:
my %attrib;
open my $data, '<', 'fileName' or die "Unable to open file: $!";
while ( my $line = <$data> ) {
my ( $attrib, $value ) = split /:\s*/, $line, 2;
$attrib{$attrib} = $value;
}
close $data;
foreach my $attrib ( keys %attrib ) {
print "$attrib: $attrib{$attrib}\n";
}
If you're into one-liners, the following would achieve the same:
$ perl -F/:\s*/ -ane '$attrib{$F[0]} = $F[1]; } END { print $_,"\t",$attrib{$_},"\n" foreach keys %attrib;" fileName

How do I convert Data::Dumper output back into a Perl data structure?

I was wondering if you could shed some lights regarding the code I've been doing for a couple of days.
I've been trying to convert a Perl-parsed hash back to XML using the XMLout() and XMLin() method and it has been quite successful with this format.
#!/usr/bin/perl -w
use strict;
# use module
use IO::File;
use XML::Simple;
use XML::Dumper;
use Data::Dumper;
my $dump = new XML::Dumper;
my ( $data, $VAR1 );
Topology:$VAR1 = {
'device' => {
'FOC1047Z2SZ' => {
'ChassisID' => '2009-09',
'Error' => undef,
'Group' => {
'ID' => 'A1',
'Type' => 'Base'
},
'Model' => 'CATALYST',
'Name' => 'CISCO-SW1',
'Neighbor' => {},
'ProbedIP' => 'TEST',
'isDerived' => 0
}
},
'issues' => [
'TEST'
]
};
# create object
my $xml = new XML::Simple (NoAttr=>1,
RootName=>'data',
SuppressEmpty => 'true');
# convert Perl array ref into XML document
$data = $xml->XMLout($VAR1);
#reads an XML file
my $X_out = $xml->XMLin($data);
# access XML data
print Dumper($data);
print "STATUS: $X_out->{issues}\n";
print "CHASSIS ID: $X_out->{device}{ChassisID}\n";
print "GROUP ID: $X_out->{device}{Group}{ID}\n";
print "DEVICE NAME: $X_out->{device}{Name}\n";
print "DEVICE NAME: $X_out->{device}{name}\n";
print "ERROR: $X_out->{device}{error}\n";
I can access all the element in the XML with no problem.
But when I try to create a file that will house the parsed hash, problem arises because I can't seem to access all the XML elements. I guess, I wasn't able to unparse the file with the following code.
#!/usr/bin/perl -w
use strict;
#!/usr/bin/perl
# use module
use IO::File;
use XML::Simple;
use XML::Dumper;
use Data::Dumper;
my $dump = new XML::Dumper;
my ( $data, $VAR1, $line_Holder );
#this is the file that contains the parsed hash
my $saveOut = "C:/parsed_hash.txt";
my $result_Holder = IO::File->new($saveOut, 'r');
while ($line_Holder = $result_Holder->getline){
print $line_Holder;
}
# create object
my $xml = new XML::Simple (NoAttr=>1, RootName=>'data', SuppressEmpty => 'true');
# convert Perl array ref into XML document
$data = $xml->XMLout($line_Holder);
#reads an XML file
my $X_out = $xml->XMLin($data);
# access XML data
print Dumper($data);
print "STATUS: $X_out->{issues}\n";
print "CHASSIS ID: $X_out->{device}{ChassisID}\n";
print "GROUP ID: $X_out->{device}{Group}{ID}\n";
print "DEVICE NAME: $X_out->{device}{Name}\n";
print "DEVICE NAME: $X_out->{device}{name}\n";
print "ERROR: $X_out->{device}{error}\n";
Do you have any idea how I could access the $VAR1 inside the text file?
Regards,
newbee_me
$data = $xml->XMLout($line_Holder);
$line_Holder has only the last line of your file, not the whole file, and not the perl hashref that would result from evaling the file. Try something like this:
my $ref = do $saveOut;
The do function loads and evals a file for you. You may want to do it in separate steps, like:
use File::Slurp "read_file";
my $fileContents = read_file( $saveOut );
my $ref = eval( $fileContents );
You might want to look at the Data::Dump module as a replacement for Data::Dumper; its output is already ready to re-eval back.
Basically to load Dumper data you eval() it:
use strict;
use Data::Dumper;
my $x = {"a" => "b", "c"=>[1,2,3],};
my $q = Dumper($x);
$q =~ s{\A\$VAR\d+\s*=\s*}{};
my $w = eval $q;
print $w->{"a"}, "\n";
The regexp (s{\A\$VAR\d+\s*=\s*}{}) is used to remove $VAR1= from the beginning of string.
On the other hand - if you need a way to store complex data structure, and load it again, it's much better to use Storable module, and it's store() and retrieve() functions.
This has worked for me, for hashes of hashes. Perhaps won't work so well with structures which contain references other structures. But works well enough for simple structures, like arrays, hashes, or hashes of hashes.
open(DATA,">",$file);
print DATA Dumper(\%g_write_hash);
close(DATA);
my %g_read_hash = %{ do $file };
Please use dump module as a replacement for Data::Dumper
You can configure the variable name used in Data::Dumper's output with $Data::Dumper::Varname.
Example
use Data::Dumper
$Data::Dumper::Varname = "foo";
my $string = Dumper($object);
eval($string);
...will create the variable $foo, and should contain the same data as $object.
If your data structure is complicated and you have strange results, you may want to consider Storable's freeze() and thaw() methods.