Parse::RecDescent - getting information from it - perl

I'm working with the Parse::RecDescent parser in Perl, and I seem to have the most terrible time getting information from it. The information readily available online does not seem to have non-trivial examples.
Here is the code:
event_function: object_list ':' event_list ';'
<defer:
{ #item is a special character with Parse::Recdescent.
print Dumper($item{object_list});
$return = $item[1];
}
>
| object_list ':' ';'
<defer:
{
print Dumper($item{object_list});
$return = $item[1];
}
>
Here is the output
PS W:\developers\paulnathan\rd_dir> perl parser.pl testfile
$VAR1 = 4;
$VAR1 = 8;
PS W:\developers\paulnathan\rd_dir>
The input file parses correctly.
stuff, stuff2: pre-operation event = {foo1, foo2};
It should be outputting a hash keyed by "stuff", "stuff2".
Thoughts?
edit:
object_list :
object ',' object_list
<defer:
{
my $retval = ();
$retval = ::merge_hash_refs($item[1], $item[3]);
$return = $retval;
}
>
| object
<defer:
{
#print Dumper($item{object});
$return = $item{object};
}
>
object :
'/' /[a-z0-9_][a-z0-9_]*/ '/' '...'
<defer:
{
$::objects->{$item[2]} = "stuff";
$return = $::objects;
}
>
| /[a-z0-9_][a-z0-9_]*/
<defer:
{
$::objects->{$item[1]} = "stuff";
$return = $::objects;
}
>
edit2:
Merge_hash_refs, just in case. :-)
#takes two hash references.
sub merge_hash_refs {
my($ref1, $ref2) = #_;
my $retref = ();
while( my ($k, $v) = each %$ref1 ) {
$retref->{$k} = $v;
}
while( my ($k, $v) = each %$ref2 ) {
$retref->{$k} = $v;
}
return $retref;
}

If you add a use strict to your script you'll get the fatal error Can't use string ("1") as a HASH ref while "strict refs" in use at [the call to merge_hash_refs]. It appears that the closures created by the <defer> directives are causing the contents of #item to be the ones when the production matched instead of the hashrefs eventually returned by the subrules. Removing the <defer> directives gives me this output:
$VAR1 = {
'stuff2' => 'stuff',
'stuff' => 'stuff'
};
Of course, this has the side effect that $::object is updated by successful object productions even if the higher level rules fail (including backtracking). I'd write it this way:
use strict;
use warnings;
use Parse::RecDescent;
use Data::Dumper;
my $parser = Parse::RecDescent->new(<<'EOT');
event_function: object_list ':' event_list(?) ';'
{
$return = $item[1];
}
object_list : <leftop: object ',' object>
{
$return = { map { %$_ } #{$item[1]} };
}
object :
'/' /[a-z0-9_][a-z0-9_]*/ '/' '...'
{
$return = { $item[2] => 'stuff' };
}
| /[a-z0-9_][a-z0-9_]*/
{
$return = { $item[1] => 'stuff' };
}
# stub, don't know what this should be
event_list : /[^;]+/
EOT
my %object;
while (<DATA>) {
my $x = $parser->event_function($_);
next unless $x;
# merge objects into master list
while (my ($k, $v) = each %$x) {
$object{$k} = $v;
}
}
print Dumper \%object;
__DATA__
stuff, stuff2: pre-operation event = {foo1, foo2};
stuff3, stuff4: ;
The output is:
$VAR1 = {
'stuff2' => 'stuff',
'stuff3' => 'stuff',
'stuff' => 'stuff',
'stuff4' => 'stuff'
};

Probably not an answer to your question, but when you start an each() loop through a hash, if each() had previously been used on the hash it just starts from wherever the iterator was pointing. To be safe, put a void-context keys() (e.g. keys(%$ref1);) before the while loop to reset the iterator. Older versions of Data::Dumper had a cute little bug of leaving the iterator pointing just after the last element sometimes, making the hash appear to be empty to an unsafe while(...each...) loop :)

Related

Split string into a hash of hashes (perl)

at the moment im a little confused..
I am looking for a way to write a string with an indefinite number of words (separated by a slash) in a recursive hash.
These "strings" are output from a text database.
Given is for example
"office/1/hardware/mouse/count/200"
the next one can be longer or shorter..
This must be created from it:
{
office {
1{
hardware {
mouse {
count => 200
}
}
}
}
}
Any idea ?
Work backwards. Split the string. Use the last two elements to make the inner-most hash. While more words exist, make each one the key of a new hash, with the inner hash as its value.
my $s = "office/1/hardware/mouse/count/200";
my #word = split(/\//, $s);
# Bottom level taken explicitly
my $val = pop #word;
my $key = pop #word;
my $h = { $key => $val };
while ( my $key = pop #word )
{
$h = { $key => $h };
}
Simple recursive function should do
use strict;
use warnings;
use Data::Dumper;
sub foo {
my $str = shift;
my ($key, $rest) = split m|/|, $str, 2;
if (defined $rest) {
return { $key => foo($rest) };
} else {
return $key;
}
}
my $hash = foo("foo/bar/baz/2");
print Dumper $hash;
Gives output
$VAR1 = {
'foo' => {
'bar' => {
'baz' => '2'
}
}
};
But like I said in the comment: What do you intend to use this for? It is not a terribly useful structure.
If there are many lines to be read into a single hash and the lines have a variable number of fields, you have big problems and the other two answers will clobber data by either smashing sibling keys or overwriting final values. I'm supposing this because there is no rational reason to convert a single line into a hash.
You will have to walk down the hash with each field. This will also give you the most control over the process.
our $hash = {};
our $eolmark = "\000";
while (my $line = <...>) {
chomp $line;
my #fields = split /\//, $line;
my $count = #fields;
my $h = $hash;
my $i = 0;
map { (++$i == $count) ?
($h->{$_}{$eolmark} = 1) :
($h = $h->{$_} ||= {});
} #fields;
}
$h->{$_}{$eolmark} = 1 You need the special "end of line" key so that you can recognize the end of a record and still permit longer records to coexist. If you had two records
foo/bar/baz foo/bar/baz/quux, the second would overwrite the final value of the first.
$h = $h->{$_} ||= {} This statement is a very handy idiom to both create and populate a cache in one step and then take a shortcut reference to it. Never do a hash lookup more than once.
HTH

functional Perl: Filter, Iterator

I have to write Perl although I'm much more comfortable with Java, Python and functional languages. I'd like to know if there's some idiomatic way to parse a simple file like
# comment line - ignore
# ignore also empty lines
key1 = value
key2 = value1, value2, value3
I want a function that I pass an iterator over the lines of the files and that returns a map from keys to list of values. But to be functional and structured I'd like to:
use a filter that wraps the given iterator and returns an iterator without empty lines or comment lines
The mentioned filter(s) should be defined outside of the function for reusability by other functions.
use another function that is given the line and returns a tuple of key and values string
use another function that breaks the comma separated values into a list of values.
What is the most modern, idiomatic, cleanest and still functional way to do this? The different parts of the code should be separately testable and reusable.
For reference, here is (a quick hack) how I might do it in Python:
re_is_comment_line = re.compile(r"^\s*#")
re_key_values = re.compile(r"^\s*(\w+)\s*=\s*(.*)$")
re_splitter = re.compile(r"\s*,\s*")
is_interesting_line = lambda line: not ("" == line or re_is_comment_line.match(line))
and re_key_values.match(line)
def parse(lines):
interesting_lines = ifilter(is_interesting_line, imap(strip, lines))
key_values = imap(lambda x: re_key_values.match(x).groups(), interesting_lines)
splitted_values = imap(lambda (k,v): (k, re_splitter.split(v)), key_values)
return dict(splitted_values)
A direct translation of your Python would be
my $re_is_comment_line = qr/^\s*#/;
my $re_key_values = qr/^\s*(\w+)\s*=\s*(.*)$/;
my $re_splitter = qr/\s*,\s*/;
my $is_interesting_line= sub {
my $_ = shift;
length($_) and not /$re_is_comment_line/ and /$re_key_values/;
};
sub parse {
my #lines = #_;
my #interesting_lines = grep $is_interesting_line->($_), #lines;
my #key_values = map [/$re_key_values/], #interesting_lines;
my %splitted_values = map { $_->[0], [split $re_splitter, $_->[1]] } #key_values;
return %splitted_values;
}
Differences are:
ifilter is called grep, and can take an expression instead of a block as first argument. These are roughly equivalent to a lambda. The current item is given in the $_ variable. The same applies to map.
Perl doesn't emphazise laziness, and seldomly uses iterators. There are instances where this is required, but usually the whole list is evaluated at once.
In the next example, the following will be added:
Regexes don't have to be precompiled, Perl is very good with regex optimizations.
Instead of extracting key/values with regexes, we use split. It takes an optional third argument that limits the number of resulting fragments.
The whole map/filter stuff can be written in one expression. This doesn't make it more efficient, but emphazises the flow of data. Read the map-map-grep from bottom upwards (actually right to left, think of APL).
.
sub parse {
my %splitted_values =
map { $_->[0], [split /\s*,\s*/, $_->[1]] }
map {[split /\s*=\s*/, $_, 2]}
grep{ length and !/^\s*#/ and /^\s*\w+\s*=\s*\S/ }
#_;
return \%splitted_values; # returning a reference improves efficiency
}
But I think a more elegant solution here is to use a traditional loop:
sub parse {
my %splitted_values;
LINE: for (#_) {
next LINE if !length or /^\s*#/;
s/\A\s*|\s*\z//g; # Trimming the string—omitted in previous examples
my ($key, $vals) = split /\s*=\s*/, $_, 2;
defined $vals or next LINE; # check if $vals was assigned
#{ $splitted_values{$key} } = split /\s*,\s*/, $vals; # Automatically create array in $splitted_values{$key}
}
return \%splitted_values
}
If we decide to pass a filehandle instead, the loop would be replaced with
my $fh = shift;
LOOP: while (<$fh>) {
chomp;
...;
}
which would use an actual iterator.
You could now go and add function parameters, but do this only iff you are optimizing for flexibility and nothing else. I already used a code reference in the first example. You can invoke them with the $code->(#args) syntax.
use Carp; # Error handling for writing APIs
sub parse {
my $args = shift;
my $interesting = $args->{interesting} or croak qq("interesting" callback required);
my $kv_splitter = $args->{kv_splitter} or croak qq("kv_splitter" callback required);
my $val_transform= $args->{val_transform} || sub { $_[0] }; # identity by default
my %splitted_values;
LINE: for (#_) {
next LINE unless $interesting->($_);
s/\A\s*|\s*\z//g;
my ($key, $vals) = $kv_splitter->($_);
defined $vals or next LINE;
$splitted_values{$key} = $val_transform->($vals);
}
return \%splitted_values;
}
This could then be called like
my $data = parse {
interesting => sub { length($_[0]) and not $_[0] =~ /^\s*#/ },
kv_splitter => sub { split /\s*=\s*/, $_[0], 2 },
val_transform => sub { [ split /\s*,\s*/, $_[0] ] }, # returns anonymous arrayref
}, #lines;
I think the most modern approach consists in taking advantage of the CPAN modules. In your example, Config::Properties may helps:
use strict;
use warnings;
use Config::Properties;
my $config = Config::Properties->new(file => 'example.properties') or die $!;
my $value = $config->getProperty('key');
As indicated in the posts linked to by #collapsar, Higher-Order Perl is a great read for exploring functional techniques in Perl.
Here is an example that hits your bullet points:
use strict;
use warnings;
use Data::Dumper;
my #filt_rx = ( qr{^\s*\#},
qr{^[\r\n]+$} );
my $kv_rx = qr{^\s*(\w+)\s*=\s*([^\r\n]*)};
my $spl_rx = qr{\s*,\s*};
my $iterator = sub {
my ($fh) = #_;
return sub {
my $line = readline($fh);
return $line;
};
};
my $filter = sub {
my ($it,#r) = #_;
return sub {
my $line;
do {
$line = $it->();
} while ( defined $line
&& grep { $line =~ m/$_/} #r );
return $line;
};
};
my $kv = sub {
my ($line,$rx) = #_;
return ($line =~ m/$rx/);
};
my $spl = sub {
my ($values,$rx) = #_;
return split $rx, $values;
};
my $it = $iterator->( \*DATA );
my $f = $filter->($it,#filt_rx);
my %map;
while ( my $line = $f->() ) {
my ($k,$v) = $kv->($line,$kv_rx);
$map{$k} = [ $spl->($v,$spl_rx) ];
}
print Dumper \%map;
__DATA__
# comment line - ignore
# ignore also empty lines
key1 = value
key2 = value1, value2, value3
It produces the following hash on the provided input:
$VAR1 = {
'key2' => [
'value1',
'value2',
'value3'
],
'key1' => [
'value'
]
};
you might be interested in this SO question as well as this one.
the following code is a self-contained perl script destined to give you an idea of how to implement in perl (only partially in a functional style; in case you don't revulse seeing the particular coding style and/or language construct, i can refine the solution somewhat).
Miguel Prz is right that in most cases you'd search CPAN for solutions to match your requirements.
my (
$is_interesting_line
, $re_is_comment_line
, $re_key_values
, $re_splitter
);
$re_is_comment_line = qr(^\s*#);
$re_key_values = qr(^\s*(\w+)\s*=\s*(.*)$);
$re_splitter = qr(\s*,\s*);
$is_interesting_line = sub {
my $line = shift;
return (
(!(
!defined($line)
|| ($line eq '')
))
&& ($line =~ /$re_key_values/)
);
};
sub strip {
my $line = shift;
# your implementation goes here
return $line;
}
sub parse {
my #lines = #_;
#
my (
$dict
, $interesting_lines
, $k
, $v
);
#
#$interesting_lines =
grep {
&{$is_interesting_line} ( $_ );
} ( map { strip($_); } #lines )
;
$dict = {};
map {
if ($_ =~ /$re_key_values/) {
($k, $v) = ($1, [split(/$re_splitter/, $2)]);
$$dict{$k} = $v;
}
} #$interesting_lines;
return $dict;
} # parse
#
# sample execution goes here
#
my $parse =<<EOL;
# comment
what = is, this, you, wonder
it = is, perl
EOL
parse ( split (/[\r\n]+/, $parse) );

Adressing a hash of hashes with an array

This is my problem:
I have a file-system like data-structure:
%fs = (
"home" => {
"test.file" => {
type => "file",
owner => 1000,
content => "Hello World!",
},
},
"etc" => {
"passwd" => {
type => "file",
owner => 0,
content => "testuser:testusershash",
},
"conf" => {
"test.file" => {
type => "file",
owner => 1000,
content => "Hello World!",
},
},
},
);
Now, to get the content of /etc/conf/test.file I need $fs{"etc"}{"conf"}{"test.file"}{"content"}, but my input is an array and looks like this: ("etc","conf","test.file").
So, because the length of the input is varied, I don't know how to access the values of the hash. Any ideas?
You can use a loop. In each step, you proceed one level deeper into the structure.
my #path = qw/etc conf test.file/;
my %result = %fs;
while (#path) {
%result = %{ $result{shift #path} };
}
print $result{content};
You can also use Data::Diver.
my #a = ("etc","conf","test.file");
my $h = \%fs;
while (my $v = shift #a) {
$h = $h->{$v};
}
print $h->{type};
Same logic as what others given, but uses foreach
#keys = qw(etc conf test.file content);
$r = \%fs ;
$r = $r->{$_} foreach (#keys);
print $r;
$pname = '/etc/conf/test.file';
#names = split '/', $pname;
$fh = \%fs;
for (#names) {
$fh = $fh->{"$_"} if $_;
}
print $fh->{'content'};
Path::Class accepts an array. It also gives you an object with helper methods and handles cross platform slash issues.
https://metacpan.org/module/Path::Class
You can just build the hash element expression and call eval. This is tidier if it is wrapped in a subroutine
my #path = qw/ etc conf test.file /;
print hash_at(\%fs, \#path)->{content}, "\n";
sub hash_at {
my ($hash, $path) = #_;
$path = sprintf q($hash->{'%s'}), join q('}{'), #$path;
return eval $path;
}

In Perl, can refer to an array using its name?

I'm new to Perl and I understand you can call functions by name, like this:
&$functionName();. However, I'd like to use an array by name. Is this possible?
Long code:
sub print_species_names {
my $species = shift(#_);
my #cats = ("Jeffry", "Owen");
my #dogs = ("Duke", "Lassie");
switch ($species) {
case "cats" {
foreach (#cats) {
print $_ . "\n";
}
}
case "dogs" {
foreach (#dogs) {
print $_ . "\n";
}
}
}
}
Seeking shorter code similar to this:
sub print_species_names {
my $species = shift(#_);
my #cats = ("Jeffry", "Owen");
my #dogs = ("Duke", "Lassie");
foreach (#<$species>) {
print $_ . "\n";
}
}
Possible? Yes. Recommended? No. In general, using symbolic references is bad practice. Instead, use a hash to hold your arrays. That way you can look them up by name:
sub print_species_names {
my $species = shift;
my %animals = (
cats => [qw(Jeffry Owen)],
dogs => [qw(Duke Lassie)],
);
if (my $array = $animals{$species}) {
print "$_\n" for #$array
}
else {
die "species '$species' not found"
}
}
If you want to reduce that even more, you could replace the if/else block with:
print "$_\n" for #{ $animals{$species}
or die "species $species not found" };
You can achieve something close by using a hash of array references:
%hash = ( 'cats' => [ "Jeffry", "Owen"],
'dogs' => [ "Duke", "Lassie" ] );
$arrayRef = $hash{cats};
You could also use eval here:
foreach (eval("#$species")) {
print $_ . "\n";
}
I should have made it clear that you need to turn off strict refs for this to work. So surrounding the code with use "nostrict" and use "strict" works.
This is whats known as a soft reference in perl.

Where can I find an array of the (un)assigned Unicode code points for a particular block?

At the moment, I'm writing these arrays by hand.
For example, the Miscellaneous Mathematical Symbols-A block has an entry in hash like this:
my %symbols = (
...
miscellaneous_mathematical_symbols_a => [(0x27C0..0x27CA), 0x27CC,
(0x27D0..0x27EF)],
...
)
The simpler, 'continuous' array
miscellaneous_mathematical_symbols_a => [0x27C0..0x27EF]
doesn't work because Unicode blocks have holes in them. For example, there's nothing at 0x27CB. Take a look at the code chart [PDF].
Writing these arrays by hand is tedious, error-prone and a bit fun. And I get the feeling that someone has already tackled this in Perl!
Perhaps you want Unicode::UCD? Use its charblock routine to get the range of any named block. If you want to get those names, you can use charblocks.
This module is really just an interface to the Unicode databases that come with Perl already, so if you have to do something fancier, you can look at the lib/5.x.y/unicore/UnicodeData.txt or the various other files in that same directory to get what you need.
Here's what I came up with to create your %symbols. I go through all the blocks (although in this sample I skip that ones without "Math" in their name. I get the starting and ending code points and check which ones are assigned. From that, I create a custom property that I can use to check if a character is in the range and assigned.
use strict;
use warnings;
digest_blocks();
my $property = 'My::InMiscellaneousMathematicalSymbolsA';
foreach ( 0x27BA..0x27F3 )
{
my $in = chr =~ m/\p{$property}/;
printf "%X is %sin $property\n",
$_, $in ? '' : ' not ';
}
sub digest_blocks {
use Unicode::UCD qw(charblocks);
my $blocks = charblocks();
foreach my $block ( keys %$blocks )
{
next unless $block =~ /Math/; # just to make the output small
my( $start, $stop ) = #{ $blocks->{$block}[0] };
$blocks->{$block} = {
assigned => [ grep { chr =~ /\A\p{Assigned}\z/ } $start .. $stop ],
unassigned => [ grep { chr !~ /\A\p{Assigned}\z/ } $start .. $stop ],
start => $start,
stop => $stop,
name => $block,
};
define_my_property( $blocks->{$block} );
}
}
sub define_my_property {
my $block = shift;
(my $subname = $block->{name}) =~ s/\W//g;
$block->{my_property} = "My::In$subname"; # needs In or Is
no strict 'refs';
my $string = join "\n", # can do ranges here too
map { sprintf "%X", $_ }
#{ $block->{assigned} };
*{"My::In$subname"} = sub { $string };
}
If I were going to do this a lot, I'd use the same thing to create a Perl source file that has the custom properties already defined so I can just use them right away in any of my work. None of the data should change until you update your Unicode data.
sub define_my_property {
my $block = shift;
(my $subname = $block->{name}) =~ s/\W//g;
$block->{my_property} = "My::In$subname"; # needs In or Is
no strict 'refs';
my $string = num2range( #{ $block->{assigned} } );
print <<"HERE";
sub My::In$subname {
return <<'CODEPOINTS';
$string
CODEPOINTS
}
HERE
}
# http://www.perlmonks.org/?node_id=87538
sub num2range {
local $_ = join ',' => sort { $a <=> $b } #_;
s/(?<!\d)(\d+)(?:,((??{$++1})))+(?!\d)/$1\t$+/g;
s/(\d+)/ sprintf "%X", $1/eg;
s/,/\n/g;
return $_;
}
That gives me output suitable for a Perl library:
sub My::InMiscellaneousMathematicalSymbolsA {
return <<'CODEPOINTS';
27C0 27CA
27CC
27D0 27EF
CODEPOINTS
}
sub My::InSupplementalMathematicalOperators {
return <<'CODEPOINTS';
2A00 2AFF
CODEPOINTS
}
sub My::InMathematicalAlphanumericSymbols {
return <<'CODEPOINTS';
1D400 1D454
1D456 1D49C
1D49E 1D49F
1D4A2
1D4A5 1D4A6
1D4A9 1D4AC
1D4AE 1D4B9
1D4BB
1D4BD 1D4C3
1D4C5 1D505
1D507 1D50A
1D50D 1D514
1D516 1D51C
1D51E 1D539
1D53B 1D53E
1D540 1D544
1D546
1D54A 1D550
1D552 1D6A5
1D6A8 1D7CB
1D7CE 1D7FF
CODEPOINTS
}
sub My::InMiscellaneousMathematicalSymbolsB {
return <<'CODEPOINTS';
2980 29FF
CODEPOINTS
}
sub My::InMathematicalOperators {
return <<'CODEPOINTS';
2200 22FF
CODEPOINTS
}
Maybe this?
my #list =
grep {chr ($_) =~ /^\p{Assigned}$/}
0x27C0..0x27EF;
#list = map { $_ = sprintf ("%X", $_ )} #list;
print "#list\n";
Gives me
27C0 27C1 27C2 27C3 27C4 27C5 27C6 27C7 27C8 27C9 27CA 27D0 27D1 27D2 27D3
27D4 27D5 27D6 27D7 27D8 27D9 27DA 27DB 27DC 27DD 27DE 27DF 27E0 27E1 27E2
27E3 27E4 27E5 27E6 27E7 27E8 27E9 27EA 27EB
I don't know why you wouldn't say miscellaneous_mathematical_symbols_a => [0x27C0..0x27EF], because that's how the Unicode standard is defined according to the PDF.
What do you mean when you say it doesn't "work"? If it's giving you some sort of error when you check the existence of the character in the block, then why not just weed them out of the block when your checker comes across an error?