Matching arbitrary delimiters - perl

I've had good success parsing complicated and silly old text formats with Marpa before and I'm trying to do it again.
This particular format has hundred and hundreds of different kinds of 'Begin' and 'End' blocks that look like this:
Begin BlahBlah
asdf qwer 123
987 xxxx
End BlahBlah
Begin FooFoo
Begin BarBar
some stuff (1,2,3)
End BarBar
whatever x
End FooFoo
How do I make a single rule that will match all of BlahBlah, BarBar, and FooFoo in the stuff above? I don't see in any examples how to dynamically capture the token and re-use it to terminate the rule, at least not with the standard scanless grammar examples. I don't want to enumerate all the different kinds of blocks because new kinds will break things, and I don't think it should be necessary.
The contents of the Begin/End blocks are immaterial to the question. In reality that stuff is a complicated mess, but nothing I don't know how to slog through. I'm hand-waving over other complicating details that make Marpa a good tool for this, such that I don't want to resort to regex.
At a bare minimum all I'm trying to achieve is a key-value map of the block type (i.e. "BlahBlah") to its contents as a string.

This doesn't exactly answer my original question because I ultimately arrived at simply ignoring the repeated string following the "End" token. I will probably follow the comment suggestion above of simply checking that the begin/end names match in a post-processing step. Operating under the assumption that the token is redundant, this seems to work OK, as a rough first cut. Critique welcome:
#!/usr/bin/perl
use warnings;
use strict;
use v5.18;
use utf8;
use feature 'unicode_strings';
use autodie;
use Marpa::R2;
use Data::Dumper;
my $g = Marpa::R2::Scanless::G->new({
source => \(<<'END_OF_SOURCE'),
lexeme default = latm => 1
:default ::= action => ::array
:start ::= beginend_blocks
:discard ~ <ws>
beginend_blocks ::= beginend_block+
beginend_block ::= beginend_block_header beginend_block_contents
beginend_block_header ::= ('Begin') beginend_block_name action => ::first
beginend_block_name ::= <word>
beginend_block_contents ::= beginend_block_content_elems (beginend_block_terminator) (<word>)
beginend_block_content_elems ::= beginend_block_content_elem+
beginend_block_content_elem ::= word action => ::first
| beginend_block action => ::first
beginend_block_terminator ::= ('End')
<word> ~ <wordchar>+
<wordchar> ~ [\S]
<ws> ~ [\s]+
END_OF_SOURCE
});
my $test_str = <<THEDATA;
Begin BlahBlah
asdf qwer 123
987 xxxx
End BlahBlah
Begin FooFoo
something else
Begin BazBaz
some stuff (1,2,3)
End BazBaz
whatever x
Begin BarBar
some stuff (1,2,3)
End BarBar
whatever y
End FooFoo
THEDATA
MAIN: {
my $re = Marpa::R2::Scanless::R->new({ grammar => $g, trace_terminals => 0 });
for (my $pos = $re->read(\$test_str); $pos < length $test_str; $pos = $re->resume) {
my ($pause_start, undef) = $re->pause_span;
}
say Dumper $re->value;
}

Related

Perl s/$array[1]\b/$array[1] won't replace

I would like to replace the string $array[1] by the actual variable value.
the \b doesn't seem to work
Does anyone know how the replace the array variable ? What's the delimiter ?
s/$array[1]\b/$array[1]
The [ ... ] has a special meaning in regular expressions (it defines a "character class"). If you want to use [ to mean a [, then you need to escape it with a \.
s/\$array\[1]/$array[1]/
Update: Added escape to $. Removed \b.
I would recommend a real templating engine to perform such a substitution, this will allow you to extend it to things that don't look exactly like $array[1] without making it more complicated, but you will need to alter your input to what the templating engine expects. One option is Text::Template.
use strict;
use warnings;
use Text::Template 'fill_in_string';
my $input = 'foo {$array[1]} bar';
my #array = 1..10;
my $rendered = fill_in_string $input, HASH => {array => \#array};
print $rendered, "\n"; # foo 2 bar

Matching text not enclosed by parenthesis

I am still learning Perl, so apologies if this is an obvious question.
Is there a way to match text that is NOT enclosed by parenthesis?
For example, searching for foo would match the second line only.
(bar foo bar)
bar foo (
bar foo
(bar) (foo)
)
Regex patterns have an implicit leading \G(?s:.)*? ("skip characters until a match is found"). The following expands that definition to consider nested parens to be a character to skip.
while (
$string =~ m{
\G (?&MEGA_DOT)*?
( foo )
(?(DEFINE)
(?<MEGA_DOT> [^()] | \( (?&MEGA_DOT)*+ \) )
)
}xg
) {
say "Found a match at pos $-[1].";
}
This is very far from "obvious"; on the contrary. There is no direct way to say "don't match" for a complex pattern (there is good support at a character level, with [^a], \S etc). Regex is firstly about matching things, not about not-matching them.
One approach is to match those (possibly nested) delimiters and get everything other than that.
A good tool for finding nested delimiters is the core module Text::Balanced. As it matches it can also give us the substring before the match and the rest of the string after the match.
use warnings;
use strict;
use feature 'say';
use Text::Balanced qw(extract_bracketed);
my $text = <<'END';
(bar foo bar)
bar foo (
bar foo
(bar) (foo)
)
END
my ($match, $before);
my $remainder = $text;
while (1) {
($match, $remainder, $before) = extract_bracketed($remainder, '(', '[^(]*');
print $before // $remainder;
last if not defined $match;
}
The extract_bracketed returns the match, the remainder substring ($remainder), and the substring before the match ($before); so we keep matching in the remainder.
Taken from this post, where there are more details and another way, using Regexp::Common.

Marpa: Can I explicitly disallow keywords as identifiers?

I'm implementing a new DSL in Marpa and (coming from Regexp::Grammars) I'm more than satisfied. My language supports a bunch of unary and binary operators, objects with C-style identifiers and method calls using the familiar dot notation. For example:
foo.has(bar == 42 AND baz == 23)
I found the prioritized rules feature offered by Marpa's grammar description language and have come to rely on that a lot, so I have nearly only one G1 rule Expression. Excerpt (many alternatives, and semantic actions omitted for brevity):
Expression ::=
NumLiteral
| '(' Expression ')' assoc => group
|| Expression ('.') Identifier
|| Expression ('.') Identifier Args
| Expression ('==') Expression
|| Expression ('AND') Expression
Args ::= ('(') ArgsList (')')
ArgsList ::= Expression+ separator => [,]
Identifier ~ IdentifierHeadChar IdentifierBody
IdentifierBody ~ IdentifierBodyChar*
IdentifierHeadChar ~ [a-zA-Z_]
IdentifierBodyChar ~ [a-zA-Z0-9_]
NumLiteral ~ [0-9]+
As you can see, I'm using the Scanless interface (SLIF). My problem is that this also parses, for example:
foo.AND(5)
Marpa knows that there can only be an identifier after a dot, so it doesn't even consider the fact that AND might be a keyword. I know that I can avoid that problem by doing a separate lexing stage that identifies AND as a keyword explicitly, but that tiny papercut is not quite worth the effort.
Is there a way in SLIF to restrict the Identifier rule to non-keyword identifiers only?
I don't know how to express such a thing in the grammar. You can introduce an intermediate non-terminal for Identifier which would check the condition, though:
#!/usr/bin/perl
use warnings;
use strict;
use Syntax::Construct qw{ // };
use Marpa::R2;
my %reserved = map { $_ => 1 } qw( AND );
my $grammar = 'Marpa::R2::Scanless::G'->new(
{ bless_package => 'main',
source => \( << '__GRAMMAR__'),
:default ::= action => store
:start ::= S
S ::= Id
| Id NumLiteral
Id ::= Identifier action => allowed
Identifier ~ IdentifierHeadChar IdentifierBody
IdentifierBody ~ IdentifierBodyChar*
IdentifierHeadChar ~ [a-zA-Z_]
IdentifierBodyChar ~ [a-zA-Z0-9_]
NumLiteral ~ [0-9]+
:discard ~ whitespace
whitespace ~ [\s]+
__GRAMMAR__
});
for my $value ('ABC', 'ABC 42', 'AND 1') {
my $value = $grammar->parse(\$value, 'main');
print $$value, "\n";
}
sub store {
my (undef, $id, $arg) = #_;
$arg //= 'null';
return "$id $arg";
}
sub allowed {
my (undef, $id) = #_;
die "Reserved keyword $id" if $reserved{$id};
return $id
}
You can use lexeme priorities intended just for such kind of thing, the example is here in Marpa::R2 test suite.
Basically, you declare <AND keyword> ~ 'AND' lexeme and give it priority 1 so that it's preferred over Identifier. That must do the trick.
P.S. I modified the above script slightly to give an example — code, output.

Trouble separating G0 and G1 rules in grammar

I'm trying to get what seems like a very basic Marpa grammar working. The code I use is below:
use strict;
use warnings;
use Marpa::R2;
use Data::Dumper;
my $grammar = Marpa::R2::Scanless::G->new(
{
source => \(<<'END_OF_SOURCE'),
:start ::= ExprSingle
ExprSingle ::= Expr AndExpr
Expr ~ word
AndExpr ~ word*
word ~ [\w]+
:discard ~ ws
ws ~ [\s]+
END_OF_SOURCE
}
);
my $reader = Marpa::R2::Scanless::R->new(
{
grammar => $grammar,
}
);
my $input = 'foo';
$reader->read(\$input);
my $value = $reader->value;
print Dumper $value;
This prints $VAR1 = \'foo';. So it recognizes one word just fine. But I want it to recognize a string of words
my $input='foo bar'
Now the script prints:
Error in SLIF G1 read: Parse exhausted, but lexemes remain, at position 4
I think this is because ExprSingle uses the ~ (match) operator, which makes it part of the tokenizing level, G0, instead of the structural level G1; the :discard rule allows space between G1 rules, not G0 ones. So I change the grammar like so:
ExprSingle ::= Expr AndExpr
Now no warning is printed, but the resulting value is undef instead of something containing 'foo' and 'bar'. I'm honestly not sure what that means, since, before, the failed parse threw an actual error.
I tried changing the grammar to separate what I think are G0 and G1 rules further, but still no luck:
:start ::= ExprSingle
ExprSingle ::= Expr AndExpr
Expr ::= token
AndExpr ::= token*
token ~ word
word ~ [\w]+
:discard ~ ws
ws ~ [\s]+
The final value is still undef. trace_terminals shows both 'foo' and 'bar' being accepted as tokens. What do I need to do to fix this grammar (by which I mean get a value containing the strings 'foo' and 'bar' instead of just undef)?
Rules by default return a value of undef, so in your case a return of \undef from $reader->value() means your parse succeeded. That is, a return of undef means failure, while a return of \undef means success where the parse evaluated to undef.
A good, fast way to start with a more helpful semantics is to add the following line:
:default ::= action => ::array
This causes the parse to generate an AST.

Perl Map Function

I'm new to the map and grep functions and I'm trying to make an existing script more concise.
I can "grep" the #tracknames successfully but I'm having a problem with "map". I want #trackartist to return true if two consecutive "--" are found in a line and take the value of $1, otherwise false, but it returns the whole line if the upper condition is not met.
What am I doing wrong?
my #tracknames = grep /^\d\d\..*?(\.(?:flac|wv))$/, <*.*>;
my #trackartist = map { s/^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/$1/; $_; } <*.*>;
Sample of files
01. some track artist 1 -- some track name 1.(flac or wv)
02. some track artist 2 -- some track name 2.(flac or wv)
03. some track artist 3 -- some track name 3.(flac or wv)
etc.
Remember that grep is for filtering a list and map is for transforming a list. Right now, your map statement returns $_ for every item in the list. If $_ matches the pattern in your substitution, it will be modified and replaced with the first match. Otherwise, it's not modified and the original $_ is returned.
It sounds like you want to filter out items that don't match the pattern. One way would be to combine a map and a grep:
my #trackartist = map { s/^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/$1/; $_; }
grep { /^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/ } <*.*>;
Of course, this means you're doing the same pattern match twice. Another approach is to do a transform with map, but transform anything that doesn't match the pattern into an empty list.
my #trackartist = map { /^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/ ? $1 : ( ) } <*.*>
This uses the ternary conditional operator (?:) to check if the regex matches (returning a true value). If it does, $1 is returned from the map block, if not, an empty list ( ) is returned, which adds nothing to the list resulting from the map.
As a side note, you might want to look into using the glob function rather than <>, which has some disadvantages.
I like map and grep as much as the next guy, but your task seems more suited to a divide-and-conquer parsing approach. I say this because your comments suggest that your interest in map is leading you down a road where you'll end up with a data model consisting of parallel arrays -- #tracks, #artists, etc. -- which is often difficult to maintain in the long run. Here's a sketch of what I mean:
my #tracks;
while (my $file_name = <DATA>){ # You'll use glob() or <*.*>
# Filter out unwanted files.
my ($num, $artist_title, $ext) = $file_name =~ /
^ (\d\d) \. \s*
(.*)
\. (flac|wv) $
/x;
next unless $ext;
# Try to parse the artist and title. Adjust as needed.
my ($artist, $title) = split /\s+--\s+/, $artist_title, 2;
($artist, $title) = ('UNKNOWN', $artist) unless $title;
# Store all info as a hash ref. No need for parallel arrays.
push #tracks, {
file_name => $file_name,
ext => $ext,
artist => $artist,
title => $title,
};
}
__DATA__
01. Perl Jam -- Open or die.wv
02. Perl Jam -- Map to nowhere.flac
03. Perl Jam -- What the #$#!?.wv
04. Perl Jam -- Regex blues.wv
05. Perl Jam -- Use my package, baby.wv
06. Perl Jam -- No warnings.wv
07. Perl Jam -- Laziness ISA virtue.wv
08. Guido and the Pythons -- Home on the xrange.flac
09. Guido and the Pythons -- You gotta keep em generated.flac
10. StackOverflow medley.wv
foo.txt