How do I make Marpa's sequence rules greedy? - perl

I am working on a Marpa::R2 grammar that groups items in a text. Each group can only contain items of a certain kind, but is not explicitly delimited. This causes problems, because x...x (where . represents an item that can be part of a group) can be grouped as x(...)x, x(..)(.)x, x(.)(..)x, x(.)(.)(.)x. In other words, the grammar is highly ambiguous.
How can I remove this ambiguity if I only want the x(...)x parse, i.e. if I want to force a + quantifier to only behave “greedy” (as it does in Perl regexes)?
In the below grammar, I tried adding rank adverbs to the sequence rules in order to prioritize Group over Sequence, but that doesn't seem to work.
Below is a test case that exercises this behaviour.
use strict;
use warnings;
use Marpa::R2;
use Test::More;
my $grammar_source = <<'END_GRAMMAR';
inaccessible is fatal by default
:discard ~ space
:start ::= Sequence
Sequence
::= SequenceItem+ action => ::array
SequenceItem
::= WORD action => ::first
| Group action => ::first
Group
::= GroupItem+ action => [name, values]
GroupItem
::= ('[') Sequence (']') action => ::first
WORD ~ [a-z]+
space ~ [\s]+
END_GRAMMAR
my $input = "foo [a] [b] bar";
diag "perl $^V";
diag "Marpa::R2 " . Marpa::R2->VERSION;
my $grammar = Marpa::R2::Scanless::G->new({ source => \$grammar_source });
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });
$recce->read(\$input);
my $parse_count = 0;
while (my $value = $recce->value) {
is_deeply $$value, ['foo', [Group => ['a'], ['b']], 'bar'], 'expected structure'
or diag explain $$value;
$parse_count++;
}
is $parse_count, 1, 'expected number of parses';
done_testing;
Output of the test case (FAIL):
# perl v5.18.2
# Marpa::R2 2.09
ok 1 - expected structure
not ok 2 - expected structure
# Failed test 'expected structure'
# at - line 38.
# Structures begin differing at:
# $got->[1][2] = Does not exist
# $expected->[1][2] = ARRAY(0x981bd68)
# [
# 'foo',
# [
# 'Group',
# [
# 'a'
# ]
# ],
# [
# ${\$VAR1->[1][0]},
# [
# 'b'
# ]
# ],
# 'bar'
# ]
not ok 3 - expected number of parses
# Failed test 'expected number of parses'
# at - line 41.
# got: '2'
# expected: '1'
1..3
# Looks like you failed 2 tests of 3.

Sequence rules are designed for non-tricky cases. Sequence rules can always be rewritten as BNF rules when the going gets tricky, and that is what I suggest here. The following makes your test work:
use strict;
use warnings;
use Marpa::R2;
use Test::More;
my $grammar_source = <<'END_GRAMMAR';
inaccessible is fatal by default
:discard ~ space
# Three cases
# 1.) Just one group.
# 2.) Group follows by alternating words and groups.
# 3.) Alternating words and groups, starting with words
Sequence ::= Group action => ::first
Sequence ::= Group Subsequence action => [values]
Sequence ::= Subsequence action => ::first
Subsequence ::= Words action => ::first
# "action => [values]" makes the test work unchanged.
# The action for the next rule probably should be
# action => [name, values] in order to handle the general case.
Subsequence ::= Subsequence Group Words action => [values]
Words ::= WORD+ action => ::first
Group
::= GroupItem+ action => [name, values]
GroupItem
::= ('[') Sequence (']') action => [value]
WORD ~ [a-z]+
space ~ [\s]+
END_GRAMMAR
my $input = "foo [a] [b] bar";
diag "perl $^V";
diag "Marpa::R2 " . Marpa::R2->VERSION;
my $grammar = Marpa::R2::Scanless::G->new( { source => \$grammar_source } );
my $recce = Marpa::R2::Scanless::R->new( { grammar => $grammar } );
$recce->read( \$input );
my $parse_count = 0;
while ( my $value = $recce->value ) {
is_deeply $$value, [ 'foo', [ Group => ['a'], ['b'] ], 'bar' ],
'expected structure'
or diag explain $$value;
$parse_count++;
} ## end while ( my $value = $recce->value )
is $parse_count, 1, 'expected number of parses';
done_testing;

Unabiguous grammar:
Sequence : WORD+ SequenceAfterWords
| Group SequenceAfterGroup
SequenceAfterWords : Group SequenceAfterGroup
|
SequenceAfterGroup : WORD+ SequenceAfterWords
|
Jeffrey Kegler says that leading with the recursion is handled more efficiently in Marpa. The same approach taken above can be taken back to front to produce this.
Sequence : SequenceBeforeWords WORD+
| SequenceBeforeGroup Group
SequenceBeforeWords : SequenceBeforeGroup Group
|
SequenceBeforeGroup : SequenceBeforeWords WORD+
|
In both cases,
Group : GroupItem+
GroupItem : '[' Sequence ']'

Related

Parse single quoted string using Marpa:r2 perl

How to parse single quoted string using Marpa:r2?
In my below code, the single quoted strings appends '\' on parsing.
Code:
use strict;
use Marpa::R2;
use Data::Dumper;
my $grammar = Marpa::R2::Scanless::G->new(
{ default_action => '[values]',
source => \(<<'END_OF_SOURCE'),
lexeme default = latm => 1
:start ::= Expression
# include begin
Expression ::= Param
Param ::= Unquoted
| ('"') Quoted ('"')
| (') Quoted (')
:discard ~ whitespace
whitespace ~ [\s]+
Unquoted ~ [^\s\/\(\),&:\"~]+
Quoted ~ [^\s&:\"~]+
END_OF_SOURCE
});
my $input1 = 'foo';
#my $input2 = '"foo"';
#my $input3 = '\'foo\'';
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });
print "Trying to parse:\n$input1\n\n";
$recce->read(\$input1);
my $value_ref = ${$recce->value};
print "Output:\n".Dumper($value_ref);
Output's:
Trying to parse:
foo
Output:
$VAR1 = [
[
'foo'
]
];
Trying to parse:
"foo"
Output:
$VAR1 = [
[
'foo'
]
];
Trying to parse:
'foo'
Output:
$VAR1 = [
[
'\'foo\''
]
]; (don't want it to be parsed like this)
Above are the outputs of all the inputs, i don't want 3rd one to get appended with the '\' and single quotes.. I want it to be parsed like OUTPUT2. Please advise.
Ideally, it should just pick the content between single quotes according to Param ::= (') Quoted (')
The other answer regarding Data::Dumper output is correct. However, your grammar does not work the way you expect it to.
When you parse the input 'foo', Marpa will consider the three Param alternatives. The predicted lexemes at that position are:
Unquoted ~ [^\s\/\(\),&:\"~]+
'"'
') Quoted ('
Yes, the last is literally ) Quoted (, not anything containing a single quote.
Even if it were ([']) Quoted ([']): Due to longest token matching, the Unquoted lexeme will match the entire input, including the single quote.
What would happen for an input like " foo " (with double quotes)? Now, only the '"' lexeme would match, then any whitespace would be discarded, then the Quoted lexeme matches, then any whitespace is discarded, then closing " is matched.
To prevent this whitespace-skipping behaviour and to prevent the Unquoted rule from being preferred due to LATM, it makes sense to describe quoted strings as lexemes. For example:
Param ::= Unquoted | Quoted
Unquoted ~ [^'"]+
Quoted ~ DQ | SQ
DQ ~ '"' DQ_Body '"' DQ_Body ~ [^"]*
SQ ~ ['] SQ_Body ['] SQ_Body ~ [^']*
These lexemes will then include any quotes and escapes, so you need to post-process the lexeme contents. You can either do this using the event system (which is conceptually clean, but a bit cumbersome to implement), or adding an action that performs this processing during parse evaluation.
Since lexemes cannot have actions, it is usually best to add a proxy production:
Param ::= Unquoted | Quoted
Unquoted ~ [^'"]+
Quoted ::= Quoted_Lexeme action => process_quoted
Quoted_Lexeme ~ DQ | SQ
DQ ~ '"' DQ_Body '"' DQ_Body ~ [^"]*
SQ ~ ['] SQ_Body ['] SQ_Body ~ [^']*
The action could then do something like:
sub process_quoted {
my (undef, $s) = #_;
# remove delimiters from double-quoted string
return $1 if $s =~ /^"(.*)"$/s;
# remove delimiters from single-quoted string
return $1 if $s =~ /^'(.*)'$/s;
die "String was not delimited with single or double quotes";
}
Your result doesn't contain \', it contains '. Dumper merely formats the result like that so it's clear what's inside the string and what isn't.
You can test this behavior for yourself:
use Data::Dumper;
my $tick = chr(39);
my $back = chr(92);
print "Tick Dumper: " . Dumper($tick);
print "Tick Print: " . $tick . "\n";
print "Backslash Dumper: " . Dumper($back);
print "Backslash Print: " . $back . "\n";
You can see a demo here: https://ideone.com/d1V8OE
If you don't want the output to contain single quotes, you'll probably need to remove them from the input yourself.
I am not so familar with Marpa::R2, but could you try to use an action on the Expression rule:
Expression ::= Param action => strip_quotes
Then, implement a simple quote stripper like:
sub MyActions::strip_quotes {
#{$_[1]}[0] =~ s/^'|'$//gr;
}

Matching arbitrary delimiters

I've had good success parsing complicated and silly old text formats with Marpa before and I'm trying to do it again.
This particular format has hundred and hundreds of different kinds of 'Begin' and 'End' blocks that look like this:
Begin BlahBlah
asdf qwer 123
987 xxxx
End BlahBlah
Begin FooFoo
Begin BarBar
some stuff (1,2,3)
End BarBar
whatever x
End FooFoo
How do I make a single rule that will match all of BlahBlah, BarBar, and FooFoo in the stuff above? I don't see in any examples how to dynamically capture the token and re-use it to terminate the rule, at least not with the standard scanless grammar examples. I don't want to enumerate all the different kinds of blocks because new kinds will break things, and I don't think it should be necessary.
The contents of the Begin/End blocks are immaterial to the question. In reality that stuff is a complicated mess, but nothing I don't know how to slog through. I'm hand-waving over other complicating details that make Marpa a good tool for this, such that I don't want to resort to regex.
At a bare minimum all I'm trying to achieve is a key-value map of the block type (i.e. "BlahBlah") to its contents as a string.
This doesn't exactly answer my original question because I ultimately arrived at simply ignoring the repeated string following the "End" token. I will probably follow the comment suggestion above of simply checking that the begin/end names match in a post-processing step. Operating under the assumption that the token is redundant, this seems to work OK, as a rough first cut. Critique welcome:
#!/usr/bin/perl
use warnings;
use strict;
use v5.18;
use utf8;
use feature 'unicode_strings';
use autodie;
use Marpa::R2;
use Data::Dumper;
my $g = Marpa::R2::Scanless::G->new({
source => \(<<'END_OF_SOURCE'),
lexeme default = latm => 1
:default ::= action => ::array
:start ::= beginend_blocks
:discard ~ <ws>
beginend_blocks ::= beginend_block+
beginend_block ::= beginend_block_header beginend_block_contents
beginend_block_header ::= ('Begin') beginend_block_name action => ::first
beginend_block_name ::= <word>
beginend_block_contents ::= beginend_block_content_elems (beginend_block_terminator) (<word>)
beginend_block_content_elems ::= beginend_block_content_elem+
beginend_block_content_elem ::= word action => ::first
| beginend_block action => ::first
beginend_block_terminator ::= ('End')
<word> ~ <wordchar>+
<wordchar> ~ [\S]
<ws> ~ [\s]+
END_OF_SOURCE
});
my $test_str = <<THEDATA;
Begin BlahBlah
asdf qwer 123
987 xxxx
End BlahBlah
Begin FooFoo
something else
Begin BazBaz
some stuff (1,2,3)
End BazBaz
whatever x
Begin BarBar
some stuff (1,2,3)
End BarBar
whatever y
End FooFoo
THEDATA
MAIN: {
my $re = Marpa::R2::Scanless::R->new({ grammar => $g, trace_terminals => 0 });
for (my $pos = $re->read(\$test_str); $pos < length $test_str; $pos = $re->resume) {
my ($pause_start, undef) = $re->pause_span;
}
say Dumper $re->value;
}

Marpa: Can I explicitly disallow keywords as identifiers?

I'm implementing a new DSL in Marpa and (coming from Regexp::Grammars) I'm more than satisfied. My language supports a bunch of unary and binary operators, objects with C-style identifiers and method calls using the familiar dot notation. For example:
foo.has(bar == 42 AND baz == 23)
I found the prioritized rules feature offered by Marpa's grammar description language and have come to rely on that a lot, so I have nearly only one G1 rule Expression. Excerpt (many alternatives, and semantic actions omitted for brevity):
Expression ::=
NumLiteral
| '(' Expression ')' assoc => group
|| Expression ('.') Identifier
|| Expression ('.') Identifier Args
| Expression ('==') Expression
|| Expression ('AND') Expression
Args ::= ('(') ArgsList (')')
ArgsList ::= Expression+ separator => [,]
Identifier ~ IdentifierHeadChar IdentifierBody
IdentifierBody ~ IdentifierBodyChar*
IdentifierHeadChar ~ [a-zA-Z_]
IdentifierBodyChar ~ [a-zA-Z0-9_]
NumLiteral ~ [0-9]+
As you can see, I'm using the Scanless interface (SLIF). My problem is that this also parses, for example:
foo.AND(5)
Marpa knows that there can only be an identifier after a dot, so it doesn't even consider the fact that AND might be a keyword. I know that I can avoid that problem by doing a separate lexing stage that identifies AND as a keyword explicitly, but that tiny papercut is not quite worth the effort.
Is there a way in SLIF to restrict the Identifier rule to non-keyword identifiers only?
I don't know how to express such a thing in the grammar. You can introduce an intermediate non-terminal for Identifier which would check the condition, though:
#!/usr/bin/perl
use warnings;
use strict;
use Syntax::Construct qw{ // };
use Marpa::R2;
my %reserved = map { $_ => 1 } qw( AND );
my $grammar = 'Marpa::R2::Scanless::G'->new(
{ bless_package => 'main',
source => \( << '__GRAMMAR__'),
:default ::= action => store
:start ::= S
S ::= Id
| Id NumLiteral
Id ::= Identifier action => allowed
Identifier ~ IdentifierHeadChar IdentifierBody
IdentifierBody ~ IdentifierBodyChar*
IdentifierHeadChar ~ [a-zA-Z_]
IdentifierBodyChar ~ [a-zA-Z0-9_]
NumLiteral ~ [0-9]+
:discard ~ whitespace
whitespace ~ [\s]+
__GRAMMAR__
});
for my $value ('ABC', 'ABC 42', 'AND 1') {
my $value = $grammar->parse(\$value, 'main');
print $$value, "\n";
}
sub store {
my (undef, $id, $arg) = #_;
$arg //= 'null';
return "$id $arg";
}
sub allowed {
my (undef, $id) = #_;
die "Reserved keyword $id" if $reserved{$id};
return $id
}
You can use lexeme priorities intended just for such kind of thing, the example is here in Marpa::R2 test suite.
Basically, you declare <AND keyword> ~ 'AND' lexeme and give it priority 1 so that it's preferred over Identifier. That must do the trick.
P.S. I modified the above script slightly to give an example — code, output.

Trouble separating G0 and G1 rules in grammar

I'm trying to get what seems like a very basic Marpa grammar working. The code I use is below:
use strict;
use warnings;
use Marpa::R2;
use Data::Dumper;
my $grammar = Marpa::R2::Scanless::G->new(
{
source => \(<<'END_OF_SOURCE'),
:start ::= ExprSingle
ExprSingle ::= Expr AndExpr
Expr ~ word
AndExpr ~ word*
word ~ [\w]+
:discard ~ ws
ws ~ [\s]+
END_OF_SOURCE
}
);
my $reader = Marpa::R2::Scanless::R->new(
{
grammar => $grammar,
}
);
my $input = 'foo';
$reader->read(\$input);
my $value = $reader->value;
print Dumper $value;
This prints $VAR1 = \'foo';. So it recognizes one word just fine. But I want it to recognize a string of words
my $input='foo bar'
Now the script prints:
Error in SLIF G1 read: Parse exhausted, but lexemes remain, at position 4
I think this is because ExprSingle uses the ~ (match) operator, which makes it part of the tokenizing level, G0, instead of the structural level G1; the :discard rule allows space between G1 rules, not G0 ones. So I change the grammar like so:
ExprSingle ::= Expr AndExpr
Now no warning is printed, but the resulting value is undef instead of something containing 'foo' and 'bar'. I'm honestly not sure what that means, since, before, the failed parse threw an actual error.
I tried changing the grammar to separate what I think are G0 and G1 rules further, but still no luck:
:start ::= ExprSingle
ExprSingle ::= Expr AndExpr
Expr ::= token
AndExpr ::= token*
token ~ word
word ~ [\w]+
:discard ~ ws
ws ~ [\s]+
The final value is still undef. trace_terminals shows both 'foo' and 'bar' being accepted as tokens. What do I need to do to fix this grammar (by which I mean get a value containing the strings 'foo' and 'bar' instead of just undef)?
Rules by default return a value of undef, so in your case a return of \undef from $reader->value() means your parse succeeded. That is, a return of undef means failure, while a return of \undef means success where the parse evaluated to undef.
A good, fast way to start with a more helpful semantics is to add the following line:
:default ::= action => ::array
This causes the parse to generate an AST.

Scripting: Read condition written in words and converting into C - ternary operator

I am new to the perl scripting. I am writing script to read excel file and put in text file in C programing syntax.
So I excel sheet I have string like below:
If ((Myvalue.xyz == 1) Or (Frmae_1.signal_1 == 1)) Then a = 1
else a = 0;
This I have to convert into:
a = (((Myvalue.xyz == 1) || (Frmae_1.signal_1 == 1))?1:0)
How this can be handled in perl?
I do not think that throwing a regex at the code string would be an especially good idea.The syntax of your input doesn't look too extraordinary, so we could just parse it with Marpa, using a grammar like
:default ::= action => [values]
:start ::= StatementList
:discard ~ ws
StatementList ::= <Expression>+ separator => <op semicolon> bless => Block
Expression ::=
('(') Expression (')') assoc => group action => ::first
| Number bless => Number
|| Ident bless => Var
|| Expression ('==') Expression bless => Numeric_eq
|| Expression ('=' ) Expression bless => Assign
|| Expression ('Or') Expression bless => Logical_or
|| Conditional
Conditional ::=
('If') Expression ('Then') Expression
bless => Cond
| ('If') Expression ('Then') Expression ('Else') Expression
bless => Cond
Ident ~ ident
Number ~ <number int> | <number rat>
word ~ [\w]+
ident ~ word | ident '.' word
<number int> ~ [\d]+
<number rat> ~ <number int> '.' <number int>
ws ~ [\s]+
<op semicolon> ~ ';'
Then:
use Marpa::R2;
my $grammar = Marpa::R2::Scanless::G->new({
bless_package => 'Ast',
source => \$the_grammar,
});
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });
$recce->read(\$the_string);
my $val = $recce->value // die "No parse found";
my $ast = $$val;
As soon as we have the AST, compiling it down to the C-like representation isn't overly complex. Factoring out the common assignment with an “optimization” pass can be done with a bit of thinking.
However, showing how this can be done is rather lengthy, so I put all the in-depth stuff into this blogpost. We can then define a method that recurses through the tree and emits the C-like code, e.g.
package Ast::Var;
...;
sub compile { my $self = shift; $self->name } # no modification needed
package Ast::Logical_Or;
...;
sub compile {
my $self = shift;
# C's "||" operator, plus parens to specify precedence
"(" . $self->l->compile . "||" . $self->r->compile . ")";
}
package Ast::Cond;
...;
sub compile {
my $self = shift;
return sprintf '(%s ? %s : %s)',
$self->cond->compile,
$self->then->compile,
$self->else->compile;
}
etc. for all the other AST node types.
The same expression is valid in Perl (modulo the access operator, that -> in perl and not the dot. You can also do
a = $my_value->xyz == 1 || $frmae_1->signal_1 == 1 ? 1 : 0;
the ? 1 : 0 part is not even necessary since $my_value->xyz == 1 || $frmae_1->signal_1 == 1 will return the Perl true or false values (numerically 1 and 0, string '1' and '')...
Say, your string is stored in $str. You can do the following to extract stuff from it:
my ($cond, $set, $then, $else) = ($str =~ /^If (.*) Then (.*?=\s+)(.*) else \2(.*);$/);
Now you have your condition in $cond, a = in $set and what should be in that variable in $then and $else.
Replace "or" and "and" in your condition
$cond =~ s/\sOr\s/ || /g;
$cond =~ s/\sAnd\s/ && /g;
and print your desired output
print "$set($cond ? $then : $else);
Those regular expressions work with your string: I got
a = (((Myvalue.xyz == 1) || (Frmae_1.signal_1 == 1)) ? 1 : 0)
but can fail if your actual strings have "Then" written as "then", will explode if your "Myvalue.xyz" in some string is "Myvalue.And" or something like that. Also this will not work if there are no spaces somewhere around = or Or. But the code can be easily modified to work with all these inputs. Be careful with regular expressions, they are powerful.