I have implemented a parser using Marpa::R2. Code appears like below:
I have a large number of test cases in a .t file, which i run to test my parser. So, if any exception arises in any of the input expression, testing shouldn't stop in mid and it should give proper error message for the one which has given an error (using exception handling) and rest of the test cases should run.
I want to do exception handling in this parser. If any sort of exception arrises even while tokenizing the input expression, I want to show appropriate message to the user, saying the position, string etc or any more details to show where the error came. Please help.
use strict;
use Marpa::R2;
use Data::Dumper;
my $grammar = Marpa::R2::Scanless::G->new({
default_action => '[values]',
source => \(<<'END_OF_SOURCE'),
lexeme default = latm => 1
:start ::= expression
expression ::= expression OP expression
expression ::= expression COMMA expression
expression ::= func LPAREN PARAM RPAREN
expression ::= PARAM
PARAM ::= STRING | REGEX_STRING
REGEX_STRING ::= '"' QUOTED_STRING '"'
:discard ~ sp
sp ~ [\s]+
COMMA ~ [,]
STRING ~ [^ \/\(\),&:\"~]+
QUOTED_STRING ~ [^ ,&:\"~]+
OP ~ ' - ' | '&'
LPAREN ~ '('
RPAREN ~ ')'
func ~ 'func'
END_OF_SOURCE
});
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
print "Trying to parse:\n$input\n\n";
$recce->read(\$input);
my $value_ref = ${$recce->value};
print "Output:\n".Dumper($value_ref);
my $input4 = "func(\"foo\")";
I want to do Proper error handling like :http://blogs.perl.org/users/jeffrey_kegler/2012/10/a-marpa-dsl-tutorial-error-reporting-made-easy.html
I dont know how to put all this stuff in place.
Wrap the lines that can fail in an exception handler:
use Try::Tiny;
⋮
try {
$recce->read(\$input);
my $value_ref = ${$recce->value};
print "Output:\n".Dumper($value_ref);
} catch {
warn $_;
};
The full error message from Marpa will be in $_, it is a single long string with newlines in it. I chose to print it to STDOUT with warn, and the program continues to run. As you can see in an example error message below, it contains the position where the parsing failed:
Error in SLIF parse: No lexeme found at line 1, column 5
* String before error: "fo\s
* The error was at line 1, column 5, and at character 0x006f 'o', ...
* here: o"
Marpa::R2 exception at so49932329.pl line 41.
If you need to, you could reformat it so it looks better to the user.
Related
How to parse single quoted string using Marpa:r2?
In my below code, the single quoted strings appends '\' on parsing.
Code:
use strict;
use Marpa::R2;
use Data::Dumper;
my $grammar = Marpa::R2::Scanless::G->new(
{ default_action => '[values]',
source => \(<<'END_OF_SOURCE'),
lexeme default = latm => 1
:start ::= Expression
# include begin
Expression ::= Param
Param ::= Unquoted
| ('"') Quoted ('"')
| (') Quoted (')
:discard ~ whitespace
whitespace ~ [\s]+
Unquoted ~ [^\s\/\(\),&:\"~]+
Quoted ~ [^\s&:\"~]+
END_OF_SOURCE
});
my $input1 = 'foo';
#my $input2 = '"foo"';
#my $input3 = '\'foo\'';
my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });
print "Trying to parse:\n$input1\n\n";
$recce->read(\$input1);
my $value_ref = ${$recce->value};
print "Output:\n".Dumper($value_ref);
Output's:
Trying to parse:
foo
Output:
$VAR1 = [
[
'foo'
]
];
Trying to parse:
"foo"
Output:
$VAR1 = [
[
'foo'
]
];
Trying to parse:
'foo'
Output:
$VAR1 = [
[
'\'foo\''
]
]; (don't want it to be parsed like this)
Above are the outputs of all the inputs, i don't want 3rd one to get appended with the '\' and single quotes.. I want it to be parsed like OUTPUT2. Please advise.
Ideally, it should just pick the content between single quotes according to Param ::= (') Quoted (')
The other answer regarding Data::Dumper output is correct. However, your grammar does not work the way you expect it to.
When you parse the input 'foo', Marpa will consider the three Param alternatives. The predicted lexemes at that position are:
Unquoted ~ [^\s\/\(\),&:\"~]+
'"'
') Quoted ('
Yes, the last is literally ) Quoted (, not anything containing a single quote.
Even if it were ([']) Quoted ([']): Due to longest token matching, the Unquoted lexeme will match the entire input, including the single quote.
What would happen for an input like " foo " (with double quotes)? Now, only the '"' lexeme would match, then any whitespace would be discarded, then the Quoted lexeme matches, then any whitespace is discarded, then closing " is matched.
To prevent this whitespace-skipping behaviour and to prevent the Unquoted rule from being preferred due to LATM, it makes sense to describe quoted strings as lexemes. For example:
Param ::= Unquoted | Quoted
Unquoted ~ [^'"]+
Quoted ~ DQ | SQ
DQ ~ '"' DQ_Body '"' DQ_Body ~ [^"]*
SQ ~ ['] SQ_Body ['] SQ_Body ~ [^']*
These lexemes will then include any quotes and escapes, so you need to post-process the lexeme contents. You can either do this using the event system (which is conceptually clean, but a bit cumbersome to implement), or adding an action that performs this processing during parse evaluation.
Since lexemes cannot have actions, it is usually best to add a proxy production:
Param ::= Unquoted | Quoted
Unquoted ~ [^'"]+
Quoted ::= Quoted_Lexeme action => process_quoted
Quoted_Lexeme ~ DQ | SQ
DQ ~ '"' DQ_Body '"' DQ_Body ~ [^"]*
SQ ~ ['] SQ_Body ['] SQ_Body ~ [^']*
The action could then do something like:
sub process_quoted {
my (undef, $s) = #_;
# remove delimiters from double-quoted string
return $1 if $s =~ /^"(.*)"$/s;
# remove delimiters from single-quoted string
return $1 if $s =~ /^'(.*)'$/s;
die "String was not delimited with single or double quotes";
}
Your result doesn't contain \', it contains '. Dumper merely formats the result like that so it's clear what's inside the string and what isn't.
You can test this behavior for yourself:
use Data::Dumper;
my $tick = chr(39);
my $back = chr(92);
print "Tick Dumper: " . Dumper($tick);
print "Tick Print: " . $tick . "\n";
print "Backslash Dumper: " . Dumper($back);
print "Backslash Print: " . $back . "\n";
You can see a demo here: https://ideone.com/d1V8OE
If you don't want the output to contain single quotes, you'll probably need to remove them from the input yourself.
I am not so familar with Marpa::R2, but could you try to use an action on the Expression rule:
Expression ::= Param action => strip_quotes
Then, implement a simple quote stripper like:
sub MyActions::strip_quotes {
#{$_[1]}[0] =~ s/^'|'$//gr;
}
I've had good success parsing complicated and silly old text formats with Marpa before and I'm trying to do it again.
This particular format has hundred and hundreds of different kinds of 'Begin' and 'End' blocks that look like this:
Begin BlahBlah
asdf qwer 123
987 xxxx
End BlahBlah
Begin FooFoo
Begin BarBar
some stuff (1,2,3)
End BarBar
whatever x
End FooFoo
How do I make a single rule that will match all of BlahBlah, BarBar, and FooFoo in the stuff above? I don't see in any examples how to dynamically capture the token and re-use it to terminate the rule, at least not with the standard scanless grammar examples. I don't want to enumerate all the different kinds of blocks because new kinds will break things, and I don't think it should be necessary.
The contents of the Begin/End blocks are immaterial to the question. In reality that stuff is a complicated mess, but nothing I don't know how to slog through. I'm hand-waving over other complicating details that make Marpa a good tool for this, such that I don't want to resort to regex.
At a bare minimum all I'm trying to achieve is a key-value map of the block type (i.e. "BlahBlah") to its contents as a string.
This doesn't exactly answer my original question because I ultimately arrived at simply ignoring the repeated string following the "End" token. I will probably follow the comment suggestion above of simply checking that the begin/end names match in a post-processing step. Operating under the assumption that the token is redundant, this seems to work OK, as a rough first cut. Critique welcome:
#!/usr/bin/perl
use warnings;
use strict;
use v5.18;
use utf8;
use feature 'unicode_strings';
use autodie;
use Marpa::R2;
use Data::Dumper;
my $g = Marpa::R2::Scanless::G->new({
source => \(<<'END_OF_SOURCE'),
lexeme default = latm => 1
:default ::= action => ::array
:start ::= beginend_blocks
:discard ~ <ws>
beginend_blocks ::= beginend_block+
beginend_block ::= beginend_block_header beginend_block_contents
beginend_block_header ::= ('Begin') beginend_block_name action => ::first
beginend_block_name ::= <word>
beginend_block_contents ::= beginend_block_content_elems (beginend_block_terminator) (<word>)
beginend_block_content_elems ::= beginend_block_content_elem+
beginend_block_content_elem ::= word action => ::first
| beginend_block action => ::first
beginend_block_terminator ::= ('End')
<word> ~ <wordchar>+
<wordchar> ~ [\S]
<ws> ~ [\s]+
END_OF_SOURCE
});
my $test_str = <<THEDATA;
Begin BlahBlah
asdf qwer 123
987 xxxx
End BlahBlah
Begin FooFoo
something else
Begin BazBaz
some stuff (1,2,3)
End BazBaz
whatever x
Begin BarBar
some stuff (1,2,3)
End BarBar
whatever y
End FooFoo
THEDATA
MAIN: {
my $re = Marpa::R2::Scanless::R->new({ grammar => $g, trace_terminals => 0 });
for (my $pos = $re->read(\$test_str); $pos < length $test_str; $pos = $re->resume) {
my ($pause_start, undef) = $re->pause_span;
}
say Dumper $re->value;
}
I'm implementing a new DSL in Marpa and (coming from Regexp::Grammars) I'm more than satisfied. My language supports a bunch of unary and binary operators, objects with C-style identifiers and method calls using the familiar dot notation. For example:
foo.has(bar == 42 AND baz == 23)
I found the prioritized rules feature offered by Marpa's grammar description language and have come to rely on that a lot, so I have nearly only one G1 rule Expression. Excerpt (many alternatives, and semantic actions omitted for brevity):
Expression ::=
NumLiteral
| '(' Expression ')' assoc => group
|| Expression ('.') Identifier
|| Expression ('.') Identifier Args
| Expression ('==') Expression
|| Expression ('AND') Expression
Args ::= ('(') ArgsList (')')
ArgsList ::= Expression+ separator => [,]
Identifier ~ IdentifierHeadChar IdentifierBody
IdentifierBody ~ IdentifierBodyChar*
IdentifierHeadChar ~ [a-zA-Z_]
IdentifierBodyChar ~ [a-zA-Z0-9_]
NumLiteral ~ [0-9]+
As you can see, I'm using the Scanless interface (SLIF). My problem is that this also parses, for example:
foo.AND(5)
Marpa knows that there can only be an identifier after a dot, so it doesn't even consider the fact that AND might be a keyword. I know that I can avoid that problem by doing a separate lexing stage that identifies AND as a keyword explicitly, but that tiny papercut is not quite worth the effort.
Is there a way in SLIF to restrict the Identifier rule to non-keyword identifiers only?
I don't know how to express such a thing in the grammar. You can introduce an intermediate non-terminal for Identifier which would check the condition, though:
#!/usr/bin/perl
use warnings;
use strict;
use Syntax::Construct qw{ // };
use Marpa::R2;
my %reserved = map { $_ => 1 } qw( AND );
my $grammar = 'Marpa::R2::Scanless::G'->new(
{ bless_package => 'main',
source => \( << '__GRAMMAR__'),
:default ::= action => store
:start ::= S
S ::= Id
| Id NumLiteral
Id ::= Identifier action => allowed
Identifier ~ IdentifierHeadChar IdentifierBody
IdentifierBody ~ IdentifierBodyChar*
IdentifierHeadChar ~ [a-zA-Z_]
IdentifierBodyChar ~ [a-zA-Z0-9_]
NumLiteral ~ [0-9]+
:discard ~ whitespace
whitespace ~ [\s]+
__GRAMMAR__
});
for my $value ('ABC', 'ABC 42', 'AND 1') {
my $value = $grammar->parse(\$value, 'main');
print $$value, "\n";
}
sub store {
my (undef, $id, $arg) = #_;
$arg //= 'null';
return "$id $arg";
}
sub allowed {
my (undef, $id) = #_;
die "Reserved keyword $id" if $reserved{$id};
return $id
}
You can use lexeme priorities intended just for such kind of thing, the example is here in Marpa::R2 test suite.
Basically, you declare <AND keyword> ~ 'AND' lexeme and give it priority 1 so that it's preferred over Identifier. That must do the trick.
P.S. I modified the above script slightly to give an example — code, output.
I am trying to check if message after "5:16:51:209|INFO| " starts with "Marker". I need to add string "|ICD" after timstamp.
input is :" 05:16:51:209|INFO|Markerprocedure Magnet "
I tried this regex, but its not working. Please help me to get it correct.
if ( $lines[$i] =~ m/(\d{2}:\d{2}:\d{2}:\d{3})|(\w+)|^Marker/)
{
$lines[$i] =~ s/(\d{2}:\d{2}:\d{2}:\d{3})(.*)/$1|ICD$2/ ;
}
I am trying to check if message after "5:16:51:209|INFO| " starts with "Marker"
What it seems to me you're trying to check is whether Marker immediately follows 5:16:51:209|INFO| so it isn't correct to use the ^ regex character because that checks to see whether the start of the string occurs in that position (which, of course, it doesn't). So remove the ^ character and Perl will check whether Marker immediately follows.
Also, you need to escape the | characters like this: \| to prevent it being treated as an alternation command in the regex. Then you can do the test and replace in a single substitution command:
if ( $lines[$i] =~ s/(\d{2}:\d{2}:\d{2}:\d{3})(\|\w+\|Marker)/$1|ICD$2/ )
{
# Line contained "Marker" and "|ICD" inserted
}
Example:
$ echo '15:16:51:209|INFO|Marker blah' | perl -ple 's/(\d{2}:\d{2}:\d{2}:\d{3})(\|\w+\|Marker)/$1|ICD$2/'
Output is:
15:16:51:209|ICD|INFO|Marker blah
Edit: #Prix has pointed out in the comments that if the timestamp is meant to appear at the start of the string, then the ^ start-marker should be at the start of the regex to prevent accidental matches in other parts of the string (and for performance):
s/^(\d{2}:\d{2}:\d{2}:\d{3})(\|\w+\|Marker)/$1|ICD$2/
↑
Use ^ here to anchor the search to the beginning of the string.
I'm trying to get what seems like a very basic Marpa grammar working. The code I use is below:
use strict;
use warnings;
use Marpa::R2;
use Data::Dumper;
my $grammar = Marpa::R2::Scanless::G->new(
{
source => \(<<'END_OF_SOURCE'),
:start ::= ExprSingle
ExprSingle ::= Expr AndExpr
Expr ~ word
AndExpr ~ word*
word ~ [\w]+
:discard ~ ws
ws ~ [\s]+
END_OF_SOURCE
}
);
my $reader = Marpa::R2::Scanless::R->new(
{
grammar => $grammar,
}
);
my $input = 'foo';
$reader->read(\$input);
my $value = $reader->value;
print Dumper $value;
This prints $VAR1 = \'foo';. So it recognizes one word just fine. But I want it to recognize a string of words
my $input='foo bar'
Now the script prints:
Error in SLIF G1 read: Parse exhausted, but lexemes remain, at position 4
I think this is because ExprSingle uses the ~ (match) operator, which makes it part of the tokenizing level, G0, instead of the structural level G1; the :discard rule allows space between G1 rules, not G0 ones. So I change the grammar like so:
ExprSingle ::= Expr AndExpr
Now no warning is printed, but the resulting value is undef instead of something containing 'foo' and 'bar'. I'm honestly not sure what that means, since, before, the failed parse threw an actual error.
I tried changing the grammar to separate what I think are G0 and G1 rules further, but still no luck:
:start ::= ExprSingle
ExprSingle ::= Expr AndExpr
Expr ::= token
AndExpr ::= token*
token ~ word
word ~ [\w]+
:discard ~ ws
ws ~ [\s]+
The final value is still undef. trace_terminals shows both 'foo' and 'bar' being accepted as tokens. What do I need to do to fix this grammar (by which I mean get a value containing the strings 'foo' and 'bar' instead of just undef)?
Rules by default return a value of undef, so in your case a return of \undef from $reader->value() means your parse succeeded. That is, a return of undef means failure, while a return of \undef means success where the parse evaluated to undef.
A good, fast way to start with a more helpful semantics is to add the following line:
:default ::= action => ::array
This causes the parse to generate an AST.