Parser Combinators - Ordered Choice and Left-Recursion - scala

What is the significance of ordered choice? Does it simply mean that you put the longest pattern match first?
Let's say you had this expression"
val expr = "eat" ~ "more" ~ "beans" |
"eat" ~ "more" ~ "beans" ~ "and" ~ "fruit"
Since parser combinators use Ordered Choice, the string eat more beans and soup ... would result in matching on the first line? The val expr uses Ordered Choice poorly since it includes a less-specific expression first?
Also, what is left recursion?

Scala parser combinators implements parsing expression grammers. A PEG is predicated on the availability of infinite lookahead and backtracking capabilities which makes it easier to express grammars as it is not necessary to make a unilateral decision at any point in the parsing process.
Ordered choice/alternation can be considered the primary enabler of this behavior; a production under which a sequence of productions are tried in sequence, accepting the first one which matches the input. In your example above the second choice will never be matched because any input matching the second choice would be accepted by the first choice.
Left recursion occurs in the event that given a production of the form a = b, an expansion of b begins with a. Consider:
def a = b ~ c
def b = a ~ c
Expansion (matching) of the production a proceeds as follows:
b ~ c
(a ~ c) ~ c // substituting b
((b ~ c) ~ c) ~ c // substituting a
(((a ~ c) ~ c) ~ c) ~ c // substituting b
This is effectively infinite, unterminated recursion.

Related

Rearrange Boolean expression by replacing brackets

I was trying to check online if there is any equivalent expression to 1. (A AND B) OR C, by removing the brackets and rearranging the operands and boolean operators in the above expression.
Like 2. (A OR B) AND C = A AND C OR B AND C.
If I try to solve the first expression with the same logic as above, it doesn't seem logically equivalent I. E. A OR C AND B OR C . I want to remove the brackets from the expression. That's my main aim
What do you mean by "remove the brackets"?
When you write
(A OR B) AND C = A AND C OR B AND C
what does mean "A AND C OR B AND C"?
Does it means
A AND (C OR B) AND C
or
(A AND C) OR (B AND C)
Probably the latter, but it is only because you see implicit brackets around the AND expressions.
One generally considers that "AND" has a higher priority (precedence in programming languages) than "OR". Like "×" has a higher priority than "+".
and
a×b+c
is generally not ambiguous in terms of interpretation.
And it is true that "×" can be distributed over a sum, while the opposite is not true.
But there is no such thing in boolean algebra, and "AND" and "OR" have similar properties.
And they have the same distributivity.
So if your question is
"can we distribute an "OR" over an "AND" expression?",
the answer is yes.
(A AND B) OR C = (A OR C) AND (B OR C)
(just consider what happens when C=0 and when C=1 to verify that it is true)
If your question is
"does such an expression can be expressed without parenthesis with the rule that AND expressions are implicitely surrounded by parenthesis?",
the answer is also yes. Just write
A AND B OR C
with the precedence rule, it is interpreted as
(A AND B) OR C
You can also write
A.B+C
to make clearer that "AND" is generally considered as equivalent to "×" and "OR" equivalent to "+" in terms of precedence.
And in programming languages like C, when one writes
A & B | C
it is clearly interpreted as
(A & B) | C
and the same is true for
A && B || C
So with the "implicit parenthesis around AND expressions" rule, you can write
A AND B OR C = (A OR C) AND (B OR C)
(A OR B) AND C = A AND C OR B AND C
Note that in either case, you need parenthesis in one side of the equality.

Parsing a list of 0 or more idents followed by ident

I want to parse a part of my DSL formed like this:
configSignal: sticky Config
Semantically this is:
argument_name: 0_or_more_modifiers argument_type
I tried implementing the following parser:
def parser = ident ~ ":" ~ rep(ident) ~ ident ^^ {
case name ~ ":" ~ modifiers ~ returnType => Arg(name, returnType, modifiers)
}
Thing is, the rep(ident) part is applied until there are no more tokens and the parser fails, because the last ~ ident doesn't match. How should I do this properly?
Edit
In the meantime I realized, that the modifiers will be reserved words (keywords), so now I have:
def parser = ident ~ ":" ~ rep(modifier) ~ ident ^^ {
case name ~ ":" ~ modifiers ~ returnType => Arg(name, returnType, modifiers)
}
def modifier = "sticky" | "control" | "count"
Nevertheless, I'm curious if it would be possible to write a parser if the modifiers weren't defined up front.
"0 or more idents followed by ident" is equivalent to "1 or more idents", so just use rep1
Its docs:
def rep1[T](p: ⇒ Parser[T]): Parser[List[T]]
A parser generator for non-empty repetitions.
rep1(p) repeatedly uses p to parse the input until p fails -- p must succeed at least once (the result is a List of the consecutive results of p)
p a Parser that is to be applied successively to the input
returns A parser that returns a list of results produced by repeatedly applying p to the input (and that only succeeds if p matches at least once).
edit in response to OP's comment:
I don't think there's a built-in way to do what you described, but it would still be relatively easy to map to your custom data types by using regular List methods:
def parser = ident ~ ":" ~ rep1(ident) ^^ {
case name ~ ":" ~ idents => Arg(name, idents.last, idents.dropRight(1))
}
In this particular case, you wouldn't have to worry about idents being Nil, since the rep1 parser only succeeds with a non-empty list.

Meaning of exclamation mark in zipAll(s).takeWhile(!_._2.isEmpty)

What is the explanation mark doing in (!_._2.isEmpty) ?
As in :
def startsWith[A](s: Stream[A]): Boolean =
zipAll(s).takeWhile(!_._2.isEmpty) forAll {
case (h,h2) => h == h2
}
taken from Stream.
Is it just negation ?
If yes, why no space is required between ! and _ ?
Is not !_ interpreted as a method name ?
Can method names contain or start with ! ?
It is just negation. expanding the definition by replacing the _ with a more verbose name might make this more obvious.
def startsWith[A](s: Stream[A]): Boolean =
zipAll(s).takeWhile(!_._2.isEmpty) forAll {
case (h,h2) => h == h2
}
can be rewritten as
def startsWith[A](s: Stream[A]): Boolean =
zipAll(s).takeWhile( element => !element._2.isEmpty) forAll {
case (h,h2) => h == h2
}
._2 is just the second item in a tuple, in this case it looks like this list is a pair of items (references later as h and h2) so you could also rewrite this by unpacking the items into a pair of values as
def startsWith[A](s: Stream[A]): Boolean =
zipAll(s).takeWhile{ element =>
val (h, h2) = element
!h2.isEmpty
} forAll {
case (h,h2) => h == h2
}
Is it just negation?
Yes
If yes, why no space is required between ! and _ ?
Because the grammar allows it
Is not !_ interpreted as a method name ?
No, because . associates stronger than !, so the expression is parsed as !(_._2.isEmpty)
Moreover, !_ is not even a valid method name (again, specified in the grammar, see below)
Can method names contain or start with !?
Yes, but not freely. Here's the rules for identifier naming, straight from the language specification:
There are three ways to form an identifier. First, an identifier can start with a letter
which can be followed by an arbitrary sequence of letters and digits. This may be
followed by underscore ‘
_
’ characters and another string composed of either letters
and digits or of operator characters. Second, an identifier can start with an operator
character followed by an arbitrary sequence of operator characters. The preceding
two forms are called
plain
identifiers. Finally, an identifier may also be formed by an
arbitrary string between back-quotes (host systems may impose some restrictions
on which strings are legal for identifiers). The identifier then is composed of all
characters excluding the backquotes themselves.
As usual, a longest match rule applies.
(The Scala Language Specification, Version 2.9, Chapter 1.1)

Parser combinator grammar not yielding correct associativity

I am working on a simple expression parser, however given the following parser combinator declarations below, I can't seem to pass my tests and a right associative tree keeps on popping up.
def EXPR:Parser[E] = FACTOR ~ rep(SUM|MINUS) ^^ {case a~b => (a /: b)((acc,f) => f(acc))}
def SUM:Parser[E => E] = "+" ~ EXPR ^^ {case "+" ~ b => Sum(_, b)}
def MINUS:Parser[E => E] = "-" ~ EXPR ^^ {case "-" ~ b => Diff(_, b)}
I've been debugging hours for this. I hope someone can help me figure it out it's not coming out right.
"5-4-3" would yield a tree that evaluates to 4 instead of the expected -2.
What is wrong with the grammar above?
I don't work with Scala but do work with F# parser combinators and also needed associativity with infix operators. While I am sure you can do 5-4 or 2+3, the problem comes in with a sequence of two or more such operators of the same precedence and operator, i.e. 5-4-2 or 2+3+5. The problem won't show up with addition as (2+3)+5 = 2+(3+5) but (5-4)-2 <> 5-(4-2) as you know.
See: Monadic Parser Combinators 4.3 Repetition with meaningful separators. Note: The separators are the operators such as "+" and "*" and not whitespace or commas.
See: Functional Parsers Look for the chainl and chainr parsers in section 7. More parser combinators.
For example, an arithmetical expressions, where the operators that
separate the subexpressions have to be part of the parse tree. For
this case we will develop the functions chainr and chainl. These
functions expect that the parser for the separators yields a function
(!);
The function f should operate on an element and a list of tuples, each
containing an operator and an element. For example, f(e0; [(1; e1);
(2; e2); (3; e3)]) should return ((eo 1 e1) 2 e2) 3 e3. You may
recognize a version of foldl in this (albeit an uncurried one), where
a tuple (; y) from the list and intermediate result x are combined
applying x y.
You need a fold function in the semantic parser, i.e. the part that converts the tokens from the syntactic parser into the output of the parser. In your code I believe it is this part.
{case a~b => (a /: b)((acc,f) => f(acc))}
Sorry I can't do better as I don't use Scala.
"-" ~ EXPR ^^ {case "-" ~ b => Diff(_, b)}
for 5-4-3, it expands to
Diff(5, 4-3)
which is
Diff(5, Diff(4, 3))
however, what you need is:
Diff(Diff(5, 4), 3))
// for 5 + 4 - 3 it should be
Diff(Sum(5, 4), 3)
you need to involve stack.
It seems using "+" ~ EXPR made the answer incorrect. It should have been FACTOR instead.

Datatype for terms over a signature in SML

I want to implement an arbitrary signature in SML. How can I define a datatype for terms over that signature ?I would be needing it to write functions that checks whether the terms are well formed .
In my point of view, there are two major ways of representing an AST. Either as series of (possibly mutually recursive) datatypes or just as one big datatype. There are pros an cos for both.
If we define the following BNF (extracted from the SML definition and slightly simplified)
<exp> ::= <exp> andalso <exp>
| <exp> orelse <exp>
| raise <exp>
| <appexp>
<appexp> ::= <atexp> | <appexp> <atexp>
<atexp> ::= ( <exp>, ..., <exp> )
| [ <exp>, ..., <exp> ]
| ( <exp> ; ... ; <exp> )
| ( <exp> )
| ()
As stated this is simplified and much of the atexp is left out.
1. A series of possibly mutually recursive datatypes
Here you would for example create a datatype for expressions, declarations, patterns, etc.
Basicly you would create a datatype for each of the non-terminals in your BNF.
We would most likely create the following datatypes
datatype exp = Andalso of exp * exp
| Orelse of exp * exp
| Raise of exp
| App of exp * atexp
| Atexp of atexp
and atexp = Tuple of exp list
| List of exp list
| Seq of exp list
| Par of exp
| Unit
Notice that the non-terminal has been consumed into exp datatype instead of having it as its own. That would just clutter up the AST for no reason. You have to remember that a BNF is often written in such a way that it also defined precedens and assosiativity (e.g., for arithmetic). In such cases you can often simplify the BNF by merging multiple non-terminals into one datatype.
The good thing about defining multiple datatypes is that you kind of get some well formednes of your AST. If for example we also had non-terminal for declarations, we know that the AST will newer contain a declaration inside a list (as only expressions can be there). Because of this, most of you well formedness check is not nessesary.
This is however not always a good thing. Often you need to do some checking on the AST anyways, for example type checking. In many cases the BNF is quite large and thus the number of datatypes nessesary to model the AST is also quite large. Keeping this in mind, you have to create one function for each of your datatypes,for every type of modification you wan't to do on your AST. In many cases you only wan't to change a small part of your AST but you will (most likely) still need to define a function for each datatype. Most of these functions will basicly be the identity and then only in a few cases you will do the desired work.
If for example we wan't to count how many units there are in a given AST we could define the following functions
fun countUnitexp (Andalso (e1, e2)) = countUnitexp e1 + countUnitexp e2
| countUnitexp (Orelse (e1, e2)) = countUnitexp e1 + countUnitexp e2
| countUnitexp (Raise e1) = countUnitexp e1
| countUnitexp (App (e1, atexp)) = countUnitexp e1 + countUnitatexp atexp
| countUnitexp (Atexp atexp) = countUnitatexp atexp
and countUnitatexp (Tuple exps) = sumUnit exps
| countUnitatexp (List exps) = sumUnit exps
| countUnitatexp (Seq exps) = sumUnit exps
| countUnitatexp (Par exp) = countUnitexp exp
| countUnitatexp Unit = 1
and sumUnit exps = foldl (fn (exp,b) => b + countUnitexp exp) 0 exps
As you may see we are doing a lot of work, just for this simple task. Imagine a bigger grammar and a more complicated task.
2. One (big) datatype (nodes) -- and a Tree of these nodes
Lets combine the datatypes from before, but change them such that they don't (themself) contain their children. Because in this approach we build a tree structure that has a node and some children of that node. Obviously if you have an identifier, then the identifier needs to contain the actual string representation (e.g., variable name).
So lets start out by defined the nodes for the tree structure.
(* The comment is the kind of children and possibly specific number of children
that the BNF defines to be valid *)
datatype node = Exp_Andalso (* [exp, exp] *)
| Exp_Orelse (* [exp, exp] *)
| Exp_Raise (* [exp] *)
| Exp_App (* [exp, atexp] *)
(* Superflous:| Exp_Atexp (* [atexp] *) *)
| Atexp_Tuple (* exp list *)
| Atexp_List (* exp list *)
| Atexp_Seq (* exp list *)
| Atexp_Par (* [exp] *)
| Atexp_Unit (* [] *)
See how the Atexp from the tupe now becomes superflous and thus we remove it. Personally I think it is nice to have the comment next by telling which children (in the tree structure) we can expect.
(* Note this is a non empty tree. That is you have to pack it in an option type
if you wan't to represent an empty tree *)
datatype 'a tree = T of 'a * 'a tree list
(* Define the ast as trees of our node datatype *)
type ast = node tree
We then define a generic tree and define the type ast to be a "tree of nodes".
If you use some library then there is a big chance that such a tree structure is already present. Also it might be handy late on to extend this tree structure to contain more than just the node as data, however we just keep it simple here.
fun foldTree f b (T (n, [])) = f (n, b)
| foldTree f b (T (n, ts)) = foldl (fn (t, b') => foldTree f b' t)
(f (n, b)) ts
For this example we define a fold function over the tree, again if you are using a library then all these functions for folding, mapping, etc. are most likely already defined.
fun countUnit (Atexp_Unit) = 1
| countUnit _ = 0
If we then take the example from before, that we wan't to count the number of occurances of unit, we can then just fold the above function over the tree.
val someAST = T(Atexp_Tuple, [ T (Atexp_Unit, [])
, T (Exp_Raise, [])
, T (Atexp_Unit, [])
]
)
A simple AST could look like the above (note that this is actually not valid as we have a Exp_Raise with no children). And we could then do the counting by
foldTree (fn (a,b) => (countUnit a) + b) 0 someAST
The down side of this approach is that you have to write a check function that verifies that your AST is well formed, as there is no restrictions when you create the AST. This includes that the children are of the correct "type" (e.g., only Exp_* as children in an Exp_Andalso) and that there are the correct number of children (e.g., exactly two children in Exp_Andalso).
This approach also requires a bit of builk getting started, given you don't use some library that has a tree defined (including auxilary functions for modifying the tree). However in the long run it pays of.