Parser Alternative Operator

Parser Alternative Operator | Fails - scala

Using an extension of the Scala parser, I wish to parse two different types of information – patterns and filters. These patterns and filters may appear in any order.
With the format for patterns and filters defined in the variables pattern and filter, respectively, I want to unite them so that filters and patterns may be supplied in any order:
val patternRepetition: Parser[List[ParsedPattern]] = {
rep1sep(pattern, ".") <~ opt(".")
}
val patternOrFilter: Parser[List[ParsedPattern]] = {
patternRepetition | rep1(filterSpec)
}
val patternList: Parser[List[ParsedPattern]] = {
rep1(patternOrFilter) ^^ {
_.flatten
}
}
For reference, the code for pattern and filter are shown at the end of this post.
Some examples of what patternList is supposed to match:
?A ?C ?B .
?A ?C ?B .
FILTER(?A > ?B)
?A ?C ?B .
?A ?C ?B
FILTER(?A > ?B)
?A ?C ?B
FILTER(?A > ?B)
?A ?C ?B
As soon as a filter appears, the parser complains about an unexpected ( character. However, changing | to ~ in patternOrFilter will successfully parse a list of patterns followed by a filter (such as in the second and third example), so I believe there is an issue with my use of the alternative operator |.
Therefore, my question is: why does the | operator fail to recognize filters?
A pattern is currently defined as the following:
val pattern: Parser[ParsedPattern] = {
"?A" ~! "?C" ~! "?B" ^^ {
case s ~ p ~ o =>
ParsedPattern(s, p, o)
}
}
This obviously matches:
?A ?C ?B
A filter is defined with the following code:
val filter: Parser[ParsedPattern] = {
"FILTER(" ~> "?A" ~ ">" ~ "?B") <~ ")" ^^ {
case lhs ~ comp ~ rhs =>
ParsedPattern(lhs, comp, rhs)
}
}
This matches the following line:
FILTER(?A > ?B)

| parses ONE side only (patterns OR filter but not both).
Try a new rule like (change types, I used _):
def patternOrFilterAll: Parser[_] = {
rep1(patternOrFilter)
}
You may need to change patternRepetition and patternOrFilter to get the return value you want.
edit
Your Parser does not work because of the usage of ~!. If you use ~ instead the parser then your example will work (e.g. http://scalafiddle.net/console/97ea3cfb64eeaa1edba65501d0bb3c86 ).
The reason: The parser uses backtracking and ~! will disable that. The parser needs backtracking because the url regex may fail.

Related

How to make my parser support logic operations and word case-insensitive?

Recently, I am learning the Scala parser combinator. I would like to parse the key in a given string. For instance,
val expr1 = "local_province != $province_name$ or city=$city_name$ or people_number<>$some_digit$"
// ==> List("local_province", "city", "people_number")
val expr2 = "(local_province=$province_name$)"
// ==> List("local_province")
val expr3 = "(local_province=$province_name$ or city=$city_name$) and (lib_name=$akka$ or lib_author=$martin$)"
// ==> List("local_province", "city", "lib_name", "lib_author")
Trial
import scala.util.parsing.combinator.JavaTokenParsers
class KeyParser extends JavaTokenParsers {
lazy val key = """[a-zA-Z_]+""".r
lazy val value = "$" ~ key ~ "$"
lazy val logicOps = ">" | "<" | "=" | ">=" | "<=" | "!=" | "<>"
lazy val elem: Parser[String] = key <~ (logicOps ~ value)
lazy val expr: Parser[List[String]] =
"(" ~> repsep(elem, "and" | "or") <~ ")" | repsep(elem, "and" | "or")
lazy val multiExpr: Parser[List[String]] =
repsep(expr, "and" | "or") ^^ { _.foldLeft(List.empty[String])(_ ++ _) }
}
object KeyParser extends KeyParser {
def parse(input: String) = parseAll(multiExpr, input)
}
Here is my test in Scala REPL
KeyParser.parse(expr1)
[1.72] failure: $' expected but >' found
KeyParser.parse(expr2)
[1.33] parsed: List(local_province)
KeyParser.parse(expr3)
[1.98] parsed: List(local_province, city, lib_name, lib_author)
I notice that the KeyParser only works for "=" and it doesn't support the case like "(local_province<>$province_name$ AND city!=$city_name$)" which contains "<> | !=" and "AND".
So I would like to know how to revise it.

I notice that the KeyParser only works for "="
This isn't quite true. It also works for !=, < and >. The ones it doesn't work for are >=, <= and <>.
More generally it does not work for those operators which have a prefix of them appear in the list of alternatives before them. That is >= is not matched because > appears before it and is a prefix of it.
So why does this happen? The | operator creates a parser that produces the result of the left parser if it succeeds or of the right parser otherwise. So if you have a chain of |s, you'll get the result of the first parser in that chain which can match the current input. So if the current input is <>$some_digit$, the parser logicOps will match < and leave you with >$some_digit$ as the remaining input. So now it tries to match value against that input and fails.
Why doesn't backtracking help here? Because the logicOps parser already succeeded, so there's nowhere to backtrack to. If the parser were structured like this:
lazy val logicOpAndValue = ">" ~ value | "<" ~ value | "=" ~ value |
">=" ~ value | "<=" ~ value | "!=" ~ value |
"<>" ~ value
lazy val elem: Parser[String] = key <~ logicOpAndValue
Then the following would happen (with the current input being <>$some_digit$):
">" does not match the current input, so go to next alternative
"<" does match the current input, so try the right operand of the ~ (i.e. value) with the current input >$some_digit$. This fails, so continue with the next alternative.
... bunch of alternatives that don't match ...
"<>" does match the current input, so try the right operand of the ~. This matches as well. Success!
However in your code the ~ value is outside of the list of alternatives, not inside each alternative. So when the parser fails, we're no longer inside any alternative, so there's no next alternative to try and it just fails.
Of course moving the ~ value inside the alternatives isn't really a satisfying solution as it's ugly as hell and not very maintainable in the general case.
One solution is simply to move the longer operators at the beginning of the alternatives (i.e. ">=" | "<=" | "<>" | ">" | "<" | ...). This way ">" and "<" will only be tried if ">=", "<=" and "<>" have already failed.
A still better solution, which does not rely on the order of alternatives and is thus less error-prone, is to use ||| instead of. ||| works like | except that it tries all of the alternatives and then returns the longest successful result - not the first.
PS: This isn't related to your problem but you're currently limiting the nesting depth of parentheses because your grammar is not recursive. To allow unlimited nesting, you'll want your expr and multiExpr rules to look like this:
lazy val expr: Parser[List[String]] =
"(" ~> multiExpr <~ ")" | elem
lazy val multiExpr: Parser[List[String]] =
repsep(expr, "and" | "or") ^^ { _.foldLeft(List.empty[String])(_ ++ _) }
However I recommend renaming expr to something like primaryExpr and multiExpr to expr.
_.foldLeft(List.empty[String])(_ ++ _) can also be more succinctly expressed as _.flatten.

Parsing a list of 0 or more idents followed by ident

I want to parse a part of my DSL formed like this:
configSignal: sticky Config
Semantically this is:
argument_name: 0_or_more_modifiers argument_type
I tried implementing the following parser:
def parser = ident ~ ":" ~ rep(ident) ~ ident ^^ {
case name ~ ":" ~ modifiers ~ returnType => Arg(name, returnType, modifiers)
}
Thing is, the rep(ident) part is applied until there are no more tokens and the parser fails, because the last ~ ident doesn't match. How should I do this properly?
Edit
In the meantime I realized, that the modifiers will be reserved words (keywords), so now I have:
def parser = ident ~ ":" ~ rep(modifier) ~ ident ^^ {
case name ~ ":" ~ modifiers ~ returnType => Arg(name, returnType, modifiers)
}
def modifier = "sticky" | "control" | "count"
Nevertheless, I'm curious if it would be possible to write a parser if the modifiers weren't defined up front.

"0 or more idents followed by ident" is equivalent to "1 or more idents", so just use rep1
Its docs:
def rep1[T](p: ⇒ Parser[T]): Parser[List[T]]
A parser generator for non-empty repetitions.
rep1(p) repeatedly uses p to parse the input until p fails -- p must succeed at least once (the result is a List of the consecutive results of p)
p a Parser that is to be applied successively to the input
returns A parser that returns a list of results produced by repeatedly applying p to the input (and that only succeeds if p matches at least once).
edit in response to OP's comment:
I don't think there's a built-in way to do what you described, but it would still be relatively easy to map to your custom data types by using regular List methods:
def parser = ident ~ ":" ~ rep1(ident) ^^ {
case name ~ ":" ~ idents => Arg(name, idents.last, idents.dropRight(1))
}
In this particular case, you wouldn't have to worry about idents being Nil, since the rep1 parser only succeeds with a non-empty list.

Parser combinator grammar not yielding correct associativity

I am working on a simple expression parser, however given the following parser combinator declarations below, I can't seem to pass my tests and a right associative tree keeps on popping up.
def EXPR:Parser[E] = FACTOR ~ rep(SUM|MINUS) ^^ {case a~b => (a /: b)((acc,f) => f(acc))}
def SUM:Parser[E => E] = "+" ~ EXPR ^^ {case "+" ~ b => Sum(_, b)}
def MINUS:Parser[E => E] = "-" ~ EXPR ^^ {case "-" ~ b => Diff(_, b)}
I've been debugging hours for this. I hope someone can help me figure it out it's not coming out right.
"5-4-3" would yield a tree that evaluates to 4 instead of the expected -2.
What is wrong with the grammar above?

I don't work with Scala but do work with F# parser combinators and also needed associativity with infix operators. While I am sure you can do 5-4 or 2+3, the problem comes in with a sequence of two or more such operators of the same precedence and operator, i.e. 5-4-2 or 2+3+5. The problem won't show up with addition as (2+3)+5 = 2+(3+5) but (5-4)-2 <> 5-(4-2) as you know.
See: Monadic Parser Combinators 4.3 Repetition with meaningful separators. Note: The separators are the operators such as "+" and "*" and not whitespace or commas.
See: Functional Parsers Look for the chainl and chainr parsers in section 7. More parser combinators.
For example, an arithmetical expressions, where the operators that
separate the subexpressions have to be part of the parse tree. For
this case we will develop the functions chainr and chainl. These
functions expect that the parser for the separators yields a function
(!);
The function f should operate on an element and a list of tuples, each
containing an operator and an element. For example, f(e0; [(1; e1);
(2; e2); (3; e3)]) should return ((eo 1 e1) 2 e2) 3 e3. You may
recognize a version of foldl in this (albeit an uncurried one), where
a tuple (; y) from the list and intermediate result x are combined
applying x y.
You need a fold function in the semantic parser, i.e. the part that converts the tokens from the syntactic parser into the output of the parser. In your code I believe it is this part.
{case a~b => (a /: b)((acc,f) => f(acc))}
Sorry I can't do better as I don't use Scala.

"-" ~ EXPR ^^ {case "-" ~ b => Diff(_, b)}
for 5-4-3, it expands to
Diff(5, 4-3)
which is
Diff(5, Diff(4, 3))
however, what you need is:
Diff(Diff(5, 4), 3))
// for 5 + 4 - 3 it should be
Diff(Sum(5, 4), 3)
you need to involve stack.

It seems using "+" ~ EXPR made the answer incorrect. It should have been FACTOR instead.

Scala rep() and associativity

I want to express this grammar in scala StdTokenParsers:
expr -> expr ("+"|"-") ~ muldivexpr | muldivexpr
"+" and "-" is left associative.
The grammar is left recursive so it caused infinite recursions. I can rewrite to remove left recursion, but it will change to right associativity.
Now I am planning to use scala rep() to rewrite it as:
expr -> rep(muldivexpr ("+"|"-")) ~ muldivexpr
but will rep() that change the associativity? How does rep() work in this case?
I am asking this question because I have to output the AST in the future.

You are most likely looking for:
chainl1[T](p: => Parser[T], q: => Parser[(T, T) => T]): Parser[T]
The general idea is p is an operand, and q is a separator yielding a function which can combine two operands, e.g.
chainl1(muldivexpr,
"+" ^^^ { (l: Expr, r: Expr) => Addition(l, r) }
| "-" ^^^ { (l: Expr, r: Expr) => Subtraction(l, r) }
)

Scala combinator parsers - distinguish between number strings and variable strings

I'm doing Cay Horstmann's combinator parser exercises, I wonder about the best way to distinguish between strings that represent numbers and strings that represent variables in a match statement:
def factor: Parser[ExprTree] = (wholeNumber | "(" ~ expr ~ ")" | ident) ^^ {
case a: wholeNumber => Number(a.toInt)
case a: String => Variable(a)
}
The second line there, "case a: wholeNumber" is not legal. I thought about a regexp, but haven't found a way to get it to work with "case".

I would split it up a bit and push the case analysis into the |. This is one of the advantages of combinators and really LL(*) parsing in general:
def factor: Parser[ExprTree] = ( wholeNumber ^^ { Number(_.toInt) }
| "(" ~> expr <~ ")"
| ident ^^ { Variable(_) } )
I apologize if you're not familiar with the underscore syntax. Basically it just means "substitute the nth parameter to the enclosing function value". Thus { Variable(_) } is equivalent to { x => Variable(x) }.
Another bit of syntax magic here is the ~> and <~ operators in place of ~. These operators mean that the parsing of that term should include the syntax of both the parens, but the result should be solely determined by the result of expr. Thus, the "(" ~> expr <~ ")" matches exactly the same thing as "(" ~ expr ~ ")", but it doesn't require the extra case analysis to retrieve the inner result value from expr.