Why doesn't parser combinator backtrack? - scala

Consider
import util.parsing.combinator._
object TreeParser extends JavaTokenParsers {
lazy val expr: Parser[String] = decimalNumber | sum
//> expr: => TreeParser.Parser[String]
lazy val sum: Parser[String] = expr ~ "+" ~ expr ^^ {case a ~ plus ~ b => s"($a)+($b)"}
//> sum: => TreeParser.Parser[String]
println(parseAll(expr, "1 + 1")) //> TreeParser.ParseResult[String] = [1.3] failure: string matching regex
//| `\z' expected but `+' found
}
The same story with fastparse
import fastparse.all._
val expr: P[Any] = P("1" | sum)
val sum: P[Any] = expr ~ "+" ~ expr
val top = expr ~ End
println(top.parse("1+1")) // Failure(End:1:2 ..."+1")
Parsers are great to discover that taking the first literal is a bad idea but do not try to fall back to the sum production. Why?
I understand that parser takes the first branch that can successfully eat up a part of input string and exits. Here, "1" of expression matches the first input char and parsing completes. In order to grab more, we need to make sum the first alternative. However, plain stupid
lazy val expr: Parser[String] = sum | "1"
endы up with stack overflow. The library authors therefore approach it from another side
val sum: P[Any] = P( num ~ ("+".! ~/ num).rep )
val top: P[Any] = P( sum ~ End )
Here, we start sum with terminal, which is fine but this syntax is more verbose and, furthermore, it produces a terminal, followed by a list, which is good for a reduction operator, like sum, but is difficult to map to a series of binary operators.
What if your language defines expression, which admits a binary operator? You want to match every occurrence of expr op expr and map it to a corresponding tree node
expr ~ "op" ~ expr ^^ {case a ~ _ ~ b => BinOp(a,b)"}
How do you do that? In short, I want a greedy parser, that consumes the whole string. This is what I mean by 'greedy' rather than greedy algorigthm that jumps into the first wagon and ends up in a dead end.

As I have found here, we need to replace | alternative operator with secret |||
//lazy val expr: Parser[String] = decimalNumber | sum
lazy val backtrackGreedy: Parser[String] = decimalNumber ||| sum
lazy val sum: Parser[String] = decimalNumber ~ "+" ~ backtrackGreedy ^^ {case a ~ plus ~ b => s"($a)+($b)"}
println(parseAll(backtrackGreedy, "1 + 1")) // [1.6] parsed: (1)+(1)
The order of alternatives does not matter with this operator. To stop stack overflow, we need to eliminate the left recursion, sum = expr + expr => sum = number + expr.
Another answer says that we need to normalize, that is instead of
def foo = "foo" | "fo"
def obar = "obar"
def foobar = foo ~ obar
we need to use
def workingFooBar = ("foo" ~ obar) | ("fo" ~ obar)
But first solution is more striking.

The parser does backtrack. Try val expr: P[String] = P(("1" | "1" ~ "+" ~ "1").!) and expr.parse("1+1") for example.
The problem is in your grammar. expr parses 1 and it is a successful parsing by your definition. Then sum fails and now you want to blame the dutiful expr for what happened?
There are plenty of examples on how to deal with binary operators. For example, the first example here: http://lihaoyi.github.io/fastparse/

Related

Why the expr parser only can parse the first item of it?

I have a Praser
package app
import scala.util.parsing.combinator._
class MyParser extends JavaTokenParsers {
import MyParser._
def expr =
plus | sub | multi | divide | num
def num = floatingPointNumber ^^ (x => Value(x.toDouble).e)
def plus = num ~ rep("+" ~> num) ^^ {
case num ~ nums => nums.foldLeft(num.e) {
(x, y) => Operation("+", x, y)
}
}
def sub = num ~ rep("-" ~> num) ^^ {
case num ~ nums => nums.foldLeft(num.e){
(x, y) => Operation("-", x, y)
}
}
def multi = num ~ rep("*" ~> num) ^^ {
case num ~ nums => nums.foldLeft(num.e){
(x, y) => Operation("*", x, y)
}
}
def divide = num ~ rep("/" ~> num) ^^ {
case num ~ nums => nums.foldLeft(num.e){
(x, y) => Operation("/", x, y)
}
}
}
object MyParser {
sealed trait Expr {
def e = this.asInstanceOf[Expr]
def compute: Double = this match {
case Value(x) => x
case Operation(op, left, right) => (op : #unchecked) match {
case "+" => left.compute + right.compute
case "-" => left.compute - right.compute
case "*" => left.compute * right.compute
case "/" => left.compute / right.compute
}
}
}
case class Value(x: Double) extends Expr
case class Operation(op: String, left: Expr, right: Expr) extends Expr
}
and I use it to parse the expression something
package app
object Runner extends App {
val p = new MyParser
println(p.parseAll(p.expr, "1 * 11"))
}
it prints
[1.3] failure: end of input expected
1 * 11
^
but if I parse the expression 1 + 11, it will succeed in parsing it.
[1.7] parsed: Operation(+,Value(1.0),Value(11.0))
and I can parse something through the plus , multi , divide , num , sub combinator , but the expr combinator only can parse the first item of the or combinator .
so why it only can parse the first item of the expr parser? And how can I change the definition of the parsers to make the parse successful ?
The problem is that you're using rep which matches zero or more times.
def rep[T](p: => Parser[T]): Parser[List[T]] = rep1(p) | success(List())
you need to use rep1 instead which would require at least one match.
If you replace all rep with rep1, your code works.
Check out the changes on scastie
Run an experiment:
println(p.parseAll(p.expr, "1 + 11"))
println(p.parseAll(p.expr, "1 - 11"))
println(p.parseAll(p.expr, "1 * 11"))
println(p.parseAll(p.expr, "1 / 11"))
What will happen?
[1.7] parsed: Operation(+,Value(1.0),Value(11.0))
[1.3] failure: end of input expected
1 - 11
^
[1.3] failure: end of input expected
1 * 11
^
[1.3] failure: end of input expected
1 / 11
+ is consumed, but everything else fails. Let's change def expr definition
def expr =
multi | plus | sub | divide | num
[1.3] failure: end of input expected
1 + 11
^
[1.3] failure: end of input expected
1 - 11
^
[1.7] parsed: Operation(*,Value(1.0),Value(11.0))
[1.3] failure: end of input expected
1 / 11
^
By moving multi to the beginning, * case passed, but + failed.
def expr =
num | multi | plus | sub | divide
[1.3] failure: end of input expected
1 + 11
^
[1.3] failure: end of input expected
1 - 11
^
[1.3] failure: end of input expected
1 * 11
^
[1.3] failure: end of input expected
1 / 11
With num as the first case everything fails. It is apparent now that this code
num | multi | plus | sub | divide
is NOT matching if any of its parts match, but only if the first one matches.
What does docs says about it?
/** A parser combinator for alternative composition.
*
* `p | q` succeeds if `p` succeeds or `q` succeeds.
* Note that `q` is only tried if `p`s failure is non-fatal (i.e., back-tracking is allowed).
*
* #param q a parser that will be executed if `p` (this parser) fails (and allows back-tracking)
* #return a `Parser` that returns the result of the first parser to succeed (out of `p` and `q`)
* The resulting parser succeeds if (and only if)
* - `p` succeeds, ''or''
* - if `p` fails allowing back-tracking and `q` succeeds.
*/
def | [U >: T](q: => Parser[U]): Parser[U] = append(q).named("|")
Important note: back tracking has to be allowed. If it isn't, then failure to match the first parser, will results in failing the alternative without trying the second parser at all.
How to make your parser backtracking? Well, you would have to use PackratParsers as this is the only parser in the library that supports backtracking. Or rewrite your code to not rely on backtracking in the first place.
Personally, I recommend not using Scala Parser Combinators and instead use a library where you explicitly decide when you can still backtrack, and when you should not allow it, like e.g. fastparse.

How to make my parser support logic operations and word case-insensitive?

Recently, I am learning the Scala parser combinator. I would like to parse the key in a given string. For instance,
val expr1 = "local_province != $province_name$ or city=$city_name$ or people_number<>$some_digit$"
// ==> List("local_province", "city", "people_number")
val expr2 = "(local_province=$province_name$)"
// ==> List("local_province")
val expr3 = "(local_province=$province_name$ or city=$city_name$) and (lib_name=$akka$ or lib_author=$martin$)"
// ==> List("local_province", "city", "lib_name", "lib_author")
Trial
import scala.util.parsing.combinator.JavaTokenParsers
class KeyParser extends JavaTokenParsers {
lazy val key = """[a-zA-Z_]+""".r
lazy val value = "$" ~ key ~ "$"
lazy val logicOps = ">" | "<" | "=" | ">=" | "<=" | "!=" | "<>"
lazy val elem: Parser[String] = key <~ (logicOps ~ value)
lazy val expr: Parser[List[String]] =
"(" ~> repsep(elem, "and" | "or") <~ ")" | repsep(elem, "and" | "or")
lazy val multiExpr: Parser[List[String]] =
repsep(expr, "and" | "or") ^^ { _.foldLeft(List.empty[String])(_ ++ _) }
}
object KeyParser extends KeyParser {
def parse(input: String) = parseAll(multiExpr, input)
}
Here is my test in Scala REPL
KeyParser.parse(expr1)
[1.72] failure: $' expected but >' found
KeyParser.parse(expr2)
[1.33] parsed: List(local_province)
KeyParser.parse(expr3)
[1.98] parsed: List(local_province, city, lib_name, lib_author)
I notice that the KeyParser only works for "=" and it doesn't support the case like "(local_province<>$province_name$ AND city!=$city_name$)" which contains "<> | !=" and "AND".
So I would like to know how to revise it.
I notice that the KeyParser only works for "="
This isn't quite true. It also works for !=, < and >. The ones it doesn't work for are >=, <= and <>.
More generally it does not work for those operators which have a prefix of them appear in the list of alternatives before them. That is >= is not matched because > appears before it and is a prefix of it.
So why does this happen? The | operator creates a parser that produces the result of the left parser if it succeeds or of the right parser otherwise. So if you have a chain of |s, you'll get the result of the first parser in that chain which can match the current input. So if the current input is <>$some_digit$, the parser logicOps will match < and leave you with >$some_digit$ as the remaining input. So now it tries to match value against that input and fails.
Why doesn't backtracking help here? Because the logicOps parser already succeeded, so there's nowhere to backtrack to. If the parser were structured like this:
lazy val logicOpAndValue = ">" ~ value | "<" ~ value | "=" ~ value |
">=" ~ value | "<=" ~ value | "!=" ~ value |
"<>" ~ value
lazy val elem: Parser[String] = key <~ logicOpAndValue
Then the following would happen (with the current input being <>$some_digit$):
">" does not match the current input, so go to next alternative
"<" does match the current input, so try the right operand of the ~ (i.e. value) with the current input >$some_digit$. This fails, so continue with the next alternative.
... bunch of alternatives that don't match ...
"<>" does match the current input, so try the right operand of the ~. This matches as well. Success!
However in your code the ~ value is outside of the list of alternatives, not inside each alternative. So when the parser fails, we're no longer inside any alternative, so there's no next alternative to try and it just fails.
Of course moving the ~ value inside the alternatives isn't really a satisfying solution as it's ugly as hell and not very maintainable in the general case.
One solution is simply to move the longer operators at the beginning of the alternatives (i.e. ">=" | "<=" | "<>" | ">" | "<" | ...). This way ">" and "<" will only be tried if ">=", "<=" and "<>" have already failed.
A still better solution, which does not rely on the order of alternatives and is thus less error-prone, is to use ||| instead of. ||| works like | except that it tries all of the alternatives and then returns the longest successful result - not the first.
PS: This isn't related to your problem but you're currently limiting the nesting depth of parentheses because your grammar is not recursive. To allow unlimited nesting, you'll want your expr and multiExpr rules to look like this:
lazy val expr: Parser[List[String]] =
"(" ~> multiExpr <~ ")" | elem
lazy val multiExpr: Parser[List[String]] =
repsep(expr, "and" | "or") ^^ { _.foldLeft(List.empty[String])(_ ++ _) }
However I recommend renaming expr to something like primaryExpr and multiExpr to expr.
_.foldLeft(List.empty[String])(_ ++ _) can also be more succinctly expressed as _.flatten.

Operator Precedence with Scala Parser Combinators

I am working on a Parsing logic that needs to take operator precedence into consideration. My needs are not too complex. To start with I need multiplication and division to take higher precedence than addition and subtraction.
For example: 1 + 2 * 3 should be treated as 1 + (2 * 3). This is a simple example but you get the point!
[There are couple more custom tokens that I need to add to the precedence logic, which I may be able to add based on the suggestions I receive here.]
Here is one example of dealing with operator precedence: http://jim-mcbeath.blogspot.com/2008/09/scala-parser-combinators.html#precedencerevisited.
Are there any other ideas?
This is a bit simpler that Jim McBeath's example, but it does what you say you need, i.e. correct arithmetic precdedence, and also allows for parentheses. I adapted the example from Programming in Scala to get it to actually do the calculation and provide the answer.
It should be quite self-explanatory. There is a heirarchy formed by saying an expr consists of terms interspersed with operators, terms consist of factors with operators, and factors are floating point numbers or expressions in parentheses.
import scala.util.parsing.combinator.JavaTokenParsers
class Arith extends JavaTokenParsers {
type D = Double
def expr: Parser[D] = term ~ rep(plus | minus) ^^ {case a~b => (a /: b)((acc,f) => f(acc))}
def plus: Parser[D=>D] = "+" ~ term ^^ {case "+"~b => _ + b}
def minus: Parser[D=>D] = "-" ~ term ^^ {case "-"~b => _ - b}
def term: Parser[D] = factor ~ rep(times | divide) ^^ {case a~b => (a /: b)((acc,f) => f(acc))}
def times: Parser[D=>D] = "*" ~ factor ^^ {case "*"~b => _ * b }
def divide: Parser[D=>D] = "/" ~ factor ^^ {case "/"~b => _ / b}
def factor: Parser[D] = fpn | "(" ~> expr <~ ")"
def fpn: Parser[D] = floatingPointNumber ^^ (_.toDouble)
}
object Main extends Arith with App {
val input = "(1 + 2 * 3 + 9) * 2 + 1"
println(parseAll(expr, input).get) // prints 33.0
}

Scala, Parser Combinator for Tree Structured Data

How can parsers be used to parse records that spans multiple lines? I need to parse tree data (and eventually transform it to a tree data structure). I'm getting a difficult-to-trace parse error in the code below, but its not clear if this is even the best approach with Scala parsers. The question is really more about the problem solving approach rather than debugging existing code.
The EBNF-ish grammer is:
SP = " "
CRLF = "\r\n"
level = "0" | "1" | "2" | "3"
varName = {alphanum}
varValue = {alphnum}
recordBegin = "0", varName
recordItem = level, varName, [varValue]
record = recordBegin, {recordItem}
file = {record}
An attempt to implement and test the grammer:
import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""
object TreeParser extends JavaTokenParsers {
override val skipWhitespace = false
def CRLF = "\r\n" | "\n"
def BOF = "\\A".r
def EOF = "\\Z".r
def TXT = "[^\r\n]*".r
def TXTNOSP = "[^ \r\n]*".r
def SP = "\\s".r
def level: Parser[Int] = "[0-3]{1}".r ^^ {v => v.toInt}
def varName: Parser[String] = SP ~> TXTNOSP
def varValue: Parser[String] = SP ~> TXT
def recordBegin: Parser[Any] = "0" ~ SP ~ varName ~ CRLF
def recordItem: Parser[(Int,String,String)] = level ~ varValue ~ opt(varValue) <~ CRLF ^^
{case l ~ f ~ v => (l,f,v.map(_+"").getOrElse(""))}
def record: Parser[List[(Int,String,String)]] = recordBegin ~> rep(recordItem)
def file: Parser[List[List[(Int,String,String)]]] = rep(record) <~ EOF
def parse(input: String) = parseAll(file, input)
}
val result = TreeParser.parse(input).get
result.foreach(println)
As Daniel said, you should better let the parser handle whitespace skipping to minimize your code. However you may want to tweak the whitespace value so you can match end of lines explicitly. I did it below to prevent the parser from moving to the next line if no value for a record is defined.
As much as possible, try to use the parsers defined in JavaTokenParsers like ident if you want to match alphabetic words.
To ease your error tracing, perform a NoSuccess match on parseAll so you can see at what point the parser failed.
import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 var_without_value
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""
object TreeParser extends JavaTokenParsers {
override val whiteSpace = """[ \t]+""".r
val level = """[1-3]{1}""".r
val value = """[a-zA-Z0-9_, ]*""".r
val eol = """[\r?\n]+""".r
def recordBegin = "0" ~ ident <~ eol
def recordItem = level ~ ident ~ opt(value) <~ opt(eol) ^^ {
case l ~ n ~ v => (l.toInt, n, v.getOrElse(""))
}
def record = recordBegin ~> rep1(recordItem)
def file = rep1(record)
def parse(input: String) = parseAll(file, input) match {
case Success(result, _) => result
case NoSuccess(msg, _) => throw new RuntimeException("Parsing Failed:" + msg)
}
}
val result = TreeParser.parse(input)
result.foreach(println)
Handling whitespace explicitly is not a particularly good idea. And, of course, using get means you lose the error message. In this particular example:
[1.3] failure: string matching regex `\s' expected but `f' found
0 fruit
^
Which is actually pretty clear, though the question is why it expected a space. Now, this was obviously processing a recordBegin rule, which is defined thusly:
"0" ~ SP ~ varName ~ CRLF
So, it parsers the zero, then the space, and then fruit must be parsed against varName. Now, varName is defined like this:
SP ~> TXTNOSP
Another space! So, fruit should have began with a space.

Scala combinator parsers - distinguish between number strings and variable strings

I'm doing Cay Horstmann's combinator parser exercises, I wonder about the best way to distinguish between strings that represent numbers and strings that represent variables in a match statement:
def factor: Parser[ExprTree] = (wholeNumber | "(" ~ expr ~ ")" | ident) ^^ {
case a: wholeNumber => Number(a.toInt)
case a: String => Variable(a)
}
The second line there, "case a: wholeNumber" is not legal. I thought about a regexp, but haven't found a way to get it to work with "case".
I would split it up a bit and push the case analysis into the |. This is one of the advantages of combinators and really LL(*) parsing in general:
def factor: Parser[ExprTree] = ( wholeNumber ^^ { Number(_.toInt) }
| "(" ~> expr <~ ")"
| ident ^^ { Variable(_) } )
I apologize if you're not familiar with the underscore syntax. Basically it just means "substitute the nth parameter to the enclosing function value". Thus { Variable(_) } is equivalent to { x => Variable(x) }.
Another bit of syntax magic here is the ~> and <~ operators in place of ~. These operators mean that the parsing of that term should include the syntax of both the parens, but the result should be solely determined by the result of expr. Thus, the "(" ~> expr <~ ")" matches exactly the same thing as "(" ~ expr ~ ")", but it doesn't require the extra case analysis to retrieve the inner result value from expr.