Scala, Parser Combinator for Tree Structured Data - scala

How can parsers be used to parse records that spans multiple lines? I need to parse tree data (and eventually transform it to a tree data structure). I'm getting a difficult-to-trace parse error in the code below, but its not clear if this is even the best approach with Scala parsers. The question is really more about the problem solving approach rather than debugging existing code.
The EBNF-ish grammer is:
SP = " "
CRLF = "\r\n"
level = "0" | "1" | "2" | "3"
varName = {alphanum}
varValue = {alphnum}
recordBegin = "0", varName
recordItem = level, varName, [varValue]
record = recordBegin, {recordItem}
file = {record}
An attempt to implement and test the grammer:
import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""
object TreeParser extends JavaTokenParsers {
override val skipWhitespace = false
def CRLF = "\r\n" | "\n"
def BOF = "\\A".r
def EOF = "\\Z".r
def TXT = "[^\r\n]*".r
def TXTNOSP = "[^ \r\n]*".r
def SP = "\\s".r
def level: Parser[Int] = "[0-3]{1}".r ^^ {v => v.toInt}
def varName: Parser[String] = SP ~> TXTNOSP
def varValue: Parser[String] = SP ~> TXT
def recordBegin: Parser[Any] = "0" ~ SP ~ varName ~ CRLF
def recordItem: Parser[(Int,String,String)] = level ~ varValue ~ opt(varValue) <~ CRLF ^^
{case l ~ f ~ v => (l,f,v.map(_+"").getOrElse(""))}
def record: Parser[List[(Int,String,String)]] = recordBegin ~> rep(recordItem)
def file: Parser[List[List[(Int,String,String)]]] = rep(record) <~ EOF
def parse(input: String) = parseAll(file, input)
}
val result = TreeParser.parse(input).get
result.foreach(println)

As Daniel said, you should better let the parser handle whitespace skipping to minimize your code. However you may want to tweak the whitespace value so you can match end of lines explicitly. I did it below to prevent the parser from moving to the next line if no value for a record is defined.
As much as possible, try to use the parsers defined in JavaTokenParsers like ident if you want to match alphabetic words.
To ease your error tracing, perform a NoSuccess match on parseAll so you can see at what point the parser failed.
import util.parsing.combinator._
val input = """0 fruit
1 id 2
1 name apple
2 type red
3 size large
3 origin Texas, US
2 var_without_value
2 date 2 aug 2011
0 fruit
1 id 3
1 name apple
2 type green
3 size small
3 origin Florida, US
2 date 3 Aug 2011"""
object TreeParser extends JavaTokenParsers {
override val whiteSpace = """[ \t]+""".r
val level = """[1-3]{1}""".r
val value = """[a-zA-Z0-9_, ]*""".r
val eol = """[\r?\n]+""".r
def recordBegin = "0" ~ ident <~ eol
def recordItem = level ~ ident ~ opt(value) <~ opt(eol) ^^ {
case l ~ n ~ v => (l.toInt, n, v.getOrElse(""))
}
def record = recordBegin ~> rep1(recordItem)
def file = rep1(record)
def parse(input: String) = parseAll(file, input) match {
case Success(result, _) => result
case NoSuccess(msg, _) => throw new RuntimeException("Parsing Failed:" + msg)
}
}
val result = TreeParser.parse(input)
result.foreach(println)

Handling whitespace explicitly is not a particularly good idea. And, of course, using get means you lose the error message. In this particular example:
[1.3] failure: string matching regex `\s' expected but `f' found
0 fruit
^
Which is actually pretty clear, though the question is why it expected a space. Now, this was obviously processing a recordBegin rule, which is defined thusly:
"0" ~ SP ~ varName ~ CRLF
So, it parsers the zero, then the space, and then fruit must be parsed against varName. Now, varName is defined like this:
SP ~> TXTNOSP
Another space! So, fruit should have began with a space.

Related

Why the expr parser only can parse the first item of it?

I have a Praser
package app
import scala.util.parsing.combinator._
class MyParser extends JavaTokenParsers {
import MyParser._
def expr =
plus | sub | multi | divide | num
def num = floatingPointNumber ^^ (x => Value(x.toDouble).e)
def plus = num ~ rep("+" ~> num) ^^ {
case num ~ nums => nums.foldLeft(num.e) {
(x, y) => Operation("+", x, y)
}
}
def sub = num ~ rep("-" ~> num) ^^ {
case num ~ nums => nums.foldLeft(num.e){
(x, y) => Operation("-", x, y)
}
}
def multi = num ~ rep("*" ~> num) ^^ {
case num ~ nums => nums.foldLeft(num.e){
(x, y) => Operation("*", x, y)
}
}
def divide = num ~ rep("/" ~> num) ^^ {
case num ~ nums => nums.foldLeft(num.e){
(x, y) => Operation("/", x, y)
}
}
}
object MyParser {
sealed trait Expr {
def e = this.asInstanceOf[Expr]
def compute: Double = this match {
case Value(x) => x
case Operation(op, left, right) => (op : #unchecked) match {
case "+" => left.compute + right.compute
case "-" => left.compute - right.compute
case "*" => left.compute * right.compute
case "/" => left.compute / right.compute
}
}
}
case class Value(x: Double) extends Expr
case class Operation(op: String, left: Expr, right: Expr) extends Expr
}
and I use it to parse the expression something
package app
object Runner extends App {
val p = new MyParser
println(p.parseAll(p.expr, "1 * 11"))
}
it prints
[1.3] failure: end of input expected
1 * 11
^
but if I parse the expression 1 + 11, it will succeed in parsing it.
[1.7] parsed: Operation(+,Value(1.0),Value(11.0))
and I can parse something through the plus , multi , divide , num , sub combinator , but the expr combinator only can parse the first item of the or combinator .
so why it only can parse the first item of the expr parser? And how can I change the definition of the parsers to make the parse successful ?
The problem is that you're using rep which matches zero or more times.
def rep[T](p: => Parser[T]): Parser[List[T]] = rep1(p) | success(List())
you need to use rep1 instead which would require at least one match.
If you replace all rep with rep1, your code works.
Check out the changes on scastie
Run an experiment:
println(p.parseAll(p.expr, "1 + 11"))
println(p.parseAll(p.expr, "1 - 11"))
println(p.parseAll(p.expr, "1 * 11"))
println(p.parseAll(p.expr, "1 / 11"))
What will happen?
[1.7] parsed: Operation(+,Value(1.0),Value(11.0))
[1.3] failure: end of input expected
1 - 11
^
[1.3] failure: end of input expected
1 * 11
^
[1.3] failure: end of input expected
1 / 11
+ is consumed, but everything else fails. Let's change def expr definition
def expr =
multi | plus | sub | divide | num
[1.3] failure: end of input expected
1 + 11
^
[1.3] failure: end of input expected
1 - 11
^
[1.7] parsed: Operation(*,Value(1.0),Value(11.0))
[1.3] failure: end of input expected
1 / 11
^
By moving multi to the beginning, * case passed, but + failed.
def expr =
num | multi | plus | sub | divide
[1.3] failure: end of input expected
1 + 11
^
[1.3] failure: end of input expected
1 - 11
^
[1.3] failure: end of input expected
1 * 11
^
[1.3] failure: end of input expected
1 / 11
With num as the first case everything fails. It is apparent now that this code
num | multi | plus | sub | divide
is NOT matching if any of its parts match, but only if the first one matches.
What does docs says about it?
/** A parser combinator for alternative composition.
*
* `p | q` succeeds if `p` succeeds or `q` succeeds.
* Note that `q` is only tried if `p`s failure is non-fatal (i.e., back-tracking is allowed).
*
* #param q a parser that will be executed if `p` (this parser) fails (and allows back-tracking)
* #return a `Parser` that returns the result of the first parser to succeed (out of `p` and `q`)
* The resulting parser succeeds if (and only if)
* - `p` succeeds, ''or''
* - if `p` fails allowing back-tracking and `q` succeeds.
*/
def | [U >: T](q: => Parser[U]): Parser[U] = append(q).named("|")
Important note: back tracking has to be allowed. If it isn't, then failure to match the first parser, will results in failing the alternative without trying the second parser at all.
How to make your parser backtracking? Well, you would have to use PackratParsers as this is the only parser in the library that supports backtracking. Or rewrite your code to not rely on backtracking in the first place.
Personally, I recommend not using Scala Parser Combinators and instead use a library where you explicitly decide when you can still backtrack, and when you should not allow it, like e.g. fastparse.

Why doesn't parser combinator backtrack?

Consider
import util.parsing.combinator._
object TreeParser extends JavaTokenParsers {
lazy val expr: Parser[String] = decimalNumber | sum
//> expr: => TreeParser.Parser[String]
lazy val sum: Parser[String] = expr ~ "+" ~ expr ^^ {case a ~ plus ~ b => s"($a)+($b)"}
//> sum: => TreeParser.Parser[String]
println(parseAll(expr, "1 + 1")) //> TreeParser.ParseResult[String] = [1.3] failure: string matching regex
//| `\z' expected but `+' found
}
The same story with fastparse
import fastparse.all._
val expr: P[Any] = P("1" | sum)
val sum: P[Any] = expr ~ "+" ~ expr
val top = expr ~ End
println(top.parse("1+1")) // Failure(End:1:2 ..."+1")
Parsers are great to discover that taking the first literal is a bad idea but do not try to fall back to the sum production. Why?
I understand that parser takes the first branch that can successfully eat up a part of input string and exits. Here, "1" of expression matches the first input char and parsing completes. In order to grab more, we need to make sum the first alternative. However, plain stupid
lazy val expr: Parser[String] = sum | "1"
endы up with stack overflow. The library authors therefore approach it from another side
val sum: P[Any] = P( num ~ ("+".! ~/ num).rep )
val top: P[Any] = P( sum ~ End )
Here, we start sum with terminal, which is fine but this syntax is more verbose and, furthermore, it produces a terminal, followed by a list, which is good for a reduction operator, like sum, but is difficult to map to a series of binary operators.
What if your language defines expression, which admits a binary operator? You want to match every occurrence of expr op expr and map it to a corresponding tree node
expr ~ "op" ~ expr ^^ {case a ~ _ ~ b => BinOp(a,b)"}
How do you do that? In short, I want a greedy parser, that consumes the whole string. This is what I mean by 'greedy' rather than greedy algorigthm that jumps into the first wagon and ends up in a dead end.
As I have found here, we need to replace | alternative operator with secret |||
//lazy val expr: Parser[String] = decimalNumber | sum
lazy val backtrackGreedy: Parser[String] = decimalNumber ||| sum
lazy val sum: Parser[String] = decimalNumber ~ "+" ~ backtrackGreedy ^^ {case a ~ plus ~ b => s"($a)+($b)"}
println(parseAll(backtrackGreedy, "1 + 1")) // [1.6] parsed: (1)+(1)
The order of alternatives does not matter with this operator. To stop stack overflow, we need to eliminate the left recursion, sum = expr + expr => sum = number + expr.
Another answer says that we need to normalize, that is instead of
def foo = "foo" | "fo"
def obar = "obar"
def foobar = foo ~ obar
we need to use
def workingFooBar = ("foo" ~ obar) | ("fo" ~ obar)
But first solution is more striking.
The parser does backtrack. Try val expr: P[String] = P(("1" | "1" ~ "+" ~ "1").!) and expr.parse("1+1") for example.
The problem is in your grammar. expr parses 1 and it is a successful parsing by your definition. Then sum fails and now you want to blame the dutiful expr for what happened?
There are plenty of examples on how to deal with binary operators. For example, the first example here: http://lihaoyi.github.io/fastparse/

Parsing a list of 0 or more idents followed by ident

I want to parse a part of my DSL formed like this:
configSignal: sticky Config
Semantically this is:
argument_name: 0_or_more_modifiers argument_type
I tried implementing the following parser:
def parser = ident ~ ":" ~ rep(ident) ~ ident ^^ {
case name ~ ":" ~ modifiers ~ returnType => Arg(name, returnType, modifiers)
}
Thing is, the rep(ident) part is applied until there are no more tokens and the parser fails, because the last ~ ident doesn't match. How should I do this properly?
Edit
In the meantime I realized, that the modifiers will be reserved words (keywords), so now I have:
def parser = ident ~ ":" ~ rep(modifier) ~ ident ^^ {
case name ~ ":" ~ modifiers ~ returnType => Arg(name, returnType, modifiers)
}
def modifier = "sticky" | "control" | "count"
Nevertheless, I'm curious if it would be possible to write a parser if the modifiers weren't defined up front.
"0 or more idents followed by ident" is equivalent to "1 or more idents", so just use rep1
Its docs:
def rep1[T](p: ⇒ Parser[T]): Parser[List[T]]
A parser generator for non-empty repetitions.
rep1(p) repeatedly uses p to parse the input until p fails -- p must succeed at least once (the result is a List of the consecutive results of p)
p a Parser that is to be applied successively to the input
returns A parser that returns a list of results produced by repeatedly applying p to the input (and that only succeeds if p matches at least once).
edit in response to OP's comment:
I don't think there's a built-in way to do what you described, but it would still be relatively easy to map to your custom data types by using regular List methods:
def parser = ident ~ ":" ~ rep1(ident) ^^ {
case name ~ ":" ~ idents => Arg(name, idents.last, idents.dropRight(1))
}
In this particular case, you wouldn't have to worry about idents being Nil, since the rep1 parser only succeeds with a non-empty list.

Operator Precedence with Scala Parser Combinators

I am working on a Parsing logic that needs to take operator precedence into consideration. My needs are not too complex. To start with I need multiplication and division to take higher precedence than addition and subtraction.
For example: 1 + 2 * 3 should be treated as 1 + (2 * 3). This is a simple example but you get the point!
[There are couple more custom tokens that I need to add to the precedence logic, which I may be able to add based on the suggestions I receive here.]
Here is one example of dealing with operator precedence: http://jim-mcbeath.blogspot.com/2008/09/scala-parser-combinators.html#precedencerevisited.
Are there any other ideas?
This is a bit simpler that Jim McBeath's example, but it does what you say you need, i.e. correct arithmetic precdedence, and also allows for parentheses. I adapted the example from Programming in Scala to get it to actually do the calculation and provide the answer.
It should be quite self-explanatory. There is a heirarchy formed by saying an expr consists of terms interspersed with operators, terms consist of factors with operators, and factors are floating point numbers or expressions in parentheses.
import scala.util.parsing.combinator.JavaTokenParsers
class Arith extends JavaTokenParsers {
type D = Double
def expr: Parser[D] = term ~ rep(plus | minus) ^^ {case a~b => (a /: b)((acc,f) => f(acc))}
def plus: Parser[D=>D] = "+" ~ term ^^ {case "+"~b => _ + b}
def minus: Parser[D=>D] = "-" ~ term ^^ {case "-"~b => _ - b}
def term: Parser[D] = factor ~ rep(times | divide) ^^ {case a~b => (a /: b)((acc,f) => f(acc))}
def times: Parser[D=>D] = "*" ~ factor ^^ {case "*"~b => _ * b }
def divide: Parser[D=>D] = "/" ~ factor ^^ {case "/"~b => _ / b}
def factor: Parser[D] = fpn | "(" ~> expr <~ ")"
def fpn: Parser[D] = floatingPointNumber ^^ (_.toDouble)
}
object Main extends Arith with App {
val input = "(1 + 2 * 3 + 9) * 2 + 1"
println(parseAll(expr, input).get) // prints 33.0
}

issue `object Foo { val 1 = 2 }` in scala

I found this issue of scala: https://issues.scala-lang.org/browse/SI-4939
Seems we can define a method whose name is a number:
scala> object Foo { val 1 = 2 }
defined module Foo
But we can't invoke it:
scala> Foo.1
<console>:1: error: ';' expected but double literal found.
Foo.1
And we can invoke it inside the object:
scala> object O { val 1 = 1; def x = 1 }
defined module O
scala> O.x
res1: Int = 1
And follow will throw error:
scala> object O { val 1 = 2; def x = 1 }
defined module O
scala> O.x
scala.MatchError: 2
at O$.<init>(<console>:5)
at O$.<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at RequestResult$.<init>(<console>:9)
I use scalac -Xprint:typer to see the code, the val 1 = 2 part is:
<synthetic> private[this] val x$1: Unit = (2: Int(2) #unchecked) match {
case 1 => ()
}
From it, we can see the method name changed to x$1, and only can be invoked inside that object.
And the resolution of that issue is: Won't Fix
I want to know is there any reason to allow a number to be the name of a method? Is there any case we need to use a "number" method?
There is no name "1" being bound here. val 1 = 2 is a pattern-matching expression, in much the same way val (x,2) = (1,2) binds x to 1 (and would throw a MatchError if the second element were not thet same). It's allowed because there's no real reason to add a special case to forbid it; this way val pattern matching works (almost) exactly the same way as match pattern-matching.
There are usually two factors in this kind of decision:
There are many bugs in Scalac that are much higher priority, and bug fixing resources are limited. This behavior is benign and therefore low priority.
There's a long term cost to any increases in the complexity of the language specification, and the current behavior is consistent with the spec. Once things start getting special cased, there can be an avalanche effect.
It's some combination of these two.
Update. Here's what seems strange to me:
val pair = (1, 2)
object Foo
object Bar
val (1, 2) = pair // Pattern matching on constants 1 and 2
val (Foo, Bar) = pair // Pattern matching on stable ids Foo and Bar
val (foo, bar) = pair // Binds foo and bar because they are lowercase
val 1 = 1 // Pattern matching on constant 1
val Foo = 1 // *Not* pattern matching; binds Foo
If val 1 = 1 is pattern matching, then why should val Foo = 1 bind Foo rather than pattern match?
Update 2. Daniel Sobral pointed out that this is a special exception, and Martin Odersky recently wrote the same.
Here's a few examples to show how the LHS of an assignment is more than just a name:
val pair = (1, 2)
val (a1, b1) = pair // LHS of the = is a pattern
val (1, b2) = pair // okay, b2 is bound the the value 2
val (0, b3) = pair // MatchError, as 0 != 1
val a4 = 1 // okay, a4 is bound to the value 1
val 1 = 1 // okay, but useless, no names are bound
val a # 1 = 1 // well, we can bind a name to a pattern with #
val 1 = 0 // MatchError
As always, you can use backticks to escape the name. I see no problem in supporting such names – either you use them and they work for you or they do not work for you, and you don’t use them.