I am trying to implement the following grammer using FastParse API.
Expr can contain only Foo,Bar,Baz sub expressions
Expr must contain atleast 1 sub expression Foo/Bar/Bar. It cannot be empty
Foo/Bar/Baz can appear in any order inside Expr.
Foo/Bar/Baz cannot repeat so you can use them only once
So a valid expression is Expr(Baz(10),Foo(10),Bar(10)) and invalid expression is Expr() or Expr(Bar(10),Bar(10))
So far I have written this code which can enforce and parse 1, 2, 3, rules. but rule no 4 is proving to be tricky.
import fastparse.noApi._
import fastparse.WhitespaceApi
object FastParsePOC {
val White = WhitespaceApi.Wrapper{
import fastparse.all._
NoTrace(" ".rep)
}
def print(input: Parsed[(String, String, Seq[(String, String)])]) : Unit = {
input match {
case Parsed.Success(value, index) =>
println(s"${value._1} ${value._2}")
value._3.foreach{case (name, index) => println(s"$name $index")}
case f # Parsed.Failure(error, line, col) => println(s"Error: $error $line $col ${f.extra.traced.trace}")
}
}
def main(args: Array[String]) : Unit = {
import White._
val base = P("(" ~ (!")" ~ AnyChar).rep(1).! ~ ")")
val foo = P("Foo".! ~ base)
val bar = P("Bar".! ~ base)
val baz = P("Baz".! ~ base)
val foobarbaz = (foo | bar | baz)
val parser = P("Expr" ~ "(" ~ foobarbaz ~ ",".? ~ (foobarbaz).rep(sep=",") ~ ")")
val input3 = "Expr(Baz(20),Bar(10),Foo(30))"
val parsed = parser.parse(input3)
print(parsed)
}
}
You may check the "exactly-once" constraint with a filter call:
test("foo bar baz") {
val arg: P0 = P("(") ~ (!P(")") ~ AnyChar).rep(1) ~ ")"
val subExpr: P[String] = (P("Foo") | P("Bar") | P("Baz")).! ~ arg
val subExprList: P[Seq[String]] = subExpr.rep(min = 1, sep = P(",")).filter { list =>
list.groupBy(identity[String]).values.forall(_.length == 1)
}
val expr: P[Seq[String]] = P("Expr") ~ "(" ~ subExprList ~ ")"
expr.parse("Expr(Foo(10))").get.value
expr.parse("Expr(Foo(10),Bar(20),Baz(30))").get.value
intercept[Throwable] {
expr.parse("Expr()").get.value
}
intercept[Throwable] {
expr.parse("Expr(Foo(10),Foo(20))").get.value
}
}
Related
I took this from a project that claims to parse real numbers, but it somehow eats the pre-decimal part:
object Main extends App {
import org.parboiled.scala._
val res = TestParser.parseDouble("2.3")
println(s"RESULT: ${res.result}")
object TestParser extends Parser {
def RealNumber = rule {
oneOrMore(Digit) ~ optional( "." ~ oneOrMore(Digit) ) ~> { s =>
println(s"CAPTURED '$s'")
s.toDouble
}
}
def Digit = rule { "0" - "9" }
def parseDouble(input: String): ParsingResult[Double] =
ReportingParseRunner(RealNumber).run(input)
}
}
This prints:
CAPTURED '.3'
RESULT: Some(0.3)
What is wrong here? Note that currently I cannot go from Parboiled-1 to Parboiled-2, because I have a larger grammar that would have to be rewritten.
As stated in parboiled documentation, action rules like ~> take the match of the immediately preceding peer rule. In the rule sequence oneOrMore(Digit) ~ optional( "." ~ oneOrMore(Digit) ) the immediately preceding rule is optional( "." ~ oneOrMore(Digit) ), so you get only its match ".3" in the action rule.
To fix that you may, for example, extract the first two elements into a separate rule:
def RealNumberString = rule {
oneOrMore(Digit) ~ optional( "." ~ oneOrMore(Digit) )
}
def RealNumber = rule {
RealNumberString ~> { s =>
println(s"CAPTURED '$s'")
s.toDouble
}
}
Or push both parts onto the stack and then combine them:
def RealNumber = rule {
oneOrMore(Digit) ~> identity ~
optional( "." ~ oneOrMore(Digit) ) ~> identity ~~> { (s1, s2) =>
val s = s1 + s2
println(s"CAPTURED '$s'")
s.toDouble
}
}
Here is one solution, but it looks very ugly. Probably there is a better way:
def Decimal = rule {
Integer ~ optional[Int]("." ~ PosInteger) ~~> { (a: Int, bOpt: Option[Int]) =>
bOpt.fold(a.toDouble)(b => s"$a.$b".toDouble) /* ??? */
}}
def PosInteger = rule { Digits ~> (_.toInt) }
def Integer = rule { optional[Unit]("-" ~> (_ => ())) /* ??? */ ~
PosInteger ~~> { (neg: Option[Unit], num: Int) =>
if (neg.isDefined) -num else num
}
}
def Digit = rule { "0" - "9" }
def Digits = rule { oneOrMore(Digit) }
def parseDouble(input: String): ParsingResult[Double] =
ReportingParseRunner(Decimal).run(input)
This seems pretty simple!
class SeparatedParser(val input: ParserInput, val delimiter: String = ",") extends Parser {
def pipedField = rule { (zeroOrMore(field).separatedBy("|")) }
def field = rule { capture(zeroOrMore(noneOf(delimiter))) }
def d = delimiter
def record = rule {
field ~ d ~ pipedField ~ d ~ field ~ EOI
}
}
I try:
val parser = new SeparatedParser("""49798,piped1|piped2,sklw""")
val parsed = parser.record.run()
parsed match {
case Success(rel) => println(rel)
case Failure(pe:ParseError) =>println(parser.formatError(pe))
}
But I get:
49798 :: Vector(piped1|piped2) :: sklw :: HNil
I would expect the Vector to have two separate elements: piped1 and piped2.
What dumbass mistake am I making?
I have this code below to check a string. We want to verify that it starts with '{' and ends with '}' and that it contains sequences of non-"{}" characters and strings that also have this property.
import util.parsing.combinator._
class Comp extends RegexParsers with PackratParsers {
lazy val bracefree: PackratParser[String] = """[^{}]*""".r ^^ {
case a => a
}
lazy val matching: PackratParser[String] = (
"{" ~ rep(bracefree | matching) ~ "}") ^^ {
case a ~ b ~ c => a + b.mkString("") + c
}
}
object Brackets extends Comp {
def main(args: Array[String])= {
println(parseAll(matching, "{ foo {hello 3 } {}}").get)
}
}
The desired output for this is to echo { foo {hello 3 } {}}, but it ends up taking a long time before dying from java.lang.OutOfMemoryError: GC overhead limit exceeded. What am I doing wrong and what should I have done instead?
Your regular expression for bracefree string matches even an empty string, so parser produced by rep() succeeds without consuming any input and will loop endlessly.
Use a + quantifier instead of *:
lazy val bracefree: PackratParser[String] = """[^{}]+""".r ^^ {
case a => a
}
Also, by default RegexParsers will skip empty strings and whitespaces. To turn that behavior off, just override method skipWhitespace to always return false. In the end your parser will look like this:
import util.parsing.combinator._
class Comp extends RegexParsers with PackratParsers {
override def skipWhitespace = false
lazy val bracefree: PackratParser[String] = """[^{}]+""".r ^^ {
case a => a
}
lazy val matching: PackratParser[String] = (
"{" ~ rep(bracefree | matching) ~ "}") ^^ {
case a ~ b ~ c => a + b.mkString("") + c
}
}
object Brackets extends Comp {
def main(args: Array[String])= {
println(parseAll(matching, "{ foo {hello 3 } {}}").get)
// prints: { foo {hello 3 } {}}
}
}
I'm trying to write a simple parser in scala but when I add a repeated token Scala seems to get stuck in an infinite loop.
I have 2 parse methods below. One uses rep(). The non repetitive version works as expected (not what I want though) using the rep() version results in an infinite loop.
EDIT:
This was a learning example where I tired to enforce the '=' was surrounded by whitespace.
If it is helpful this is my actual test file:
a = 1
b = 2
c = 1 2 3
I was able to parse: (with the parse1 method)
K = V
but then ran into this problem when tried to expand the exercise out to:
K = V1 V2 V3
import scala.util.parsing.combinator._
import scala.io.Source.fromFile
class MyParser extends RegexParsers {
override def skipWhitespace(): Boolean = { false }
def key: Parser[String] = """[a-zA-Z]+""".r ^^ { _.toString }
def eq: Parser[String] = """\s+=\s+""".r ^^ { _.toString.trim }
def string: Parser[String] = """[^ \t\n]*""".r ^^ { _.toString.trim }
def value: Parser[List[String]] = rep(string)
def foo(key: String, value: String): Boolean = {
println(key + " = " + value)
true
}
def parse1: Parser[Boolean] = key ~ eq ~ string ^^ { case k ~ eq ~ string => foo(k, string) }
def parse2: Parser[Boolean] = key ~ eq ~ value ^^ { case k ~ eq ~ value => foo(k, value.toString) }
def parseLine(line: String): Boolean = {
parse(parse2, line) match {
case Success(matched, _) => true
case Failure(msg, _) => false
case Error(msg, _) => false
}
}
}
object TestParser {
def usage() = {
System.out.println("<file>")
}
def main(args: Array[String]) : Unit = {
if (args.length != 1) {
usage()
} else {
val mp = new MyParser()
fromFile(args(0)).getLines().foreach { mp.parseLine }
println("done")
}
}
}
Next time, please provide some concrete examples, it's not obvious what your input is supposed to look like.
Meanwhile, you can try this, maybe you find it helpful:
import scala.util.parsing.combinator._
import scala.io.Source.fromFile
class MyParser extends JavaTokenParsers {
// override def skipWhitespace(): Boolean = { false }
def key: Parser[String] = """[a-zA-Z]+""".r ^^ { _.toString }
def eq: Parser[String] = "="
def string: Parser[String] = """[^ \t\n]+""".r
def value: Parser[List[String]] = rep(string)
def foo(key: String, value: String): Boolean = {
println(key + " = " + value)
true
}
def parse1: Parser[Boolean] = key ~ eq ~ string ^^ { case k ~ eq ~ string => foo(k, string) }
def parse2: Parser[Boolean] = key ~ eq ~ value ^^ { case k ~ eq ~ value => foo(k, value.toString) }
def parseLine(line: String): Boolean = {
parseAll(parse2, line) match {
case Success(matched, _) => true
case Failure(msg, _) => false
case Error(msg, _) => false
}
}
}
val mp = new MyParser()
for (line <- List("hey = hou", "hello = world ppl", "foo = bar baz blup")) {
println(mp.parseLine(line))
}
Explanation:
JavaTokenParsers and RegexParsers treat white space differently.
The JavaTokenParsers handles the white space for you, it's not specific for Java, it works for most non-esoteric languages. As long as you are not trying to parse Whitespace, JavaTokenParsers is a good starting point.
Your string definition included a *, which caused the infinite recursion.
Your eq definition included something that messed with the empty space handling (don't do this unless it's really necessary).
Furthermore, if you want to parse the whole line, you must call parseAll,
otherwise it parses only the beginning of the string in non-greedy manner.
Final remark: for parsing key-value pairs line by line, some String.split and
String.trim would be completely sufficient. Scala Parser Combinators are a little overkill for that.
PS: Hmm... Did you want to allow =-signs in your key-names? Then my version would not work here, because it does not enforce an empty space after the key-name.
This is not a duplicate, it's a different version with RegexParsers that takes care of whitespace explicitly
If you for some reason really care about the white space, then you could stick to the RegexParsers, and do the following (notice the skipWhitespace = false, explicit parser for whitespace ws, the two ws with squiglies around the equality sign, and the repsep with explicitly specified ws):
import scala.util.parsing.combinator._
import scala.io.Source.fromFile
class MyParser extends RegexParsers {
override def skipWhitespace(): Boolean = false
def ws: Parser[String] = "[ \t]+".r
def key: Parser[String] = """[a-zA-Z]+""".r ^^ { _.toString }
def eq: Parser[String] = ws ~> """=""" <~ ws
def string: Parser[String] = """[^ \t\n]+""".r
def value: Parser[List[String]] = repsep(string, ws)
def foo(key: String, value: String): Boolean = {
print(key + " = " + value)
true
}
def parse1: Parser[Boolean] = (key ~ eq ~ string) ^^ { case k ~ e ~ v => foo(k, v) }
def parse2: Parser[Boolean] = (key ~ eq ~ value) ^^ { case k ~ e ~ v => foo(k, v.toString) }
def parseLine(line: String): Boolean = {
parseAll(parse2, line) match {
case Success(matched, _) => true
case Failure(msg, _) => false
case Error(msg, _) => false
}
}
}
val mp = new MyParser()
for (line <- List("hey = hou", "hello = world ppl", "foo = bar baz blup", "foo= bar baz", "foo =bar baz")) {
println(" (Matches: " + mp.parseLine(line) + ")")
}
Now the parser rejects the lines where there is no whitespace around the equal sign:
hey = List(hou) (Matches: true)
hello = List(world, ppl) (Matches: true)
foo = List(bar, baz, blup) (Matches: true)
(Matches: false)
(Matches: false)
The bug with * instead of + in string has been removed, just like in the previous version.
I tried to parse an input of two Ints and some elements and the end:
import scala.util.parsing.combinator.JavaTokenParsers
class X extends JavaTokenParsers {
lazy val elems = elem("wrong elem", "#WB-" contains _)
lazy val lists = repsep(rep(elems), ",")
lazy val p1 = int ~ int ~ lists
lazy val p2 = int ~ int ~ (whiteSpace ~> lists)
def go[A](p: Parser[A]) = parseAll(p, "1 2 WB#,---,BBB") match {
case NoSuccess(msg, _) => sys.error(msg)
case _ =>
}
lazy val int: Parser[Int] =
wholeNumber ^^ {
try _.toInt catch {
case e: NumberFormatException => sys.error("invalid number")
}
}
}
An example input is given in method go. The Ints and the elements at the end have to be delimited by spaces. But this works only for the Ints and not for the elements. When I type in
val x = new X
x go x.p1
I get following error:
java.lang.RuntimeException: string matching regex `\z' expected but `W' found
But when I type in
x go x.p1
I get:
java.lang.RuntimeException: string matching regex `\s+' expected but `W' found
At the end I want to have a Parser[Int ~ Int ~ List[List[Char]]]. Why does inserting white spaces in front of elem not work? And how can I get this code to work?
Just replace elems by a RegEx Parser :
import scala.util.parsing.combinator.JavaTokenParsers
class X extends JavaTokenParsers {
lazy val elems = "[#WB-]".r
lazy val lists = repsep(rep(elems), ",")
lazy val p1 = int ~ int ~ lists
def go[A](p: Parser[A]) = parseAll(p, "1 2 WB#,---,BBB") match {
case NoSuccess(msg, _) => sys.error(msg)
case _ =>
}
lazy val int: Parser[Int] =
wholeNumber ^^ {
try _.toInt catch {
case e: NumberFormatException => sys.error("invalid number")
}
}
}
i have removed p2 because is not useful now