I'm trying to write a simple parser in scala but when I add a repeated token Scala seems to get stuck in an infinite loop...
object RewriteRuleParsers extends RegexParsers {
private def space = regex(new Regex("[ \\n]+"))
private def number = regex(new Regex("[0-9]+"))
private def equals = (space?)~"="~(space?)
private def word = regex(new Regex("[a-zA-Z][a-zA-Z0-9-]*"))
private def string = regex(new Regex("[0-9]+")) >> { len => ":" ~> regex(new Regex(".{" + len + "}")) }
private def matchTokenPartContent: Parser[Any] = (space?)~word~equals~word<~ space?
private def matchTokenPart: Parser[Any] = ((space?) ~> "{" ~> matchTokenPartContent <~ "}"<~ space?)
private def matchTokenParts = (matchTokenPart *)
private def matchToken: Parser[Any] = ("[" ~> matchTokenParts ~ "]")
def parseMatchToken(str: String): ParseResult[Any] = parse(matchToken, str)
}
and the code to call it
val parseResult = RewriteRuleParsers.parseMatchToken("[{tag=hello}]")
Any advice gratefull received
The problem is the precedence of ?. Take the following, for example:
object simpleParser extends RegexParsers {
val a = literal("a")
val b = literal("b")
def apply(s: String) = this.parseAll(a <~ b?, s)
}
The a <~ b? here is interpreted as (a <~ b)?, so it will accept "ab" or "", but not "a". We'd need to write a <~ (b?) if we want it to accept "a".
In your case you can just parenthesize space? at the end of matchTokenPartContent and matchTokenPart and it'll work as expected.
Related
I have a combinator and a result converter that looks like so:
// parses a line like so:
//
// 2
// 00:00:01.610 --> 00:00:02.620 align:start position:0%
//
private def subtitleHeader: Parser[SubtitleBlock] = {
(subtitleNumber ~ whiteSpace).? ~>
time ~ arrow ~ time ~ opt(textLine) ~ eol
} ^^ {
case
startTime ~ _ ~ endTime ~ _ ~ _
=> SubtitleBlock(startTime, endTime, List(""))
}
Because the arrow, textline and eol are not important to my result converter, I was hoping I could use <~ and ~> in the right places within my combinator such that my converter doesn't have to deal with them. As an experiment, I changed the first ~ in the parser to <~ and removed the ~ _ where the "arrow" would be matched in the case statement like so:
private def subtitleHeader: Parser[SubtitleBlock] = {
(subtitleNumber ~ whiteSpace).? ~>
time <~ arrow ~ time ~ opt(textLine) ~ eol
} ^^ {
case
startTime ~ endTime ~ _ ~ _
=> SubtitleBlock(startTime, endTime, List(""))
}
However, I get red-squigglies in IntelliJ with the error message:
Error:(44, 31) constructor cannot be instantiated to expected type;
found : caption.vttdissector.VttParsers.~[a,b] required: Int
startTime ~ endTime ~ _ ~ _
What am I doing wrong?
Since you didn't insert any parentheses in the chain of ~ and <~, most matched subexpressions are thrown out "with the bathwater" (or rather "with the whitespace and arrows"). Just insert some parentheses.
Here is the general pattern what it should look like:
(irrelevant ~> irrelevant ~> RELEVANT <~ irrelevant <~ irrelevant) ~
(irrelevant ~> RELEVANT <~ irrelevant <~ irrelevant) ~
...
i.e. every "relevant" subexpression is surrounded by irrelevant stuff and a pair of parentheses, and then the parenthesized subexpressions are connected by ~'s.
Your example:
import scala.util.parsing.combinator._
import scala.util.{Either, Left, Right}
case class SubtitleBlock(startTime: String, endTime: String, text: List[String])
object YourParser extends RegexParsers {
def subtitleHeader: Parser[SubtitleBlock] = {
(subtitleNumber.? ~> time <~ arrow) ~
time ~
(opt(textLine) <~ eol)
} ^^ {
case startTime ~ endTime ~ _ => SubtitleBlock(startTime, endTime, Nil)
}
override val whiteSpace = "[ \t]+".r
def arrow: Parser[String] = "-->".r
def subtitleNumber: Parser[String] = "\\d+".r
def time: Parser[String] = "\\d{2}:\\d{2}:\\d{2}.\\d{3}".r
def textLine: Parser[String] = ".*".r
def eol: Parser[String] = "\n".r
def parseStuff(s: String): scala.util.Either[String, SubtitleBlock] =
parseAll(subtitleHeader, s) match {
case Success(t, _) => scala.util.Right(t)
case f => scala.util.Left(f.toString)
}
def main(args: Array[String]): Unit = {
val examples: List[String] = List(
"2 00:00:01.610 --> 00:00:02.620 align:start position:0%\n"
) ++ args.map(_ + "\n")
for (x <- examples) {
println(parseStuff(x))
}
}
}
finds:
Right(SubtitleBlock(00:00:01.610,00:00:02.620,List()))
Lets say I want to parse a string in scala, and every time there were parenthesis nested within each other I would multiply some number with itself . Ex
(()) +() + ((())) with number=3 would be 3*3 + 3 + 3*3*3. How would I do this with scala combinators.
class SimpleParser extends JavaTokenParsers {
def Base:Parser[Int] = """(""" ~remainder ~ """)"""
def Plus = atom ~ '+' ~ remainder
def Parens = Base
def remainder:Parser[Int] =(Next|Start) }
How would I make it so that every time an atom is parsed the number would multiply by itself, and then what was inside the atom will also be parsed?
would I put a method after the atom def like
def Base:Parser[Int] = """(""" ~remainder ~ """)""" ^^(2*paser(remainder))
? I don't understand how to do this because of the recursive nature of it, as if I find parenthesis, I must then multiply by three times whatever is in these parenthesis.
This is easiest if you build up the number from the inside out. For the parenthetical groups, we start with the base case (which will result in simply the number itself), and then add the number again for each nesting. For the sum, we start with a single parenthetical group and then optionally add summands until we run out:
import scala.util.parsing.combinator.JavaTokenParsers
class SimpleParser(number: Int) extends JavaTokenParsers {
def base: Parser[Int] = literal("()").map(_ => number)
def pars: Parser[Int] = base | ("(" ~> pars <~ ")").map(_ + number)
def plus: Parser[Int] = "+" ~> expr
def expr: Parser[Int] = (pars ~ opt(plus).map(_.getOrElse(0))).map {
case first ~ rest => first + rest
}
}
object ParserWith3 extends SimpleParser(3)
And then:
scala> ParserWith3.parseAll(ParserWith3.expr, "(())+()+((()))")
res0: ParserWith3.ParseResult[Int] = [1.15] parsed: 18
I'm using map because I can't stand the parsing library's little operator party, but you could replace all the maps with ^^ or ^^^ if you really wanted to.
If you use the fact that you can build right recursive rules using scala parser combinators(here mult appears on the right of its own definition for example):
import scala.util.parsing.combinator.RegexParsers
trait ExprsParsers extends RegexParsers {
val value = 3
lazy val mult: Parser[Int] =
"(" ~> mult <~ ")" ^^ { _ * value } |||
"()" ^^ { _ => value }
lazy val plus: Parser[Int] =
(mult <~ "+") ~ plus ^^ { case m ~ p => m + p } |||
mult
}
To use that code you simply create a structure that inherits ExprsParsers, e.g. :
object MainObj extends ExprsParsers {
def main(args: Array[String]): Unit = {
println(parseAll(plus, "() + ()")) //[1.8] parsed: 6
println(parseAll(plus, "() + (())")) //[1.10] parsed: 12
println(parseAll(plus, "((())) + ()")) //[1.12] parsed: 30
}
}
check scala source file for parser for any operator you don't understand.
I'm trying to write a simple parser in scala but when I add a repeated token Scala seems to get stuck in an infinite loop.
I have 2 parse methods below. One uses rep(). The non repetitive version works as expected (not what I want though) using the rep() version results in an infinite loop.
EDIT:
This was a learning example where I tired to enforce the '=' was surrounded by whitespace.
If it is helpful this is my actual test file:
a = 1
b = 2
c = 1 2 3
I was able to parse: (with the parse1 method)
K = V
but then ran into this problem when tried to expand the exercise out to:
K = V1 V2 V3
import scala.util.parsing.combinator._
import scala.io.Source.fromFile
class MyParser extends RegexParsers {
override def skipWhitespace(): Boolean = { false }
def key: Parser[String] = """[a-zA-Z]+""".r ^^ { _.toString }
def eq: Parser[String] = """\s+=\s+""".r ^^ { _.toString.trim }
def string: Parser[String] = """[^ \t\n]*""".r ^^ { _.toString.trim }
def value: Parser[List[String]] = rep(string)
def foo(key: String, value: String): Boolean = {
println(key + " = " + value)
true
}
def parse1: Parser[Boolean] = key ~ eq ~ string ^^ { case k ~ eq ~ string => foo(k, string) }
def parse2: Parser[Boolean] = key ~ eq ~ value ^^ { case k ~ eq ~ value => foo(k, value.toString) }
def parseLine(line: String): Boolean = {
parse(parse2, line) match {
case Success(matched, _) => true
case Failure(msg, _) => false
case Error(msg, _) => false
}
}
}
object TestParser {
def usage() = {
System.out.println("<file>")
}
def main(args: Array[String]) : Unit = {
if (args.length != 1) {
usage()
} else {
val mp = new MyParser()
fromFile(args(0)).getLines().foreach { mp.parseLine }
println("done")
}
}
}
Next time, please provide some concrete examples, it's not obvious what your input is supposed to look like.
Meanwhile, you can try this, maybe you find it helpful:
import scala.util.parsing.combinator._
import scala.io.Source.fromFile
class MyParser extends JavaTokenParsers {
// override def skipWhitespace(): Boolean = { false }
def key: Parser[String] = """[a-zA-Z]+""".r ^^ { _.toString }
def eq: Parser[String] = "="
def string: Parser[String] = """[^ \t\n]+""".r
def value: Parser[List[String]] = rep(string)
def foo(key: String, value: String): Boolean = {
println(key + " = " + value)
true
}
def parse1: Parser[Boolean] = key ~ eq ~ string ^^ { case k ~ eq ~ string => foo(k, string) }
def parse2: Parser[Boolean] = key ~ eq ~ value ^^ { case k ~ eq ~ value => foo(k, value.toString) }
def parseLine(line: String): Boolean = {
parseAll(parse2, line) match {
case Success(matched, _) => true
case Failure(msg, _) => false
case Error(msg, _) => false
}
}
}
val mp = new MyParser()
for (line <- List("hey = hou", "hello = world ppl", "foo = bar baz blup")) {
println(mp.parseLine(line))
}
Explanation:
JavaTokenParsers and RegexParsers treat white space differently.
The JavaTokenParsers handles the white space for you, it's not specific for Java, it works for most non-esoteric languages. As long as you are not trying to parse Whitespace, JavaTokenParsers is a good starting point.
Your string definition included a *, which caused the infinite recursion.
Your eq definition included something that messed with the empty space handling (don't do this unless it's really necessary).
Furthermore, if you want to parse the whole line, you must call parseAll,
otherwise it parses only the beginning of the string in non-greedy manner.
Final remark: for parsing key-value pairs line by line, some String.split and
String.trim would be completely sufficient. Scala Parser Combinators are a little overkill for that.
PS: Hmm... Did you want to allow =-signs in your key-names? Then my version would not work here, because it does not enforce an empty space after the key-name.
This is not a duplicate, it's a different version with RegexParsers that takes care of whitespace explicitly
If you for some reason really care about the white space, then you could stick to the RegexParsers, and do the following (notice the skipWhitespace = false, explicit parser for whitespace ws, the two ws with squiglies around the equality sign, and the repsep with explicitly specified ws):
import scala.util.parsing.combinator._
import scala.io.Source.fromFile
class MyParser extends RegexParsers {
override def skipWhitespace(): Boolean = false
def ws: Parser[String] = "[ \t]+".r
def key: Parser[String] = """[a-zA-Z]+""".r ^^ { _.toString }
def eq: Parser[String] = ws ~> """=""" <~ ws
def string: Parser[String] = """[^ \t\n]+""".r
def value: Parser[List[String]] = repsep(string, ws)
def foo(key: String, value: String): Boolean = {
print(key + " = " + value)
true
}
def parse1: Parser[Boolean] = (key ~ eq ~ string) ^^ { case k ~ e ~ v => foo(k, v) }
def parse2: Parser[Boolean] = (key ~ eq ~ value) ^^ { case k ~ e ~ v => foo(k, v.toString) }
def parseLine(line: String): Boolean = {
parseAll(parse2, line) match {
case Success(matched, _) => true
case Failure(msg, _) => false
case Error(msg, _) => false
}
}
}
val mp = new MyParser()
for (line <- List("hey = hou", "hello = world ppl", "foo = bar baz blup", "foo= bar baz", "foo =bar baz")) {
println(" (Matches: " + mp.parseLine(line) + ")")
}
Now the parser rejects the lines where there is no whitespace around the equal sign:
hey = List(hou) (Matches: true)
hello = List(world, ppl) (Matches: true)
foo = List(bar, baz, blup) (Matches: true)
(Matches: false)
(Matches: false)
The bug with * instead of + in string has been removed, just like in the previous version.
I am using the following object to parse csv. The parser seems to be working correctly except spaces are being stripped out. Could someone help me figure out where that is happening. Thanks
object CsvParser extends RegexParsers {
override protected val whiteSpace = """[ \t]""".r
def COMMA = ","
def DQUOTE = "\""
def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }
def CR = "\r"
def LF = "\n"
def CRLF = "\r\n"
def TXT = "[^\",\r\n]".r
def record: Parser[List[String]] = rep1sep(field, COMMA)
def field: Parser[String] = (escaped | nonescaped)
def escaped: Parser[String] = (DQUOTE ~> ((TXT | COMMA | CR | LF | DQUOTE2)*) <~ DQUOTE) ^^ { case ls => ls.mkString("") }
def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }
def parse(s: String) = parseAll(record, s) match {
case Success(res, _) => res
case _ => List[List[String]]()
}
}
I think I figured it out. Needed to add:
override val skipWhitespace = false
I'm trying to write a parser which should parse a Prolog list (for example [1,2,3,4]) into the corresponding Scala List. I programmed the parser with Scalas parsing combinators.
My parser looks like this so far:
class PListParser extends JavaTokenParsers{
def list:Parser[List[Any]] = "[" ~> listArgs <~ "]"
def listArgs:Parser[List[Any]] = list | repsep(args, ",")
def args:Parser[String] = "(.)*".r
}
Is there any possibility to turn the type parameters of the first two parsers into something more specific? Like a general parameter for nested lists of arbitrary dimension but the same underling type.
I think it should be trees, that is the proper structure for lists nested to arbitrary depth
sealed trait Tree[A]
case class Leaf[A](value: A) extends Tree[A] {
override def toString: String = value.toString
}
case class Node[A](items: List[Tree[A]]) extends Tree[A] {
override def toString: String = "Node(" + items.mkString(", ") + ")"
}
(Do the toString as you like, but I think the default one are rather too verbose)
Then, with minor fixes to your grammar (+ parse method, just to test easily on REPL)
object PrologListParser extends JavaTokenParsers{
def list:Parser[Tree[String]] = "[" ~> listArgs <~ "]"
def listArgs:Parser[Tree[String]] = repsep(list | args, ",") ^^ {Node(_)}
def args:Parser[Tree[String]] = """([^,\[\]])*""".r ^^ {Leaf(_)}
def parse(s: String): ParseResult[Tree[String]] = parse(list, s)
}
PrologListParser.parse("[a, b, [c, [d, e], f, [g], h], [i, j], k]")
res0: PrologList.ParseResult[Tree[String]] = [1.42] parsed: Node(a, b, Node(c, Node(d, e), f, Node(g), h), Node(i, j), k)
(Not tested)
sealed trait PrologTerm
case class PInt(i: Integer) extends PrologTerm {
override def toString = i.toString
}
case class PAtom(s: String) extends PrologTerm {
override def toString = s.toString
}
case class PComplex(f: PAtom, args: List[PrologTerm]) extends PrologTerm {
override def toString = f.toString+"("+args.mkString(", ")+")"
}
case class PList(items: List[PrologTerm]) extends PrologTerm {
override def toString = "["+items.mkString(", ")+"]"
}
object PrologListParser extends JavaTokenParsers {
def term : Parser[PrologTerm] = int | complex | atom | list
def int : Parser[PInt] = wholeNumber ^^ {s => PInt(s.toInt)}
def complex : Parser[PComplex] =
(atom ~ ("(" ~> repsep(term, ",") <~ ")")) ^^ {case f ~ args => PAtom(f, args)}
def atom : Parser[PAtom] = "[a-z][a-zA-Z_]*".r ^^ {PAtom(_)}
def list : Parser[PList] = ("[" ~> repsep(term, ",") <~ "]") ^^ {PList(_)}
}