How to ignore single line comments in a parser-combinator - scala

I have a working parser, but I've just realised I do not cater for comments. In the DSL I am parsing, comments start with a ; character. If a ; is encountered, the rest of the line is ignored (not all of it however, unless the first character is ;).
I am extending RegexParsers for my parser and ignoring whitespace (the default way), so I am losing the new line characters anyway. I don't wish to modify each and every parser I have to cater for the possibility of comments either, because statements can span across multiple lines (thus each part of each statement may end with a comment). Is there any clean way to acheive this?

One thing that may influence your choice is whether comments can be found within your valid parsers. For instance let's say you have something like:
val p = "(" ~> "[a-z]*".r <~ ")"
which would parse something like ( abc ) but because of comments you could actually encounter something like:
( ; comment goes here
abc
)
Then I would recommend using a TokenParser or one of its subclass. It's more work because you have to provide a lexical parser that will do a first pass to discard the comments. But it is also more flexible if you have nested comments or if the ; can be escaped or if the ; can be inside a string literal like:
abc = "; don't ignore this" ; ignore this
On the other hand, you could also try to override the value of whitespace to be something like
override protected val whiteSpace = """(\s|;.*)+""".r
Or something along those lines.
For instance using the example from the RegexParsers scaladoc:
import scala.util.parsing.combinator.RegexParsers
object so1 {
Calculator("""(1 + ; foo
(1 + 2))
; bar""")
}
object Calculator extends RegexParsers {
override protected val whiteSpace = """(\s|;.*)+""".r
def number: Parser[Double] = """\d+(\.\d*)?""".r ^^ { _.toDouble }
def factor: Parser[Double] = number | "(" ~> expr <~ ")"
def term: Parser[Double] = factor ~ rep("*" ~ factor | "/" ~ factor) ^^ {
case number ~ list => (number /: list) {
case (x, "*" ~ y) => x * y
case (x, "/" ~ y) => x / y
}
}
def expr: Parser[Double] = term ~ rep("+" ~ log(term)("Plus term") | "-" ~ log(term)("Minus term")) ^^ {
case number ~ list => list.foldLeft(number) { // same as before, using alternate name for /:
case (x, "+" ~ y) => x + y
case (x, "-" ~ y) => x - y
}
}
def apply(input: String): Double = parseAll(expr, input) match {
case Success(result, _) => result
case failure: NoSuccess => scala.sys.error(failure.msg)
}
}
This prints:
Plus term --> [2.9] parsed: 2.0
Plus term --> [2.10] parsed: 3.0
res0: Double = 4.0

Just filter out all the comments with a regex before you pass the code into your parser.
def removeComments(input: String): String = {
"""(?ms)\".*?\"|;.*?$|.+?""".r.findAllIn(input).map(str => if(str.startsWith(";")) "" else str).mkString
}
val code =
"""abc "def; ghij"
abc ;this is a comment
def"""
println(removeComments(code))

Related

SCALA: How to convert a Parser Combinator result to Scala List[String]?

I am trying to write a parser for a language very similar to Milner's CCS. Basically what I am parsing so far are expressions of the following sort:
a.b.a.1
a.0
An expression must start with a letter (excluding t) and could have any number of letters following the first letter (separated by a '.'). The Expression must terminate with a digit (for simplicity I chose digits between 0 and 2 for now). I want to use Parser Combinators for Scala, however this is the first time that I am working with them. This is what I have so far:
import scala.util.parsing.combinator._
class SimpleParser extends RegexParsers {
def alpha: Parser[String] = """[^t]{1}""".r ^^ { _.toString }
def digit: Parser[Int] = """[0-2]{1}""".r ^^ { _.toInt }
def expr: Parser[Any] = alpha ~ "." ~ digit ^^ {
case al ~ "." ~ di => List(al, di)
}
def simpleExpression: Parser[Any] = alpha ~ "." ~ rep(alpha ~ ".") ~ digit //^^ { }
}
As you can see in def expr :Parser[Any] I am trying to return the result as a list, since Lists in Scala are very easy to work with (in my opinion). Is this the correct way how to convert a Parser[Any] result to a List? Can anyone give me any tips on how I can do this for def simpleExpression:Parser[Any].
The main reason why I want to use Lists is because after parsing and Expression I want to be able to consume it. For example, given the expression a.b.1, if I am given an 'a', I would like to consume the expression to end up with a new expression: b.1 (i.e. a.b.1 ->(a)-> b.1). The idea behind this is to simulate finite state automatas. Any tips on how I may improve my implementation are appreciated.
To keeps things type safe, I recommend a parser that produces a tuple of a list of strings and an int. That is, the input a.b.a.1 would get parsed as (List("a", "b", "a"), 1). Note also that the regex for alpha was modified to exclude anything that is not a lowercase letter (in addition to t).
class SimpleParser extends RegexParsers {
def alpha: Parser[String] = """[a-su-z]{1}""".r ^^ { _.toString }
def digit: Parser[Int] = """[0-2]{1}""".r ^^ { _.toInt }
def repAlpha: Parser[List[String]] = rep1sep(alpha, ".")
def expr: Parser[(List[String], Int)] = repAlpha ~ "." ~ digit ^^ {
case alphas ~ _ ~ num =>
(alphas, num)
}
}
With an instance of this SimpleParser, here's the output I got:
println(parser.parse(parser.expr, "a.b.a.1"))
// [1.8] parsed: (List(a, b, a),1)
println(parser.parse(parser.expr, "a.0"))
// [1.4] parsed: (List(a),0)

Scala parsing, how to go through what has been collected

Using scala parser combinators I have parsed some text input and created some of my own types along the way. The result prints fine. Now I need to go through the output, which I presume is a nested structure that includes the types I created. How do I go about this?
This is how I call the parser:
GMParser1.parseItem(i_inputHard_2) match {
case GMParser1.Success(res, _) =>
println(">" + res + "< of type: " + res.getClass.getSimpleName)
case x =>
println("Could not parse the input string:" + x)
}
EDIT
What I get back from MsgResponse(O2 Flow|Off) is:
>(MsgResponse~(O2 Flow~Off))< of type: $tilde
And what I get back from WthrResponse(Id(Tube 25,Carbon Monoxide)|0.20) is:
>(WthrResponse~IdWithValue(Id(Tube 25,Carbon Monoxide),0.20))< of type: $tilde
Just to give context to the question here is some of the input parsing. I will want to get at Id:
trait Keeper
case class Id(leftContents:String,rightContents:String) extends Keeper
And here is Id being created:
def id = "Id(" ~> idContents <~ ")" ^^ { contents => Id(contents._1,contents._2) }
And here is the whole of the parser:
object GMParser1 extends RegexParsers {
override def skipWhitespace = false
def number = regex(new Regex("[-+]?(\\d*[.])?\\d+"))
def idContents = text ~ ("," ~> text)
def id = "Id(" ~> idContents <~ ")" ^^ { contents => Id(contents._1,contents._2) }
def text = """[A-Za-z0-9* ]+""".r
def wholeWord = """[A-Za-z]+""".r
def idBracketContents = id ~ ( "|" ~> number ) ^^ { contents => IdWithValue(contents._1,contents._2) }
def nonIdBracketContents = text ~ ( "|" ~> text )
def bracketContents = idBracketContents | nonIdBracketContents
def outerBrackets = "(" ~> bracketContents <~ ")"
def target = wholeWord ~ outerBrackets
def parseItem(str: String): ParseResult[Any] = parse(target, str)
trait Keeper
case class Id(leftContents:String,rightContents:String) extends Keeper
case class IdWithValue(leftContents:Id,numberContents:String) extends Keeper
}
The parser created by the ~ operator produces a value of the ~ case class. To get at its contents, you can pattern match on it like on any other case class (keeping in mind that its name is symbolic, so it's used infix).
So you can replace case GMParser1.Success(res, _) => ... with case GMParser1.Success(functionName ~ argument) => ... to get at the function name and argument (or whatever the semantics of wholeWord and bracketContents in wholeWord "(" bracketContents ")" are). You can then similarly use a nested pattern to get at the individual parts of the argument.
You could (and probably should) also use ^^ together with pattern matching in your rules to create a more meaningful AST structure that doesn't contain ~. This would be useful to distinguish a nonIdBracketContents result from a bracketContents result for example.

Ignoring prefixes in a JavaToken combinator parser

I'm trying to use a JavaToken combinator parser to pull out a particular match that's in the middle of larger string (ie ignore a random set of prefix chars). However I can't get it working and think I'm getting caught out by a greedy parser and/or CRs LFs. (the prefix chars can be basically anything). I have:
class RuleHandler extends JavaTokenParsers {
def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\\s]*""".r
def findX: Parser[Double] = allowedPrefixChars ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}
}
and then in my test case ..
"when looking for the X value" in {
"must find and correctly interpret X" in {
val testString =
"""
|Looking (only)
|for (x=45) within
|this string
""".stripMargin
val answer = ruleHandler.parse(ruleHandler.findX, testString)
System.out.println(" X value is : " + answer.toString)
}
}
I think it's similar to this SO question. Can anyone see whats wrong pls ? Tks.
First, you should not escape "\\s" twice inside """ """:
def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\s]*?""".r
In your case it was interpreted separately "\" or "s" (s as symbol, not \s)
Second, your allowedPrefixChars parser includes (, x, =, so it captures the whole string, including (x=, nothing is left to subsequent parsers.
The solution is to be more concrete about prefix you want:
object ruleHandler extends JavaTokenParsers {
def allowedPrefixChar: Parser[String] = """[a-zA-Z0-9=*+-/<>!\_){}~\s]""".r //no "(" here
def findX: Parser[Double] = rep(allowedPrefixChar | "\\((?!x=)".r ) ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}
}
ruleHandler.parse(ruleHandler.findX, testString)
res14: ruleHandler.ParseResult[Double] = [3.11] parsed: 45.0
I've told the parser to ignore (, that has x= going after (it's just negative lookahead).
Alternative:
"""\(x=(.*?)\)""".r.findAllMatchIn(testString).map(_.group(1).toDouble).toList
res22: List[Double] = List(45.0)
If you want to use parsers correctly, I would recommend you to describe the whole BNF grammar (with all possible (,) and = usages) - not just fragment. For example, include (only) into your parser if it's keyword, "(" ~> valueName <~ "=" ~ value to get value. Don't forget that scala-parser is intended to return you AST, not just some matched value. Pure regexps are better for regular matching from unstructured data.
Example how it would like to use parsers in correct way (didn't try to compile):
trait Command
case class Rule(name: String, value: Double) extends Command
case class Directive(name: String) extends Command
class RuleHandler extends JavaTokenParsers { //why `JavaTokenParsers` (not `RegexParsers`) if you don't use tokens from Java Language Specification ?
def string = """[a-zA-Z0-9*+-/<>!\_{}~\s]*""".r //it's still wrong you should use some predefined Java-like literals from **JavaToken**Parsers
def rule = "(" ~> string <~ "=" ~> string <~ ")" ^^ { case name ~ num => Rule(name, num.toDouble} }
def directive = "(" ~> string <~ ")" ^^ { case name => Directive(name) }
def commands: Parser[Command] = repsep(rule | directive, string)
}
If you need to process natural language (Chomsky type-0), scalanlp or something similar fits better.

How to skip whitespace but use it as a token delimeter in a parser combinator

I am trying to build a small parser where the tokens (luckily) never contain whitespace. Whitespace (spaces, tabs and newlines) are essentially token delimeters (apart from cases where there are brackets etc.).
I am extending the RegexParsers class. If I turn on skipWhitespace the parser is greedily joining tokens together when the next token matches the regular expression of the previous one. If I turn off skipWhitespace, on the other hand, it complains because of the spaces not being part of the definition. I am trying to match the BNF as much as possible, and given that whitespace is almost always the delimeter (apart from brackets or some other cases where the delimeter is explicitly defined in the BNF), is there away to avoid putting whitespace regex in all my definitions?
UPDATE
This is a small test example where the tokens are being joined together:
import scala.util.parsing.combinator.RegexParsers
object TestParser extends RegexParsers {
def test = "(test" ~> name <~ ")"
def name : Parser[String] = (letter ~ (anyChar*)) ^^ { case first ~ rest => (first :: rest).mkString}
def anyChar = letter | digit | "_".r | "-".r
def letter = """[a-zA-Z]""".r
def digit = """\d""".r
def main(args: Array[String]) {
val s = "(test hello these should not be joined and I should get an error)"
val res = parseAll(test, s)
res match {
case Success(r, n) => println(r)
case Failure(msg, n) => println(msg)
case Error(msg, n) => println(msg)
}
}
}
In the above case I just get the string joined together.
A similar effect is if I change test to the following, expecting it to give me the list of separate words after test, but instead it joins them together and just gives me a one element list with a long string, without the middle spaces:
def test = "(test" ~> (name+) <~ ")"
White space is skipped just before every production rule. So, in this snippet:
def name : Parser[String] = (letter ~ (anyChar*)) ^^ { case first ~ rest => (first :: rest).mkString}
It will skip whitespace before each letter and, even worse, each empty string for good measure (since anyChar* can be empty).
Use regular expressions (or plain strings) for each token, not each lexical element. Like this:
object TestParser extends RegexParsers {
def test = "(test" ~> name <~ ")"
def name : Parser[String] = """[a-zA-Z][a-zA-Z0-9_-]*""".r
// ...

Grammars, Scala Parsing Combinators and Orderless Sets

I'm writing an application that will take in various "command" strings. I've been looking at the Scala combinator library to tokenize the commands. I find in a lot of cases I want to say: "These tokens are an orderless set, and so they can appear in any order, and some might not appear".
With my current knowledge of grammars I would have to define all combinations of sequences as such (pseudo grammar):
command = action~content
action = alphanum
content = (tokenA~tokenB~tokenC | tokenB~tokenC~tokenA | tokenC~tokenB~tokenA ....... )
So my question is, considering tokenA-C are unique, is there a shorter way to define a set of any order using a grammar?
You can use the "Parser.^?" operator to check a group of parse elements for duplicates.
def tokens = tokenA | tokenB | tokenC
def uniqueTokens = (tokens*) ^? (
{ case t if (t == t.removeDuplicates) => t },
{ "duplicate tokens found: " + _ })
Here is an example that allows you to enter any of the four stooges in any order, but fails to parse if a duplicate is encountered:
package blevins.example
import scala.util.parsing.combinator._
case class Stooge(name: String)
object StoogesParser extends RegexParsers {
def moe = "Moe".r
def larry = "Larry".r
def curly = "Curly".r
def shemp = "Shemp".r
def stooge = ( moe | larry | curly | shemp ) ^^ { case s => Stooge(s) }
def certifiedStooge = stooge | """\w+""".r ^? (
{ case s: Stooge => s },
{ "not a stooge: " + _ })
def stooges = (certifiedStooge*) ^? (
{ case x if (x == x.removeDuplicates) => x.toSet },
{ "duplicate stooge in: " + _ })
def parse(s: String): String = {
parseAll(stooges, new scala.util.parsing.input.CharSequenceReader(s)) match {
case Success(r,_) => r.mkString(" ")
case Failure(r,_) => "failure: " + r
case Error(r,_) => "error: " + r
}
}
}
And some example usage:
package blevins.example
object App extends Application {
def printParse(s: String): Unit = println(StoogesParser.parse(s))
printParse("Moe Shemp Larry")
printParse("Moe Shemp Shemp")
printParse("Curly Beyonce")
/* Output:
Stooge(Moe) Stooge(Shemp) Stooge(Larry)
failure: duplicate stooge in: List(Stooge(Moe), Stooge(Shemp), Stooge(Shemp))
failure: not a stooge: Beyonce
*/
}
There are ways around it. Take a look at the parser here, for example. It accepts 4 pre-defined numbers, which may appear in any other, but must appear once, and only once.
OTOH, you could write a combinator, if this pattern happens often:
def comb3[A](a: Parser[A], b: Parser[A], c: Parser[A]) =
a ~ b ~ c | a ~ c ~ b | b ~ a ~ c | b ~ c ~ a | c ~ a ~ b | c ~ b ~ a
I would not try to enforce this requirement syntactically. I'd write a production that admits multiple tokens from the set allowed and then use a non-parsing approach to ascertaining the acceptability of the keywords actually given. In addition to allowing a simpler grammar, it will allow you to more easily continue parsing after emitting a diagnostic about the erroneous usage.
Randall Schulz
I don't know what kind of constructs you want to support, but I gather you should be specifying a more specific grammar. From your comment to another answer:
todo message:link Todo class to database
I guess you don't want to accept something like
todo message:database Todo to link class
So you probably want to define some message-level keywords like "link" and "to"...
def token = alphanum~':'~ "link" ~ alphanum ~ "class" ~ "to" ~ alphanum
^^ { (a:String,b:String,c:String) => /* a == "message", b="Todo", c="database" */ }
I guess you would have to define your grammar at that level.
You could of course write a combination rule that does this for you if you encounter this situation frequently.
On the other hand, maybe the option exists to make "tokenA..C" just "token" and then differentiate inside the handler of "token"