Using scala parser combinators I have parsed some text input and created some of my own types along the way. The result prints fine. Now I need to go through the output, which I presume is a nested structure that includes the types I created. How do I go about this?
This is how I call the parser:
GMParser1.parseItem(i_inputHard_2) match {
case GMParser1.Success(res, _) =>
println(">" + res + "< of type: " + res.getClass.getSimpleName)
case x =>
println("Could not parse the input string:" + x)
}
EDIT
What I get back from MsgResponse(O2 Flow|Off) is:
>(MsgResponse~(O2 Flow~Off))< of type: $tilde
And what I get back from WthrResponse(Id(Tube 25,Carbon Monoxide)|0.20) is:
>(WthrResponse~IdWithValue(Id(Tube 25,Carbon Monoxide),0.20))< of type: $tilde
Just to give context to the question here is some of the input parsing. I will want to get at Id:
trait Keeper
case class Id(leftContents:String,rightContents:String) extends Keeper
And here is Id being created:
def id = "Id(" ~> idContents <~ ")" ^^ { contents => Id(contents._1,contents._2) }
And here is the whole of the parser:
object GMParser1 extends RegexParsers {
override def skipWhitespace = false
def number = regex(new Regex("[-+]?(\\d*[.])?\\d+"))
def idContents = text ~ ("," ~> text)
def id = "Id(" ~> idContents <~ ")" ^^ { contents => Id(contents._1,contents._2) }
def text = """[A-Za-z0-9* ]+""".r
def wholeWord = """[A-Za-z]+""".r
def idBracketContents = id ~ ( "|" ~> number ) ^^ { contents => IdWithValue(contents._1,contents._2) }
def nonIdBracketContents = text ~ ( "|" ~> text )
def bracketContents = idBracketContents | nonIdBracketContents
def outerBrackets = "(" ~> bracketContents <~ ")"
def target = wholeWord ~ outerBrackets
def parseItem(str: String): ParseResult[Any] = parse(target, str)
trait Keeper
case class Id(leftContents:String,rightContents:String) extends Keeper
case class IdWithValue(leftContents:Id,numberContents:String) extends Keeper
}
The parser created by the ~ operator produces a value of the ~ case class. To get at its contents, you can pattern match on it like on any other case class (keeping in mind that its name is symbolic, so it's used infix).
So you can replace case GMParser1.Success(res, _) => ... with case GMParser1.Success(functionName ~ argument) => ... to get at the function name and argument (or whatever the semantics of wholeWord and bracketContents in wholeWord "(" bracketContents ")" are). You can then similarly use a nested pattern to get at the individual parts of the argument.
You could (and probably should) also use ^^ together with pattern matching in your rules to create a more meaningful AST structure that doesn't contain ~. This would be useful to distinguish a nonIdBracketContents result from a bracketContents result for example.
Related
I'm trying to use a JavaToken combinator parser to pull out a particular match that's in the middle of larger string (ie ignore a random set of prefix chars). However I can't get it working and think I'm getting caught out by a greedy parser and/or CRs LFs. (the prefix chars can be basically anything). I have:
class RuleHandler extends JavaTokenParsers {
def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\\s]*""".r
def findX: Parser[Double] = allowedPrefixChars ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}
}
and then in my test case ..
"when looking for the X value" in {
"must find and correctly interpret X" in {
val testString =
"""
|Looking (only)
|for (x=45) within
|this string
""".stripMargin
val answer = ruleHandler.parse(ruleHandler.findX, testString)
System.out.println(" X value is : " + answer.toString)
}
}
I think it's similar to this SO question. Can anyone see whats wrong pls ? Tks.
First, you should not escape "\\s" twice inside """ """:
def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\s]*?""".r
In your case it was interpreted separately "\" or "s" (s as symbol, not \s)
Second, your allowedPrefixChars parser includes (, x, =, so it captures the whole string, including (x=, nothing is left to subsequent parsers.
The solution is to be more concrete about prefix you want:
object ruleHandler extends JavaTokenParsers {
def allowedPrefixChar: Parser[String] = """[a-zA-Z0-9=*+-/<>!\_){}~\s]""".r //no "(" here
def findX: Parser[Double] = rep(allowedPrefixChar | "\\((?!x=)".r ) ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}
}
ruleHandler.parse(ruleHandler.findX, testString)
res14: ruleHandler.ParseResult[Double] = [3.11] parsed: 45.0
I've told the parser to ignore (, that has x= going after (it's just negative lookahead).
Alternative:
"""\(x=(.*?)\)""".r.findAllMatchIn(testString).map(_.group(1).toDouble).toList
res22: List[Double] = List(45.0)
If you want to use parsers correctly, I would recommend you to describe the whole BNF grammar (with all possible (,) and = usages) - not just fragment. For example, include (only) into your parser if it's keyword, "(" ~> valueName <~ "=" ~ value to get value. Don't forget that scala-parser is intended to return you AST, not just some matched value. Pure regexps are better for regular matching from unstructured data.
Example how it would like to use parsers in correct way (didn't try to compile):
trait Command
case class Rule(name: String, value: Double) extends Command
case class Directive(name: String) extends Command
class RuleHandler extends JavaTokenParsers { //why `JavaTokenParsers` (not `RegexParsers`) if you don't use tokens from Java Language Specification ?
def string = """[a-zA-Z0-9*+-/<>!\_{}~\s]*""".r //it's still wrong you should use some predefined Java-like literals from **JavaToken**Parsers
def rule = "(" ~> string <~ "=" ~> string <~ ")" ^^ { case name ~ num => Rule(name, num.toDouble} }
def directive = "(" ~> string <~ ")" ^^ { case name => Directive(name) }
def commands: Parser[Command] = repsep(rule | directive, string)
}
If you need to process natural language (Chomsky type-0), scalanlp or something similar fits better.
I'm trying to write a SemVer (http://semver.org) parser in Scala using parser combinators, as a sort of familiarisation with them.
This is my current code:
case class SemVer(major: Int, minor: Int, patch: Int, prerelease: Option[List[String]], metadata: Option[List[String]]) {
override def toString = s"$major.$minor.$patch" + prerelease.map("-" + _.mkString(".")).getOrElse("") + metadata.map("+" + _.mkString("."))
}
class VersionParser extends RegexParsers {
def number: Parser[Int] = """(0|[1-9]\d*)""".r ^^ (_.toInt)
def separator: Parser[String] = """\.""".r
def prereleaseSeparator: Parser[String] = """-""".r
def metadataSeparator: Parser[String] = """\+""".r
def identifier: Parser[String] = """([0-9A-Za-z-])+""".r ^^ (_.toString)
def prereleaseIdentifiers: Parser[List[String]] = (number | identifier) ~ rep(separator ~> (number | identifier)) ^^ {
case first ~ rest => List(first.toString) ++ rest.map(_.toString)
}
def metadataIdentifiers: Parser[List[String]] = identifier ~ rep(separator ~> identifier) ^^ {
case first ~ rest => List(first.toString) ++ rest.map(_.toString)
}
}
I'd like to know how I should parse identifiers for the prerelease section, because it disallows leading zeros in numeric identifiers and when I try to parse using my current parser leading zeros (for e.g. in "01.2.3") simply become a list containing the element 0.
More generically, how should I detect that the string does not conform to the SemVer spec and consequently force a failure condition?
After some playing around and some searching, I've discovered the issue was that I was calling the parse method instead of the parseAll method. Since parse basically parses as much as it can, ending when it can't parse anymore, it is possible for it to accept partially correct strings. Using parseAll forces all the input to be parsed, and it fails if there is input remaining once parsing stops. This is exactly what I was looking for.
For the sake of completeness I'd add
def version = number ~ (separator ~> number) ~ (separator ~> number) ~ ((prereleaseSeparator ~> prereleaseIdentifiers)?) ~ ((metadataSeparator ~> metadataIdentifiers)?) ^^ {
case major ~ minor ~ patch ~ prerelease ~ metadata => SemVer(major, minor, patch, prerelease, metadata)
}
method to VersionParser
Let's say I have the following:
case class Var(s: String)
class MyParser extends JavaTokensParser {
def variableExpr = "?" ~ identifier ^^ { case "?" ~ id => Var(id) }
def identifier = //...
}
I want this to accept inputs of the form ?X but not ? X (with a space in between). How would this be expressed?
Thanks!
JavaTokensParser by default allows white spaces between any parsers. You could change this behavior this way:
override def skipWhitespace = false
Now you have to specify all white spaces manually:
def ws: Parser[Seq[Char]] = rep(' ')
def variableExpr = ws ~> "?" ~ identifier ^^ { case "?" ~ id => Var(id) }
I have a working parser, but I've just realised I do not cater for comments. In the DSL I am parsing, comments start with a ; character. If a ; is encountered, the rest of the line is ignored (not all of it however, unless the first character is ;).
I am extending RegexParsers for my parser and ignoring whitespace (the default way), so I am losing the new line characters anyway. I don't wish to modify each and every parser I have to cater for the possibility of comments either, because statements can span across multiple lines (thus each part of each statement may end with a comment). Is there any clean way to acheive this?
One thing that may influence your choice is whether comments can be found within your valid parsers. For instance let's say you have something like:
val p = "(" ~> "[a-z]*".r <~ ")"
which would parse something like ( abc ) but because of comments you could actually encounter something like:
( ; comment goes here
abc
)
Then I would recommend using a TokenParser or one of its subclass. It's more work because you have to provide a lexical parser that will do a first pass to discard the comments. But it is also more flexible if you have nested comments or if the ; can be escaped or if the ; can be inside a string literal like:
abc = "; don't ignore this" ; ignore this
On the other hand, you could also try to override the value of whitespace to be something like
override protected val whiteSpace = """(\s|;.*)+""".r
Or something along those lines.
For instance using the example from the RegexParsers scaladoc:
import scala.util.parsing.combinator.RegexParsers
object so1 {
Calculator("""(1 + ; foo
(1 + 2))
; bar""")
}
object Calculator extends RegexParsers {
override protected val whiteSpace = """(\s|;.*)+""".r
def number: Parser[Double] = """\d+(\.\d*)?""".r ^^ { _.toDouble }
def factor: Parser[Double] = number | "(" ~> expr <~ ")"
def term: Parser[Double] = factor ~ rep("*" ~ factor | "/" ~ factor) ^^ {
case number ~ list => (number /: list) {
case (x, "*" ~ y) => x * y
case (x, "/" ~ y) => x / y
}
}
def expr: Parser[Double] = term ~ rep("+" ~ log(term)("Plus term") | "-" ~ log(term)("Minus term")) ^^ {
case number ~ list => list.foldLeft(number) { // same as before, using alternate name for /:
case (x, "+" ~ y) => x + y
case (x, "-" ~ y) => x - y
}
}
def apply(input: String): Double = parseAll(expr, input) match {
case Success(result, _) => result
case failure: NoSuccess => scala.sys.error(failure.msg)
}
}
This prints:
Plus term --> [2.9] parsed: 2.0
Plus term --> [2.10] parsed: 3.0
res0: Double = 4.0
Just filter out all the comments with a regex before you pass the code into your parser.
def removeComments(input: String): String = {
"""(?ms)\".*?\"|;.*?$|.+?""".r.findAllIn(input).map(str => if(str.startsWith(";")) "" else str).mkString
}
val code =
"""abc "def; ghij"
abc ;this is a comment
def"""
println(removeComments(code))
I wondering if it's possible to get the MatchData generated from the matching regular expression in the grammar below.
object DateParser extends JavaTokenParsers {
....
val dateLiteral = """(\d{4}[-/])?(\d\d[-/])?(\d\d)""".r ^^ {
... get MatchData
}
}
One option of course is to perform the match again inside the block, but since the RegexParser has already performed the match I'm hoping that it passes the MatchData to the block, or stores it?
Here is the implicit definition that converts your Regex into a Parser:
/** A parser that matches a regex string */
implicit def regex(r: Regex): Parser[String] = new Parser[String] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
(r findPrefixMatchOf (source.subSequence(start, source.length))) match {
case Some(matched) =>
Success(source.subSequence(start, start + matched.end).toString,
in.drop(start + matched.end - offset))
case None =>
Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
}
}
}
Just adapt it:
object X extends RegexParsers {
/** A parser that matches a regex string and returns the Match */
def regexMatch(r: Regex): Parser[Regex.Match] = new Parser[Regex.Match] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
(r findPrefixMatchOf (source.subSequence(start, source.length))) match {
case Some(matched) =>
Success(matched,
in.drop(start + matched.end - offset))
case None =>
Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
}
}
}
val t = regexMatch("""(\d\d)/(\d\d)/(\d\d\d\d)""".r) ^^ { case m => (m.group(1), m.group(2), m.group(3)) }
}
Example:
scala> X.parseAll(X.t, "23/03/1971")
res8: X.ParseResult[(String, String, String)] = [1.11] parsed: (23,03,1971)
No, you can't do this. If you look at the definition of the Parser used when you convert a regex to a Parser, it throws away all context and just returns the full matched string:
http://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_7_7_final/src/library/scala/util/parsing/combinator/RegexParsers.scala?view=markup#L55
You have a couple of other options, though:
break up your parser into several smaller parsers (for the tokens you actually want to extract)
define a custom parser that extracts the values you want and returns a domain object instead of a string
The first would look like
val separator = "-" | "/"
val year = ("""\d{4}"""r) <~ separator
val month = ("""\d\d"""r) <~ separator
val day = """\d\d"""r
val date = ((year?) ~ (month?) ~ day) map {
case year ~ month ~ day =>
(year.getOrElse("2009"), month.getOrElse("11"), day)
}
The <~ means "require these two tokens together, but only give me the result of the first one.
The ~ means "require these two tokens together and tie them together in a pattern-matchable ~ object.
The ? means that the parser is optional and will return an Option.
The .getOrElse bit provides a default value for when the parser didn't define a value.
When a Regex is used in a RegexParsers instance, the implicit def regex(Regex): Parser[String] in RegexParsers is used to appoly that Regex to the input. The Match instance yielded upon successful application of the RE at the current input is used to construct a Success in the regex() method, but only its "end" value is used, so any captured sub-matches are discarded by the time that method returns.
As it stands (in the 2.7 source I looked at), you're out of luck, I believe.
I ran into a similar issue using scala 2.8.1 and trying to parse input of the form "name:value" using the RegexParsers class:
package scalucene.query
import scala.util.matching.Regex
import scala.util.parsing.combinator._
object QueryParser extends RegexParsers {
override def skipWhitespace = false
private def quoted = regex(new Regex("\"[^\"]+"))
private def colon = regex(new Regex(":"))
private def word = regex(new Regex("\\w+"))
private def fielded = (regex(new Regex("[^:]+")) <~ colon) ~ word
private def term = (fielded | word | quoted)
def parseItem(str: String) = parse(term, str)
}
It seems that you can grab the matched groups after parsing like this:
QueryParser.parseItem("nameExample:valueExample") match {
case QueryParser.Success(result:scala.util.parsing.combinator.Parsers$$tilde, _) => {
println("Name: " + result.productElement(0) + " value: " + result.productElement(1))
}
}