I am trying to build a simple external DSL in Scala that would be able to parse strings like:
value = "john${tom}peter${greg}${sue}meg"
In general, a substring within quotation marks contains interlaced names and names put between ${ and }.
My grammar is as following:
class Grammar extends JavaTokenParsers {
def workflow = "value" ~> "=" ~> "\"" ~> pair <~ "\""
def pair = rep(str | token)
def str = rep(char)
def char: Parser[String] = """[a-z]""".r
def token = "$" ~> "{" ~> str <~ "}"
}
and executed by:
var res = parseAll(workflow, str)
println(res)
I thought that a method def pair = rep(str | token) would make it possible to parse it properly. Not only it doesn't work but also it leads to an infinite loop within parseAll method.
How can I parse such a string then? It seems that an alternative repetition (rep) is not the right approach.
You should replace rep with rep1.
rep is always successful (unless it is an Error), so in rep(char) | token right part (token) is useless - you'll get an empty successful result of rep(char).
Also, you could either replace """[a-z]""".r with accept('a' to 'z') or define str as def str: Parser[String] = """[a-z]+""".r. Usage of Regex to match a single Char is an overkill.
Related
I'm trying to use a JavaToken combinator parser to pull out a particular match that's in the middle of larger string (ie ignore a random set of prefix chars). However I can't get it working and think I'm getting caught out by a greedy parser and/or CRs LFs. (the prefix chars can be basically anything). I have:
class RuleHandler extends JavaTokenParsers {
def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\\s]*""".r
def findX: Parser[Double] = allowedPrefixChars ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}
}
and then in my test case ..
"when looking for the X value" in {
"must find and correctly interpret X" in {
val testString =
"""
|Looking (only)
|for (x=45) within
|this string
""".stripMargin
val answer = ruleHandler.parse(ruleHandler.findX, testString)
System.out.println(" X value is : " + answer.toString)
}
}
I think it's similar to this SO question. Can anyone see whats wrong pls ? Tks.
First, you should not escape "\\s" twice inside """ """:
def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\s]*?""".r
In your case it was interpreted separately "\" or "s" (s as symbol, not \s)
Second, your allowedPrefixChars parser includes (, x, =, so it captures the whole string, including (x=, nothing is left to subsequent parsers.
The solution is to be more concrete about prefix you want:
object ruleHandler extends JavaTokenParsers {
def allowedPrefixChar: Parser[String] = """[a-zA-Z0-9=*+-/<>!\_){}~\s]""".r //no "(" here
def findX: Parser[Double] = rep(allowedPrefixChar | "\\((?!x=)".r ) ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}
}
ruleHandler.parse(ruleHandler.findX, testString)
res14: ruleHandler.ParseResult[Double] = [3.11] parsed: 45.0
I've told the parser to ignore (, that has x= going after (it's just negative lookahead).
Alternative:
"""\(x=(.*?)\)""".r.findAllMatchIn(testString).map(_.group(1).toDouble).toList
res22: List[Double] = List(45.0)
If you want to use parsers correctly, I would recommend you to describe the whole BNF grammar (with all possible (,) and = usages) - not just fragment. For example, include (only) into your parser if it's keyword, "(" ~> valueName <~ "=" ~ value to get value. Don't forget that scala-parser is intended to return you AST, not just some matched value. Pure regexps are better for regular matching from unstructured data.
Example how it would like to use parsers in correct way (didn't try to compile):
trait Command
case class Rule(name: String, value: Double) extends Command
case class Directive(name: String) extends Command
class RuleHandler extends JavaTokenParsers { //why `JavaTokenParsers` (not `RegexParsers`) if you don't use tokens from Java Language Specification ?
def string = """[a-zA-Z0-9*+-/<>!\_{}~\s]*""".r //it's still wrong you should use some predefined Java-like literals from **JavaToken**Parsers
def rule = "(" ~> string <~ "=" ~> string <~ ")" ^^ { case name ~ num => Rule(name, num.toDouble} }
def directive = "(" ~> string <~ ")" ^^ { case name => Directive(name) }
def commands: Parser[Command] = repsep(rule | directive, string)
}
If you need to process natural language (Chomsky type-0), scalanlp or something similar fits better.
I'm trying to write a SemVer (http://semver.org) parser in Scala using parser combinators, as a sort of familiarisation with them.
This is my current code:
case class SemVer(major: Int, minor: Int, patch: Int, prerelease: Option[List[String]], metadata: Option[List[String]]) {
override def toString = s"$major.$minor.$patch" + prerelease.map("-" + _.mkString(".")).getOrElse("") + metadata.map("+" + _.mkString("."))
}
class VersionParser extends RegexParsers {
def number: Parser[Int] = """(0|[1-9]\d*)""".r ^^ (_.toInt)
def separator: Parser[String] = """\.""".r
def prereleaseSeparator: Parser[String] = """-""".r
def metadataSeparator: Parser[String] = """\+""".r
def identifier: Parser[String] = """([0-9A-Za-z-])+""".r ^^ (_.toString)
def prereleaseIdentifiers: Parser[List[String]] = (number | identifier) ~ rep(separator ~> (number | identifier)) ^^ {
case first ~ rest => List(first.toString) ++ rest.map(_.toString)
}
def metadataIdentifiers: Parser[List[String]] = identifier ~ rep(separator ~> identifier) ^^ {
case first ~ rest => List(first.toString) ++ rest.map(_.toString)
}
}
I'd like to know how I should parse identifiers for the prerelease section, because it disallows leading zeros in numeric identifiers and when I try to parse using my current parser leading zeros (for e.g. in "01.2.3") simply become a list containing the element 0.
More generically, how should I detect that the string does not conform to the SemVer spec and consequently force a failure condition?
After some playing around and some searching, I've discovered the issue was that I was calling the parse method instead of the parseAll method. Since parse basically parses as much as it can, ending when it can't parse anymore, it is possible for it to accept partially correct strings. Using parseAll forces all the input to be parsed, and it fails if there is input remaining once parsing stops. This is exactly what I was looking for.
For the sake of completeness I'd add
def version = number ~ (separator ~> number) ~ (separator ~> number) ~ ((prereleaseSeparator ~> prereleaseIdentifiers)?) ~ ((metadataSeparator ~> metadataIdentifiers)?) ^^ {
case major ~ minor ~ patch ~ prerelease ~ metadata => SemVer(major, minor, patch, prerelease, metadata)
}
method to VersionParser
Problem
I want to parse line like this:
fieldName: value1|value2 anotherFieldName: value3 valueWithoutFieldName
into
List(Some("fieldName") ~ List("value1", "value2"), Some("anotherFieldName") ~ List("value3"), None~List("valueWithoutFieldName"))
(Alternative field values are separated by pipe (|). Field name is optional. If field has no name, it should be parsed as None (see valueWithoutFieldName)
My current (not working) solution
This is what I have so far:
val parser: Parser[ParsedQuery] = {
phrase(rep(opt(fieldNameTerm) ~ (multipleValueTerm | singleValueTerm))) ^^ {
case termSequence =>
// operate on List[Option[String] ~ List[String]]
}
}
val fieldNameTerm: Parser[String] = {
("\\w+".r <~ ":(?=\\S)".r) ^^ {
case fieldName => fieldName
}
}
val multipleValueTerm = rep1((singleValueTerm <~ alternativeValueTerm) | (alternativeValueTerm ~> singleValueTerm))
val alternativeValueTerm: Parser[String] = {
// matches '|'
("\\|".r) ^^ {
case token => token
}
}
val singleValueTerm: Parser[String] = {
// all non-whitespace characters except '|'
("[\\S&&[^\\|]]+".r) ^^ {
case token => token
}
}
Unfortunately, my code does not parse last possible field value (the last value after pipe) correctly and treats it as value of a new nameless field. For instance:
The following string:
"abc:111|222|333|444 cde:555"
is parsed into:
List((Some(abc)~List(111, 222, 333)), (None~444), (Some(cde)~555))
while I'd like it to be:
List((Some(abc)~List(111, 222, 333, 444)), (Some(cde)~555))
My suspicions
I think that the problem lies in definition of multipleValueTerm:
rep1((singleValueTerm <~ alternativeValueTerm) | (alternativeValueTerm ~> singleValueTerm))
It's second part is probably not interpreted correctly, but I have no idea why.
Shouldn't <~ from the first part of multipleValueTerm left pipe representing value separator, so that second part of this expression (alternativeValueTerm ~> singleValueTerm) is able to parse it successfully?
Let's look at what's happening. We want to parse: 111|222|333|444 with multiValueTerm.
111| fits (singleValueTerm <~ alternativeValueTerm). <~ throws away the | and we take the 111.
So we have 222|333|444 left.
Analog to the previous: 222| and 333| are taken. So we are left with 444. But 444 does not fit either (singleValueTerm <~ alternativeValueTerm) or (alternativeValueTerm ~> singleValueTerm). So it is not taken. That is why it will be treated as a new value without variable.
I would improve your parser this way:
val seperator = "|"
lazy val parser: Parser[List[(Option[String] ~ List[String])]] = rep(termParser)
lazy val termParser: Parser[(Option[String] ~ List[String])] = opt(fieldNameTerm) ~ valueParser
lazy val fieldNameTerm: Parser[String] = ("\\w+".r <~ ":(?=\\S)".r)
lazy val valueParser: Parser[List[String]] = rep1sep(singleValueTerm, seperator)
lazy val singleValueTerm: Parser[String] = ("[\\S&&[^\\|]]+".r)
There is no need for all this identity stuff ^^ {case x => x}. I removed that. Then I treat single- and multi-values the same way. It is either a List with exactly one or more elements. repsep is nice for dealing with seperators.
rep1sep(singleValueTerm, seperator) could be equivalently expressed with
singlevalueTerm ~ rep(seperator ~> singlevalueTerm)
I am trying to build a small parser where the tokens (luckily) never contain whitespace. Whitespace (spaces, tabs and newlines) are essentially token delimeters (apart from cases where there are brackets etc.).
I am extending the RegexParsers class. If I turn on skipWhitespace the parser is greedily joining tokens together when the next token matches the regular expression of the previous one. If I turn off skipWhitespace, on the other hand, it complains because of the spaces not being part of the definition. I am trying to match the BNF as much as possible, and given that whitespace is almost always the delimeter (apart from brackets or some other cases where the delimeter is explicitly defined in the BNF), is there away to avoid putting whitespace regex in all my definitions?
UPDATE
This is a small test example where the tokens are being joined together:
import scala.util.parsing.combinator.RegexParsers
object TestParser extends RegexParsers {
def test = "(test" ~> name <~ ")"
def name : Parser[String] = (letter ~ (anyChar*)) ^^ { case first ~ rest => (first :: rest).mkString}
def anyChar = letter | digit | "_".r | "-".r
def letter = """[a-zA-Z]""".r
def digit = """\d""".r
def main(args: Array[String]) {
val s = "(test hello these should not be joined and I should get an error)"
val res = parseAll(test, s)
res match {
case Success(r, n) => println(r)
case Failure(msg, n) => println(msg)
case Error(msg, n) => println(msg)
}
}
}
In the above case I just get the string joined together.
A similar effect is if I change test to the following, expecting it to give me the list of separate words after test, but instead it joins them together and just gives me a one element list with a long string, without the middle spaces:
def test = "(test" ~> (name+) <~ ")"
White space is skipped just before every production rule. So, in this snippet:
def name : Parser[String] = (letter ~ (anyChar*)) ^^ { case first ~ rest => (first :: rest).mkString}
It will skip whitespace before each letter and, even worse, each empty string for good measure (since anyChar* can be empty).
Use regular expressions (or plain strings) for each token, not each lexical element. Like this:
object TestParser extends RegexParsers {
def test = "(test" ~> name <~ ")"
def name : Parser[String] = """[a-zA-Z][a-zA-Z0-9_-]*""".r
// ...
I'm trying to write a parser in Scala that gradually builds up a concrete type hierarchy. I started with:
private def word = regex(new Regex("[a-zA-Z][a-zA-Z0-9-]*"))
private def quicktoken: Parser[Quicktoken] = "/" ~> word <~ "/" <~ (space?) ^^ { new Quicktoken(_) }
which is fine. /hello/ will get parsed to a quicktoken
Now I want to add the quicktoken to a composite expression. I have a class
class MatchTokenPart(word:String,quicktoken:RewriteWord){
}
I would have thought that I could write...
private def matchTokenPartContent: Parser[MatchTokenPart] = word<~equals~quicktoken ^^ { case word~quicktoken => new MatchTokenPart(word, quicktoken)}
but it doesn't work. It says that word is of type Option[String] and quicktoken of type String. What am I missing?
Another precedence issue: a <~ b ~ c is interpreted as a <~ (b ~ c), not (a <~ b) ~ c. This is because infix operators starting with < have lower precedence than ones starting with ~, (see the list in 6.12.3 of the language specification).
You want (word <~ equals) ~ quicktoken, so you'll need to supply the parentheses.