I want to parse a document treating each line as a distinct token. A toy example might look something like this:
August Pizza Sales
2021-08-21
22
19
53
P.S. I <3 🐈🐈🐈
Lines might have arbitrary whitespace before and after the text. (My real case is more complicated, and might have different types of data sections, headers, etc. of arbitrary length.)
Here's a basic attempt:
import java.time.LocalDate
import scala.util.parsing.combinator.RegexParsers
case class Document(
title: String,
date: LocalDate,
data: Seq[Integer],
)
class LineParser extends RegexParsers {
private val eol = sys.props("line.separator")
def text: Parser[String] = ".*".r ^^ { _.toString }
def date: Parser[LocalDate] = ".*".r ^^ { LocalDate.parse(_) }
def blank: Parser[String] = "^\\s*$".r ^^ { _.toString }
def datum: Parser[Integer] = "[0-9]+".r ^^ { _.toInt }
def data: Parser[Seq[Integer]] = repsep(datum, eol)
def document: Parser[Document] = text ~ date ~ blank ~ data ~ blank ~ text ^^ {
case title ~ date ~ _ ~ data ~ _ ~ footer =>
Document(title=title, date=date, data=data, footer=footer)
}
}
When I run this, I get Document(Pizza Sales,2021-08-21,List(23),19).
There are many things I don't understand here ...
Tactical Questions
What changes do I need to make to correctly parse the document?
If I change my definition of blank to "^\\s*$".r (which is what should actually match a blank line, I think) then everything breaks. Why?
Strategic Questions
What's really going on with newlines here? The text parser does not match across newlines, so it seems they're being treated specially (i.e., even if I try to match text to the entire document, it only gets Pizza Sales.)
What is the impact of setting skipWhitespace?
What is the impact of setting val whiteSpace = ... (in combination with skipWhitespace)?
[SUPER BONUS QUESTION] What the heck does ~> mean?
Philosophical Questions
Since I already know how to tokenize my input (i.e., split it on newlines), is RegexParsers (or Scala parser combinators in general) the right tool for this job?
Related
I am trying to write a parser for a language very similar to Milner's CCS. Basically what I am parsing so far are expressions of the following sort:
a.b.a.1
a.0
An expression must start with a letter (excluding t) and could have any number of letters following the first letter (separated by a '.'). The Expression must terminate with a digit (for simplicity I chose digits between 0 and 2 for now). I want to use Parser Combinators for Scala, however this is the first time that I am working with them. This is what I have so far:
import scala.util.parsing.combinator._
class SimpleParser extends RegexParsers {
def alpha: Parser[String] = """[^t]{1}""".r ^^ { _.toString }
def digit: Parser[Int] = """[0-2]{1}""".r ^^ { _.toInt }
def expr: Parser[Any] = alpha ~ "." ~ digit ^^ {
case al ~ "." ~ di => List(al, di)
}
def simpleExpression: Parser[Any] = alpha ~ "." ~ rep(alpha ~ ".") ~ digit //^^ { }
}
As you can see in def expr :Parser[Any] I am trying to return the result as a list, since Lists in Scala are very easy to work with (in my opinion). Is this the correct way how to convert a Parser[Any] result to a List? Can anyone give me any tips on how I can do this for def simpleExpression:Parser[Any].
The main reason why I want to use Lists is because after parsing and Expression I want to be able to consume it. For example, given the expression a.b.1, if I am given an 'a', I would like to consume the expression to end up with a new expression: b.1 (i.e. a.b.1 ->(a)-> b.1). The idea behind this is to simulate finite state automatas. Any tips on how I may improve my implementation are appreciated.
To keeps things type safe, I recommend a parser that produces a tuple of a list of strings and an int. That is, the input a.b.a.1 would get parsed as (List("a", "b", "a"), 1). Note also that the regex for alpha was modified to exclude anything that is not a lowercase letter (in addition to t).
class SimpleParser extends RegexParsers {
def alpha: Parser[String] = """[a-su-z]{1}""".r ^^ { _.toString }
def digit: Parser[Int] = """[0-2]{1}""".r ^^ { _.toInt }
def repAlpha: Parser[List[String]] = rep1sep(alpha, ".")
def expr: Parser[(List[String], Int)] = repAlpha ~ "." ~ digit ^^ {
case alphas ~ _ ~ num =>
(alphas, num)
}
}
With an instance of this SimpleParser, here's the output I got:
println(parser.parse(parser.expr, "a.b.a.1"))
// [1.8] parsed: (List(a, b, a),1)
println(parser.parse(parser.expr, "a.0"))
// [1.4] parsed: (List(a),0)
I'm trying to use a JavaToken combinator parser to pull out a particular match that's in the middle of larger string (ie ignore a random set of prefix chars). However I can't get it working and think I'm getting caught out by a greedy parser and/or CRs LFs. (the prefix chars can be basically anything). I have:
class RuleHandler extends JavaTokenParsers {
def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\\s]*""".r
def findX: Parser[Double] = allowedPrefixChars ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}
}
and then in my test case ..
"when looking for the X value" in {
"must find and correctly interpret X" in {
val testString =
"""
|Looking (only)
|for (x=45) within
|this string
""".stripMargin
val answer = ruleHandler.parse(ruleHandler.findX, testString)
System.out.println(" X value is : " + answer.toString)
}
}
I think it's similar to this SO question. Can anyone see whats wrong pls ? Tks.
First, you should not escape "\\s" twice inside """ """:
def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\s]*?""".r
In your case it was interpreted separately "\" or "s" (s as symbol, not \s)
Second, your allowedPrefixChars parser includes (, x, =, so it captures the whole string, including (x=, nothing is left to subsequent parsers.
The solution is to be more concrete about prefix you want:
object ruleHandler extends JavaTokenParsers {
def allowedPrefixChar: Parser[String] = """[a-zA-Z0-9=*+-/<>!\_){}~\s]""".r //no "(" here
def findX: Parser[Double] = rep(allowedPrefixChar | "\\((?!x=)".r ) ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}
}
ruleHandler.parse(ruleHandler.findX, testString)
res14: ruleHandler.ParseResult[Double] = [3.11] parsed: 45.0
I've told the parser to ignore (, that has x= going after (it's just negative lookahead).
Alternative:
"""\(x=(.*?)\)""".r.findAllMatchIn(testString).map(_.group(1).toDouble).toList
res22: List[Double] = List(45.0)
If you want to use parsers correctly, I would recommend you to describe the whole BNF grammar (with all possible (,) and = usages) - not just fragment. For example, include (only) into your parser if it's keyword, "(" ~> valueName <~ "=" ~ value to get value. Don't forget that scala-parser is intended to return you AST, not just some matched value. Pure regexps are better for regular matching from unstructured data.
Example how it would like to use parsers in correct way (didn't try to compile):
trait Command
case class Rule(name: String, value: Double) extends Command
case class Directive(name: String) extends Command
class RuleHandler extends JavaTokenParsers { //why `JavaTokenParsers` (not `RegexParsers`) if you don't use tokens from Java Language Specification ?
def string = """[a-zA-Z0-9*+-/<>!\_{}~\s]*""".r //it's still wrong you should use some predefined Java-like literals from **JavaToken**Parsers
def rule = "(" ~> string <~ "=" ~> string <~ ")" ^^ { case name ~ num => Rule(name, num.toDouble} }
def directive = "(" ~> string <~ ")" ^^ { case name => Directive(name) }
def commands: Parser[Command] = repsep(rule | directive, string)
}
If you need to process natural language (Chomsky type-0), scalanlp or something similar fits better.
I'm trying to write a SemVer (http://semver.org) parser in Scala using parser combinators, as a sort of familiarisation with them.
This is my current code:
case class SemVer(major: Int, minor: Int, patch: Int, prerelease: Option[List[String]], metadata: Option[List[String]]) {
override def toString = s"$major.$minor.$patch" + prerelease.map("-" + _.mkString(".")).getOrElse("") + metadata.map("+" + _.mkString("."))
}
class VersionParser extends RegexParsers {
def number: Parser[Int] = """(0|[1-9]\d*)""".r ^^ (_.toInt)
def separator: Parser[String] = """\.""".r
def prereleaseSeparator: Parser[String] = """-""".r
def metadataSeparator: Parser[String] = """\+""".r
def identifier: Parser[String] = """([0-9A-Za-z-])+""".r ^^ (_.toString)
def prereleaseIdentifiers: Parser[List[String]] = (number | identifier) ~ rep(separator ~> (number | identifier)) ^^ {
case first ~ rest => List(first.toString) ++ rest.map(_.toString)
}
def metadataIdentifiers: Parser[List[String]] = identifier ~ rep(separator ~> identifier) ^^ {
case first ~ rest => List(first.toString) ++ rest.map(_.toString)
}
}
I'd like to know how I should parse identifiers for the prerelease section, because it disallows leading zeros in numeric identifiers and when I try to parse using my current parser leading zeros (for e.g. in "01.2.3") simply become a list containing the element 0.
More generically, how should I detect that the string does not conform to the SemVer spec and consequently force a failure condition?
After some playing around and some searching, I've discovered the issue was that I was calling the parse method instead of the parseAll method. Since parse basically parses as much as it can, ending when it can't parse anymore, it is possible for it to accept partially correct strings. Using parseAll forces all the input to be parsed, and it fails if there is input remaining once parsing stops. This is exactly what I was looking for.
For the sake of completeness I'd add
def version = number ~ (separator ~> number) ~ (separator ~> number) ~ ((prereleaseSeparator ~> prereleaseIdentifiers)?) ~ ((metadataSeparator ~> metadataIdentifiers)?) ^^ {
case major ~ minor ~ patch ~ prerelease ~ metadata => SemVer(major, minor, patch, prerelease, metadata)
}
method to VersionParser
I am trying to build a small parser where the tokens (luckily) never contain whitespace. Whitespace (spaces, tabs and newlines) are essentially token delimeters (apart from cases where there are brackets etc.).
I am extending the RegexParsers class. If I turn on skipWhitespace the parser is greedily joining tokens together when the next token matches the regular expression of the previous one. If I turn off skipWhitespace, on the other hand, it complains because of the spaces not being part of the definition. I am trying to match the BNF as much as possible, and given that whitespace is almost always the delimeter (apart from brackets or some other cases where the delimeter is explicitly defined in the BNF), is there away to avoid putting whitespace regex in all my definitions?
UPDATE
This is a small test example where the tokens are being joined together:
import scala.util.parsing.combinator.RegexParsers
object TestParser extends RegexParsers {
def test = "(test" ~> name <~ ")"
def name : Parser[String] = (letter ~ (anyChar*)) ^^ { case first ~ rest => (first :: rest).mkString}
def anyChar = letter | digit | "_".r | "-".r
def letter = """[a-zA-Z]""".r
def digit = """\d""".r
def main(args: Array[String]) {
val s = "(test hello these should not be joined and I should get an error)"
val res = parseAll(test, s)
res match {
case Success(r, n) => println(r)
case Failure(msg, n) => println(msg)
case Error(msg, n) => println(msg)
}
}
}
In the above case I just get the string joined together.
A similar effect is if I change test to the following, expecting it to give me the list of separate words after test, but instead it joins them together and just gives me a one element list with a long string, without the middle spaces:
def test = "(test" ~> (name+) <~ ")"
White space is skipped just before every production rule. So, in this snippet:
def name : Parser[String] = (letter ~ (anyChar*)) ^^ { case first ~ rest => (first :: rest).mkString}
It will skip whitespace before each letter and, even worse, each empty string for good measure (since anyChar* can be empty).
Use regular expressions (or plain strings) for each token, not each lexical element. Like this:
object TestParser extends RegexParsers {
def test = "(test" ~> name <~ ")"
def name : Parser[String] = """[a-zA-Z][a-zA-Z0-9_-]*""".r
// ...
I'm working with the native parser combinator library in Scala and I'd like to parse some parts of my input, but not others. Specifically, I'd like to discard all of the arbitrary text between inputs that I care about. For example, with this input:
begin
Text I care about
Text I care about
DONT CARE
Text I don't care about
begin
More text I care about
...
Right now I have:
object MyParser extends RegexParsers {
val beginToken: Parser[String] = "begin"
val dontCareToken: Parser[String] = "DONT CARE"
val text: Parser[String] = not(dontCareToken) ~> """([^\n]+)""".r
val document: Parser[String] = begin ~> text.+ <~ dontCareToken ^^ { _.mkString("\n") }
val documents: Parser[Iterable[String]] = document.+
but I'm not sure how to ignore the text that comes after DONT CARE and until the next begin. Specifically, I don't want to make any assumptions about the form of that text, I just want to start parsing again at the next begin statement.
You almost had it. Parse for what you don't care and then do nothing with it.
I added dontCareText and skipDontCare and then in your document parser indicated that skipDontCare was optional.
import scala.util.parsing.combinator.RegexParsers
object MyParser extends RegexParsers {
val beginToken: Parser[String] = "begin"
val dontCareToken: Parser[String] = "DONT CARE"
val text: Parser[String] = not(dontCareToken) ~> """([^\n]+)""".r
val dontCareText: Parser[String] = not(beginToken) ~> """([^\n]+)""".r
val skipDontCare = dontCareToken ~ dontCareText ^^ { case c => "" }
val document: Parser[String] =
beginToken ~> text.+ <~ opt(skipDontCare) ^^ {
_.mkString("\n")
}
val documents: Parser[Iterable[String]] = document.+
}
val s = """begin
Text I care about
Text I care about
DONT CARE
Text I don't care about
begin
More text I care about
"""
MyParser.parseAll(MyParser.documents,s)