How to override delimiters for RegexParsers based parser combinator - scala

Our parser requires the following two capabilities:
regex parsing capabilities.
expand the set of characters that are considered delimiters.
The quandary is that the former requires RegexParsers (or its derivatives including JavaTokenParsers) whereas the latter requires StandardTokenParsers.
The following code would be added to a subclass of StandardTokenParsers:
// Extend the set of delimiters
class LogLexical extends StdLexical {
delimiters +=(
"|", ",", "<", ">", "/", "(", ")",
";", "%", "{", "}", ":", "[", "]", "&", "^", "~"
)
}
// Use the custom lexer to scan the input text
override val lexical = new LogLexical
override def parse(input: String): Seq[LogLine] = synchronized {
val tokens = new lexical.Scanner(input).asInstanceOf[Reader[Char]]
phrase(allLogTypes)(tokens) match {
case Success(plan, _) => plan
case failureOrError => sys.error(failureOrError.toString)
}
}
However the above approach can not be used for RegexParsers: they do not support the custom Lexer AFAICT.
It is not clear how to obtain both of those capabilities simultaneously.
One difference between those may impact the approach:
StandardTokenParsers is Char based whereas RegexParsers is String based
But even so - there is some set of delimiters being used by RegexParsers: they need to define how the tilde ~ operator is being applied
E.g. val myParser = someRegexParser ~ someOtherRegexParser
So the tilde is using a set of predefined delimiters to distinguish the beginning of the patterns. But where is that set? And how to change it? Any guidance appreciated.

Related

Scala - Can Match-extraction be used on backtick identifiers?

The question is a little difficult to phrase so I'll try to provide an example instead:
def myThing(): (String, String, String) = ("", "", "")
// Illegal, this is a Match
val (`r-1`, `r-2`, `r-3`) = myThing()
// Legal
val `r-1` = myThing()._1
The first evaluation is invalid because this is technically a match expression, and in a match backtick marked identifiers are assumed to be references to an existing val in scope.
Outside of a match though, I could freely define "r-1".
Is there a way to perform match extraction using complex variable names?
You can write out the full variable names explicitly:
def myThing(): (String, String, String) = ("a", "b", "c")
// legal, syntactic backtick-sugar replaced by explicit variable names
val (r$minus1, r$minus2, r$minus3) = myThing()
println(`r-1`, `r-2`, `r-3`)
But since variable names can be chosen freely (unlike method in Java APIs that are called yield etc.), I would suggest to invent simpler variable names, the r$minusx-things really don't look pretty.

Make parser include surrounding whitespace in string literals

I wrote a Scala parser for an in-house expression language that has double quote-delimited string literals:
object MyParser extends JavaTokenParsers {
lazy val strLiteral = "\"" ~> """[^"]*""".r <~ "\"" ^^ {
case x ⇒ StringLiteral(x)
}
// ...
}
(The actual code is a bit different since I support "" as an escape sequence for a literal double quote. While this is not relevant for the discussion, it's the reason why I cannot just use JavaTokenParsers's stringLiteral).
I noticed that the parser fails to include whitespace at the beginning and at the end of a string:
"a" parsed as StringLiteral("a")
" a" parsed as StringLiteral("a")
"a " parsed as StringLiteral("a")
" a " parsed as StringLiteral("a")
I tried matching whitespace in the regex:
"\"" ~> """\s*[^"]*\s*""".r <~ "\""
and also using the explicit whiteSpace parser:
"\"" ~> whiteSpace.? ~ """[^"]*""".r ~ whiteSpace.? <~ "\""
but in both cases the ~> operator has already consumed and ignored the spaces before there's a chance to read and handle them.
I know that I can set skipWhitespace = false, but I prefer not to, since in general I want to allow arbitrary whitespace around tokens in this language.
What's a simple and clean strategy to include surrounding whitespace in string literals?
One option you have is to use single regexp for your string literal:
val stringLiteral:Parser[String] = """"([^"]*("")?)*"""".r
and then strip matched quotes afterwards.

Scala: parsing multiple files using Scala's combinators

I am writing a DSL using Scala parser combinators and have a working version that can read a single file and parse it. However, I would like to split my input into several files where some files are 'standard' and can be used with any top-level file. What I would like is something like:
import "a.dsl"
import "b.dsl"
// rest of file using {a, b}
It isn't important what order the files are read in or that something is necessarily 'defined' before being referred to so parsing the top-level file first then parsing the closure of all imports into a single model is sufficient. I will then post-process the resulting model for my own purposes.
The question I have is, is there a reasonable way of accomplishing this? If necessary I could iterate over the closure, parsing each file into a separate model, and manually 'merge' the resulting models but this feels clunky and seems ugly to me.
BTW, I am using an extension of StandardTokenParsers, if that matters.
I think the only approach would be to open and parse the file indicated by the import directly. From there you can create a sub-expression tree for the module. You may not need to manually merge the trees when parsing, for example if your already using ^^ and/or ^^^ to return your own Expressions then you should be able to simply emit a relevant expression type in the correct place within the tree, for example:
import scala.util.parsing.combinator.syntactical.StandardTokenParsers
import scala.io.Source
object Example {
sealed trait Expr
case class Imports(modules: List[Module]) extends Expr
case class Module(modulePath: String, root: Option[Expr]) extends Expr
case class BracedExpr(x: String, y: String) extends Expr
case class Main(imports: Imports, braced: BracedExpr) extends Expr
class BlahTest extends StandardTokenParsers {
def importExpr: Parser[Module] = "import" ~> "\"" ~> stringLit <~ "\"" ^^ {
case modulePath =>
//you could use something other than `expr` below if you
//wanted to limit the expressions available in modules
//e.g. you could stop one module importing another.
phrase(expr)(new lexical.Scanner(Source.fromFile(modulePath).mkString)) match {
case Success(result, _) =>
Module(modulePath, Some(result))
case failure : NoSuccess =>
//TODO log or act on failure
Module(modulePath, None)
}
}
def prologExprs = rep(importExpr) ^^ {
case modules =>
Imports(modules)
}
def bracedExpr = "{" ~> stringLit ~ "," ~ stringLit <~ "}" ^^ {
case x ~ "," ~ y =>
BracedExpr(x, y)
}
def bodyExprs = bracedExpr
def expr = prologExprs ~ bodyExprs ^^ {
case prolog ~ body =>
Main(prolog, body)
}
}
}
You could simply add an eval to your Expression trait, implement each eval as necessary on the sub-classes and then have a visitor recursively descend your AST. In this manner you would not need to manually merge expression trees together.

Parsing sentences using Scala parser combinator

I just started playing with parser combinators in Scala, but got stuck on a parser to parse sentences such as "I like Scala." (words end on a whitespace or a period (.)).
I started with the following implementation:
package example
import scala.util.parsing.combinator._
object Example extends RegexParsers {
override def skipWhitespace = false
def character: Parser[String] = """\w""".r
def word: Parser[String] =
rep(character) <~ (whiteSpace | guard(literal("."))) ^^ (_.mkString(""))
def sentence: Parser[List[String]] = rep(word) <~ "."
}
object Test extends App {
val result = Example.parseAll(Example.sentence, "I like Scala.")
println(result)
}
The idea behind using guard() is to have a period demarcate word endings, but not consume it so that sentences can. However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).
If I change the word and sentence definitions as follows, it parses the sentence, but the grammar description doesn't look right and will not work if I try to add parser for paragraph (rep(sentence)) etc.
def word: Parser[String] =
rep(character) <~ (whiteSpace | literal(".")) ^^ (_.mkString(""))
def sentence: Parser[List[String]] = rep(word) <~ opt(".")
Any ideas what may be going on here?
However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).
The rep combinator corresponds to a * in perl-style regex notation. This means it matches zero or more characters. I think you want it to match one or more characters. Changing that to a rep1 (corresponding to + in perl-style regex notation) should fix the problem.
However, your definition still seems a little verbose to me. Why are you parsing individual characters instead of just using \w+ as the pattern for a word? Here's how I'd write it:
object Example extends RegexParsers {
override def skipWhitespace = false
def word: Parser[String] = """\w+""".r
def sentence: Parser[List[String]] = rep1sep(word, whiteSpace) <~ "."
}
Notice that I use rep1sep to parse a non-empty list of words separated by whitespace. There's a repsep combinator as well, but I think you'd want at least one word per sentence.

Runtime exception in syntax like defining two vals with identical names

In some book I've got a code similar to this:
object ValVarsSamples extends App {
val pattern = "([ 0-9] +) ([ A-Za-z] +)". r // RegEx
val pattern( count, fruit) = "100 Bananas"
}
This is supposed to be a trick, it should like defining same names for two vals, but it is not.
So, this fails with an exception.
The question: what this might be about? (what's that supposed to be?) and why it does not work?
--
As I understand first: val pattern - refers to RegEx constructor function.. And in second val we are trying to pass the params using such a syntax? just putting a string
This is an extractor:
val pattern( count, fruit) = "100 Bananas"
This code is equivalent
val res = pattern.unapplySeq("100 Bananas")
count = res.get(0)
fruit = res.get(1)
The problem is your regex doesn't match, you should change it to:
val pattern = "([ 0-9]+) ([ A-Za-z]+)". r
The space before + in [ A-Za-z] + means you are matching a single character in the class [ A-Za-z] and then at least one space character. You have the same issue with [ 0-9] +.
Scala regexes define an extractor, which returns a sequence of matching groups in the regular expression. Your regex defines two groups so if the match succeeds the sequence will contain two elements.