Scala: parsing multiple files using Scala's combinators - scala

I am writing a DSL using Scala parser combinators and have a working version that can read a single file and parse it. However, I would like to split my input into several files where some files are 'standard' and can be used with any top-level file. What I would like is something like:
import "a.dsl"
import "b.dsl"
// rest of file using {a, b}
It isn't important what order the files are read in or that something is necessarily 'defined' before being referred to so parsing the top-level file first then parsing the closure of all imports into a single model is sufficient. I will then post-process the resulting model for my own purposes.
The question I have is, is there a reasonable way of accomplishing this? If necessary I could iterate over the closure, parsing each file into a separate model, and manually 'merge' the resulting models but this feels clunky and seems ugly to me.
BTW, I am using an extension of StandardTokenParsers, if that matters.

I think the only approach would be to open and parse the file indicated by the import directly. From there you can create a sub-expression tree for the module. You may not need to manually merge the trees when parsing, for example if your already using ^^ and/or ^^^ to return your own Expressions then you should be able to simply emit a relevant expression type in the correct place within the tree, for example:
import scala.util.parsing.combinator.syntactical.StandardTokenParsers
import scala.io.Source
object Example {
sealed trait Expr
case class Imports(modules: List[Module]) extends Expr
case class Module(modulePath: String, root: Option[Expr]) extends Expr
case class BracedExpr(x: String, y: String) extends Expr
case class Main(imports: Imports, braced: BracedExpr) extends Expr
class BlahTest extends StandardTokenParsers {
def importExpr: Parser[Module] = "import" ~> "\"" ~> stringLit <~ "\"" ^^ {
case modulePath =>
//you could use something other than `expr` below if you
//wanted to limit the expressions available in modules
//e.g. you could stop one module importing another.
phrase(expr)(new lexical.Scanner(Source.fromFile(modulePath).mkString)) match {
case Success(result, _) =>
Module(modulePath, Some(result))
case failure : NoSuccess =>
//TODO log or act on failure
Module(modulePath, None)
}
}
def prologExprs = rep(importExpr) ^^ {
case modules =>
Imports(modules)
}
def bracedExpr = "{" ~> stringLit ~ "," ~ stringLit <~ "}" ^^ {
case x ~ "," ~ y =>
BracedExpr(x, y)
}
def bodyExprs = bracedExpr
def expr = prologExprs ~ bodyExprs ^^ {
case prolog ~ body =>
Main(prolog, body)
}
}
}
You could simply add an eval to your Expression trait, implement each eval as necessary on the sub-classes and then have a visitor recursively descend your AST. In this manner you would not need to manually merge expression trees together.

Related

Scala Parser Combinator ignoring errors in optional elements

We are building a parser for a DSL that looks like SQL. Like in SQL it has multiple blocks namely select, where, order by, group by etc. We made some parts of the query like where, order by etc. optional. When we do this the parser ignores all input errors.
DSL Code Snippet
case class Query(select: Select, where: Where)
case class Select(cols: Seq[String])
case class Where(conditions: Seq[String])
object QueryLanguage extends StandardTokenParsers {
lazy val sql: Parser[Query] = select_block ~ (where_block?) ^^ {case slct ~ whr => Query(slct, whr.getOrElse(null))}
lazy val select_block: Parser[Select] = "SELECT" ~> rep1sep(ident, ",") ^^ { case cols => Select(cols) }
lazy val where_block: Parser[Where] = "WHERE" ~> rep1sep(ident, "and") ^^ { case conds => Where(conds) }
}
Any syntax errors in the select block are reported but not on other blocks. This is frustrating to someone who wants to have a where block in their query but they have no way to report errors in their definition.
For example, a SQL like shown below simply ignores the WHERE clause without reporting error (note and keyword is misspelled)
select A, B
WHERE a=1 andd b=2
The SQL parses correctly if I were to fix the spelling in the input or make the where caluse mandatory in the DSL like shown below
lazy val sql: Parser[Query] = select ~ where ~ (orderby?)
Is there another way to handle this or override this default behavior?

Recursive definitions with scala-parser-combinators

I have been trying to build a SQL parser with the scala-parser-combinator library, which I've simplified greatly into the code below.
class Expression
case class FalseExpr() extends Expression
case class TrueExpr() extends Expression
case class AndExpression(expr1: Expression, expr2: Expression) extends Expression
object SimpleSqlParser {
def parse(sql: String): Try[Expression] = new SimpleSqlParser().parse(sql)
}
class SimpleSqlParser extends RegexParsers {
def parse(sql: String): Try[_ <: Expression] = parseAll(expression, sql) match {
case Success(matched,_) => scala.util.Success(matched)
case Failure(msg,remaining) => scala.util.Failure(new Exception("Parser failed: "+msg + "remaining: "+ remaining.source.toString.drop(remaining.offset)))
case Error(msg,_) => scala.util.Failure(new Exception(msg))
}
private def expression: Parser[_ <: Expression] =
andExpr | falseExpr | trueExpr
private def falseExpr: Parser[FalseExpr] =
"false" ^^ (_ => FalseExpr())
private def trueExpr: Parser[TrueExpr] = "true" ^^ (_ => TrueExpr())
private def andExpr: Parser[Expression] =
expression ~ "and" ~ expression ^^ { case e1 ~ and ~ e2 => AndExpression(e1,e2)}
}
Without the 'and' parsing, it works fine. But I want to be able to parse things like 'true AND (false OR true)', for example. When I add the 'and' part to the definition of an expression, I get a StackOverflowError, the stack is alternating between the definitions of 'and' and 'expression'.
I understand why this is happening - the definition of expression begins with and, and vice-versa. But this seems like the most natural way to model this problem. In reality an expression could also be LIKE, EQUALS etc. Is there another way to model this kind of thing in general in order to get around the problem of recursive definitions.
scala.util.parsing.combinator.RegexParsers cannot handle left-recursive grammars. Your grammar can be summarized by the following production rules:
expression -> andExpr | falseExpr | trueExpr
...
andExpr -> expression "and" expression
expression is indirectly left-recursive via andExpr.
To avoid the infinite recursion, you need to reformulate the grammar so that it is not left-recursive anymore. One frequently-used way is to use repetition combinators, such as chainl1:
private def expression: Parser[_ <: Expression] =
chainl1(falseExpr | trueExpr, "and" ^^^ { AndExpression(_, _) })
Live on Scastie
The new expression matches one or more falseExpr/trueExpr, separated by "and", and combines the matched elements with AndExpression in a left-associative way. Conceptually, it corresponds to the following production rule:
expression -> (falseExpr | trueExpr) ("and" (falseExpr | trueExpr))*
If your grammar contains many tangled left-recursive production rules, you might want to consider other parser combinator libraries, such as GLL combinators, that directly support left recursion.

Reuse parser within another parser with Scala parser combinators

I have a parser for arithmetic expressions:
object FormulaParser extends JavaTokenParsers {
def apply(input: String) = parseAll(formula, input)
// ...
val formula: Parser[Formula] =
comparison | comparable | concatenable | term | factor
}
I need to parse a different language that can contain formulas. Let's say I need to parse something like X < formula. Unfortunately I cannot reuse FormulaParser.formula in my new parser:
object ConditionParser extends JavaTokenParsers {
def apply(input: String) = parseAll(condition, input)
// ...
val condition: Parser[Condition] =
"X" ~ ("<=" | "<") ~ FormulaParser.formula ^^ { ... } // doesn't work
}
because the parser on the left-hand side of ~ is an instance of ConditionParser.Parser, so its ~ method expects something with that same type, not something of type FormulaParser.Parser.
The whole point of using parser combinators is, well, combining parsers! It seems silly to me that my first attempt didn't work, although I understand why it happens (we are reusing base parsers by extending a base trait).
Is there a simple way to combine parsers defined in different types?
In order to reuse parsers, you need to use inheritance. So if you make FormulaParsers a class or a trait, ConditionParser can inherit from it and reuse its parsers.
This is also how you're already reusing the parsers defined in JavaTokenParsers.

Scala: pattern matching Option with case classes and code blocks

I'm starting to learn the great Scala language ang have a question about "deep" pattern matching
I have a simple Request class:
case class Request(method: String, path: String, version: String) {}
And a function, that tries to match an request instance and build a corresponding response:
def guessResponse(requestOrNone: Option[Request]): Response = {
requestOrNone match {
case Some(Request("GET", path, _)) => Response.streamFromPath(path)
case Some(Request(_, _, _)) => new Response(405, "Method Not Allowed", requestOrNone.get)
case None => new Response(400, "Bad Request")
}
}
See, I use requestOrNone.get inside case statement to get the action Request object. Is it type safe, since case statement matched? I find it a bit of ugly. Is it a way to "unwrap" the Request object from Some, but still be able to match Request class fields?
What if I want a complex calculation inside a case with local variables, etc... Can I use {} blocks after case statements? I use IntelliJ Idea with official Scala plugin and it highlights my brackets, suggesting to remove them.
If that is possible, is it good practice to enclose matches in matches?
... match {
case Some(Request("GET", path, _)) => {
var stream = this.getStream(path)
stream match {
case Some(InputStream) => Response.stream(stream.get)
case None => new Response(404, "Not Found)
}
}
}
For the first part of your question, you can name the value you match against with # :
scala> case class A(i: Int)
defined class A
scala> Option(A(1)) match {
| case None => A(0)
| case Some(a # A(_)) => a
| }
res0: A = A(1)
From the Scala Specifications (8.1.3 : Pattern Binders) :
A pattern binder x#p consists of a pattern variable x and a pattern p.
The type of the variable x is the static type T of the pattern p. This
pattern matches any value v matched by the pattern p, provided the
run-time type of v is also an instance of T , and it binds the
variable name to that value.
However, you do not need to in your example: since you're not matching against anything about the Request but just its presence, you could do :
case Some(req) => new Response(405, "Method Not Allowed", req)
For the second part, you can nest matches. The reason Intellij suggests removing the braces is that they are unnecessary : the keyword case is enough to know that the previous case is done.
As to whether it is a good practice, that obviously depends on the situation, but I would probably try to refactor the code into smaller blocks.
You can rewrite the pattern as following (with alias).
case Some(req # Request(_, _, _)) => new Response(405, "Method Not Allowed", req)
You cannot use code block in pattern, only guard (if ...).
There are pattern matching compiler plugin like rich pattern matching.

Is there a good way in Scala to interpret the types of values in a CSV

Suppose I'm given a CSV with the following values:
0, 1.00, Hello
3, 2.13, World
.
.
.
Is there a good method or library that could automatically detect the best type to classify a given column as? In this case (Int, Float, String).
For more context, I'm attempting to extend a CSV parsing library to allow it to report histogram like data on the CSV that is passed in. The idea is to make it very easy to add certain validation tasks into this framework so as to figure out deficiencies or irregularities in a CSV data dump.
Intially I thought to write something which a user could supply a config file that specified the types, but for cases when the CSV column sets are very large, or just for ease of use, I'd like to attempt to automatically detect the types instead of having a user have to write them out.
One answer might be:
def parse(s:String): Any = Try(s.toInt) orElse(Try(s.toDouble)) getOrElse(s)
Then you can use pattern-matching to do whatever you want with it.
You could, of course, first do regular-expression tests on the string to see which type you have. But I'm fairly sure just brute-forcing the parse for each format, as above, will be faster.
Consider parser combinators; inferred types are reported via a list of case classes,
import scala.util.parsing.combinator._
trait CSVType
case class LiteralStr extends CSVType
case class Float extends CSVType
case class Integer extends CSVType
case class Bool extends CSVType
case class NA extends CSVType // Not Available
class CSV extends JavaTokenParsers {
def row: Parser[List[CSVType]] = repsep(value, ",")
def value: Parser[CSVType] =
floatingPointNumber ^^ { f => if (f.toDouble.toInt == f.toDouble) Integer()
else Float() } |
"NA" ^^ { na => NA() } |
("true" | "false") ^^ { b => Bool() } |
stringLiteral ^^ { s => LiteralStr() }
}
object ParseExpr extends CSV with App {
println("in: "+ args(0))
println(parseAll(row, args(0)))
}
Hence
scala> val s = """1.23,2,true,NA,"hello" """
s: String = "1.23,2,true,NA,"hello" "
scala> ParseExpr.main(Array(s))
in: 1.23,2,true,NA,"hello"
[1.24] parsed: List(Float(), Integer(), Bool(), NA(), LiteralStr())
Note that combinators include the parsing of types such as numerics, boolean and strings. In addition, custom types are defined by the parser, for instance NA. See JavaTokenParsers trait for definitions used here.
Each case class may include additional logic to report typing in a most convenient way.