How to make my parser support logic operations and word case-insensitive? - scala

Recently, I am learning the Scala parser combinator. I would like to parse the key in a given string. For instance,
val expr1 = "local_province != $province_name$ or city=$city_name$ or people_number<>$some_digit$"
// ==> List("local_province", "city", "people_number")
val expr2 = "(local_province=$province_name$)"
// ==> List("local_province")
val expr3 = "(local_province=$province_name$ or city=$city_name$) and (lib_name=$akka$ or lib_author=$martin$)"
// ==> List("local_province", "city", "lib_name", "lib_author")
Trial
import scala.util.parsing.combinator.JavaTokenParsers
class KeyParser extends JavaTokenParsers {
lazy val key = """[a-zA-Z_]+""".r
lazy val value = "$" ~ key ~ "$"
lazy val logicOps = ">" | "<" | "=" | ">=" | "<=" | "!=" | "<>"
lazy val elem: Parser[String] = key <~ (logicOps ~ value)
lazy val expr: Parser[List[String]] =
"(" ~> repsep(elem, "and" | "or") <~ ")" | repsep(elem, "and" | "or")
lazy val multiExpr: Parser[List[String]] =
repsep(expr, "and" | "or") ^^ { _.foldLeft(List.empty[String])(_ ++ _) }
}
object KeyParser extends KeyParser {
def parse(input: String) = parseAll(multiExpr, input)
}
Here is my test in Scala REPL
KeyParser.parse(expr1)
[1.72] failure: $' expected but >' found
KeyParser.parse(expr2)
[1.33] parsed: List(local_province)
KeyParser.parse(expr3)
[1.98] parsed: List(local_province, city, lib_name, lib_author)
I notice that the KeyParser only works for "=" and it doesn't support the case like "(local_province<>$province_name$ AND city!=$city_name$)" which contains "<> | !=" and "AND".
So I would like to know how to revise it.

I notice that the KeyParser only works for "="
This isn't quite true. It also works for !=, < and >. The ones it doesn't work for are >=, <= and <>.
More generally it does not work for those operators which have a prefix of them appear in the list of alternatives before them. That is >= is not matched because > appears before it and is a prefix of it.
So why does this happen? The | operator creates a parser that produces the result of the left parser if it succeeds or of the right parser otherwise. So if you have a chain of |s, you'll get the result of the first parser in that chain which can match the current input. So if the current input is <>$some_digit$, the parser logicOps will match < and leave you with >$some_digit$ as the remaining input. So now it tries to match value against that input and fails.
Why doesn't backtracking help here? Because the logicOps parser already succeeded, so there's nowhere to backtrack to. If the parser were structured like this:
lazy val logicOpAndValue = ">" ~ value | "<" ~ value | "=" ~ value |
">=" ~ value | "<=" ~ value | "!=" ~ value |
"<>" ~ value
lazy val elem: Parser[String] = key <~ logicOpAndValue
Then the following would happen (with the current input being <>$some_digit$):
">" does not match the current input, so go to next alternative
"<" does match the current input, so try the right operand of the ~ (i.e. value) with the current input >$some_digit$. This fails, so continue with the next alternative.
... bunch of alternatives that don't match ...
"<>" does match the current input, so try the right operand of the ~. This matches as well. Success!
However in your code the ~ value is outside of the list of alternatives, not inside each alternative. So when the parser fails, we're no longer inside any alternative, so there's no next alternative to try and it just fails.
Of course moving the ~ value inside the alternatives isn't really a satisfying solution as it's ugly as hell and not very maintainable in the general case.
One solution is simply to move the longer operators at the beginning of the alternatives (i.e. ">=" | "<=" | "<>" | ">" | "<" | ...). This way ">" and "<" will only be tried if ">=", "<=" and "<>" have already failed.
A still better solution, which does not rely on the order of alternatives and is thus less error-prone, is to use ||| instead of. ||| works like | except that it tries all of the alternatives and then returns the longest successful result - not the first.
PS: This isn't related to your problem but you're currently limiting the nesting depth of parentheses because your grammar is not recursive. To allow unlimited nesting, you'll want your expr and multiExpr rules to look like this:
lazy val expr: Parser[List[String]] =
"(" ~> multiExpr <~ ")" | elem
lazy val multiExpr: Parser[List[String]] =
repsep(expr, "and" | "or") ^^ { _.foldLeft(List.empty[String])(_ ++ _) }
However I recommend renaming expr to something like primaryExpr and multiExpr to expr.
_.foldLeft(List.empty[String])(_ ++ _) can also be more succinctly expressed as _.flatten.

Related

Scala Parser Combinators Matching Wrong Parser

I am writing an extremely simple programming language using scala's parsers. I am trying to allow users to have multi-word variables ie my variable and I want to allow them to assign variables in three ways: let var = 2, var =2, set var to 4. I have the first two working, but I cant get the latter to work.
Here is my code
lazy val line:PackratParser[Line] = (assignment | element | comment) <~ (";" | "[\n\r]*".r)
lazy val assignment:PackratParser[Assignment] = assignmentLHS ~ element ^^ {
case x ~ y => new Assignment(x,y)
}
lazy val assignmentLHS = "set" ~ "[" ~> identifier <~ "]" ~ "to" | ("let"?) ~> identifier <~ "="
lazy val identifier:Parser[String] = "([a-zA-Z]+[a-zA-Z0-9]* ?)+".r ^^ (X => if (X.charAt(X.length-1) == ' ') X.substring(0, X.length-1) else X)
lazy val element:PackratParser[Element] =(
functionDef
| comparison
| expression
| boolean
| controlStatement
| functionCall
| reference
| value
| codeBlock
)
lazy val reference: PackratParser[Reference] = identifier ^^ (x=>new Reference(x))
Elements are most things in the language.
I would like to replace the assignmentLHS parser with:
lazy val assignmentLHS = "set" ~> identifier <~ "to" | ("let"?) ~> identifier <~ "="
so that the user can write set my variable to 4 instead of `set [my variable] to 4. The problem is that it just parses that as a reference
Why do you have a space in the identifier before the 'optional' '?' symbol:
0-9]* ?)+
Maybe that is causing
set <identifier>
to change to
set other_chars
and set becomes part of the identifier. That would explain why your use of
"set" ~ "[" ~> identifier <~ "]"
works but
"set" ~> identifier <~ "to"
does not

Why doesn't parser combinator backtrack?

Consider
import util.parsing.combinator._
object TreeParser extends JavaTokenParsers {
lazy val expr: Parser[String] = decimalNumber | sum
//> expr: => TreeParser.Parser[String]
lazy val sum: Parser[String] = expr ~ "+" ~ expr ^^ {case a ~ plus ~ b => s"($a)+($b)"}
//> sum: => TreeParser.Parser[String]
println(parseAll(expr, "1 + 1")) //> TreeParser.ParseResult[String] = [1.3] failure: string matching regex
//| `\z' expected but `+' found
}
The same story with fastparse
import fastparse.all._
val expr: P[Any] = P("1" | sum)
val sum: P[Any] = expr ~ "+" ~ expr
val top = expr ~ End
println(top.parse("1+1")) // Failure(End:1:2 ..."+1")
Parsers are great to discover that taking the first literal is a bad idea but do not try to fall back to the sum production. Why?
I understand that parser takes the first branch that can successfully eat up a part of input string and exits. Here, "1" of expression matches the first input char and parsing completes. In order to grab more, we need to make sum the first alternative. However, plain stupid
lazy val expr: Parser[String] = sum | "1"
endы up with stack overflow. The library authors therefore approach it from another side
val sum: P[Any] = P( num ~ ("+".! ~/ num).rep )
val top: P[Any] = P( sum ~ End )
Here, we start sum with terminal, which is fine but this syntax is more verbose and, furthermore, it produces a terminal, followed by a list, which is good for a reduction operator, like sum, but is difficult to map to a series of binary operators.
What if your language defines expression, which admits a binary operator? You want to match every occurrence of expr op expr and map it to a corresponding tree node
expr ~ "op" ~ expr ^^ {case a ~ _ ~ b => BinOp(a,b)"}
How do you do that? In short, I want a greedy parser, that consumes the whole string. This is what I mean by 'greedy' rather than greedy algorigthm that jumps into the first wagon and ends up in a dead end.
As I have found here, we need to replace | alternative operator with secret |||
//lazy val expr: Parser[String] = decimalNumber | sum
lazy val backtrackGreedy: Parser[String] = decimalNumber ||| sum
lazy val sum: Parser[String] = decimalNumber ~ "+" ~ backtrackGreedy ^^ {case a ~ plus ~ b => s"($a)+($b)"}
println(parseAll(backtrackGreedy, "1 + 1")) // [1.6] parsed: (1)+(1)
The order of alternatives does not matter with this operator. To stop stack overflow, we need to eliminate the left recursion, sum = expr + expr => sum = number + expr.
Another answer says that we need to normalize, that is instead of
def foo = "foo" | "fo"
def obar = "obar"
def foobar = foo ~ obar
we need to use
def workingFooBar = ("foo" ~ obar) | ("fo" ~ obar)
But first solution is more striking.
The parser does backtrack. Try val expr: P[String] = P(("1" | "1" ~ "+" ~ "1").!) and expr.parse("1+1") for example.
The problem is in your grammar. expr parses 1 and it is a successful parsing by your definition. Then sum fails and now you want to blame the dutiful expr for what happened?
There are plenty of examples on how to deal with binary operators. For example, the first example here: http://lihaoyi.github.io/fastparse/

Parsing a list of 0 or more idents followed by ident

I want to parse a part of my DSL formed like this:
configSignal: sticky Config
Semantically this is:
argument_name: 0_or_more_modifiers argument_type
I tried implementing the following parser:
def parser = ident ~ ":" ~ rep(ident) ~ ident ^^ {
case name ~ ":" ~ modifiers ~ returnType => Arg(name, returnType, modifiers)
}
Thing is, the rep(ident) part is applied until there are no more tokens and the parser fails, because the last ~ ident doesn't match. How should I do this properly?
Edit
In the meantime I realized, that the modifiers will be reserved words (keywords), so now I have:
def parser = ident ~ ":" ~ rep(modifier) ~ ident ^^ {
case name ~ ":" ~ modifiers ~ returnType => Arg(name, returnType, modifiers)
}
def modifier = "sticky" | "control" | "count"
Nevertheless, I'm curious if it would be possible to write a parser if the modifiers weren't defined up front.
"0 or more idents followed by ident" is equivalent to "1 or more idents", so just use rep1
Its docs:
def rep1[T](p: ⇒ Parser[T]): Parser[List[T]]
A parser generator for non-empty repetitions.
rep1(p) repeatedly uses p to parse the input until p fails -- p must succeed at least once (the result is a List of the consecutive results of p)
p a Parser that is to be applied successively to the input
returns A parser that returns a list of results produced by repeatedly applying p to the input (and that only succeeds if p matches at least once).
edit in response to OP's comment:
I don't think there's a built-in way to do what you described, but it would still be relatively easy to map to your custom data types by using regular List methods:
def parser = ident ~ ":" ~ rep1(ident) ^^ {
case name ~ ":" ~ idents => Arg(name, idents.last, idents.dropRight(1))
}
In this particular case, you wouldn't have to worry about idents being Nil, since the rep1 parser only succeeds with a non-empty list.

parsing recursive structures in scala

I'm trying to contruct a parser in scala which can parse simple SQL-like strings. I've got the basics working and can parse something like:
select id from users where name = "peter" and age = 30 order by lastname
But now I wondered how to parse nested and classes, i.e.
select name from users where name = "peter" and (age = 29 or age = 30)
The current production of my 'combinedPredicate' looks like this:
def combinedPredicate = predicate ~ ("and"|"or") ~ predicate ^^ {
case l ~ "and" ~ r => And(l,r)
case l ~ "or" ~ r => Or(l,r)
}
I tried recursively referencing the combinedPredicate production within itself but that results in a stackoverflow.
btw, I'm just experimenting here... not implementing the entire ansi-99 spec ;)
Well, recursion has to be delimited somehow. In this case, you could do this:
def combinedPredicate = predicate ~ rep( ("and" | "or" ) ~ predicate )
def predicate = "(" ~ combinedPredicate ~ ")" | simplePredicate
def simplePredicate = ...
So it will never stack overflow because, to recurse, it first has to accept a character. This is the important part -- if you always ensure recursion won't happen without first accepting a character, you'll never get into an infinite recursion. Unless, of course, you have infinite input. :-)
The stack overflow you're experiencing is probably the result of a left-recursive language:
def combinedPredicate = predicate ~ ...
def predicate = combinedPrediacate | ...
The parser combinators in Scala 2.7 are recursive descent parsers. Recursive descent parsers have problems with these because they have no idea how the terminal symbol is when they first encounter it. Other kinds of parsers like Scala 2.8's packrat parser combinators will have no problem with this, though you'll need to define the grammar using lazy vals instead of methods, like so:
lazy val combinedPredicate = predicate ~ ...
lazy val predicate = combinedPrediacate | ...
Alternatively, you could refactor the language to avoid left recursion. From the example you're giving me, requiring parentheses in this language could solve the problem effectively.
def combinedPredicate = predicate ~ ...
def predicate = "(" ~> combinedPrediacate <~ ")" | ...
Now each deeper level of recursion corresponds to another parentheses parsed. You know you don't have to recurse deeper when you run out of parentheses.
After reading about solutions for operator precedence and came up with the following:
def clause:Parser[Clause] = (predicate|parens) * (
"and" ^^^ { (a:Clause, b:Clause) => And(a,b) } |
"or" ^^^ { (a:Clause, b:Clause) => Or(a,b) } )
def parens:Parser[Clause] = "(" ~> clause <~ ")"
Wich is probably just another way writing what #Daniel wrote ;)

Scala combinator parsers - distinguish between number strings and variable strings

I'm doing Cay Horstmann's combinator parser exercises, I wonder about the best way to distinguish between strings that represent numbers and strings that represent variables in a match statement:
def factor: Parser[ExprTree] = (wholeNumber | "(" ~ expr ~ ")" | ident) ^^ {
case a: wholeNumber => Number(a.toInt)
case a: String => Variable(a)
}
The second line there, "case a: wholeNumber" is not legal. I thought about a regexp, but haven't found a way to get it to work with "case".
I would split it up a bit and push the case analysis into the |. This is one of the advantages of combinators and really LL(*) parsing in general:
def factor: Parser[ExprTree] = ( wholeNumber ^^ { Number(_.toInt) }
| "(" ~> expr <~ ")"
| ident ^^ { Variable(_) } )
I apologize if you're not familiar with the underscore syntax. Basically it just means "substitute the nth parameter to the enclosing function value". Thus { Variable(_) } is equivalent to { x => Variable(x) }.
Another bit of syntax magic here is the ~> and <~ operators in place of ~. These operators mean that the parsing of that term should include the syntax of both the parens, but the result should be solely determined by the result of expr. Thus, the "(" ~> expr <~ ")" matches exactly the same thing as "(" ~ expr ~ ")", but it doesn't require the extra case analysis to retrieve the inner result value from expr.