scala parsing combinators force parsers ordering - scala

assume two parsers: p & q.
now, given the following code:
val p1 = p | q
val p2 = "$" ~> p1 <~ "$"
something weird is happening. assume some string: val input = "..."
parseAll(p1, input)
succeeds with p's parsing result, but:
parseAll(p2, "$" + input +"$")
succeeds with q's parsing result.
the order is important. I only want to try q after p failed.
is there any way I can force the parser to evaluate p before q?
EDIT:
after checking the docs more carefully:
p | q succeeds if p succeeds or q succeeds.
Note that q is only tried if ps failure is non-fatal (i.e., back-tracking is allowed).
so my code should work.
so I tried reproducing with a simple example (my original code is too use case specific and most of it not relevant for here). but didn't quite succeed. instead, I managed to come up with a parser that fails when it should succeed. shown in the following REPL session:
Welcome to Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66).
Type in expressions to have them evaluated.
Type :help for more information.
scala> import scala.util.parsing.combinator._
import scala.util.parsing.combinator._
scala> object MyParser extends RegexParsers {
| val xyz = "xyz".r ^^ { case s => Right(s) }
| val any = "\\S+".r ^^ { case s => Left(s) }
| val either: Parser[Either[String,String]] = xyz | any
| val wrapped = "$" ~> either <~ "$"
| def parseMeATest1(input: String) = parseAll(either, input) match {
| case Success(Right(s),_) => println(s"right: $s")
| case Success(Left(s),_) => println(s"left: $s")
| case NoSuccess(msg, _) => println(msg)
| }
| def parseMeATest2(input: String) = parseAll(wrapped, input) match {
| case Success(Right(s),_) => println(s"right: $s")
| case Success(Left(s),_) => println(s"left: $s")
| case NoSuccess(msg, _) => println(msg)
| }
| }
defined object MyParser
scala> MyParser.parseMeATest1("xyz")
right: xyz
scala> MyParser.parseMeATest1("zzz")
left: zzz
scala> MyParser.parseMeATest2("$xyz$")
right: xyz
scala> MyParser.parseMeATest2("$zzz$")
`$' expected but end of source found
So, am I missing something here, or this a bug with scala parsing combinators?
SOLVED:
problem was that p parser consumed the trailing $ character, causing overall failure, but since backtracking is used, q which does not consume the $ character completed successfully.

Related

Scala Parser Combinator: recursion

I am writing a parser for boolean expressions, and try to parse input like "true and false"
def boolExpression: Parser[BoolExpression] = boolLiteral | andExpression
def andExpression: Parser[AndExpression] = (boolExpression ~ "and" ~ boolExpression) ^^ {
case b1 ~ "and" ~ b2 => AndExpression(b1, b2)
}
def boolLiteral: Parser[BoolLiteral] = ("true" | "false") ^^ {
s => BoolLiteral(java.lang.Boolean.valueOf(s))
}
The above code does not parse "true and false", since it reads only "true" and applies rule boolLiteral immediately
But if I change the rule boolExpression to this:
def boolExpression: Parser[BoolExpression] = andExpression | boolLiteral
Then, when parsing "true and false", the code throws a StackoverflowError due to endless recursion
java.lang.StackOverflowError
at Parser.boolExpression(NewParser.scala:58)
at Parser.andExpression(NewParser.scala:62)
at Parser.boolExpression(NewParser.scala:58)
at Parser.andExpression(NewParser.scala:62)
...
How to solve this?
This appears to be parsing a string of boolean constants separated by "and" which is best done using the chainl1 primitive in the parser. This process a chain of operations a op b op c op d using left-to-right precedence.
It might look something like this totally untested code:
trait BoolExpression
case class BoolLiteral(value: Boolean) extends BoolExpression
case class AndExpression(l: BoolExpression, r: BoolExpression) extends BoolExpression
def boolLiteral: Parser[BoolLiteral] =
("true" | "false") ^^ { s => BoolLiteral(java.lang.Boolean.valueOf(s)) }
def andExpression: Parser[(BoolExpression, BoolExpression) => BoolExpression] =
"and" ^^ { _ => (l: BoolExpression, r: BoolExpression) => AndExpression(l, r) }
def boolExpression: Parser[BoolExpression] =
chainl1(boolLiteral, andExpression) ^^ { expr => expr }
Presumably the requirement is more complex than this, but the chainl1 parser is a good starting point.
The core of the issue is that the usual parser combinator libraries use parsing algorithms that don't support left recursion. One solution is to rewrite the grammar without left recursion, meaning that every parser needs to consume at least some input before invoking itself recursively. For instance, your grammar can be written like so:
def boolExpression: Parser[BoolExpression] = andExpression | boolLiteral
def andExpression: Parser[AndExpression] = (boolLiteral ~ "and" ~ boolExpression) ^^ {
case b1 ~ "and" ~ b2 => AndExpression(b1, b2)
}
But you can also use a parser combinator library that's based on another parsing algorithm that supports left recursion. I'm only aware of one for Scala:
https://github.com/djspiewak/gll-combinators
I don't know to what degree it is production ready.
//edit:
I think this one might also support left recursion. Again, I don't know the degree to which it is production ready.
https://github.com/djspiewak/parseback

How to solve reassignment to val error?

I am writing a parser trying to calculate the result of an expression containing float and an RDD, I have override + - / * and it works fine. In one part I am getting the famous error the "reassignment to val" but cannot figure out how to solve it.
Part of the code is as follow:
def calc: Parser[Any]=rep(term2 ~ operator) ^^ {
//match a list of term~operator
case termss =>
var stack =List[Either[RDD[(Int,Array[Float])], Float]]()
var lastop:(Either[RDD[(Int,Array[Float])], Float], Either[RDD[(Int,Array[Float])], Float]) => RDD[(Int,Array[Float])] = add
termss.foreach(t =>
t match { case nums ~ op => {
if (nums=="/path1/test3D.xml")
nums=sv.getInlineArrayRDD()
lastop = op; stack = reduce(stack ++ nums, op)}}
)
stack.reduceRight((x, y) => lastop(y, x))
}
def term2: Parser[List[Any]] = rep(factor2)
def factor2: Parser[Any] = pathIdent | num | "(" ~> calc <~ ")"
def num: Parser[Float] = floatingPointNumber ^^ (_.toFloat)
I defined pathIdent to parse paths.
Here is the error:
[error] reassignment to val:
[error] nums=sv.getInlineArrayRDD()
[error] ^
I have changed def in term2, factor2, and num to var although I knew it seems incorrect but that's the only thing came into my mind to test and it didn't work.
Where is it coming from?
In this piece of code:
case nums ~ op => {
if (nums=="/path1/test3D.xml")
nums=sv.getInlineArrayRDD()
The nums isn't reassignable because it comes from the pattern matching (see the case line). The last line (nums = ...) is trying to assign to nums when it can't.

Threading extra state through a parser in Scala

I'll give you the tl;dr up front
I'm trying to use the state monad transformer in Scalaz 7 to thread extra state through a parser, and I'm having trouble doing anything useful without writing a lot of t m a -> t m b versions of m a -> m b methods.
An example parsing problem
Suppose I have a string containing nested parentheses with digits inside them:
val input = "((617)((0)(32)))"
I also have a stream of fresh variable names (characters, in this case):
val names = Stream('a' to 'z': _*)
I want to pull a name off the top of the stream and assign it to each parenthetical
expression as I parse it, and then map that name to a string representing the
contents of the parentheses, with the nested parenthetical expressions (if any) replaced by their
names.
To make this more concrete, here's what I'd want the output to look like for the example input above:
val target = Map(
'a' -> "617",
'b' -> "0",
'c' -> "32",
'd' -> "bc",
'e' -> "ad"
)
There may be either a string of digits or arbitrarily many sub-expressions at a given level, but these two kinds of content won't be mixed in a single parenthetical expression.
To keep things simple, we'll assume that the stream of names will never
contain either duplicates or digits, and that it will always contain enough
names for our input.
Using parser combinators with a bit of mutable state
The example above is a slightly simplified version of the parsing problem in
this Stack Overflow question.
I answered that question with
a solution that looked roughly like this:
import scala.util.parsing.combinator._
class ParenParser(names: Iterator[Char]) extends RegexParsers {
def paren: Parser[List[(Char, String)]] = "(" ~> contents <~ ")" ^^ {
case (s, m) => (names.next -> s) :: m
}
def contents: Parser[(String, List[(Char, String)])] =
"\\d+".r ^^ (_ -> Nil) | rep1(paren) ^^ (
ps => ps.map(_.head._1).mkString -> ps.flatten
)
def parse(s: String) = parseAll(paren, s).map(_.toMap)
}
It's not too bad, but I'd prefer to avoid the mutable state.
What I want
Haskell's Parsec library makes
adding user state to a parser trivially easy:
import Control.Applicative ((*>), (<$>), (<*))
import Data.Map (fromList)
import Text.Parsec
paren = do
(s, m) <- char '(' *> contents <* char ')'
h : t <- getState
putState t
return $ (h, s) : m
where
contents
= flip (,) []
<$> many1 digit
<|> (\ps -> (map (fst . head) ps, concat ps))
<$> many1 paren
main = print $
runParser (fromList <$> paren) ['a'..'z'] "example" "((617)((0)(32)))"
This is a fairly straightforward translation of my Scala parser above, but without mutable state.
What I've tried
I'm trying to get as close to the Parsec solution as I can using Scalaz's state monad transformer, so instead of Parser[A] I'm working with StateT[Parser, Stream[Char], A].
I have a "solution" that allows me to write the following:
import scala.util.parsing.combinator._
import scalaz._, Scalaz._
object ParenParser extends ExtraStateParsers[Stream[Char]] with RegexParsers {
protected implicit def monadInstance = parserMonad(this)
def paren: ESP[List[(Char, String)]] =
(lift("(" ) ~> contents <~ lift(")")).flatMap {
case (s, m) => get.flatMap(
names => put(names.tail).map(_ => (names.head -> s) :: m)
)
}
def contents: ESP[(String, List[(Char, String)])] =
lift("\\d+".r ^^ (_ -> Nil)) | rep1(paren).map(
ps => ps.map(_.head._1).mkString -> ps.flatten
)
def parse(s: String, names: Stream[Char]) =
parseAll(paren.eval(names), s).map(_.toMap)
}
This works, and it's not that much less concise than either the mutable state version or the Parsec version.
But my ExtraStateParsers is ugly as sin—I don't want to try your patience more than I already have, so I won't include it here (although here's a link, if you really want it). I've had to write new versions of every Parser and Parsers method I use above
for my ExtraStateParsers and ESP types (rep1, ~>, <~, and |, in case you're counting). If I had needed to use other combinators, I'd have had to write new state transformer-level versions of them as well.
Is there a cleaner way to do this? I'd love to see an example of a Scalaz 7's state monad transformer being used to thread state through a parser, but Scalaz 6 or Haskell examples would also be useful and appreciated.
Probably the most general solution would be to rewrite Scala's parser library to accommodate monadic computations while parsing (like you partly did), but that would be quite a laborious task.
I suggest a solution using ScalaZ's State where each of our result isn't a value of type Parse[X], but a value of type Parse[State[Stream[Char],X]] (aliased as ParserS[X]). So the overall parsed result isn't a value, but a monadic state value, which is then run on some Stream[Char]. This is almost a monad transformer, but we have to do lifting/unlifting manually. It makes the code a bit uglier, as we need to lift values sometimes or use map/flatMap on several places, but I believe it's still reasonable.
import scala.util.parsing.combinator._
import scalaz._
import Scalaz._
import Traverse._
object ParenParser extends RegexParsers with States {
type S[X] = State[Stream[Char],X];
type ParserS[X] = Parser[S[X]];
// Haskell's `return` for States
def toState[S,X](x: X): State[S,X] = gets(_ => x)
// Haskell's `mapM` for State
def mapM[S,X](l: List[State[S,X]]): State[S,List[X]] =
l.traverse[({type L[Y] = State[S,Y]})#L,X](identity _);
// .................................................
// Read the next character from the stream inside the state
// and update the state to the stream's tail.
def next: S[Char] = state(s => (s.tail, s.head));
def paren: ParserS[List[(Char, String)]] =
"(" ~> contents <~ ")" ^^ (_ flatMap {
case (s, m) => next map (v => (v -> s) :: m)
})
def contents: ParserS[(String, List[(Char, String)])] = digits | parens;
def digits: ParserS[(String, List[(Char, String)])] =
"\\d+".r ^^ (_ -> Nil) ^^ (toState _)
def parens: ParserS[(String, List[(Char, String)])] =
rep1(paren) ^^ (mapM _) ^^ (_.map(
ps => ps.map(_.head._1).mkString -> ps.flatten
))
def parse(s: String): ParseResult[S[Map[Char,String]]] =
parseAll(paren, s).map(_.map(_.toMap))
def parse(s: String, names: Stream[Char]): ParseResult[Map[Char,String]] =
parse(s).map(_ ! names);
}
object ParenParserTest extends App {
{
println(ParenParser.parse("((617)((0)(32)))", Stream('a' to 'z': _*)));
}
}
Note: I believe that your approach with StateT[Parser, Stream[Char], _] isn't conceptually correct. The type says that we're constructing a parser given some state (a stream of names). So it would be possible that given different streams we get different parsers. This is not what we want to do. We only want that the result of parsing depends on the names, not the whole parser. In this way Parser[State[Stream[Char],_]] seems to be more appropriate (Haskell's Parsec takes a similar approach, the state/monad is inside the parser).

Scala: 'implicit conversions are not applicable' in a simple for expression

I started out with Scala today, and I ran into an intriguing problem. I am running a for expression to iterate over the characters in a string, like such:
class Example {
def forString(s: String) = {
for (c <- s) {
// ...
}
}
}
and it is consistently failing with the message:
error: type mismatch;
found : Int
required: java.lang.Object
Note that implicit conversions are not applicable because they are ambiguous:
...
for (c <- s) {
^
one error found
I tried changing the loop to several things, including using the string's length and using hardcoded numbers (just for testing), but to no avail. Searching the web didn't yield anything either...
Edit: This code is the smallest I could reduce it to, while still yielding the error:
class Example {
def forString(s: String) = {
for (c <- s) {
println(String.format("%03i", c.toInt))
}
}
}
The error is the same as above, and happens at compile time. Running in the 'interpreter' yields the same.
Don't use the raw String.format method. Instead use the .format method on the implicitly converted RichString. It will box the primitives for you. i.e.
jem#Respect:~$ scala
Welcome to Scala version 2.8.0.final (Java HotSpot(TM) Client VM, Java 1.6.0_21).
Type in expressions to have them evaluated.
Type :help for more information.
scala> class Example {
| def forString(s: String) = {
| for (c <- s) {
| println("%03i".format(c.toInt))
| }
| }
| }
defined class Example
scala> new Example().forString("9")
java.util.UnknownFormatConversionException: Conversion = 'i'
Closer, but not quite. You might want to try "%03d" as your format string.
scala> "%03d".format("9".toInt)
res3: String = 009
Scala 2.81 produces the following, clearer error:
scala> class Example {
| def forString(s: String) = {
| for (c <- s) {
| println(String.format("%03i", c.toInt))
| }
| }
| }
<console>:8: error: type mismatch;
found : Int
required: java.lang.Object
Note: primitive types are not implicitly converted to AnyRef.
You can safely force boxing by casting x.asInstanceOf[AnyRef].
println(String.format("%03i", c.toInt))
^
Taking into account the other suggestion about String.format, here's the minimal fix for the above code:
scala> def forString(s: String) = {
| for (c: Char <- s) {
| println(String.format("%03d", c.toInt.asInstanceOf[AnyRef]))
| }}
forString: (s: String)Unit
scala> forString("ciao")
099
105
097
111
In this case, using the implicit format is even better, but in case you need again to call a Java varargs method, that's a solution which works always.
I tried your code (with an extra println) and it works in 2.8.1:
class Example {
| def forString(s:String) = {
| for (c <- s) {
| println(c)
| }
| }
| }
It can be used with:
new Example().forString("hello")
h
e
l
l
o

Grammars, Scala Parsing Combinators and Orderless Sets

I'm writing an application that will take in various "command" strings. I've been looking at the Scala combinator library to tokenize the commands. I find in a lot of cases I want to say: "These tokens are an orderless set, and so they can appear in any order, and some might not appear".
With my current knowledge of grammars I would have to define all combinations of sequences as such (pseudo grammar):
command = action~content
action = alphanum
content = (tokenA~tokenB~tokenC | tokenB~tokenC~tokenA | tokenC~tokenB~tokenA ....... )
So my question is, considering tokenA-C are unique, is there a shorter way to define a set of any order using a grammar?
You can use the "Parser.^?" operator to check a group of parse elements for duplicates.
def tokens = tokenA | tokenB | tokenC
def uniqueTokens = (tokens*) ^? (
{ case t if (t == t.removeDuplicates) => t },
{ "duplicate tokens found: " + _ })
Here is an example that allows you to enter any of the four stooges in any order, but fails to parse if a duplicate is encountered:
package blevins.example
import scala.util.parsing.combinator._
case class Stooge(name: String)
object StoogesParser extends RegexParsers {
def moe = "Moe".r
def larry = "Larry".r
def curly = "Curly".r
def shemp = "Shemp".r
def stooge = ( moe | larry | curly | shemp ) ^^ { case s => Stooge(s) }
def certifiedStooge = stooge | """\w+""".r ^? (
{ case s: Stooge => s },
{ "not a stooge: " + _ })
def stooges = (certifiedStooge*) ^? (
{ case x if (x == x.removeDuplicates) => x.toSet },
{ "duplicate stooge in: " + _ })
def parse(s: String): String = {
parseAll(stooges, new scala.util.parsing.input.CharSequenceReader(s)) match {
case Success(r,_) => r.mkString(" ")
case Failure(r,_) => "failure: " + r
case Error(r,_) => "error: " + r
}
}
}
And some example usage:
package blevins.example
object App extends Application {
def printParse(s: String): Unit = println(StoogesParser.parse(s))
printParse("Moe Shemp Larry")
printParse("Moe Shemp Shemp")
printParse("Curly Beyonce")
/* Output:
Stooge(Moe) Stooge(Shemp) Stooge(Larry)
failure: duplicate stooge in: List(Stooge(Moe), Stooge(Shemp), Stooge(Shemp))
failure: not a stooge: Beyonce
*/
}
There are ways around it. Take a look at the parser here, for example. It accepts 4 pre-defined numbers, which may appear in any other, but must appear once, and only once.
OTOH, you could write a combinator, if this pattern happens often:
def comb3[A](a: Parser[A], b: Parser[A], c: Parser[A]) =
a ~ b ~ c | a ~ c ~ b | b ~ a ~ c | b ~ c ~ a | c ~ a ~ b | c ~ b ~ a
I would not try to enforce this requirement syntactically. I'd write a production that admits multiple tokens from the set allowed and then use a non-parsing approach to ascertaining the acceptability of the keywords actually given. In addition to allowing a simpler grammar, it will allow you to more easily continue parsing after emitting a diagnostic about the erroneous usage.
Randall Schulz
I don't know what kind of constructs you want to support, but I gather you should be specifying a more specific grammar. From your comment to another answer:
todo message:link Todo class to database
I guess you don't want to accept something like
todo message:database Todo to link class
So you probably want to define some message-level keywords like "link" and "to"...
def token = alphanum~':'~ "link" ~ alphanum ~ "class" ~ "to" ~ alphanum
^^ { (a:String,b:String,c:String) => /* a == "message", b="Todo", c="database" */ }
I guess you would have to define your grammar at that level.
You could of course write a combination rule that does this for you if you encounter this situation frequently.
On the other hand, maybe the option exists to make "tokenA..C" just "token" and then differentiate inside the handler of "token"