I wrote a Scala parser for an in-house expression language that has double quote-delimited string literals:
object MyParser extends JavaTokenParsers {
lazy val strLiteral = "\"" ~> """[^"]*""".r <~ "\"" ^^ {
case x ⇒ StringLiteral(x)
}
// ...
}
(The actual code is a bit different since I support "" as an escape sequence for a literal double quote. While this is not relevant for the discussion, it's the reason why I cannot just use JavaTokenParsers's stringLiteral).
I noticed that the parser fails to include whitespace at the beginning and at the end of a string:
"a" parsed as StringLiteral("a")
" a" parsed as StringLiteral("a")
"a " parsed as StringLiteral("a")
" a " parsed as StringLiteral("a")
I tried matching whitespace in the regex:
"\"" ~> """\s*[^"]*\s*""".r <~ "\""
and also using the explicit whiteSpace parser:
"\"" ~> whiteSpace.? ~ """[^"]*""".r ~ whiteSpace.? <~ "\""
but in both cases the ~> operator has already consumed and ignored the spaces before there's a chance to read and handle them.
I know that I can set skipWhitespace = false, but I prefer not to, since in general I want to allow arbitrary whitespace around tokens in this language.
What's a simple and clean strategy to include surrounding whitespace in string literals?
One option you have is to use single regexp for your string literal:
val stringLiteral:Parser[String] = """"([^"]*("")?)*"""".r
and then strip matched quotes afterwards.
Related
Our parser requires the following two capabilities:
regex parsing capabilities.
expand the set of characters that are considered delimiters.
The quandary is that the former requires RegexParsers (or its derivatives including JavaTokenParsers) whereas the latter requires StandardTokenParsers.
The following code would be added to a subclass of StandardTokenParsers:
// Extend the set of delimiters
class LogLexical extends StdLexical {
delimiters +=(
"|", ",", "<", ">", "/", "(", ")",
";", "%", "{", "}", ":", "[", "]", "&", "^", "~"
)
}
// Use the custom lexer to scan the input text
override val lexical = new LogLexical
override def parse(input: String): Seq[LogLine] = synchronized {
val tokens = new lexical.Scanner(input).asInstanceOf[Reader[Char]]
phrase(allLogTypes)(tokens) match {
case Success(plan, _) => plan
case failureOrError => sys.error(failureOrError.toString)
}
}
However the above approach can not be used for RegexParsers: they do not support the custom Lexer AFAICT.
It is not clear how to obtain both of those capabilities simultaneously.
One difference between those may impact the approach:
StandardTokenParsers is Char based whereas RegexParsers is String based
But even so - there is some set of delimiters being used by RegexParsers: they need to define how the tilde ~ operator is being applied
E.g. val myParser = someRegexParser ~ someOtherRegexParser
So the tilde is using a set of predefined delimiters to distinguish the beginning of the patterns. But where is that set? And how to change it? Any guidance appreciated.
With scala-parser-combinators, I want to try parse with postfix string(end). but previous parser cought end contents. how to fix it?
(just changing "[^z]".r is not good answer.)
val input = "aaabbbzzz"
parseAll(body, input) // java.lang.IllegalStateException: `zzz' expected but end of source found
def body: Parser[List[String]] = content.+ <~ end
def content: Parser[String] = repChar // more complex like (repChar | foo | bar)
def repChar: Parser[String] = ".{3}".r // If change this like "[^z]{3}", but no flexibility.
def end: Parser[String] = "zzz"
I want to try like followings.
"""(.*)(?=zzz)""".r.into(str => ...check content.+ or not... <~ end)
search strings until end string.
then parse them with another parser.
Another way to fix this is to use the not combinator. You just need to check that what you are parsing is not the end value.
The trick is that not doesn't consume the input, so if not(end) succeeds (meaning that end failed) then we didn't reach the stopping condition so we can parse the three characters with the content parser that made the end parser failing.
As with the non-greedy approach linked in comments, it will fail for an input that has characters after "zzz" in the input (such as "aaabbbzzzzzz" for example).
But it may be sufficient for your use case. So you could give it a try
with:
def body: Parser[List[String]] = rep(not(end) ~> content) <~ end
In fact this is a kind of takeUntil parser, because it parses with content repeatedly until you're able to parse with end.
Why does this simple example of a scala combinator parser fail?
def test: Parser[String] = "< " ~> ident <~ " >"
When I provide the following string:
"< a >"
I get this error:
[1.8] failure: ` >' expected but `&' found
< a >
^
Why is it tripping up on the space?
You are probably using RegexParsers. In documentation, you can find that:
The parsing methods call the method skipWhitespace (defaults to true)
and, if true, skip any whitespace before each parser is called.
To change this:
object MyParsers extends RegexParsers {
override def skipWhitespace = false
//your parsers...
}
I just started playing with parser combinators in Scala, but got stuck on a parser to parse sentences such as "I like Scala." (words end on a whitespace or a period (.)).
I started with the following implementation:
package example
import scala.util.parsing.combinator._
object Example extends RegexParsers {
override def skipWhitespace = false
def character: Parser[String] = """\w""".r
def word: Parser[String] =
rep(character) <~ (whiteSpace | guard(literal("."))) ^^ (_.mkString(""))
def sentence: Parser[List[String]] = rep(word) <~ "."
}
object Test extends App {
val result = Example.parseAll(Example.sentence, "I like Scala.")
println(result)
}
The idea behind using guard() is to have a period demarcate word endings, but not consume it so that sentences can. However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).
If I change the word and sentence definitions as follows, it parses the sentence, but the grammar description doesn't look right and will not work if I try to add parser for paragraph (rep(sentence)) etc.
def word: Parser[String] =
rep(character) <~ (whiteSpace | literal(".")) ^^ (_.mkString(""))
def sentence: Parser[List[String]] = rep(word) <~ opt(".")
Any ideas what may be going on here?
However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).
The rep combinator corresponds to a * in perl-style regex notation. This means it matches zero or more characters. I think you want it to match one or more characters. Changing that to a rep1 (corresponding to + in perl-style regex notation) should fix the problem.
However, your definition still seems a little verbose to me. Why are you parsing individual characters instead of just using \w+ as the pattern for a word? Here's how I'd write it:
object Example extends RegexParsers {
override def skipWhitespace = false
def word: Parser[String] = """\w+""".r
def sentence: Parser[List[String]] = rep1sep(word, whiteSpace) <~ "."
}
Notice that I use rep1sep to parse a non-empty list of words separated by whitespace. There's a repsep combinator as well, but I think you'd want at least one word per sentence.
In Scala's parser combinators (JavaTokensParser in particular) there is a definition stringLiteral that matches a Java-like string. Is there a way to convert a stringLiteral into a String? For example, If I parse "Run \" run \\ run" I would want to convert the entered string literal into Run " run \ run.
Also, is there a definition for stringLiterals that also supports """?
I have a hunch you are asking a more complicated question, but just in case the simple answer is write your own parser and trim the quotes after the applicator, ^^.
In the REPL you can test it like such:
import scala.util.parsing.combinator.JavaTokenParsers
object testParsers extends JavaTokenParsers {
val aString : Parser[String] = stringLiteral ^^ {
case s => s.substring( 1, s.length-1 )
}
}
testParsers.parseAll(testParsers.stringLiteral,""""Run \" run \\ run"""")
testParsers.parseAll(testParsers.aString,""""Run \" run \\ run"""")
I am not aware of any built in triple-quote parsers, so I guess you will have to roll your own.
Apache Commons provides a useful method: StringEscapeUtils.escapeJava.