Parsing sentences using Scala parser combinator

Parsing sentences using Scala parser combinator - scala

I just started playing with parser combinators in Scala, but got stuck on a parser to parse sentences such as "I like Scala." (words end on a whitespace or a period (.)).
I started with the following implementation:
package example
import scala.util.parsing.combinator._
object Example extends RegexParsers {
override def skipWhitespace = false
def character: Parser[String] = """\w""".r
def word: Parser[String] =
rep(character) <~ (whiteSpace | guard(literal("."))) ^^ (_.mkString(""))
def sentence: Parser[List[String]] = rep(word) <~ "."
}
object Test extends App {
val result = Example.parseAll(Example.sentence, "I like Scala.")
println(result)
}
The idea behind using guard() is to have a period demarcate word endings, but not consume it so that sentences can. However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).
If I change the word and sentence definitions as follows, it parses the sentence, but the grammar description doesn't look right and will not work if I try to add parser for paragraph (rep(sentence)) etc.
def word: Parser[String] =
rep(character) <~ (whiteSpace | literal(".")) ^^ (_.mkString(""))
def sentence: Parser[List[String]] = rep(word) <~ opt(".")
Any ideas what may be going on here?

However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).
The rep combinator corresponds to a * in perl-style regex notation. This means it matches zero or more characters. I think you want it to match one or more characters. Changing that to a rep1 (corresponding to + in perl-style regex notation) should fix the problem.
However, your definition still seems a little verbose to me. Why are you parsing individual characters instead of just using \w+ as the pattern for a word? Here's how I'd write it:
object Example extends RegexParsers {
override def skipWhitespace = false
def word: Parser[String] = """\w+""".r
def sentence: Parser[List[String]] = rep1sep(word, whiteSpace) <~ "."
}
Notice that I use rep1sep to parse a non-empty list of words separated by whitespace. There's a repsep combinator as well, but I think you'd want at least one word per sentence.

Related

Using regex parser within a JavaTokensParser subclass

I am trying out scala parser combinators with the following object:
object LogParser extends JavaTokenParsers with PackratParsers {
Some of the parsers are working. But the following one is getting tripped up:
def time = """([\d]{2}:[\d]{2}:[\d]{2}\.[\d]+)"""
Following is the input not working:
09:58:24.608891
On reaching that line we get:
[2.22] failure: `([\d]{2}:[\d]{2}:[\d]{2}\.[\d]+)' expected but `:' found
09:58:24.608891
Note: I did verify correct behavior of that regex within the scala repl on the same input pattern.
val r = """([\d]{2}):([\d]{2}):([\d]{2}\.[\d]+)""".r
val s = """09:58:24.608891"""
val r(t,t2,t3) = s
t: String = 09
t2: String = 58
t3: String = 24.608891
So.. AFA parser combinator: is there an issue with the ":" token itself - i.e. need to create my own custom Lexer and add ":" to lexical.delimiters?
Update an answer was provided to add ".r". I had already tried that- but in any case to be explicit: the following has the same behavior (does not work):
def time = """([\d]{2}:[\d]{2}:[\d]{2}.[\d]+)""".r

I think you're just missing an .r at the end here to actually have a Regex as opposed to a string literal.
def time = """([\d]{2}:[\d]{2}:[\d]{2}\.[\d]+)"""
it should be
def time = """([\d]{2}:[\d]{2}:[\d]{2}\.[\d]+)""".r
The first one expects the text to be exactly like the regex string literal (which obviously isn't present), the second one expects text that actually matches the regex. Both create a Parser[String], so it's not immediately obvious that something is missing.
There's an implicit conversion from java.lang.String to Parser[String], so that string literals can be used as parser combinators.
There's an implicit conversion from scala.util.matching.Regex to > Parser[String], so that regex expressions can be used as parser combinators.
http://www.scala-lang.org/files/archive/api/2.11.2/scala-parser-combinators/#scala.util.parsing.combinator.RegexParsers

Make parser include surrounding whitespace in string literals

I wrote a Scala parser for an in-house expression language that has double quote-delimited string literals:
object MyParser extends JavaTokenParsers {
lazy val strLiteral = "\"" ~> """[^"]*""".r <~ "\"" ^^ {
case x ⇒ StringLiteral(x)
}
// ...
}
(The actual code is a bit different since I support "" as an escape sequence for a literal double quote. While this is not relevant for the discussion, it's the reason why I cannot just use JavaTokenParsers's stringLiteral).
I noticed that the parser fails to include whitespace at the beginning and at the end of a string:
"a" parsed as StringLiteral("a")
" a" parsed as StringLiteral("a")
"a " parsed as StringLiteral("a")
" a " parsed as StringLiteral("a")
I tried matching whitespace in the regex:
"\"" ~> """\s*[^"]*\s*""".r <~ "\""
and also using the explicit whiteSpace parser:
"\"" ~> whiteSpace.? ~ """[^"]*""".r ~ whiteSpace.? <~ "\""
but in both cases the ~> operator has already consumed and ignored the spaces before there's a chance to read and handle them.
I know that I can set skipWhitespace = false, but I prefer not to, since in general I want to allow arbitrary whitespace around tokens in this language.
What's a simple and clean strategy to include surrounding whitespace in string literals?

One option you have is to use single regexp for your string literal:
val stringLiteral:Parser[String] = """"([^"]*("")?)*"""".r
and then strip matched quotes afterwards.

Scala - Parse String until end of string

With scala-parser-combinators, I want to try parse with postfix string(end). but previous parser cought end contents. how to fix it?
(just changing "[^z]".r is not good answer.)
val input = "aaabbbzzz"
parseAll(body, input) // java.lang.IllegalStateException: `zzz' expected but end of source found
def body: Parser[List[String]] = content.+ <~ end
def content: Parser[String] = repChar // more complex like (repChar | foo | bar)
def repChar: Parser[String] = ".{3}".r // If change this like "[^z]{3}", but no flexibility.
def end: Parser[String] = "zzz"
I want to try like followings.
"""(.*)(?=zzz)""".r.into(str => ...check content.+ or not... <~ end)
search strings until end string.
then parse them with another parser.

Another way to fix this is to use the not combinator. You just need to check that what you are parsing is not the end value.
The trick is that not doesn't consume the input, so if not(end) succeeds (meaning that end failed) then we didn't reach the stopping condition so we can parse the three characters with the content parser that made the end parser failing.
As with the non-greedy approach linked in comments, it will fail for an input that has characters after "zzz" in the input (such as "aaabbbzzzzzz" for example).
But it may be sufficient for your use case. So you could give it a try
with:
def body: Parser[List[String]] = rep(not(end) ~> content) <~ end
In fact this is a kind of takeUntil parser, because it parses with content repeatedly until you're able to parse with end.

Scala PackratParsers: backtracking seems not to work

The following scala code fails to work as expected:
import scala.util.parsing.combinator.PackratParsers
import scala.util.parsing.combinator.syntactical.StandardTokenParsers
import scala.util.parsing.combinator.lexical.StdLexical
object Minimal extends StandardTokenParsers with PackratParsers {
override val lexical = new StdLexical
lexical.delimiters += ("<", "(", ")")
lazy val expression: PackratParser[Any] = (
numericLit
| numericLit ~ "<" ~ numericLit
)
def parseAll[T](p: PackratParser[T], in: String): ParseResult[T] =
phrase(p)(new PackratReader(new lexical.Scanner(in)))
def main(args: Array[String]) = println(parseAll(expression, "2 < 4"))
}
I get the error message:
[1.3] failure: end of input expected
2 < 4
^
If however I change the definition of "expression" to
lazy val expression: PackratParser[Any] = (
numericLit ~ "<" ~ numericLit
| numericLit
)
the problem disappears.
The problem seems to be that with the original definition code for "expression" the first rule consisting only of "numericLit" is applied, such that the parser indeed expects the input to end immediately afterwards. I do not understand why the parser does not backtrack as soon as it notices that the input does not indeed end; scala PackratParsers are supposed to be backtracking, and I also made sure to replace "def" by "lazy val" as suggested in the answer to another question.

The reason you are seeing this behaviour is that the alternation operator (vertical bar) is designed to accept the first of its alternatives that succeeds. In your case numericLit succeeds so the alternation never considers other alternatives.
With this kind of grammar specification you have to be careful if one alternative can match a prefix of another. As you've seen, the longer alternative should be placed earlier in the alternatives, otherwise it can never succeed.
If you wish the shorter alternative to match only if there is no more input after it, then you could try using the not combinator to express that extra condition. However, this approach will cause problems if expression is intended to be used inside other constructs.

It has nothing to do with packrat parser.
What you need to know is that in PEG, the choice operator selects the first match, which is numericLit in your case.

Scala parser-combinators: how to invert matches?

Is it possible to invert matches with Scala parser combinators? I am trying to match lines with a parser that do not start with a set of keywords. I could do this with an annoying zero width negative lookahead regular expression (e.g. "(?!h1|h2).*"), but I'd rather do it with a Scala parser. The best I've been able to come up with is this:
def keyword = "h1." | "h2."
def alwaysfails = "(?=a)b".r
def linenotstartingwithkeyword = keyword ~! alwaysfails | ".*".r
The idea is here that I use ~! to forbid backtracking to the all-matching regexp, and then continue with a regex "(?=a)b".r that matches nothing. (By the way, is there a predefined parser that always fails?) That way the line would not be matched if a keyword is found but would be matched if keyword does not match.
I am wondering if there is a better way to do this. Is there?

You can use not here:
import scala.util.parsing.combinator._
object MyParser extends RegexParsers {
val keyword = "h1." | "h2."
val lineNotStartingWithKeyword = not(keyword) ~> ".*".r
def apply(s: String) = parseAll(lineNotStartingWithKeyword, s)
}
Now:
scala> MyParser("h1. test")
res0: MyParser.ParseResult[String] =
[1.1] failure: Expected failure
h1. test
^
scala> MyParser("h1 test")
res1: MyParser.ParseResult[String] = [1.8] parsed: h1 test
Note that there is also a failure method on Parsers, so you could just as well have written your version with keyword ~! failure("keyword!"). But not's a lot nicer, anyway.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Parsing sentences using Scala parser combinator - scala

Related

Using regex parser within a JavaTokensParser subclass

Make parser include surrounding whitespace in string literals

Scala - Parse String until end of string

Scala PackratParsers: backtracking seems not to work

Scala parser-combinators: how to invert matches?

Categories

Resources