With scala-parser-combinators, I want to try parse with postfix string(end). but previous parser cought end contents. how to fix it?
(just changing "[^z]".r is not good answer.)
val input = "aaabbbzzz"
parseAll(body, input) // java.lang.IllegalStateException: `zzz' expected but end of source found
def body: Parser[List[String]] = content.+ <~ end
def content: Parser[String] = repChar // more complex like (repChar | foo | bar)
def repChar: Parser[String] = ".{3}".r // If change this like "[^z]{3}", but no flexibility.
def end: Parser[String] = "zzz"
I want to try like followings.
"""(.*)(?=zzz)""".r.into(str => ...check content.+ or not... <~ end)
search strings until end string.
then parse them with another parser.
Another way to fix this is to use the not combinator. You just need to check that what you are parsing is not the end value.
The trick is that not doesn't consume the input, so if not(end) succeeds (meaning that end failed) then we didn't reach the stopping condition so we can parse the three characters with the content parser that made the end parser failing.
As with the non-greedy approach linked in comments, it will fail for an input that has characters after "zzz" in the input (such as "aaabbbzzzzzz" for example).
But it may be sufficient for your use case. So you could give it a try
with:
def body: Parser[List[String]] = rep(not(end) ~> content) <~ end
In fact this is a kind of takeUntil parser, because it parses with content repeatedly until you're able to parse with end.
Related
I am trying out scala parser combinators with the following object:
object LogParser extends JavaTokenParsers with PackratParsers {
Some of the parsers are working. But the following one is getting tripped up:
def time = """([\d]{2}:[\d]{2}:[\d]{2}\.[\d]+)"""
Following is the input not working:
09:58:24.608891
On reaching that line we get:
[2.22] failure: `([\d]{2}:[\d]{2}:[\d]{2}\.[\d]+)' expected but `:' found
09:58:24.608891
Note: I did verify correct behavior of that regex within the scala repl on the same input pattern.
val r = """([\d]{2}):([\d]{2}):([\d]{2}\.[\d]+)""".r
val s = """09:58:24.608891"""
val r(t,t2,t3) = s
t: String = 09
t2: String = 58
t3: String = 24.608891
So.. AFA parser combinator: is there an issue with the ":" token itself - i.e. need to create my own custom Lexer and add ":" to lexical.delimiters?
Update an answer was provided to add ".r". I had already tried that- but in any case to be explicit: the following has the same behavior (does not work):
def time = """([\d]{2}:[\d]{2}:[\d]{2}.[\d]+)""".r
I think you're just missing an .r at the end here to actually have a Regex as opposed to a string literal.
def time = """([\d]{2}:[\d]{2}:[\d]{2}\.[\d]+)"""
it should be
def time = """([\d]{2}:[\d]{2}:[\d]{2}\.[\d]+)""".r
The first one expects the text to be exactly like the regex string literal (which obviously isn't present), the second one expects text that actually matches the regex. Both create a Parser[String], so it's not immediately obvious that something is missing.
There's an implicit conversion from java.lang.String to Parser[String], so that string literals can be used as parser combinators.
There's an implicit conversion from scala.util.matching.Regex to > Parser[String], so that regex expressions can be used as parser combinators.
http://www.scala-lang.org/files/archive/api/2.11.2/scala-parser-combinators/#scala.util.parsing.combinator.RegexParsers
In the following Parser:
object Foo extends JavaTokenParsers {
def word(x: String) = s"\\b$x\\b".r
lazy val expr = aSentence | something
lazy val aSentence = noun ~ verb ~ obj
lazy val noun = word("noun")
lazy val verb = word("verb") | err("not a verb!")
lazy val obj = word("object")
lazy val something = word("FOO")
}
It will parse noun verb object.
scala> Foo.parseAll(Foo.expr, "noun verb object")
res1: Foo.ParseResult[java.io.Serializable] = [1.17] parsed: ((noun~verb)~object)
But, when entering a valid noun, but an invalid verb, why won't the err("not a verb!") return an Error with that particular error message?
scala> Foo.parseAll(Foo.expr, "noun vedsfasdf")
res2: Foo.ParseResult[java.io.Serializable] =
[1.6] failure: string matching regex `\bverb\b' expected but `v' found
noun vedsfasdf
^
credit: Thanks to Travis Brown for explaining the need for the word function here.
This question seems similar, but I'm not sure how to handle err with the ~ function.
Here's another question you might ask: why isn't it complaining that it expected the word "FOO" but got "noun"? After all, if it fails to parse aSentence, it's then going to try something.
The culprit should be obvious when you think about it: what in that source code is taking two Failure results and choosing one? | (aka append).
This method on Parser will feed the input to both parsers, and then call append on ParseResult. That method is abstract at that level, and defined on Success, Failure and Error in different ways.
On both Success and Error, it always take this (that is, the parser on the left). On Failure, though, it does something else:
case class Failure(override val msg: String, override val next: Input) extends NoSuccess(msg, next) {
/** The toString method of a Failure yields an error message. */
override def toString = "["+next.pos+"] failure: "+msg+"\n\n"+next.pos.longString
def append[U >: Nothing](a: => ParseResult[U]): ParseResult[U] = { val alt = a; alt match {
case Success(_, _) => alt
case ns: NoSuccess => if (alt.next.pos < next.pos) this else alt
}}
}
Or, in other words, if both sides have failed, then it will take the side that read the most of the input (which is why it won't complain about a missing FOO), but if both have read the same amount, it will give precedence to the second failure.
I do wonder if it shouldn't check whether the right side is an Error, and, if so, return that. After all, if the left side is an Error, it always return that. This look suspicious to me, but maybe it's supposed to be that way. But I digress.
Back to the problem, it would seem that it should have gone with err, as they both consumed the same amount of input, right? Well... Here's the thing: regex parsers skip whiteSpace first, but that's for regex literals and literal strings. It does not apply over all other methods, including err.
That means that err's input is at the whitespace, while the word's input is at the word, and, therefore, further on the input. Try this:
lazy val verb = word("verb") | " *".r ~ err("not a verb!")
Arguably, err ought to be overridden by RegexParsers to do the right thing (tm). Since Scala Parser Combinators is now a separate project, I suggest you open an issue and follow it up with a Pull Request implementing the change. It will have the impact of changing error messages for some parser (well, that's the whole purpose of changing it :).
I want a parser that matches if and only if the parsed String is contained by a given list of Strings.
def box: Parser[String] = // match if token is element of boxSyms: List[String]
Even after hours of searching the web, I have no idea how to achieve this. (Which makes me think I've looked for it the wrong way).
Edit:
This is only a snippet from a bigger parser. The input is going to be used in further parser combinators:
lazy val boxModal = box ~ formula ^^ {
case boxSym ~ formula => Box(boxSyms.get(boxSym).get, formula)
}
The problem is that the List boxSyms is unknown at compile time.
Maybe something like this would work:
lazy val boxModal = box ~ formula ^^ {
case boxSym ~ formula if boxSyms.contains(boxSym) =>
Box(boxSyms.get(boxSym).get, formula)
}
Or some other, more specific condition.
I just started playing with parser combinators in Scala, but got stuck on a parser to parse sentences such as "I like Scala." (words end on a whitespace or a period (.)).
I started with the following implementation:
package example
import scala.util.parsing.combinator._
object Example extends RegexParsers {
override def skipWhitespace = false
def character: Parser[String] = """\w""".r
def word: Parser[String] =
rep(character) <~ (whiteSpace | guard(literal("."))) ^^ (_.mkString(""))
def sentence: Parser[List[String]] = rep(word) <~ "."
}
object Test extends App {
val result = Example.parseAll(Example.sentence, "I like Scala.")
println(result)
}
The idea behind using guard() is to have a period demarcate word endings, but not consume it so that sentences can. However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).
If I change the word and sentence definitions as follows, it parses the sentence, but the grammar description doesn't look right and will not work if I try to add parser for paragraph (rep(sentence)) etc.
def word: Parser[String] =
rep(character) <~ (whiteSpace | literal(".")) ^^ (_.mkString(""))
def sentence: Parser[List[String]] = rep(word) <~ opt(".")
Any ideas what may be going on here?
However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).
The rep combinator corresponds to a * in perl-style regex notation. This means it matches zero or more characters. I think you want it to match one or more characters. Changing that to a rep1 (corresponding to + in perl-style regex notation) should fix the problem.
However, your definition still seems a little verbose to me. Why are you parsing individual characters instead of just using \w+ as the pattern for a word? Here's how I'd write it:
object Example extends RegexParsers {
override def skipWhitespace = false
def word: Parser[String] = """\w+""".r
def sentence: Parser[List[String]] = rep1sep(word, whiteSpace) <~ "."
}
Notice that I use rep1sep to parse a non-empty list of words separated by whitespace. There's a repsep combinator as well, but I think you'd want at least one word per sentence.
I'm trying to define a grammar for the commands below.
object ParserWorkshop {
def main(args: Array[String]) = {
ChoiceParser("todo link todo to database")
ChoiceParser("todo link todo to database deadline: next tuesday context: app.model")
}
}
The second command should be tokenized as:
action = todo
message = link todo to database
properties = [deadline: next tuesday, context: app.model]
When I run this input on the grammar defined below, I receive the following error message:
[1.27] parsed: Command(todo,link todo to database,List())
[1.36] failure: string matching regex `\z' expected but `:' found
todo link todo to database deadline: next tuesday context: app.model
^
As far as I can see it fails because the pattern for matching the words of the message is nearly identical to the pattern for the key of the property key:value pair, so the parser cannot tell where the message ends and the property starts. I can solve this by insisting that start token be used for each property like so:
todo link todo to database :deadline: next tuesday :context: app.model
But i would prefer to keep the command as close natural language as possible.
I have two questions:
What does the error message actually mean?
And how would I modify the existing grammar to work for the given input strings?
import scala.util.parsing.combinator._
case class Command(action: String, message: String, properties: List[Property])
case class Property(name: String, value: String)
object ChoiceParser extends JavaTokenParsers {
def apply(input: String) = println(parseAll(command, input))
def command = action~message~properties ^^ {case a~m~p => new Command(a, m, p)}
def action = ident
def message = """[\w\d\s\.]+""".r
def properties = rep(property)
def property = propertyName~":"~propertyValue ^^ {
case n~":"~v => new Property(n, v)
}
def propertyName: Parser[String] = ident
def propertyValue: Parser[String] = """[\w\d\s\.]+""".r
}
It is really simple. When you use ~, you have to understand that there's no backtracking on individual parsers which have completed succesfully.
So, for instance, message got everything up to before the colon, as all of that is an acceptable pattern. Next, properties is a rep of property, which requires propertyName, but it only finds the colon (the first char not gobbled by message). So propertyName fails, and property fails. Now, properties, as mentioned, is a rep, so it finishes succesfully with 0 repetitions, which then makes command finish succesfully.
So, back to parseAll. The command parser returned succesfully, having consumed everything before the colon. It then asks the question: are we at the end of the input (\z)? No, because there is a colon right next. So, it expected end-of-input, but got a colon.
You'll have to change the regex so it won't consume the last identifier before a colon. For example:
def message = """[\w\d\s\.]+(?![:\w])""".r
By the way, when you use def you force the expression to be reevaluated. In other words, each of these defs create a parser every time each one is called. The regular expressions are instantiated every time the parsers they belong to are processed. If you change everything to val, you'll get much better performance.
Remember, these things define the parser, they do not run it. It is parseAll which runs a parser.