I am trying out scala parser combinators with the following object:
object LogParser extends JavaTokenParsers with PackratParsers {
Some of the parsers are working. But the following one is getting tripped up:
def time = """([\d]{2}:[\d]{2}:[\d]{2}\.[\d]+)"""
Following is the input not working:
09:58:24.608891
On reaching that line we get:
[2.22] failure: `([\d]{2}:[\d]{2}:[\d]{2}\.[\d]+)' expected but `:' found
09:58:24.608891
Note: I did verify correct behavior of that regex within the scala repl on the same input pattern.
val r = """([\d]{2}):([\d]{2}):([\d]{2}\.[\d]+)""".r
val s = """09:58:24.608891"""
val r(t,t2,t3) = s
t: String = 09
t2: String = 58
t3: String = 24.608891
So.. AFA parser combinator: is there an issue with the ":" token itself - i.e. need to create my own custom Lexer and add ":" to lexical.delimiters?
Update an answer was provided to add ".r". I had already tried that- but in any case to be explicit: the following has the same behavior (does not work):
def time = """([\d]{2}:[\d]{2}:[\d]{2}.[\d]+)""".r
I think you're just missing an .r at the end here to actually have a Regex as opposed to a string literal.
def time = """([\d]{2}:[\d]{2}:[\d]{2}\.[\d]+)"""
it should be
def time = """([\d]{2}:[\d]{2}:[\d]{2}\.[\d]+)""".r
The first one expects the text to be exactly like the regex string literal (which obviously isn't present), the second one expects text that actually matches the regex. Both create a Parser[String], so it's not immediately obvious that something is missing.
There's an implicit conversion from java.lang.String to Parser[String], so that string literals can be used as parser combinators.
There's an implicit conversion from scala.util.matching.Regex to > Parser[String], so that regex expressions can be used as parser combinators.
http://www.scala-lang.org/files/archive/api/2.11.2/scala-parser-combinators/#scala.util.parsing.combinator.RegexParsers
Related
In the following Parser:
object Foo extends JavaTokenParsers {
def word(x: String) = s"\\b$x\\b".r
lazy val expr = aSentence | something
lazy val aSentence = noun ~ verb ~ obj
lazy val noun = word("noun")
lazy val verb = word("verb") | err("not a verb!")
lazy val obj = word("object")
lazy val something = word("FOO")
}
It will parse noun verb object.
scala> Foo.parseAll(Foo.expr, "noun verb object")
res1: Foo.ParseResult[java.io.Serializable] = [1.17] parsed: ((noun~verb)~object)
But, when entering a valid noun, but an invalid verb, why won't the err("not a verb!") return an Error with that particular error message?
scala> Foo.parseAll(Foo.expr, "noun vedsfasdf")
res2: Foo.ParseResult[java.io.Serializable] =
[1.6] failure: string matching regex `\bverb\b' expected but `v' found
noun vedsfasdf
^
credit: Thanks to Travis Brown for explaining the need for the word function here.
This question seems similar, but I'm not sure how to handle err with the ~ function.
Here's another question you might ask: why isn't it complaining that it expected the word "FOO" but got "noun"? After all, if it fails to parse aSentence, it's then going to try something.
The culprit should be obvious when you think about it: what in that source code is taking two Failure results and choosing one? | (aka append).
This method on Parser will feed the input to both parsers, and then call append on ParseResult. That method is abstract at that level, and defined on Success, Failure and Error in different ways.
On both Success and Error, it always take this (that is, the parser on the left). On Failure, though, it does something else:
case class Failure(override val msg: String, override val next: Input) extends NoSuccess(msg, next) {
/** The toString method of a Failure yields an error message. */
override def toString = "["+next.pos+"] failure: "+msg+"\n\n"+next.pos.longString
def append[U >: Nothing](a: => ParseResult[U]): ParseResult[U] = { val alt = a; alt match {
case Success(_, _) => alt
case ns: NoSuccess => if (alt.next.pos < next.pos) this else alt
}}
}
Or, in other words, if both sides have failed, then it will take the side that read the most of the input (which is why it won't complain about a missing FOO), but if both have read the same amount, it will give precedence to the second failure.
I do wonder if it shouldn't check whether the right side is an Error, and, if so, return that. After all, if the left side is an Error, it always return that. This look suspicious to me, but maybe it's supposed to be that way. But I digress.
Back to the problem, it would seem that it should have gone with err, as they both consumed the same amount of input, right? Well... Here's the thing: regex parsers skip whiteSpace first, but that's for regex literals and literal strings. It does not apply over all other methods, including err.
That means that err's input is at the whitespace, while the word's input is at the word, and, therefore, further on the input. Try this:
lazy val verb = word("verb") | " *".r ~ err("not a verb!")
Arguably, err ought to be overridden by RegexParsers to do the right thing (tm). Since Scala Parser Combinators is now a separate project, I suggest you open an issue and follow it up with a Pull Request implementing the change. It will have the impact of changing error messages for some parser (well, that's the whole purpose of changing it :).
String interpolation is available in Scala starting Scala 2.10
This is the basic example
val name = "World" //> name : String = World
val message = s"Hello $name" //> message : String = Hello World
I was wondering if there is a way to do dynamic interpolation, e.g. the following (doesn't compile, just for illustration purposes)
val name = "World" //> name : String = World
val template = "Hello $name" //> template : String = Hello $name
//just for illustration:
val message = s(template) //> doesn't compile (not found: value s)
Is there a way to "dynamically" evaluate a String like that? (or is it inherently wrong / not possible)
And what is s exactly? it's not a method def (apparently it is a method on StringContext), and not an object (if it was, it would have thrown a different compile error than not found I think)
s is actually a method on StringContext (or something which can be implicitly converted from StringContext). When you write
whatever"Here is text $identifier and more text"
the compiler desugars it into
StringContext("Here is text ", " and more text").whatever(identifier)
By default, StringContext gives you s, f, and raw* methods.
As you can see, the compiler itself picks out the name and gives it to the method. Since this happens at compile time, you can't sensibly do it dynamically--the compiler doesn't have information about variable names at runtime.
You can use vars, however, so you can swap in values that you want. And the default s method just calls toString (as you'd expect) so you can play games like
class PrintCounter {
var i = 0
override def toString = { val ans = i.toString; i += 1; ans }
}
val pc = new PrintCounter
def pr[A](a: A) { println(s"$pc: $a") }
scala> List("salmon","herring").foreach(pr)
1: salmon
2: herring
(0 was already called by the REPL in this example).
That's about the best you can do.
*raw is broken and isn't slated to be fixed until 2.10.1; only text before a variable is actually raw (no escape processing). So hold off on using that one until 2.10.1 is out, or look at the source code and define your own. By default, there is no escape processing, so defining your own is pretty easy.
Here is a possible solution to #1 in the context of the original question based on Rex's excellent answer
val name = "World" //> name: String = World
val template = name=>s"Hello $name" //> template: Seq[Any]=>String = <function1>
val message = template(name) //> message: String = Hello World
String interpolation happens at compile time, so the compiler does not generally have enough information to interpolate s(str). It expects a string literal, according to the SIP.
Under Advanced Usage in the documentation you linked, it is explained that an expression of the form id"Hello $name ." is translated at compile time to new StringContext("Hello", "."). id(name).
Note that id can be a user-defined interpolator introduced through an implicit class. The documentation gives an example for a json interpolator,
implicit class JsonHelper(val sc: StringContext) extends AnyVal {
def json(args: Any*): JSONObject = {
...
}
}
This is inherently impossible in the current implementation: local variable names are not available at execution time -- may be kept around as debug symbols, but can also have been stripped. (Member variable names are, but that's not what you're describing here).
Is it possible to invert matches with Scala parser combinators? I am trying to match lines with a parser that do not start with a set of keywords. I could do this with an annoying zero width negative lookahead regular expression (e.g. "(?!h1|h2).*"), but I'd rather do it with a Scala parser. The best I've been able to come up with is this:
def keyword = "h1." | "h2."
def alwaysfails = "(?=a)b".r
def linenotstartingwithkeyword = keyword ~! alwaysfails | ".*".r
The idea is here that I use ~! to forbid backtracking to the all-matching regexp, and then continue with a regex "(?=a)b".r that matches nothing. (By the way, is there a predefined parser that always fails?) That way the line would not be matched if a keyword is found but would be matched if keyword does not match.
I am wondering if there is a better way to do this. Is there?
You can use not here:
import scala.util.parsing.combinator._
object MyParser extends RegexParsers {
val keyword = "h1." | "h2."
val lineNotStartingWithKeyword = not(keyword) ~> ".*".r
def apply(s: String) = parseAll(lineNotStartingWithKeyword, s)
}
Now:
scala> MyParser("h1. test")
res0: MyParser.ParseResult[String] =
[1.1] failure: Expected failure
h1. test
^
scala> MyParser("h1 test")
res1: MyParser.ParseResult[String] = [1.8] parsed: h1 test
Note that there is also a failure method on Parsers, so you could just as well have written your version with keyword ~! failure("keyword!"). But not's a lot nicer, anyway.
I'm trying to define a grammar for the commands below.
object ParserWorkshop {
def main(args: Array[String]) = {
ChoiceParser("todo link todo to database")
ChoiceParser("todo link todo to database deadline: next tuesday context: app.model")
}
}
The second command should be tokenized as:
action = todo
message = link todo to database
properties = [deadline: next tuesday, context: app.model]
When I run this input on the grammar defined below, I receive the following error message:
[1.27] parsed: Command(todo,link todo to database,List())
[1.36] failure: string matching regex `\z' expected but `:' found
todo link todo to database deadline: next tuesday context: app.model
^
As far as I can see it fails because the pattern for matching the words of the message is nearly identical to the pattern for the key of the property key:value pair, so the parser cannot tell where the message ends and the property starts. I can solve this by insisting that start token be used for each property like so:
todo link todo to database :deadline: next tuesday :context: app.model
But i would prefer to keep the command as close natural language as possible.
I have two questions:
What does the error message actually mean?
And how would I modify the existing grammar to work for the given input strings?
import scala.util.parsing.combinator._
case class Command(action: String, message: String, properties: List[Property])
case class Property(name: String, value: String)
object ChoiceParser extends JavaTokenParsers {
def apply(input: String) = println(parseAll(command, input))
def command = action~message~properties ^^ {case a~m~p => new Command(a, m, p)}
def action = ident
def message = """[\w\d\s\.]+""".r
def properties = rep(property)
def property = propertyName~":"~propertyValue ^^ {
case n~":"~v => new Property(n, v)
}
def propertyName: Parser[String] = ident
def propertyValue: Parser[String] = """[\w\d\s\.]+""".r
}
It is really simple. When you use ~, you have to understand that there's no backtracking on individual parsers which have completed succesfully.
So, for instance, message got everything up to before the colon, as all of that is an acceptable pattern. Next, properties is a rep of property, which requires propertyName, but it only finds the colon (the first char not gobbled by message). So propertyName fails, and property fails. Now, properties, as mentioned, is a rep, so it finishes succesfully with 0 repetitions, which then makes command finish succesfully.
So, back to parseAll. The command parser returned succesfully, having consumed everything before the colon. It then asks the question: are we at the end of the input (\z)? No, because there is a colon right next. So, it expected end-of-input, but got a colon.
You'll have to change the regex so it won't consume the last identifier before a colon. For example:
def message = """[\w\d\s\.]+(?![:\w])""".r
By the way, when you use def you force the expression to be reevaluated. In other words, each of these defs create a parser every time each one is called. The regular expressions are instantiated every time the parsers they belong to are processed. If you change everything to val, you'll get much better performance.
Remember, these things define the parser, they do not run it. It is parseAll which runs a parser.
In Scala it is possible formulate patterns based on the invididual characters of a string by treating it as a Seq[Char].
An example of this feature is mentioned in A Tour of Scala
This is the example code used there:
object RegExpTest1 extends Application {
def containsScala(x: String): Boolean = {
val z: Seq[Char] = x
z match {
case Seq('s','c','a','l','a', rest # _*) =>
println("rest is "+rest)
true
case Seq(_*) =>
false
}
}
}
The problem I have with this is the third line of the snippet:
val z: Seq[Char] = x
Why is this sort of cast necessary? Shouldn't a String behave like a Seq[Char] under all circumstances (which would include pattern matching)? However, without this conversion, the code snippet will not work.
There is some real abuse of terminology going on in the question and the comments. There is no cast in this code, and especially "So basically, this is a major concession to Java interoperability, sacrificing some type soundness" has no basis in reality.
A scala cast looks like this: x.asInstanceOf[Y].
What you see above is an assignment: val z: Seq[Char] = x
This assignment is legal because there is an implicit conversion from String to Seq[Char]. I emphasize again, this is not a cast. A cast is an arbitrary assertion which can fail at runtime. There is no way for the implicit conversion to fail.
The problem with depending on implicit conversions between types, and the answer to the original question, is that implicit conversions only take place if the original value doesn't type check. Since it's perfectly legal to match on a String, no conversion takes place, the match just fails.
Not 100% sure if this is correct, but my intuition says that without this explicit cast you would pattern match against java.lang.String, which is not what you want.
The explicit cast forces the Scala compiler to use Predef.stringWrapper implicit conversion; thus, as RichString extends Seq[Char], you are able to do a pattern match as if the string were a sequence of characters.
I'm going to echo everything that andri said. For interoperability, Scala strings are java.lang.Strings. In Predef, there's an implicit conversion from String to RichString, which implements Seq[Char].
A perhaps nicer way of coding the pattern match, without needing an intermediate val z to hold the Seq[Char]:
def containsScala(x: String): Boolean = {
(x: Seq[Char]) match {
...
}
}