Accessing Scala Parser regular expression match data - scala

I wondering if it's possible to get the MatchData generated from the matching regular expression in the grammar below.
object DateParser extends JavaTokenParsers {
....
val dateLiteral = """(\d{4}[-/])?(\d\d[-/])?(\d\d)""".r ^^ {
... get MatchData
}
}
One option of course is to perform the match again inside the block, but since the RegexParser has already performed the match I'm hoping that it passes the MatchData to the block, or stores it?

Here is the implicit definition that converts your Regex into a Parser:
/** A parser that matches a regex string */
implicit def regex(r: Regex): Parser[String] = new Parser[String] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
(r findPrefixMatchOf (source.subSequence(start, source.length))) match {
case Some(matched) =>
Success(source.subSequence(start, start + matched.end).toString,
in.drop(start + matched.end - offset))
case None =>
Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
}
}
}
Just adapt it:
object X extends RegexParsers {
/** A parser that matches a regex string and returns the Match */
def regexMatch(r: Regex): Parser[Regex.Match] = new Parser[Regex.Match] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
(r findPrefixMatchOf (source.subSequence(start, source.length))) match {
case Some(matched) =>
Success(matched,
in.drop(start + matched.end - offset))
case None =>
Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
}
}
}
val t = regexMatch("""(\d\d)/(\d\d)/(\d\d\d\d)""".r) ^^ { case m => (m.group(1), m.group(2), m.group(3)) }
}
Example:
scala> X.parseAll(X.t, "23/03/1971")
res8: X.ParseResult[(String, String, String)] = [1.11] parsed: (23,03,1971)

No, you can't do this. If you look at the definition of the Parser used when you convert a regex to a Parser, it throws away all context and just returns the full matched string:
http://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_7_7_final/src/library/scala/util/parsing/combinator/RegexParsers.scala?view=markup#L55
You have a couple of other options, though:
break up your parser into several smaller parsers (for the tokens you actually want to extract)
define a custom parser that extracts the values you want and returns a domain object instead of a string
The first would look like
val separator = "-" | "/"
val year = ("""\d{4}"""r) <~ separator
val month = ("""\d\d"""r) <~ separator
val day = """\d\d"""r
val date = ((year?) ~ (month?) ~ day) map {
case year ~ month ~ day =>
(year.getOrElse("2009"), month.getOrElse("11"), day)
}
The <~ means "require these two tokens together, but only give me the result of the first one.
The ~ means "require these two tokens together and tie them together in a pattern-matchable ~ object.
The ? means that the parser is optional and will return an Option.
The .getOrElse bit provides a default value for when the parser didn't define a value.

When a Regex is used in a RegexParsers instance, the implicit def regex(Regex): Parser[String] in RegexParsers is used to appoly that Regex to the input. The Match instance yielded upon successful application of the RE at the current input is used to construct a Success in the regex() method, but only its "end" value is used, so any captured sub-matches are discarded by the time that method returns.
As it stands (in the 2.7 source I looked at), you're out of luck, I believe.

I ran into a similar issue using scala 2.8.1 and trying to parse input of the form "name:value" using the RegexParsers class:
package scalucene.query
import scala.util.matching.Regex
import scala.util.parsing.combinator._
object QueryParser extends RegexParsers {
override def skipWhitespace = false
private def quoted = regex(new Regex("\"[^\"]+"))
private def colon = regex(new Regex(":"))
private def word = regex(new Regex("\\w+"))
private def fielded = (regex(new Regex("[^:]+")) <~ colon) ~ word
private def term = (fielded | word | quoted)
def parseItem(str: String) = parse(term, str)
}
It seems that you can grab the matched groups after parsing like this:
QueryParser.parseItem("nameExample:valueExample") match {
case QueryParser.Success(result:scala.util.parsing.combinator.Parsers$$tilde, _) => {
println("Name: " + result.productElement(0) + " value: " + result.productElement(1))
}
}

Related

SCALA: How to convert a Parser Combinator result to Scala List[String]?

I am trying to write a parser for a language very similar to Milner's CCS. Basically what I am parsing so far are expressions of the following sort:
a.b.a.1
a.0
An expression must start with a letter (excluding t) and could have any number of letters following the first letter (separated by a '.'). The Expression must terminate with a digit (for simplicity I chose digits between 0 and 2 for now). I want to use Parser Combinators for Scala, however this is the first time that I am working with them. This is what I have so far:
import scala.util.parsing.combinator._
class SimpleParser extends RegexParsers {
def alpha: Parser[String] = """[^t]{1}""".r ^^ { _.toString }
def digit: Parser[Int] = """[0-2]{1}""".r ^^ { _.toInt }
def expr: Parser[Any] = alpha ~ "." ~ digit ^^ {
case al ~ "." ~ di => List(al, di)
}
def simpleExpression: Parser[Any] = alpha ~ "." ~ rep(alpha ~ ".") ~ digit //^^ { }
}
As you can see in def expr :Parser[Any] I am trying to return the result as a list, since Lists in Scala are very easy to work with (in my opinion). Is this the correct way how to convert a Parser[Any] result to a List? Can anyone give me any tips on how I can do this for def simpleExpression:Parser[Any].
The main reason why I want to use Lists is because after parsing and Expression I want to be able to consume it. For example, given the expression a.b.1, if I am given an 'a', I would like to consume the expression to end up with a new expression: b.1 (i.e. a.b.1 ->(a)-> b.1). The idea behind this is to simulate finite state automatas. Any tips on how I may improve my implementation are appreciated.
To keeps things type safe, I recommend a parser that produces a tuple of a list of strings and an int. That is, the input a.b.a.1 would get parsed as (List("a", "b", "a"), 1). Note also that the regex for alpha was modified to exclude anything that is not a lowercase letter (in addition to t).
class SimpleParser extends RegexParsers {
def alpha: Parser[String] = """[a-su-z]{1}""".r ^^ { _.toString }
def digit: Parser[Int] = """[0-2]{1}""".r ^^ { _.toInt }
def repAlpha: Parser[List[String]] = rep1sep(alpha, ".")
def expr: Parser[(List[String], Int)] = repAlpha ~ "." ~ digit ^^ {
case alphas ~ _ ~ num =>
(alphas, num)
}
}
With an instance of this SimpleParser, here's the output I got:
println(parser.parse(parser.expr, "a.b.a.1"))
// [1.8] parsed: (List(a, b, a),1)
println(parser.parse(parser.expr, "a.0"))
// [1.4] parsed: (List(a),0)

Scala parsing, how to go through what has been collected

Using scala parser combinators I have parsed some text input and created some of my own types along the way. The result prints fine. Now I need to go through the output, which I presume is a nested structure that includes the types I created. How do I go about this?
This is how I call the parser:
GMParser1.parseItem(i_inputHard_2) match {
case GMParser1.Success(res, _) =>
println(">" + res + "< of type: " + res.getClass.getSimpleName)
case x =>
println("Could not parse the input string:" + x)
}
EDIT
What I get back from MsgResponse(O2 Flow|Off) is:
>(MsgResponse~(O2 Flow~Off))< of type: $tilde
And what I get back from WthrResponse(Id(Tube 25,Carbon Monoxide)|0.20) is:
>(WthrResponse~IdWithValue(Id(Tube 25,Carbon Monoxide),0.20))< of type: $tilde
Just to give context to the question here is some of the input parsing. I will want to get at Id:
trait Keeper
case class Id(leftContents:String,rightContents:String) extends Keeper
And here is Id being created:
def id = "Id(" ~> idContents <~ ")" ^^ { contents => Id(contents._1,contents._2) }
And here is the whole of the parser:
object GMParser1 extends RegexParsers {
override def skipWhitespace = false
def number = regex(new Regex("[-+]?(\\d*[.])?\\d+"))
def idContents = text ~ ("," ~> text)
def id = "Id(" ~> idContents <~ ")" ^^ { contents => Id(contents._1,contents._2) }
def text = """[A-Za-z0-9* ]+""".r
def wholeWord = """[A-Za-z]+""".r
def idBracketContents = id ~ ( "|" ~> number ) ^^ { contents => IdWithValue(contents._1,contents._2) }
def nonIdBracketContents = text ~ ( "|" ~> text )
def bracketContents = idBracketContents | nonIdBracketContents
def outerBrackets = "(" ~> bracketContents <~ ")"
def target = wholeWord ~ outerBrackets
def parseItem(str: String): ParseResult[Any] = parse(target, str)
trait Keeper
case class Id(leftContents:String,rightContents:String) extends Keeper
case class IdWithValue(leftContents:Id,numberContents:String) extends Keeper
}
The parser created by the ~ operator produces a value of the ~ case class. To get at its contents, you can pattern match on it like on any other case class (keeping in mind that its name is symbolic, so it's used infix).
So you can replace case GMParser1.Success(res, _) => ... with case GMParser1.Success(functionName ~ argument) => ... to get at the function name and argument (or whatever the semantics of wholeWord and bracketContents in wholeWord "(" bracketContents ")" are). You can then similarly use a nested pattern to get at the individual parts of the argument.
You could (and probably should) also use ^^ together with pattern matching in your rules to create a more meaningful AST structure that doesn't contain ~. This would be useful to distinguish a nonIdBracketContents result from a bracketContents result for example.

How to distinguish triple quotes from single quotes in macros?

I am writing a macro m(expr: String), where expr is an expression in some language (not Scala):
m("SOME EXPRESSION")
m("""
SOME EXPRESSION
""")
When I am parsing the expression I would like to report error messages with proper locations in the source file. To achieve this I should know the location of the string literal itself and the number of quotes of the literal (3 or 1). Unfortunately, I did not find any method that returns the number of quotes of the literal:
import scala.language.experimental.macros
import scala.reflect.macros.blackbox.Context
object Temp {
def m(s: String) : String = macro mImpl
def mImpl(context: Context)(s: context.Expr[String]): context.universe.Tree = {
import context.universe._
s match {
case l # Literal(Constant(p: String)) =>
if (l.<TRIPLE QUOTES>) {
...
} else {
...
}
case _ =>
context.abort(context.enclosingPosition, "The argument of m must be a string literal")
}
}
}
What should I put instead of <TRIPLE QUOTES>?
The only way i can think of is to access the source file and check for triple quotes:
l.tree.pos.source.content.startsWith("\"\"\"",l.tree.pos.start)
You need also to edit your matching case:
case l # Expr(Literal(Constant(p: String))) =>
Here the version with some explanation:
val tree: context.universe.Tree = l.tree
val pos: scala.reflect.api.Position = tree.pos
val source: scala.reflect.internal.util.SourceFile = pos.source
val content: Array[Char] = source.content
val start: Int = pos.start
val isTriple: Boolean = content.startsWith("\"\"\"",start)

Ignoring prefixes in a JavaToken combinator parser

I'm trying to use a JavaToken combinator parser to pull out a particular match that's in the middle of larger string (ie ignore a random set of prefix chars). However I can't get it working and think I'm getting caught out by a greedy parser and/or CRs LFs. (the prefix chars can be basically anything). I have:
class RuleHandler extends JavaTokenParsers {
def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\\s]*""".r
def findX: Parser[Double] = allowedPrefixChars ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}
}
and then in my test case ..
"when looking for the X value" in {
"must find and correctly interpret X" in {
val testString =
"""
|Looking (only)
|for (x=45) within
|this string
""".stripMargin
val answer = ruleHandler.parse(ruleHandler.findX, testString)
System.out.println(" X value is : " + answer.toString)
}
}
I think it's similar to this SO question. Can anyone see whats wrong pls ? Tks.
First, you should not escape "\\s" twice inside """ """:
def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\s]*?""".r
In your case it was interpreted separately "\" or "s" (s as symbol, not \s)
Second, your allowedPrefixChars parser includes (, x, =, so it captures the whole string, including (x=, nothing is left to subsequent parsers.
The solution is to be more concrete about prefix you want:
object ruleHandler extends JavaTokenParsers {
def allowedPrefixChar: Parser[String] = """[a-zA-Z0-9=*+-/<>!\_){}~\s]""".r //no "(" here
def findX: Parser[Double] = rep(allowedPrefixChar | "\\((?!x=)".r ) ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}
}
ruleHandler.parse(ruleHandler.findX, testString)
res14: ruleHandler.ParseResult[Double] = [3.11] parsed: 45.0
I've told the parser to ignore (, that has x= going after (it's just negative lookahead).
Alternative:
"""\(x=(.*?)\)""".r.findAllMatchIn(testString).map(_.group(1).toDouble).toList
res22: List[Double] = List(45.0)
If you want to use parsers correctly, I would recommend you to describe the whole BNF grammar (with all possible (,) and = usages) - not just fragment. For example, include (only) into your parser if it's keyword, "(" ~> valueName <~ "=" ~ value to get value. Don't forget that scala-parser is intended to return you AST, not just some matched value. Pure regexps are better for regular matching from unstructured data.
Example how it would like to use parsers in correct way (didn't try to compile):
trait Command
case class Rule(name: String, value: Double) extends Command
case class Directive(name: String) extends Command
class RuleHandler extends JavaTokenParsers { //why `JavaTokenParsers` (not `RegexParsers`) if you don't use tokens from Java Language Specification ?
def string = """[a-zA-Z0-9*+-/<>!\_{}~\s]*""".r //it's still wrong you should use some predefined Java-like literals from **JavaToken**Parsers
def rule = "(" ~> string <~ "=" ~> string <~ ")" ^^ { case name ~ num => Rule(name, num.toDouble} }
def directive = "(" ~> string <~ ")" ^^ { case name => Directive(name) }
def commands: Parser[Command] = repsep(rule | directive, string)
}
If you need to process natural language (Chomsky type-0), scalanlp or something similar fits better.

Idiomatic way of extracting known key-> val pairs from a string

Example context:
An HTTP Response with a body as follows:
key1=val1&key2=val2&key3=val3.
The names of the keys are always known.
Currently the extraction is done with regex:
val params = response split ("""&""") map { _.split("""=""") } map { el => { el(0) -> el(1) } } toMap;
Is there a simpler way of pattern matching the response for specific params?
I think using split is probably going to be the fastest/simplest solution here. You're not doing any advanced parsing, so using parser combinators or regex capture groups seems a little overkill.
However, when you have complex expressions involving multiple calls to map, filter, etc., it's usually an indicator that you can clean things up with a for-comprehension:
val response = "key1=val1&key2=val2&key3=val3"
val params = (for { x <- response split ("&")
Array(k, v) = x split ("=") }
yield k->v).toMap
You can use parser combinators here for most flexibility and robustness (i.e., handle failed parsing):
object Parser extends RegexParsers with App {
def lit: Parser[String] = "[^=&]+".r
def pair: Parser[(String, String)] = lit ~ "=" ~ lit ^^ {
case key ~ "=" ~ value => key -> value
}
def parse: Parser[Seq[(String, String)]] = repsep(pair, "&")
val response = "key1=val1&key2=val2&key3=val3"
val params = parse(new CharSequenceReader(response)).get.toMap
println(params)
}
You can use regexp as a matcher like this:
val r = "([^=]+)=([^=]+)".r
def toKv(s:String) = s match {
case r(k,v) => (k,v)
case _ => throw InvalidFormatException
}
So, for your case it would look like:
response split ("&") map (toKv)