I have an "object" with the following typo within a string: {myObjectIdIKnow?someInfo,{someBracedInfo},{someOtherBracedInfo},someInfo,...,lastInfo}.
I want to retrieve its content (i.e. from someInfo to lastInfo).
Following, the function I built:
def retrieveMyObject(line: String, myObjectId: String) =
{
if (line.contains(myObjectId))
{
var openingDelimiterCount = 0
var closingDelimiterCount = 0
val bit = iLine.split(myObjectIdIKnow).last
var i = -1
do
{
i += 1
if (bit(i).equals("{")) openingDelimiterCount += 1
else if (bit(i).equals("}")) closingDelimiterCount += 1
} while (lDelimiterOpeningCount >= closingDelimiterCount)
if (i.equals(0)) bit
else bit.splitAt(i)._1
}
}
I match with my myObjectId and browse through each character of the input line to check if it is a brace delimiter or not, then compare the numbers of { and }: if the second is bigger than the first, it means I reached the end of my content and thus I retrieve it.
It does not seem like a good method at all and I was wondering what better way could I do it?
I've tried to implement simple parser using Scala Parser Combinators. Here is what I got. I'm not very experienced with parser combinator but did something working just for the sake of curiosity.
import scala.util.parsing.combinator.JavaTokenParsers
case class InfoObject(id: String, objInfo: String, bracedInfos: List[String], infos: List[String])
class ObjectParser extends JavaTokenParsers {
def objDefinition: Parser[InfoObject] = "{" ~> (idPlusInfo <~ ",") ~ (bracedInfos <~ ",") ~ infos <~ "}" ^^ {
case (id, objInfo) ~ bracedInfos ~ infos => InfoObject(id, objInfo, bracedInfos, infos)
}
def idPlusInfo: Parser[(String, String)] = (infoObj <~ "?") ~ infoObj ^^ { case id ~ info => (id, info) }
def bracedInfos: Parser[List[String]] = repsep(bracedInfo, ",")
def bracedInfo: Parser[String] = "{" ~> infoObj <~ "}"
def infos: Parser[List[String]] = repsep(infoObj, ",")
def infoObj: Parser[String] = """\w+""".r
}
val parser = new ObjectParser()
parser.parse(parser.infoObj, "someInfo").get == "someInfo" // true
parser.parse(parser.bracedInfo, "{someBracedInfo}").get == "someBracedInfo" // true
val expected = InfoObject(
"myObjectIdIKnow",
"someInfo",
List("someBracedInfo", "someOtherBracedInfo"),
List("someInfo", "lastInfo")
)
val objectDef = "{myObjectIdIKnow?someInfo,{someBracedInfo},{someOtherBracedInfo},someInfo,lastInfo}"
parser.parse(parser.objDefinition, objectDef).get == expected // true
Related
Using scala parser combinators I have parsed some text input and created some of my own types along the way. The result prints fine. Now I need to go through the output, which I presume is a nested structure that includes the types I created. How do I go about this?
This is how I call the parser:
GMParser1.parseItem(i_inputHard_2) match {
case GMParser1.Success(res, _) =>
println(">" + res + "< of type: " + res.getClass.getSimpleName)
case x =>
println("Could not parse the input string:" + x)
}
EDIT
What I get back from MsgResponse(O2 Flow|Off) is:
>(MsgResponse~(O2 Flow~Off))< of type: $tilde
And what I get back from WthrResponse(Id(Tube 25,Carbon Monoxide)|0.20) is:
>(WthrResponse~IdWithValue(Id(Tube 25,Carbon Monoxide),0.20))< of type: $tilde
Just to give context to the question here is some of the input parsing. I will want to get at Id:
trait Keeper
case class Id(leftContents:String,rightContents:String) extends Keeper
And here is Id being created:
def id = "Id(" ~> idContents <~ ")" ^^ { contents => Id(contents._1,contents._2) }
And here is the whole of the parser:
object GMParser1 extends RegexParsers {
override def skipWhitespace = false
def number = regex(new Regex("[-+]?(\\d*[.])?\\d+"))
def idContents = text ~ ("," ~> text)
def id = "Id(" ~> idContents <~ ")" ^^ { contents => Id(contents._1,contents._2) }
def text = """[A-Za-z0-9* ]+""".r
def wholeWord = """[A-Za-z]+""".r
def idBracketContents = id ~ ( "|" ~> number ) ^^ { contents => IdWithValue(contents._1,contents._2) }
def nonIdBracketContents = text ~ ( "|" ~> text )
def bracketContents = idBracketContents | nonIdBracketContents
def outerBrackets = "(" ~> bracketContents <~ ")"
def target = wholeWord ~ outerBrackets
def parseItem(str: String): ParseResult[Any] = parse(target, str)
trait Keeper
case class Id(leftContents:String,rightContents:String) extends Keeper
case class IdWithValue(leftContents:Id,numberContents:String) extends Keeper
}
The parser created by the ~ operator produces a value of the ~ case class. To get at its contents, you can pattern match on it like on any other case class (keeping in mind that its name is symbolic, so it's used infix).
So you can replace case GMParser1.Success(res, _) => ... with case GMParser1.Success(functionName ~ argument) => ... to get at the function name and argument (or whatever the semantics of wholeWord and bracketContents in wholeWord "(" bracketContents ")" are). You can then similarly use a nested pattern to get at the individual parts of the argument.
You could (and probably should) also use ^^ together with pattern matching in your rules to create a more meaningful AST structure that doesn't contain ~. This would be useful to distinguish a nonIdBracketContents result from a bracketContents result for example.
Problem
I want to parse line like this:
fieldName: value1|value2 anotherFieldName: value3 valueWithoutFieldName
into
List(Some("fieldName") ~ List("value1", "value2"), Some("anotherFieldName") ~ List("value3"), None~List("valueWithoutFieldName"))
(Alternative field values are separated by pipe (|). Field name is optional. If field has no name, it should be parsed as None (see valueWithoutFieldName)
My current (not working) solution
This is what I have so far:
val parser: Parser[ParsedQuery] = {
phrase(rep(opt(fieldNameTerm) ~ (multipleValueTerm | singleValueTerm))) ^^ {
case termSequence =>
// operate on List[Option[String] ~ List[String]]
}
}
val fieldNameTerm: Parser[String] = {
("\\w+".r <~ ":(?=\\S)".r) ^^ {
case fieldName => fieldName
}
}
val multipleValueTerm = rep1((singleValueTerm <~ alternativeValueTerm) | (alternativeValueTerm ~> singleValueTerm))
val alternativeValueTerm: Parser[String] = {
// matches '|'
("\\|".r) ^^ {
case token => token
}
}
val singleValueTerm: Parser[String] = {
// all non-whitespace characters except '|'
("[\\S&&[^\\|]]+".r) ^^ {
case token => token
}
}
Unfortunately, my code does not parse last possible field value (the last value after pipe) correctly and treats it as value of a new nameless field. For instance:
The following string:
"abc:111|222|333|444 cde:555"
is parsed into:
List((Some(abc)~List(111, 222, 333)), (None~444), (Some(cde)~555))
while I'd like it to be:
List((Some(abc)~List(111, 222, 333, 444)), (Some(cde)~555))
My suspicions
I think that the problem lies in definition of multipleValueTerm:
rep1((singleValueTerm <~ alternativeValueTerm) | (alternativeValueTerm ~> singleValueTerm))
It's second part is probably not interpreted correctly, but I have no idea why.
Shouldn't <~ from the first part of multipleValueTerm left pipe representing value separator, so that second part of this expression (alternativeValueTerm ~> singleValueTerm) is able to parse it successfully?
Let's look at what's happening. We want to parse: 111|222|333|444 with multiValueTerm.
111| fits (singleValueTerm <~ alternativeValueTerm). <~ throws away the | and we take the 111.
So we have 222|333|444 left.
Analog to the previous: 222| and 333| are taken. So we are left with 444. But 444 does not fit either (singleValueTerm <~ alternativeValueTerm) or (alternativeValueTerm ~> singleValueTerm). So it is not taken. That is why it will be treated as a new value without variable.
I would improve your parser this way:
val seperator = "|"
lazy val parser: Parser[List[(Option[String] ~ List[String])]] = rep(termParser)
lazy val termParser: Parser[(Option[String] ~ List[String])] = opt(fieldNameTerm) ~ valueParser
lazy val fieldNameTerm: Parser[String] = ("\\w+".r <~ ":(?=\\S)".r)
lazy val valueParser: Parser[List[String]] = rep1sep(singleValueTerm, seperator)
lazy val singleValueTerm: Parser[String] = ("[\\S&&[^\\|]]+".r)
There is no need for all this identity stuff ^^ {case x => x}. I removed that. Then I treat single- and multi-values the same way. It is either a List with exactly one or more elements. repsep is nice for dealing with seperators.
rep1sep(singleValueTerm, seperator) could be equivalently expressed with
singlevalueTerm ~ rep(seperator ~> singlevalueTerm)
Let's say I have the following:
case class Var(s: String)
class MyParser extends JavaTokensParser {
def variableExpr = "?" ~ identifier ^^ { case "?" ~ id => Var(id) }
def identifier = //...
}
I want this to accept inputs of the form ?X but not ? X (with a space in between). How would this be expressed?
Thanks!
JavaTokensParser by default allows white spaces between any parsers. You could change this behavior this way:
override def skipWhitespace = false
Now you have to specify all white spaces manually:
def ws: Parser[Seq[Char]] = rep(' ')
def variableExpr = ws ~> "?" ~ identifier ^^ { case "?" ~ id => Var(id) }
I'm working with the native parser combinator library in Scala and I'd like to parse some parts of my input, but not others. Specifically, I'd like to discard all of the arbitrary text between inputs that I care about. For example, with this input:
begin
Text I care about
Text I care about
DONT CARE
Text I don't care about
begin
More text I care about
...
Right now I have:
object MyParser extends RegexParsers {
val beginToken: Parser[String] = "begin"
val dontCareToken: Parser[String] = "DONT CARE"
val text: Parser[String] = not(dontCareToken) ~> """([^\n]+)""".r
val document: Parser[String] = begin ~> text.+ <~ dontCareToken ^^ { _.mkString("\n") }
val documents: Parser[Iterable[String]] = document.+
but I'm not sure how to ignore the text that comes after DONT CARE and until the next begin. Specifically, I don't want to make any assumptions about the form of that text, I just want to start parsing again at the next begin statement.
You almost had it. Parse for what you don't care and then do nothing with it.
I added dontCareText and skipDontCare and then in your document parser indicated that skipDontCare was optional.
import scala.util.parsing.combinator.RegexParsers
object MyParser extends RegexParsers {
val beginToken: Parser[String] = "begin"
val dontCareToken: Parser[String] = "DONT CARE"
val text: Parser[String] = not(dontCareToken) ~> """([^\n]+)""".r
val dontCareText: Parser[String] = not(beginToken) ~> """([^\n]+)""".r
val skipDontCare = dontCareToken ~ dontCareText ^^ { case c => "" }
val document: Parser[String] =
beginToken ~> text.+ <~ opt(skipDontCare) ^^ {
_.mkString("\n")
}
val documents: Parser[Iterable[String]] = document.+
}
val s = """begin
Text I care about
Text I care about
DONT CARE
Text I don't care about
begin
More text I care about
"""
MyParser.parseAll(MyParser.documents,s)
I wondering if it's possible to get the MatchData generated from the matching regular expression in the grammar below.
object DateParser extends JavaTokenParsers {
....
val dateLiteral = """(\d{4}[-/])?(\d\d[-/])?(\d\d)""".r ^^ {
... get MatchData
}
}
One option of course is to perform the match again inside the block, but since the RegexParser has already performed the match I'm hoping that it passes the MatchData to the block, or stores it?
Here is the implicit definition that converts your Regex into a Parser:
/** A parser that matches a regex string */
implicit def regex(r: Regex): Parser[String] = new Parser[String] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
(r findPrefixMatchOf (source.subSequence(start, source.length))) match {
case Some(matched) =>
Success(source.subSequence(start, start + matched.end).toString,
in.drop(start + matched.end - offset))
case None =>
Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
}
}
}
Just adapt it:
object X extends RegexParsers {
/** A parser that matches a regex string and returns the Match */
def regexMatch(r: Regex): Parser[Regex.Match] = new Parser[Regex.Match] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
(r findPrefixMatchOf (source.subSequence(start, source.length))) match {
case Some(matched) =>
Success(matched,
in.drop(start + matched.end - offset))
case None =>
Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
}
}
}
val t = regexMatch("""(\d\d)/(\d\d)/(\d\d\d\d)""".r) ^^ { case m => (m.group(1), m.group(2), m.group(3)) }
}
Example:
scala> X.parseAll(X.t, "23/03/1971")
res8: X.ParseResult[(String, String, String)] = [1.11] parsed: (23,03,1971)
No, you can't do this. If you look at the definition of the Parser used when you convert a regex to a Parser, it throws away all context and just returns the full matched string:
http://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_7_7_final/src/library/scala/util/parsing/combinator/RegexParsers.scala?view=markup#L55
You have a couple of other options, though:
break up your parser into several smaller parsers (for the tokens you actually want to extract)
define a custom parser that extracts the values you want and returns a domain object instead of a string
The first would look like
val separator = "-" | "/"
val year = ("""\d{4}"""r) <~ separator
val month = ("""\d\d"""r) <~ separator
val day = """\d\d"""r
val date = ((year?) ~ (month?) ~ day) map {
case year ~ month ~ day =>
(year.getOrElse("2009"), month.getOrElse("11"), day)
}
The <~ means "require these two tokens together, but only give me the result of the first one.
The ~ means "require these two tokens together and tie them together in a pattern-matchable ~ object.
The ? means that the parser is optional and will return an Option.
The .getOrElse bit provides a default value for when the parser didn't define a value.
When a Regex is used in a RegexParsers instance, the implicit def regex(Regex): Parser[String] in RegexParsers is used to appoly that Regex to the input. The Match instance yielded upon successful application of the RE at the current input is used to construct a Success in the regex() method, but only its "end" value is used, so any captured sub-matches are discarded by the time that method returns.
As it stands (in the 2.7 source I looked at), you're out of luck, I believe.
I ran into a similar issue using scala 2.8.1 and trying to parse input of the form "name:value" using the RegexParsers class:
package scalucene.query
import scala.util.matching.Regex
import scala.util.parsing.combinator._
object QueryParser extends RegexParsers {
override def skipWhitespace = false
private def quoted = regex(new Regex("\"[^\"]+"))
private def colon = regex(new Regex(":"))
private def word = regex(new Regex("\\w+"))
private def fielded = (regex(new Regex("[^:]+")) <~ colon) ~ word
private def term = (fielded | word | quoted)
def parseItem(str: String) = parse(term, str)
}
It seems that you can grab the matched groups after parsing like this:
QueryParser.parseItem("nameExample:valueExample") match {
case QueryParser.Success(result:scala.util.parsing.combinator.Parsers$$tilde, _) => {
println("Name: " + result.productElement(0) + " value: " + result.productElement(1))
}
}