Parse string with unescaped characters

Parse string with unescaped characters - scala

I have one library which supports some kind of custom language. The parser is written using scala RegexParsers. Now I'm trying to rewrite our parser using fastparse library to speedup our engine.
The question is: Is it possible to parse properly params inside our pseudolanguage function?
Here is an example:
$out <= doSomething('/mypath[text() != '']', 'def f(a) {a * 2}', ',') <= $in
here is a function doSomething with 3 params:
/mypath[text() != '']
def f(a) {a * 2}
,
I'm expecting to get a tree for the function with params:
Function(
name = doSomething
params = List[String](
"/mypath[text() != '']",
"def f(a) {a * 2}",
","
)
)
What I do:
val ws = P(CharsWhileIn(" \r\n"))
def wsSep(sep: String) = P(ws.? ~ sep ~ ws.?)
val name = P(CharsIn('a' to 'z', 'A' to 'Z'))
val param = P(ws.? ~ "'" ~ CharPred(_ != '\'').rep ~ "'" ~ ws.?)
val params = P("(" ~ param.!.rep(sep = wsSep(",")) ~ ")")
val function = P(name.! ~ params.?).map(case (name, params) => Function(name, params.getOrElse(List())))
The problem here that the single quotes represent a String in my code, but inside that string sometimes we have additional single quotes like here:
/mypath[text() != '']
So, I can't use CharPred(_ != '\'') in my case
We also have a commas inside a Strings like in 3rd param
This is works somehow using scala parser but I can't parse the same using fastparse
Does anyone have ideas how to make the parser work properly?
Update
Got it!
The main magic is in val param
object X {
import fastparse.all._
case class Fn(name: String, params: Seq[String])
val ws = P(CharsWhileIn(" \r\n"))
def wsSep(sep: String) = P(ws.? ~ sep ~ ws.?)
val name = P(CharIn('a' to 'z', 'A' to 'Z').rep)
val param = P(ws.? ~ "'" ~ (!("'" ~ ws.? ~ ("," | ")")) ~ AnyChar).rep ~ "'" ~ ws.?)
val params = P("(" ~ param.!.rep(sep = wsSep(",")) ~ ")")
val function = P(name.! ~ params.?).map{case (name, params) => Fn(name, params.getOrElse(Seq()))}
}
object Test extends App {
val res = X.function.parse("myFunction('/hello[name != '']' , 'def f(a) {mytest}', ',')")
res match {
case Success(r, z) =>
println(s"fn name: ${r.name}")
println(s"params:\n {${r.params.mkString("\n")}\n}")
case Failure(e, z, m) => println(m)
}
}
out:
name: myFunction
params:
'/hello[name != '']'
'def f(a) {mytest}'
','

Related

Retrieve object method

I have an "object" with the following typo within a string: {myObjectIdIKnow?someInfo,{someBracedInfo},{someOtherBracedInfo},someInfo,...,lastInfo}.
I want to retrieve its content (i.e. from someInfo to lastInfo).
Following, the function I built:
def retrieveMyObject(line: String, myObjectId: String) =
{
if (line.contains(myObjectId))
{
var openingDelimiterCount = 0
var closingDelimiterCount = 0
val bit = iLine.split(myObjectIdIKnow).last
var i = -1
do
{
i += 1
if (bit(i).equals("{")) openingDelimiterCount += 1
else if (bit(i).equals("}")) closingDelimiterCount += 1
} while (lDelimiterOpeningCount >= closingDelimiterCount)
if (i.equals(0)) bit
else bit.splitAt(i)._1
}
}
I match with my myObjectId and browse through each character of the input line to check if it is a brace delimiter or not, then compare the numbers of { and }: if the second is bigger than the first, it means I reached the end of my content and thus I retrieve it.
It does not seem like a good method at all and I was wondering what better way could I do it?

I've tried to implement simple parser using Scala Parser Combinators. Here is what I got. I'm not very experienced with parser combinator but did something working just for the sake of curiosity.
import scala.util.parsing.combinator.JavaTokenParsers
case class InfoObject(id: String, objInfo: String, bracedInfos: List[String], infos: List[String])
class ObjectParser extends JavaTokenParsers {
def objDefinition: Parser[InfoObject] = "{" ~> (idPlusInfo <~ ",") ~ (bracedInfos <~ ",") ~ infos <~ "}" ^^ {
case (id, objInfo) ~ bracedInfos ~ infos => InfoObject(id, objInfo, bracedInfos, infos)
}
def idPlusInfo: Parser[(String, String)] = (infoObj <~ "?") ~ infoObj ^^ { case id ~ info => (id, info) }
def bracedInfos: Parser[List[String]] = repsep(bracedInfo, ",")
def bracedInfo: Parser[String] = "{" ~> infoObj <~ "}"
def infos: Parser[List[String]] = repsep(infoObj, ",")
def infoObj: Parser[String] = """\w+""".r
}
val parser = new ObjectParser()
parser.parse(parser.infoObj, "someInfo").get == "someInfo" // true
parser.parse(parser.bracedInfo, "{someBracedInfo}").get == "someBracedInfo" // true
val expected = InfoObject(
"myObjectIdIKnow",
"someInfo",
List("someBracedInfo", "someOtherBracedInfo"),
List("someInfo", "lastInfo")
)
val objectDef = "{myObjectIdIKnow?someInfo,{someBracedInfo},{someOtherBracedInfo},someInfo,lastInfo}"
parser.parse(parser.objDefinition, objectDef).get == expected // true

Representing ad-hoc operator precedence in Scala Parser

So I'm trying to write a parser specifically for the arithmetic fragment of a programming language I'm playing with, using scala RegexParsers.
As it stands, my top-level expression parser is of the form:
parser: Parser[Exp] = binAppExp | otherKindsOfParserLike | lval | int
It accepts lvals (things like "a.b, a.b[c.d], a[b], {record=expression, like=this}" just fine. Now, I'd like to enable expressions like "1 + b / c = d", but potentially with (source language, not Scala) compile-time user-defined operators.
My initial thought was, if I encode the operations recursively and numerically by precedence, then I could add higher precedences ad-hoc, and each level of precedence could parse consuming lower-precedence sub-terms on the right-side of the operation expression. So, I'm trying to build a toy of that idea with just some fairly common operators.
So I'd expect "1 * 2+1" to parse into something like Call(*, Seq(1, Call(+ Seq(2,1)))), where case class Call(functionName: String, args: Seq[Exp]) extends Exp.
Instead though, it parses as IntExp(1).
Is there a reason that this can't work (is it left-recursive in a way I'm missing? If so, I'm sure there's something else wrong, or it'd never terminate, right?), or is it just plain wrong for some other reason?
def binAppExp: Parser[Exp] = {
//assume a registry of operations
val ops = Map(
(7, Set("*", "/")),
(6, Set("-", "+")),
(4, Set("=", "!=", ">", "<", ">=", "<=")),
(3, Set("&")),
(2, Set("|"))
)
//relevant ops for a level of precedence
def opsWithPrecedence(n: Int): Set[String] = ops.getOrElse(n, Set.empty)
//parse an op with some level of precedence
def opWithPrecedence(n: Int): Parser[String] = ".+".r ^? (
{ case s if opsWithPrecedence(n).contains(s) => s },
{ case s => s"SYMBOL NOT FOUND: $s" }
)
//assuming the parse happens, encode it as an AST representation
def folder(h: Exp, t: LangParser.~[String, Exp]): CallExp =
CallExp(t._1, Seq(h, t._2))
val maxPrecedence: Int = ops.maxBy(_._1)._1
def term: (Int => Parser[Exp]) = {
case 0 => lval | int | notApp | "(" ~> term(maxPrecedence) <~ ")"
case n =>
val lowerTerm = term(n - 1)
lowerTerm ~ rep(opWithPrecedence(n) ~ lowerTerm) ^^ {
case h ~ ts => ts.foldLeft(h)(folder)
}
}
term(maxPrecedence)
}

Okay, so there was nothing inherently impossible with what I was trying to do, it was just wrong in the details.
The core idea is just: maintain a mapping from level of precedence to operators/parsers, and recursively look for parses based on that table. If you allow parenthetical expressions, just nest a call to your most precedent possible parser within the call to the parenthetical terms' parser.
Just in case anyone else ever wants to do something like this, here's code for a set of arithmetic/logical operators, heavily commented to relate it to the above:
def opExp: Parser[Exp] = {
sealed trait Assoc
val ops = Map(
(1, Set("*", "/")),
(2, Set("-", "+")),
(3, Set("=", "!=", ">", "<", ">=", "<=")),
(4, Set("&")),
(5, Set("|"))
)
def opsWithPrecedence(n: Int): Set[String] = ops.getOrElse(n, Set.empty)
/* before, this was trying to match the remainder of the expression,
so something like `3 - 2` would parse the Int(3),
and try to pass "- 2" as an operator to the op parser.
RegexParsers has an implicit def "literal : String => SubclassOfParser[String]",
that I'm using explicitly here.
*/
def opWithPrecedence(n: Int): Parser[String] = {
val ops = opsWithPrecedence(n)
if (ops.size > 1) {
ops.map(literal).fold (literal(ops.head)) {
case (l1, l2) => l1 | l2
}
} else if (ops.size == 1) {
literal(ops.head)
} else {
failure(s"No Ops for Precedence $n")
}
}
def folder(h: Exp, t: TigerParser.~[String, Exp]): CallExp = CallExp(t._1, Seq(h, t._2))
val maxPrecedence: Int = ops.maxBy(_._1)._1
def term: (Int => Parser[Exp]) = {
case 0 => lval | int | "(" ~> { term(maxPrecedence) } <~ ")"
case n if n > 0 =>
val lowerTerm = term(n - 1)
lowerTerm ~ rep(opWithPrecedence(n) ~ lowerTerm) ^^ {
case h ~ ts if ts.nonEmpty => ts.foldLeft(h)(folder)
case h ~ _ => h
}
}
term(maxPrecedence)
}

How to ignore single line comments in a parser-combinator

I have a working parser, but I've just realised I do not cater for comments. In the DSL I am parsing, comments start with a ; character. If a ; is encountered, the rest of the line is ignored (not all of it however, unless the first character is ;).
I am extending RegexParsers for my parser and ignoring whitespace (the default way), so I am losing the new line characters anyway. I don't wish to modify each and every parser I have to cater for the possibility of comments either, because statements can span across multiple lines (thus each part of each statement may end with a comment). Is there any clean way to acheive this?

One thing that may influence your choice is whether comments can be found within your valid parsers. For instance let's say you have something like:
val p = "(" ~> "[a-z]*".r <~ ")"
which would parse something like ( abc ) but because of comments you could actually encounter something like:
( ; comment goes here
abc
)
Then I would recommend using a TokenParser or one of its subclass. It's more work because you have to provide a lexical parser that will do a first pass to discard the comments. But it is also more flexible if you have nested comments or if the ; can be escaped or if the ; can be inside a string literal like:
abc = "; don't ignore this" ; ignore this
On the other hand, you could also try to override the value of whitespace to be something like
override protected val whiteSpace = """(\s|;.*)+""".r
Or something along those lines.
For instance using the example from the RegexParsers scaladoc:
import scala.util.parsing.combinator.RegexParsers
object so1 {
Calculator("""(1 + ; foo
(1 + 2))
; bar""")
}
object Calculator extends RegexParsers {
override protected val whiteSpace = """(\s|;.*)+""".r
def number: Parser[Double] = """\d+(\.\d*)?""".r ^^ { _.toDouble }
def factor: Parser[Double] = number | "(" ~> expr <~ ")"
def term: Parser[Double] = factor ~ rep("*" ~ factor | "/" ~ factor) ^^ {
case number ~ list => (number /: list) {
case (x, "*" ~ y) => x * y
case (x, "/" ~ y) => x / y
}
}
def expr: Parser[Double] = term ~ rep("+" ~ log(term)("Plus term") | "-" ~ log(term)("Minus term")) ^^ {
case number ~ list => list.foldLeft(number) { // same as before, using alternate name for /:
case (x, "+" ~ y) => x + y
case (x, "-" ~ y) => x - y
}
}
def apply(input: String): Double = parseAll(expr, input) match {
case Success(result, _) => result
case failure: NoSuccess => scala.sys.error(failure.msg)
}
}
This prints:
Plus term --> [2.9] parsed: 2.0
Plus term --> [2.10] parsed: 3.0
res0: Double = 4.0

Just filter out all the comments with a regex before you pass the code into your parser.
def removeComments(input: String): String = {
"""(?ms)\".*?\"|;.*?$|.+?""".r.findAllIn(input).map(str => if(str.startsWith(";")) "" else str).mkString
}
val code =
"""abc "def; ghij"
abc ;this is a comment
def"""
println(removeComments(code))

Parentheses matching in Scala --- functional approach

Let's say I want to parse a string with various opening and closing brackets (I used parentheses in the title because I believe it is more common -- the question is the same nevertheless) so that I get all the higher levels separated in a list.
Given:
[hello:=[notting],[hill]][3.4(4.56676|5.67787)][the[hill[is[high]]not]]
I want:
List("[hello:=[notting],[hill]]", "[3.4(4.56676|5.67787)]", "[the[hill[is[high]]not]]")
The way I am doing this is by counting the opening and closing brackets and adding to the list whenever I get my counter to 0. However, I have an ugly imperative code. You may assume that the original string is well formed.
My question is: what would be a nice functional approach to this problem?
Notes: I have thought of using the for...yield construct but given the use of the counters I cannot get a simple conditional (I must have conditionals just for updating the counters as well) and I do not know how I could use this construct in this case.

Quick solution using Scala parser combinator library:
import util.parsing.combinator.RegexParsers
object Parser extends RegexParsers {
lazy val t = "[^\\[\\]\\(\\)]+".r
def paren: Parser[String] =
("(" ~ rep1(t | paren) ~ ")" |
"[" ~ rep1(t | paren) ~ "]") ^^ {
case o ~ l ~ c => (o :: l ::: c :: Nil) mkString ""
}
def all = rep(paren)
def apply(s: String) = parseAll(all, s)
}
Checking it in REPL:
scala> Parser("[hello:=[notting],[hill]][3.4(4.56676|5.67787)][the[hill[is[high]]not]]")
res0: Parser.ParseResult[List[String]] = [1.72] parsed: List([hello:=[notting],[hill]], [3.4(4.56676|5.67787)], [the[hill[is[high]]not]])

What about:
def split(input: String): List[String] = {
def loop(pos: Int, ends: List[Int], xs: List[String]): List[String] =
if (pos >= 0)
if ((input charAt pos) == ']') loop(pos-1, pos+1 :: ends, xs)
else if ((input charAt pos) == '[')
if (ends.size == 1) loop(pos-1, Nil, input.substring(pos, ends.head) :: xs)
else loop(pos-1, ends.tail, xs)
else loop(pos-1, ends, xs)
else xs
loop(input.length-1, Nil, Nil)
}
scala> val s1 = "[hello:=[notting],[hill]][3.4(4.56676|5.67787)][the[hill[is[high]]not]]"
s1: String = [hello:=[notting],[hill]][3.4(4.56676|5.67787)][the[hill[is[high]]not]]
scala> val s2 = "[f[sad][add]dir][er][p]"
s2: String = [f[sad][add]dir][er][p]
scala> split(s1) foreach println
[hello:=[notting],[hill]]
[3.4(4.56676|5.67787)]
[the[hill[is[high]]not]]
scala> split(s2) foreach println
[f[sad][add]dir]
[er]
[p]

Given your requirements counting the parenthesis seems perfectly fine. How would you do that in a functional way? You can make the state explicitly passed around.
So first we define our state which accumulates results in blocks or concatenates the next block and keeps track of the depth:
case class Parsed(blocks: Vector[String], block: String, depth: Int)
Then we write a pure function that processed that returns the next state. Hopefully, we can just carefully look at this one function and ensure it's correct.
def nextChar(parsed: Parsed, c: Char): Parsed = {
import parsed._
c match {
case '[' | '(' => parsed.copy(block = block + c,
depth = depth + 1)
case ']' | ')' if depth == 1
=> parsed.copy(blocks = blocks :+ (block + c),
block = "",
depth = depth - 1)
case ']' | ')' => parsed.copy(block = block + c,
depth = depth - 1)
case _ => parsed.copy(block = block + c)
}
}
Then we just used a foldLeft to process the data with an initial state:
val data = "[hello:=[notting],[hill]][3.4(4.56676|5.67787)][the[hill[is[high]]not]]"
val parsed = data.foldLeft(Parsed(Vector(), "", 0))(nextChar)
parsed.blocks foreach println
Which returns:
[hello:=[notting],[hill]]
[3.4(4.56676|5.67787)]
[the[hill[is[high]]not]]

You have an ugly imperative solution, so why not make a good-looking one? :)
This is an imperative translation of huynhjl's solution, but just posting to show that sometimes imperative is concise and perhaps easier to follow.
def parse(s: String) = {
var res = Vector[String]()
var depth = 0
var block = ""
for (c <- s) {
block += c
c match {
case '[' => depth += 1
case ']' => depth -= 1
if (depth == 0) {
res :+= block
block = ""
}
case _ =>
}
}
res
}

Try this:
val s = "[hello:=[notting],[hill]][3.4(4.56676|5.67787)][the[hill[is[high]]not]]"
s.split("]\\[").toList
returns:
List[String](
[hello:=[notting],[hill],
3.4(4.56676|5.67787),
the[hill[is[high]]not]]
)

In Scala, how to find an elemein in CSV by a pair of key values?

For example, from a following file:
Name,Surname,E-mail
John,Smith,john.smith#hotmail.com
Nancy,Smith,nancy.smith#gmail.com
Jane,Doe,jane.doe#aol.com
John,Doe,john.doe#yahoo.com
how do I get e-mail address of John Doe?
I use the following code now, but can specify only one key field now:
val src = Source.fromFile(file)
val iter = src.getLines().drop(1).map(_.split(","))
var quote = ""
iter.find( _(1) == "Doe" ) foreach (a => println(a(2)))
src.close()
I've tried writing "iter.find( _(0) == "John" && _(1) == "Doe" )", but this raises an error saying that only one parameter is expected (enclosing the condition into extra pair of parentheses does not help).

The underscore as a placeholder for a parameter to a lambda doesn't work the way that you think.
a => println(a)
// is equivalent to
println(_)
(a,b) => a + b
// is equivalent to
_ + _
a => a + a
// is not equivalent to
_ + _
That is, the first underscore means the first parameter and the second one means the second parameter and so on. So that's the reason for the error that you're seeing -- you're using two underscores but have only one parameter. The fix is to use the explicit version:
iter.find( a=> a(0) == "John" && a(1) == "Doe" )

You can use Regex:
scala> def getRegex(v1: String, v2: String) = (v1 + "," + v2 +",(\\S+)").r
getRegex: (v1: String,v2: String)scala.util.matching.Regex
scala> val src = """John,Smith,john.smith#hotmail.com
| Nancy,Smith,nancy.smith#gmail.com
| Jane,Doe,jane.doe#aol.com
| John,Doe,john.doe#yahoo.com
| """
src: java.lang.String =
John,Smith,john.smith#hotmail.com
Nancy,Smith,nancy.smith#gmail.com
Jane,Doe,jane.doe#aol.com
John,Doe,john.doe#yahoo.com
scala> val MAIL = getRegex("John","Doe")
MAIL: scala.util.matching.Regex = John,Doe,(\S+)
scala> val itr = src.lines
itr: Iterator[String] = non-empty iterator
scala> for(MAIL(address) <- itr) println(address)
john.doe#yahoo.com
scala>

You could also do a pattern match on the result of split in a for comprehension.
val firstName = "John"
val surName = "Doe"
val emails = for {
Array(`firstName`, `surName`, email) <-
src.getLines().drop(1) map { _ split ',' }
} yield { email }
println(emails.mkString(","))
Note the backticks in the pattern: this means we match on the value of firstName instead of introducing a new variable matching anything and shadowing the val firstname.