Idiomatic way of extracting known key-> val pairs from a string - scala

Example context:
An HTTP Response with a body as follows:
key1=val1&key2=val2&key3=val3.
The names of the keys are always known.
Currently the extraction is done with regex:
val params = response split ("""&""") map { _.split("""=""") } map { el => { el(0) -> el(1) } } toMap;
Is there a simpler way of pattern matching the response for specific params?

I think using split is probably going to be the fastest/simplest solution here. You're not doing any advanced parsing, so using parser combinators or regex capture groups seems a little overkill.
However, when you have complex expressions involving multiple calls to map, filter, etc., it's usually an indicator that you can clean things up with a for-comprehension:
val response = "key1=val1&key2=val2&key3=val3"
val params = (for { x <- response split ("&")
Array(k, v) = x split ("=") }
yield k->v).toMap

You can use parser combinators here for most flexibility and robustness (i.e., handle failed parsing):
object Parser extends RegexParsers with App {
def lit: Parser[String] = "[^=&]+".r
def pair: Parser[(String, String)] = lit ~ "=" ~ lit ^^ {
case key ~ "=" ~ value => key -> value
}
def parse: Parser[Seq[(String, String)]] = repsep(pair, "&")
val response = "key1=val1&key2=val2&key3=val3"
val params = parse(new CharSequenceReader(response)).get.toMap
println(params)
}

You can use regexp as a matcher like this:
val r = "([^=]+)=([^=]+)".r
def toKv(s:String) = s match {
case r(k,v) => (k,v)
case _ => throw InvalidFormatException
}
So, for your case it would look like:
response split ("&") map (toKv)

Related

Pattern matching and RDDs

I have a very simple (n00b) question but I'm somehow stuck. I'm trying to read a set of files in Spark with wholeTextFiles and want to return an RDD[LogEntry], where LogEntry is just a case class. I want to end up with an RDD of valid entries and I need to use a regular expression to extract the parameters for my case class. When an entry is not valid I do not want the extractor logic to fail but simply write an entry in a log. For that I use LazyLogging.
object LogProcessors extends LazyLogging {
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[Option[CleaningLogEntry]] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.map(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
That gives me an RDD[Array[Option[LogEntry]]]. Is there a neat way to end up with an RDD of the LogEntrys? I'm somehow missing it.
I was thinking about using Try instead, but I'm not sure if that's any better.
Thoughts greatly appreciated.
To get rid of the Array - simply replace the map command with flatMap - flatMap will treat a result of type Traversable[T] for each record as separate records of type T.
To get rid of the Option - collect only the successful ones: entries.collect { case Some(entry) => entry }.
Note that this collect(p: PartialFunction) overload (which performs something equivelant to a map and a filter combined) is very different from collect() (which sends all data to the driver).
Altogether, this would be something like:
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[CleaningLogEntry] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.flatMap(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
entries.collect { case Some(entry) => entry }
}

Converting MongoCursor to JSON

Using Casbah, I query Mongo.
val mongoClient = MongoClient("localhost", 27017)
val db = mongoClient("test")
val coll = db("test")
val results: MongoCursor = coll.find(builder)
var matchedDocuments = List[DBObject]()
for(result <- results) {
matchedDocuments = matchedDocuments :+ result
}
Then, I convert the List[DBObject] into JSON via:
val jsonString: String = buildJsonString(matchedDocuments)
Is there a better way to convert from "results" (MongoCursor) to JSON (JsValue)?
private def buildJsonString(list: List[DBObject]): Option[String] = {
def go(list: List[DBObject], json: String): Option[String] = list match {
case Nil => Some(json)
case x :: xs if(json == "") => go(xs, x.toString)
case x :: xs => go(xs, json + "," + x.toString)
case _ => None
}
go(list, "")
}
Assuming you want implicit conversion (like in flavian's answer), the easiest way to join the elements of your list with commas is:
private implicit def buildJsonString(list: List[DBObject]): String =
list.mkString(",")
Which is basically the answer given in Scala: join an iterable of strings
If you want to include the square brackets to properly construct a JSON array you'd just change it to:
list.mkString("[", ",", "]") // punctuation madness
However if you'd actually like to get to Play JsValue elements as you seem to indicate in the original question, then you could do:
list.map { x => Json.parse(x.toString) }
Which should produce a List[JsValue] instead of a String. However, if you're just going to convert it back to a string again when sending a response, then it's an unneeded step.

Internal DSL in Scala: Lists without ","

I'm trying to build an internal DSL in Scala to represent algebraic definitions. Let's consider this simplified data model:
case class Var(name:String)
case class Eq(head:Var, body:Var*)
case class Definition(name:String, body:Eq*)
For example a simple definition would be:
val x = Var("x")
val y = Var("y")
val z = Var("z")
val eq1 = Eq(x, y, z)
val eq2 = Eq(y, x, z)
val defn = Definition("Dummy", eq1, eq2)
I would like to have an internal DSL to represent such an equation in the form:
Dummy {
x = y z
y = x z
}
The closest I could get is the following:
Definition("Dummy") := (
"x" -> ("y", "z")
"y" -> ("x", "z")
)
The first problem I encountered is that I cannot have two implicit conversions for Definition and Var, hence Definition("Dummy"). The main problem, however, are the lists. I don't want to surround them by any thing, e.g. (), and I also don't want their elements be separated by commas.
Is what I want possible using Scala? If yes, can anyone show me an easy way of achieving it?
While Scalas syntax is powerful, it is not flexible enough to create arbitrary delimiters for symbols. Thus, there is no way to leave commas and replace them only with spaces.
Nevertheless, it is possible to use macros and parse a string with arbitrary content at compile time. It is not an "easy" solution, but one that works:
object AlgDefDSL {
import language.experimental.macros
import scala.reflect.macros.Context
implicit class DefDSL(sc: StringContext) {
def dsl(): Definition = macro __dsl_impl
}
def __dsl_impl(c: Context)(): c.Expr[Definition] = {
import c.universe._
val defn = c.prefix.tree match {
case Apply(_, List(Apply(_, List(Literal(Constant(s: String)))))) =>
def toAST[A : TypeTag](xs: Tree*): Tree =
Apply(
Select(Ident(typeOf[A].typeSymbol.companionSymbol), newTermName("apply")),
xs.toList
)
def toVarAST(varObj: Var) =
toAST[Var](c.literal(varObj.name).tree)
def toEqAST(eqObj: Eq) =
toAST[Eq]((eqObj.head +: eqObj.body).map(toVarAST(_)): _*)
def toDefAST(defObj: Definition) =
toAST[Definition](c.literal(defObj.name).tree +: defObj.body.map(toEqAST(_)): _*)
parsers.parse(s) match {
case parsers.Success(defn, _) => toDefAST(defn)
case parsers.NoSuccess(msg, _) => c.abort(c.enclosingPosition, msg)
}
}
c.Expr(defn)
}
import scala.util.parsing.combinator.JavaTokenParsers
private object parsers extends JavaTokenParsers {
override val whiteSpace = "[ \t]*".r
lazy val newlines =
opt(rep("\n"))
lazy val varP =
"[a-z]+".r ^^ Var
lazy val eqP =
(varP <~ "=") ~ rep(varP) ^^ {
case lhs ~ rhs => Eq(lhs, rhs: _*)
}
lazy val defHead =
newlines ~> ("[a-zA-Z]+".r <~ "{") <~ newlines
lazy val defBody =
rep(eqP <~ rep("\n"))
lazy val defEnd =
"}" ~ newlines
lazy val defP =
defHead ~ defBody <~ defEnd ^^ {
case name ~ eqs => Definition(name, eqs: _*)
}
def parse(s: String) = parseAll(defP, s)
}
case class Var(name: String)
case class Eq(head: Var, body: Var*)
case class Definition(name: String, body: Eq*)
}
It can be used with something like this:
scala> import AlgDefDSL._
import AlgDefDSL._
scala> dsl"""
| Dummy {
| x = y z
| y = x z
| }
| """
res12: AlgDefDSL.Definition = Definition(Dummy,WrappedArray(Eq(Var(x),WrappedArray(Var(y), Var(z))), Eq(Var(y),WrappedArray(Var(x), Var(z)))))
In addition to sschaef's nice solution I want to mention a few possibilities that are commonly used to get rid of commas in list construction for a DSL.
Colons
This might be trivial, but it is sometimes overlooked as a solution.
line1 ::
line2 ::
line3 ::
Nil
For a DSL it is often desired that every line that contains some instruction/data is terminated the same way (opposed to Lists where all but the last line will get a comma). With such a solutions exchanging the lines no longer can mess up the trailing comma. Unfortunately, the Nil looks a bit ugly.
Fluid API
Another alternative that might be interesting for a DSL is something like that:
BuildDefinition()
.line1
.line2
.line3
.build
where each line is a member function of the builder (and returns a modified builder). This solution requires to eventually convert the builder to a list (which might be done as an implicit conversion). Note that for some APIs it might be possible to pass around the builder instances themselves, and only extract the data wherever needed.
Constructor API
Similarly another possibility is to exploit constructors.
new BuildInterface {
line1
line2
line3
}
Here, BuildInterface is a trait and we simply instantiate an anonymous class from the interface. The line functions call some member functions of this trait. Each invocation can internally update the state of the build interface. Note that this commonly results in a mutable design (but only during construction). To extract the list, an implicit conversion could be used.
Since I don't understand the actual purpose of your DSL, I'm not really sure if any of these techniques is interesting for your scenario. I just wanted to add them since they are common ways to get rid of ",".
Here is another solution which is relatively simple and enables a syntax that is pretty close to your ideal
(as other have pointed, the exact syntax your asked for is not possible, in particular because you cannot redefine delimiter symbols).
My solution stretches a bit what is reasonable to do because it adds an operator right on scala.Symbol,
but if you're going to use this DSL in a constrained scope then this should be OK.
object VarOps {
val currentEqs = new util.DynamicVariable( Vector.empty[Eq] )
}
implicit class VarOps( val variable: Var ) extends AnyVal {
import VarOps._
def :=[T]( body: Var* ) = {
val eq = Eq( variable, body:_* )
currentEqs.value = currentEqs.value :+ eq
}
}
implicit class SymbolOps( val sym: Symbol ) extends AnyVal {
def apply[T]( body: => Unit ): Definition = {
import VarOps._
currentEqs.withValue( Vector.empty[Eq] ) {
body
Definition( sym.name, currentEqs.value:_* )
}
}
}
Now you can do:
'Dummy {
x := (y, z)
y := (x, z)
}
Which builds the following definition (as printed in the REPL):
Definition(Dummy,Vector(Eq(Var(x),WrappedArray(Var(y), Var(z))), Eq(Var(y),WrappedArray(Var(x), Var(z)))))

traverse collection of type "Any" in Scala

I would like to traverse a collection resulting from the Scala JSON toolkit at github.
The problem is that the JsonParser returns "Any" so I am wondering how I can avoid the following error:
"Value foreach is not a member of Any".
val json = Json.parse(urls)
for(l <- json) {...}
object Json {
def parse(s: String): Any = (new JsonParser).parse(s)
}
You will have to do pattern matching to traverse the structures returned from the parser.
/*
* (untested)
*/
def printThem(a: Any) {
a match {
case l:List[_] =>
println("List:")
l foreach printThem
case m:Map[_, _] =>
for ( (k,v) <- m ) {
print("%s -> " format k)
printThem(v)
}
case x =>
println(x)
}
val json = Json.parse(urls)
printThem(json)
You might have more luck using the lift-json parser, available at: http://github.com/lift/lift/tree/master/framework/lift-base/lift-json/
It has a much richer type-safe DSL available, and (despite the name) can be used completely standalone outside of the Lift framework.
If you are sure that in all cases there will be only one type you can come up with the following cast:
for (l <- json.asInstanceOf[List[List[String]]]) {...}
Otherwise do a Pattern-Match for all expected cases.

Accessing Scala Parser regular expression match data

I wondering if it's possible to get the MatchData generated from the matching regular expression in the grammar below.
object DateParser extends JavaTokenParsers {
....
val dateLiteral = """(\d{4}[-/])?(\d\d[-/])?(\d\d)""".r ^^ {
... get MatchData
}
}
One option of course is to perform the match again inside the block, but since the RegexParser has already performed the match I'm hoping that it passes the MatchData to the block, or stores it?
Here is the implicit definition that converts your Regex into a Parser:
/** A parser that matches a regex string */
implicit def regex(r: Regex): Parser[String] = new Parser[String] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
(r findPrefixMatchOf (source.subSequence(start, source.length))) match {
case Some(matched) =>
Success(source.subSequence(start, start + matched.end).toString,
in.drop(start + matched.end - offset))
case None =>
Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
}
}
}
Just adapt it:
object X extends RegexParsers {
/** A parser that matches a regex string and returns the Match */
def regexMatch(r: Regex): Parser[Regex.Match] = new Parser[Regex.Match] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
(r findPrefixMatchOf (source.subSequence(start, source.length))) match {
case Some(matched) =>
Success(matched,
in.drop(start + matched.end - offset))
case None =>
Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
}
}
}
val t = regexMatch("""(\d\d)/(\d\d)/(\d\d\d\d)""".r) ^^ { case m => (m.group(1), m.group(2), m.group(3)) }
}
Example:
scala> X.parseAll(X.t, "23/03/1971")
res8: X.ParseResult[(String, String, String)] = [1.11] parsed: (23,03,1971)
No, you can't do this. If you look at the definition of the Parser used when you convert a regex to a Parser, it throws away all context and just returns the full matched string:
http://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_7_7_final/src/library/scala/util/parsing/combinator/RegexParsers.scala?view=markup#L55
You have a couple of other options, though:
break up your parser into several smaller parsers (for the tokens you actually want to extract)
define a custom parser that extracts the values you want and returns a domain object instead of a string
The first would look like
val separator = "-" | "/"
val year = ("""\d{4}"""r) <~ separator
val month = ("""\d\d"""r) <~ separator
val day = """\d\d"""r
val date = ((year?) ~ (month?) ~ day) map {
case year ~ month ~ day =>
(year.getOrElse("2009"), month.getOrElse("11"), day)
}
The <~ means "require these two tokens together, but only give me the result of the first one.
The ~ means "require these two tokens together and tie them together in a pattern-matchable ~ object.
The ? means that the parser is optional and will return an Option.
The .getOrElse bit provides a default value for when the parser didn't define a value.
When a Regex is used in a RegexParsers instance, the implicit def regex(Regex): Parser[String] in RegexParsers is used to appoly that Regex to the input. The Match instance yielded upon successful application of the RE at the current input is used to construct a Success in the regex() method, but only its "end" value is used, so any captured sub-matches are discarded by the time that method returns.
As it stands (in the 2.7 source I looked at), you're out of luck, I believe.
I ran into a similar issue using scala 2.8.1 and trying to parse input of the form "name:value" using the RegexParsers class:
package scalucene.query
import scala.util.matching.Regex
import scala.util.parsing.combinator._
object QueryParser extends RegexParsers {
override def skipWhitespace = false
private def quoted = regex(new Regex("\"[^\"]+"))
private def colon = regex(new Regex(":"))
private def word = regex(new Regex("\\w+"))
private def fielded = (regex(new Regex("[^:]+")) <~ colon) ~ word
private def term = (fielded | word | quoted)
def parseItem(str: String) = parse(term, str)
}
It seems that you can grab the matched groups after parsing like this:
QueryParser.parseItem("nameExample:valueExample") match {
case QueryParser.Success(result:scala.util.parsing.combinator.Parsers$$tilde, _) => {
println("Name: " + result.productElement(0) + " value: " + result.productElement(1))
}
}