Exception handling in Spark Scala UDF - scala

def parse_values(value: String) = {
val values = value.split(",").map(_.trim)
values.foldLeft(Array[(Int, Double)]()) {
case (acc, present) =>
val Array(k, v) = present.split(",")(0).split(":")
acc :+ (k.trim.toInt, v.trim.toDouble)
}
I am currently using the above UDF to parse a column of string into an array of keys and values.
"50:63.25,100:58.38" to [[50,63.2], [100,58.38]].
In some cases, the string is "\N" and I am unable to parse the column value.
If the string is "\N" then I should return an empty array. Can anyone help me to handle this exception or help me adding a new case? I am new to spark-scala.
Error: scala.MatchError: [Ljava.lang.String;#497cb6a9 (of class [Ljava.lang.String;)

You need to check that the resulting Array has two elements. You need a pattern matching like this to avoid that parse error:
def parse_values(value: String) = {
val values = value.split(",").map(_.trim)
values.foldLeft(Array[(Int, Double)]()) {
case (acc, present) =>
val Array(k, v) = {
present.split(",")(0).split(":") match {
case Array(_) => Array("0", "0.0")
case arr => arr
}
}
acc :+ (k.trim.toInt, v.trim.toDouble)
}
}

Related

Scala alternative to series of if statements that append to a list?

I have a Seq[String] in Scala, and if the Seq contains certain Strings, I append a relevant message to another list.
Is there a more 'scalaesque' way to do this, rather than a series of if statements appending to a list like I have below?
val result = new ListBuffer[Err]()
val malformedParamNames = // A Seq[String]
if (malformedParamNames.contains("$top")) result += IntegerMustBePositive("$top")
if (malformedParamNames.contains("$skip")) result += IntegerMustBePositive("$skip")
if (malformedParamNames.contains("modifiedDate")) result += FormatInvalid("modifiedDate", "yyyy-MM-dd")
...
result.toList
If you want to use some scala iterables sugar I would use
sealed trait Err
case class IntegerMustBePositive(msg: String) extends Err
case class FormatInvalid(msg: String, format: String) extends Err
val malformedParamNames = Seq[String]("$top", "aa", "$skip", "ccc", "ddd", "modifiedDate")
val result = malformedParamNames.map { v =>
v match {
case "$top" => Some(IntegerMustBePositive("$top"))
case "$skip" => Some(IntegerMustBePositive("$skip"))
case "modifiedDate" => Some(FormatInvalid("modifiedDate", "yyyy-MM-dd"))
case _ => None
}
}.flatten
result.toList
Be warn if you ask for scala-esque way of doing things there are many possibilities.
The map function combined with flatten can be simplified by using flatmap
sealed trait Err
case class IntegerMustBePositive(msg: String) extends Err
case class FormatInvalid(msg: String, format: String) extends Err
val malformedParamNames = Seq[String]("$top", "aa", "$skip", "ccc", "ddd", "modifiedDate")
val result = malformedParamNames.flatMap {
case "$top" => Some(IntegerMustBePositive("$top"))
case "$skip" => Some(IntegerMustBePositive("$skip"))
case "modifiedDate" => Some(FormatInvalid("modifiedDate", "yyyy-MM-dd"))
case _ => None
}
result
Most 'scalesque' version I can think of while keeping it readable would be:
val map = scala.collection.immutable.ListMap(
"$top" -> IntegerMustBePositive("$top"),
"$skip" -> IntegerMustBePositive("$skip"),
"modifiedDate" -> FormatInvalid("modifiedDate", "yyyy-MM-dd"))
val result = for {
(k,v) <- map
if malformedParamNames contains k
} yield v
//or
val result2 = map.filterKeys(malformedParamNames.contains).values.toList
Benoit's is probably the most scala-esque way of doing it, but depending on who's going to be reading the code later, you might want a different approach.
// Some type definitions omitted
val malformations = Seq[(String, Err)](
("$top", IntegerMustBePositive("$top")),
("$skip", IntegerMustBePositive("$skip")),
("modifiedDate", FormatInvalid("modifiedDate", "yyyy-MM-dd")
)
If you need a list and the order is siginificant:
val result = (malformations.foldLeft(List.empty[Err]) { (acc, pair) =>
if (malformedParamNames.contains(pair._1)) {
pair._2 ++: acc // prepend to list for faster performance
} else acc
}).reverse // and reverse since we were prepending
If the order isn't significant (although if the order's not significant, you might consider wanting a Set instead of a List):
val result = (malformations.foldLeft(Set.empty[Err]) { (acc, pair) =>
if (malformedParamNames.contains(pair._1)) {
acc ++ pair._2
} else acc
}).toList // omit the .toList if you're OK with just a Set
If the predicates in the repeated ifs are more complex/less uniform, then the type for malformations might need to change, as they would if the responses changed, but the basic pattern is very flexible.
In this solution we define a list of mappings that take your IF condition and THEN statement in pairs and we iterate over the inputted list and apply the changes where they match.
// IF THEN
case class Operation(matcher :String, action :String)
def processInput(input :List[String]) :List[String] = {
val operations = List(
Operation("$top", "integer must be positive"),
Operation("$skip", "skip value"),
Operation("$modify", "modify the date")
)
input.flatMap { in =>
operations.find(_.matcher == in).map { _.action }
}
}
println(processInput(List("$skip","$modify", "$skip")));
A breakdown
operations.find(_.matcher == in) // find an operation in our
// list matching the input we are
// checking. Returns Some or None
.map { _.action } // if some, replace input with action
// if none, do nothing
input.flatMap { in => // inputs are processed, converted
// to some(action) or none and the
// flatten removes the some/none
// returning just the strings.

Scala: dynamically joining data frames

I have data split into multiple files. I want to load and join the files.
I'd like to build a dynamic function that
1. will join n data files into a single data frame
2. given the input of file location and join column (e.g., pk)
I think this can be done with foldLeft, but I am not quite sure how:
Here is my code so far:
#throws
def dataJoin(path:String, fileNames:String*): DataFrame=
{
try
{
val dfList:ArrayBuffer[DataFrame]=new ArrayBuffer
for(fileName <- fileNames)
{
val df:DataFrame=DataFrameUtils.openFile(spark, s"$path${File.separator}$fileName")
dfList += df
}
dfList.foldLeft
{
(df,df1) => joinDataFrames(df,df1, "UID")
}
}
catch
{
case e:Exception => throw new Exception(e)
}
}
def joinDataFrames(df:DataFrame,df1:DataFrame, joinColum:String): Unit =
{
df.join(df1, Seq(joinColum))
}
foldLeft might indeed be suitable here, but it requires a "zero" element to start the folding from (in addition to the folding function). In this case, that "zero" can be the first DataFrame:
dfList.tail.foldLeft(dfList.head) { (df1, df2) => df1.join(df2, "UID") }
To avoid errors, you probably want to make sure the list isn't empty before trying to access that first item - one way of doing that would be using pattern matching.
dfList match {
case head :: tail => tail.foldLeft(head) { (df1, df2) => df1.join(df2, "UID") }
case Nil => spark.emptyDataFrame
}
Lastly, it's simpler, safer and more idiomatic to map over a collection instead of iterating over it and populated another (empty, mutable) collection:
val dfList = fileNames.map(fileName => DataFrameUtils.openFile(spark, s"$path${File.separator}$fileName"))
Altogether:
def dataJoin(path:String, fileNames: String*): DataFrame = {
val dfList = fileNames
.map(fileName => DataFrameUtils.openFile(spark, s"$path${File.separator}$fileName"))
.toList
dfList match {
case head :: tail => tail.foldLeft(head) { (df1, df2) => df1.join(df2, "UID") }
case Nil => spark.emptyDataFrame
}
}

Scala removing IF Else statement and write in a functional way

Let's say I have the following tuple
(colType, colDocV)
Where colType is a boolean and colDocV is a String
Depending on those two values, I will apply some chunk of code that applies transformations to a Dataframe.
Now, this code works. However, I am not convinced this is the proper way to write functional programming code.
I don't know which of these 3 approaches will improve the quality of the code and remove all if-if else-else :
Should I apply some kind of design pattern and which one?
Should I use some kind of pattern matching?
Should I use some anonymous function?
if (colDocV) {
val newCol = udf(UDFHashCode.udfHashCode).apply(col(columnName))
dataframe.withColumn(columnName, newCol)
} else if (colType.contains("string") || colType.contains("text")) {
val newCol = udf(Entropy.stringEntropyFunc).apply(col(columnName)).cast(DoubleType)
dataframe.withColumn(columnName, newCol)
} else if (colType.contains("date")) {
val newCol = udf(DateUtils.getTimeAsDoubleFunc).apply(col(columnName)).cast(DoubleType)
dataframe.withColumn(columnName, newCol)
} else if (colType.contains("long")) {
dataframe.withColumn(columnName, dataframe(columnName).cast(DoubleType) )
} else {
dataframe.drop(columnName) //Dropping column that cannot be processed
}
You can do this with a match statement and a bunch of regexps.
val str = ".*(?:string|text).*".r
val date = ".*date.*".r
val long = ".*long.*".r
def col(tuple: (Boolean, String)) = tuple match {
case (true, _) => Some(udf(...))
case (_, str()) => Some(udf(...))
case (_, date()) => Some(udf(...))
case (, long()) => Some(udf(...))
case _ => None
}
col(colType -> colDocv)
.fold(dataframe.drop(columnName)) { newCol =>
dataframe.withColumn(columnName, newCol)
}
According to what I understand from your question following can be a solution using match case
def callUdf(colDocV: String, colType: Boolean, dataframe: DataFrame) = (colDocV, colType) match {
case x if (x._1.contains("string") || x._1.contains("text")) => dataframe.withColumn(columnName, udf(Entropy.stringEntropyFunc).apply(col(columnName)).cast(DoubleType))
case x if (x._1.contains("date")) => dataframe.withColumn(columnName, udf(DateUtils.getTimeAsDoubleFunc).apply(col(columnName)).cast(DoubleType))
case x if (x._1.contains("long")) => dataframe.withColumn(columnName, dataframe(columnName).cast(DoubleType) )
case _ => dataframe.drop(columnName)
}

Tuple seen as Product, compiler rejects reference to element

Constructing phoneVector:
val phoneVector = (
for (i <- 1 until 20) yield {
val p = killNS(r.get("Phone %d - Value" format(i)))
val t = killNS(r.get("Phone %d - Type" format(i)))
if (p == None) None
else
if (t == None) (p,"Main") else (p,t)
}
).filter(_ != None)
Consider this very simple snippet:
for (pTuple <- phoneVector) {
println(pTuple.getClass.getName)
println(pTuple)
//val pKey = pTuple._1.replaceAll("[^\\d]","")
associate() // stub prints "associate"
}
When I run it, I see output like this:
scala.Tuple2
((609) 954-3815,Mobile)
associate
When I uncomment the line with replaceAll(), compile fails:
....scala:57: value _1 is not a member of Product with Serializable
[error] val pKey = pTuple._1.replaceAll("[^\\d]","")
[error] ^
Why does it not recognize pTuple as a Tuple2 and treat it only as Product
OK, this compiles and produces the desired result. But it's too verbose. Can someone please demonstrate a more concise solution for dealing with this typesafe stuff?
for (pTuple <- phoneVector) {
println(pTuple.getClass.getName)
println(pTuple)
val pPhone = pTuple match {
case t:Tuple2[_,_] => t._1
case _ => None
}
val pKey = pPhone match {
case s:String => s.replaceAll("[^\\d]","")
case _ => None
}
println(pKey)
associate()
}
You can do:
for (pTuple <- phoneVector) {
val pPhone = pTuple match {
case (key, value) => key
case _ => None
}
val pKey = pPhone match {
case s:String => s.replaceAll("[^\\d]","")
case _ => None
}
println(pKey)
associate()
}
Or simply phoneVector.map(_._1.replaceAll("[^\\d]",""))
By changing the construction of phoneVector, as wrick's question implied, I've been able to eliminate the match/case stuff because Tuple is assured. Not thrilled by it, but Change is Hard, and Scala seems cool.
Now, it's still possible to slip a None value into either of the Tuple values. My match/case does not check for that, and I suspect that could lead to a runtime error in the replaceAll call. How is that allowed?
def killNS (s:Option[_]) = {
(s match {
case _:Some[_] => s.get
case _ => None
}) match {
case None => None
case "" => None
case s => s
}
}
val phoneVector = (
for (i <- 1 until 20) yield {
val p = killNS(r.get("Phone %d - Value" format(i)))
val t = killNS(r.get("Phone %d - Type" format(i)))
if (t == None) (p,"Main") else (p,t)
}
).filter(_._1 != None)
println(phoneVector)
println(name)
println
// Create the Neo4j nodes:
for (pTuple <- phoneVector) {
val pPhone = pTuple._1 match { case p:String => p }
val pType = pTuple._2
val pKey = pPhone.replaceAll(",.*","").replaceAll("[^\\d]","")
associate(Map("target"->Map("label"->"Phone","key"->pKey,
"dial"->pPhone),
"relation"->Map("label"->"IS_AT","key"->pType),
"source"->Map("label"->"Person","name"->name)
)
)
}
}

Converting MongoCursor to JSON

Using Casbah, I query Mongo.
val mongoClient = MongoClient("localhost", 27017)
val db = mongoClient("test")
val coll = db("test")
val results: MongoCursor = coll.find(builder)
var matchedDocuments = List[DBObject]()
for(result <- results) {
matchedDocuments = matchedDocuments :+ result
}
Then, I convert the List[DBObject] into JSON via:
val jsonString: String = buildJsonString(matchedDocuments)
Is there a better way to convert from "results" (MongoCursor) to JSON (JsValue)?
private def buildJsonString(list: List[DBObject]): Option[String] = {
def go(list: List[DBObject], json: String): Option[String] = list match {
case Nil => Some(json)
case x :: xs if(json == "") => go(xs, x.toString)
case x :: xs => go(xs, json + "," + x.toString)
case _ => None
}
go(list, "")
}
Assuming you want implicit conversion (like in flavian's answer), the easiest way to join the elements of your list with commas is:
private implicit def buildJsonString(list: List[DBObject]): String =
list.mkString(",")
Which is basically the answer given in Scala: join an iterable of strings
If you want to include the square brackets to properly construct a JSON array you'd just change it to:
list.mkString("[", ",", "]") // punctuation madness
However if you'd actually like to get to Play JsValue elements as you seem to indicate in the original question, then you could do:
list.map { x => Json.parse(x.toString) }
Which should produce a List[JsValue] instead of a String. However, if you're just going to convert it back to a string again when sending a response, then it's an unneeded step.