Scala functional programming dry run - scala

Could you please help me in understanding the following method:
def extractGlobalID(custDimIndex :Int)(gaData:DataFrame) : DataFrame = {
val getGlobId = udf[String,Seq[GenericRowWithSchema]](genArr => {
val globId: List[String] =
genArr.toList
.filter(_(0) == custDimIndex)
.map(custDim => custDim(1).toString)
globId match {
case Nil => ""
case x :: _ => x
}
})
gaData.withColumn("globalId", getGlobId('customDimensions))
}

The method applies an UDF to to dataframe. The UDF seems intended to extract a single ID from column of type array<struct>, where the first element of the struct is an index, the second one an ID.
You could rewrite the code to be more readable:
def extractGlobalID(custDimIndex :Int)(gaData:DataFrame) : DataFrame = {
val getGlobId = udf((genArr : Seq[Row]) => {
genArr
.find(_(0) == custDimIndex)
.map(_(1).toString)
.getOrElse("")
})
gaData.withColumn("globalId", getGlobId('customDimensions))
}
or even shorter with collectFirst:
def extractGlobalID(custDimIndex :Int)(gaData:DataFrame) : DataFrame = {
val getGlobId = udf((genArr : Seq[Row]) => {
genArr
.collectFirst{case r if(r.getInt(0)==custDimIndex) => r.getString(1)}
.getOrElse("")
})
gaData.withColumn("globalId", getGlobId('customDimensions))
}

Related

Exception handling in Spark Scala UDF

def parse_values(value: String) = {
val values = value.split(",").map(_.trim)
values.foldLeft(Array[(Int, Double)]()) {
case (acc, present) =>
val Array(k, v) = present.split(",")(0).split(":")
acc :+ (k.trim.toInt, v.trim.toDouble)
}
I am currently using the above UDF to parse a column of string into an array of keys and values.
"50:63.25,100:58.38" to [[50,63.2], [100,58.38]].
In some cases, the string is "\N" and I am unable to parse the column value.
If the string is "\N" then I should return an empty array. Can anyone help me to handle this exception or help me adding a new case? I am new to spark-scala.
Error: scala.MatchError: [Ljava.lang.String;#497cb6a9 (of class [Ljava.lang.String;)
You need to check that the resulting Array has two elements. You need a pattern matching like this to avoid that parse error:
def parse_values(value: String) = {
val values = value.split(",").map(_.trim)
values.foldLeft(Array[(Int, Double)]()) {
case (acc, present) =>
val Array(k, v) = {
present.split(",")(0).split(":") match {
case Array(_) => Array("0", "0.0")
case arr => arr
}
}
acc :+ (k.trim.toInt, v.trim.toDouble)
}
}

Scala removing IF Else statement and write in a functional way

Let's say I have the following tuple
(colType, colDocV)
Where colType is a boolean and colDocV is a String
Depending on those two values, I will apply some chunk of code that applies transformations to a Dataframe.
Now, this code works. However, I am not convinced this is the proper way to write functional programming code.
I don't know which of these 3 approaches will improve the quality of the code and remove all if-if else-else :
Should I apply some kind of design pattern and which one?
Should I use some kind of pattern matching?
Should I use some anonymous function?
if (colDocV) {
val newCol = udf(UDFHashCode.udfHashCode).apply(col(columnName))
dataframe.withColumn(columnName, newCol)
} else if (colType.contains("string") || colType.contains("text")) {
val newCol = udf(Entropy.stringEntropyFunc).apply(col(columnName)).cast(DoubleType)
dataframe.withColumn(columnName, newCol)
} else if (colType.contains("date")) {
val newCol = udf(DateUtils.getTimeAsDoubleFunc).apply(col(columnName)).cast(DoubleType)
dataframe.withColumn(columnName, newCol)
} else if (colType.contains("long")) {
dataframe.withColumn(columnName, dataframe(columnName).cast(DoubleType) )
} else {
dataframe.drop(columnName) //Dropping column that cannot be processed
}
You can do this with a match statement and a bunch of regexps.
val str = ".*(?:string|text).*".r
val date = ".*date.*".r
val long = ".*long.*".r
def col(tuple: (Boolean, String)) = tuple match {
case (true, _) => Some(udf(...))
case (_, str()) => Some(udf(...))
case (_, date()) => Some(udf(...))
case (, long()) => Some(udf(...))
case _ => None
}
col(colType -> colDocv)
.fold(dataframe.drop(columnName)) { newCol =>
dataframe.withColumn(columnName, newCol)
}
According to what I understand from your question following can be a solution using match case
def callUdf(colDocV: String, colType: Boolean, dataframe: DataFrame) = (colDocV, colType) match {
case x if (x._1.contains("string") || x._1.contains("text")) => dataframe.withColumn(columnName, udf(Entropy.stringEntropyFunc).apply(col(columnName)).cast(DoubleType))
case x if (x._1.contains("date")) => dataframe.withColumn(columnName, udf(DateUtils.getTimeAsDoubleFunc).apply(col(columnName)).cast(DoubleType))
case x if (x._1.contains("long")) => dataframe.withColumn(columnName, dataframe(columnName).cast(DoubleType) )
case _ => dataframe.drop(columnName)
}

Wait future completion to execute a another one for a sequence

I met a read / write problem this last days and I cannot fix an issue in my test. I have a JSON document based on the following model
package models
import play.api.libs.json._
object Models {
case class Record
(
id : Int,
samples : List[Double]
)
object Record {
implicit val recordFormat = Json.format[Record]
}
}
I have two functions : one to read a record and an another one to update.
case class MongoIO(futureCollection : Future[JSONCollection]) {
def readRecord(id : Int) : Future[Option[Record]] =
futureCollection
.flatMap { collection =>
collection.find(Json.obj("id" -> id)).one[Record]
}
def updateRecord(id : Int, newSample : Double) : Future[UpdateWriteResult]= {
readRecord(id) flatMap { recordOpt =>
recordOpt match {
case None =>
Future { UpdateWriteResult(ok = false, -1, -1, Seq(), Seq(), None, None, None) }
case Some(record) =>
val newRecord =
record.copy(samples = record.samples :+ newSample)
futureCollection
.flatMap { collection =>
collection.update(Json.obj("id" -> id), newRecord)
}
}
}
}
}
Now, I have a List[Future[UpdateWriteResult]] corresponds to a many updates on the document but what I want is that : wait the future is complete to execute the second one then wait the completion of the second to execute the third. I tried to do that with a foldLeft and flatMap like this :
val l : List[Future[UpdateWriteResult]] = ...
println(l.size) // give me 10
l
.foldLeft(Future.successful(UpdateWriteResult(ok = false, -1, -1, Seq(), Seq(), None, None, None))) {
case (cur, next) => cur.flatMap(_ => next)
}
but the document is never updated like excepted : instead to have a document with a samples list of size 10, I got a list of 1 samples. So the read is faster than the write (impression that I have) and also using a combinaison of foldLeft / flatMap seems to do not wait the completion of the current future so how can I fix this issue properly (without Await) ?
Update
val futureCollection = DB.getCollection("foo")
val mongoIO = MongoIO(futureCollection)
val id = 1
val samples = List(1.1, 2.2, 3.3)
val l : List[Future[UpdateWriteResult]] = samples.map(sample => mongoIO.updateRecord(id, sample))
You have to do the foldLeft on the samples:
(mongoIO.updateRecord(id, samples.head) /: samples.tail) {(acc, next) =>
acc.flatMap(_ => mongoIO.updateRecord(id, next))
}
updateRecord is what triggers the future, so you have to make sure not to call it until the previous one finishes.
I took the liberty of making some minor modification to you original sources:
case class MongoIO(futureCollection : Future[JSONCollection]) {
def readRecord(id : Int) : Future[Option[Record]] =
futureCollection.flatMap(_.find(Json.obj("id" -> id)).one[Record])
def updateRecord(id : Int, newSample : Double) : Future[UpdateWriteResult] =
readRecord(id) flatMap {
case None => Future.successful(UpdateWriteResult(ok = false, -1, -1, Nil, Nil, None, None, None))
case Some(record) =>
val newRecord = record.copy(samples = record.samples :+ newSample)
futureCollection.flatMap(_.update(Json.obj("id" -> id), newRecord))
}
}
Then we only need to write a serialized/sequential update:
def update(samples: Seq[Double], id: Int): Future[Unit] = s match {
case sample +: remainingSamples => updateRecord(id, sample).flatMap(_ => update(remainingSamples, id))
case _ => Future.successful(())
}

Tuple seen as Product, compiler rejects reference to element

Constructing phoneVector:
val phoneVector = (
for (i <- 1 until 20) yield {
val p = killNS(r.get("Phone %d - Value" format(i)))
val t = killNS(r.get("Phone %d - Type" format(i)))
if (p == None) None
else
if (t == None) (p,"Main") else (p,t)
}
).filter(_ != None)
Consider this very simple snippet:
for (pTuple <- phoneVector) {
println(pTuple.getClass.getName)
println(pTuple)
//val pKey = pTuple._1.replaceAll("[^\\d]","")
associate() // stub prints "associate"
}
When I run it, I see output like this:
scala.Tuple2
((609) 954-3815,Mobile)
associate
When I uncomment the line with replaceAll(), compile fails:
....scala:57: value _1 is not a member of Product with Serializable
[error] val pKey = pTuple._1.replaceAll("[^\\d]","")
[error] ^
Why does it not recognize pTuple as a Tuple2 and treat it only as Product
OK, this compiles and produces the desired result. But it's too verbose. Can someone please demonstrate a more concise solution for dealing with this typesafe stuff?
for (pTuple <- phoneVector) {
println(pTuple.getClass.getName)
println(pTuple)
val pPhone = pTuple match {
case t:Tuple2[_,_] => t._1
case _ => None
}
val pKey = pPhone match {
case s:String => s.replaceAll("[^\\d]","")
case _ => None
}
println(pKey)
associate()
}
You can do:
for (pTuple <- phoneVector) {
val pPhone = pTuple match {
case (key, value) => key
case _ => None
}
val pKey = pPhone match {
case s:String => s.replaceAll("[^\\d]","")
case _ => None
}
println(pKey)
associate()
}
Or simply phoneVector.map(_._1.replaceAll("[^\\d]",""))
By changing the construction of phoneVector, as wrick's question implied, I've been able to eliminate the match/case stuff because Tuple is assured. Not thrilled by it, but Change is Hard, and Scala seems cool.
Now, it's still possible to slip a None value into either of the Tuple values. My match/case does not check for that, and I suspect that could lead to a runtime error in the replaceAll call. How is that allowed?
def killNS (s:Option[_]) = {
(s match {
case _:Some[_] => s.get
case _ => None
}) match {
case None => None
case "" => None
case s => s
}
}
val phoneVector = (
for (i <- 1 until 20) yield {
val p = killNS(r.get("Phone %d - Value" format(i)))
val t = killNS(r.get("Phone %d - Type" format(i)))
if (t == None) (p,"Main") else (p,t)
}
).filter(_._1 != None)
println(phoneVector)
println(name)
println
// Create the Neo4j nodes:
for (pTuple <- phoneVector) {
val pPhone = pTuple._1 match { case p:String => p }
val pType = pTuple._2
val pKey = pPhone.replaceAll(",.*","").replaceAll("[^\\d]","")
associate(Map("target"->Map("label"->"Phone","key"->pKey,
"dial"->pPhone),
"relation"->Map("label"->"IS_AT","key"->pType),
"source"->Map("label"->"Person","name"->name)
)
)
}
}

mongodb casbah and list handling

I am having problems writing this function, which takes a string and returns a list of strings associated to it.
(I'm expecting entries like {_id: ...., hash: "abcde", n: ["a","b","ijojoij"]} in mongodb)
def findByHash(hash: Hash) = {
val dbobj = mongoColl.findOne(MongoDBObject("hash" -> hash.hashStr))
val n = dbobj match {
case Some(doc: com.mongodb.casbah.Imports.DBObject) => {
doc("n") match {
case Some(n: com.mongodb.casbah.Imports.DBObject) => {
Some(List[String]() ++ n map { x => x.asInstanceOf[String] })
}
case _ => {
None // hash match but no n in object
}
}
}
case _ => {
None // no hash match
}
}
n
}
Is there anything wrong with the code? Do you know how to correct it?
doc("n") returns AnyRef, so you should explicitly cast it to BasicDBList.
val n = doc("n").asInstanceOf[BasicDBList]
Some(List[String]() ++ n map { x => x.asInstanceOf[String] })