How to dynamic invoke UDAFs in Spark? - scala

Update.
It seems the below request is not reasonable. The better practice is to add AggName in partitioner function. In this case, the job can have a better scalbility.
I have multi independent UDAFs(User-defined-aggregation-functions) and want to do different aggregations on the same DataSet - targetDataSet. It looks like below.
// The Dataset
val targetDataSet = ....
// The first UDAF
val agg1 = new Agg1
targetDataSet
.groupByKey(c => agg1.genPartitionKey(c))(Encoders.STRING)
.reduceGroups(agg1.aggs(_, _))
.map {
case (key, value) => agg1.generateRuleOutput(key, value)
}
}
.write
.format("json")
.mode(SaveMode.Append)
.save(agg1.getOutputPath())
// The second UDAF
val agg2 = new Agg2
targetDataSet
.groupByKey(c => agg2.genPartitionKey(c))(Encoders.STRING)
.reduceGroups(agg2.aggs(_, _))
.map {
case (key, value) => agg2.generateRuleOutput(key, value)
}
}
.write
.format("json")
.mode(SaveMode.Append)
.save(agg2.getOutputPath())
The above codes can work well. But repeatable coding looks ugly since the UDAFs have the same interfaces.
So, my question is that whether we can dynamic invoke UDAFs. If so, we can configure the UDAFs in config files and the code will get more clear.
I tried to call these UDAFs in a for loop, but it does not work.

Related

How to deal with None output of a function in Scala?

I have the following function:
def getData(spark: SparkSession,
indices: Option[String]): Option[DataFrame] = {
indices.map{
ind =>
spark
.read
.format("org.elasticsearch.spark.sql")
.load(ind)
}
}
This function returns Option[DataFrame].
Then I want to use this function as follows:
val df = getData(spark, indices)
df.persist(StorageLevel.MEMORY_AND_DISK)
Of course the last two lines of code will not compile because df might be None. What is the idiomatic way deal with None output in Scala?
I would like to throw an exception and stop the program if df is None. Otherwise I want to persist it.
If you do care about the None I'd use simple pattern match here:
df match {
case None => throw new RuntimeException()
case Some(dataFrame) => dataFrame.persist(StorageLevel.MEMORY_AND_DISK)
}
But if you don't care, just use foreach like:
df.foreach { dataFrame =>
dataFrame.persist(StorageLevel.MEMORY_AND_DISK)
}
val df = dfOption.getOrElse(throw new Exception("Disaster Strikes"))
df.persist(...)

Spark accumulator empty when used in UDF

I was working on optimizing my Spark process, and was trying to use a UDF with an accumulator. I have gotten the accumulator to work on its own, and was looking to see if I would get any speed up using a UDF. But instead, when I wrap the accumulator in the UDF, it remains empty. Am I going something wrong in particular? Is there something going on with Lazy Execution where even with my .count it is still not executing?
Input:
0,[0.11,0.22]
1,[0.22,0.33]
Output:
(0,0,0.11),(0,1,0.22),(1,0,0.22),(1,1,0.33)
Code
val accum = new MapAccumulator2d()
val session = SparkSession.builder().getOrCreate()
session.sparkContext.register(accum)
//Does not work - Empty Accumlator
val rowAccum = udf((itemId: Int, item: mutable.WrappedArray[Float]) => {
val map = item
.zipWithIndex
.map(ff => {
((itemId, ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
itemId
})
dataFrame.select(rowAccum(col("itemId"), col("jaccardList"))).count
//Works
dataFrame.foreach(f => {
val map = f.getAs[mutable.WrappedArray[Float]](1)
.zipWithIndex
.map(ff => {
((f.getInt(0), ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
})
val list = accum.value.toList.map(f => (f._1._1, f._1._2, f._2))
Looks like the only issue here is using count to "trigger" the lazily-evaluated UDF: Spark is "smart" enough to realize that the select operation can't change the result of count and therefore doesn't really execute the UDF. Choosing a different operation (e.g. collect) shows that the UDF works and updates the accumulator.
Here's a (more concise) example:
val accum = sc.longAccumulator
val rowAccum = udf((itemId: Int) => { accum.add(itemId); itemId })
val dataFrame = Seq(1,2,3,4,5).toDF("itemId")
dataFrame.select(rowAccum(col("itemId"))).count() // won't trigger UDF
println(s"RESULT: ${accum.value}") // prints 0
dataFrame.select(rowAccum(col("itemId"))).collect() // triggers UDF
println(s"RESULT: ${accum.value}") // prints 15

is rdd.contains function in spark-scala expensive

I am getting millions of message from Kafka stream in spark-streaming. There are 15 different types of message. Messages come from a single topic. I can only differentiate message by its content. so I am using rdd.contains method to get the different type of rdd.
sample message
{"a":"foo", "b":"bar","type":"first" .......}
{"a":"foo1", "b":"bar1","type":"second" .......}
{"a":"foo2", "b":"bar2","type":"third" .......}
{"a":"foo", "b":"bar","type":"first" .......}
..............
...............
.........
so on
code
DStream.foreachRDD { rdd =>
if (!rdd.isEmpty()) {
val rdd_first = rdd.filter {
ele => ele.contains("First")
}
if (!rdd_first.isEmpty()) {
insertIntoTableFirst(hivecontext.read.json(rdd_first))
}
val rdd_second = rdd.filter {
ele => ele.contains("Second")
}
if (!rdd_second.isEmpty()) {
insertIntoTableSecond(hivecontext.read.json(rdd_second))
}
.............
......
same way for 15 different rdd
is there any way to get different rdd from kafka topic message?
There's no rdd.contains. The function contains used here is applied to the Strings in the RDD.
Like here:
val rdd_first = rdd.filter {
element => element.contains("First") // each `element` is a String
}
This method is not robust because other content in the String might meet the comparison, resulting in errors.
e.g.
{"a":"foo", "b":"bar","type":"second", "c": "first", .......}
One way to deal with this would be to first transform the JSON data into proper records, and then apply grouping or filtering logic on those records. For that, we first need a schema definition of the data. With the schema, we can parse the records into json and apply any processing on top of that:
case class Record(a:String, b:String, `type`:String)
import org.apache.spark.sql.types._
val schema = StructType(
Array(
StructField("a", StringType, true),
StructField("b", StringType, true),
StructField("type", String, true)
)
)
val processPerType: Map[String, Dataset[Record] => Unit ] = Map(...)
stream.foreachRDD { rdd =>
val records = rdd.toDF("value").select(from_json($"value", schema)).as[Record]
processPerType.foreach{case (tpe, process) =>
val target = records.filter(entry => entry.`type` == tpe)
process(target)
}
}
The question does not specify what kind of logic needs to be applied to each type of record. What's presented here is a generic way of approaching the problem where any custom logic can be expressed as a function Dataset[Record] => Unit.
If the logic could be expressed as an aggregation, probably the Dataset aggregation functions will be more appropriate.

Spark : How to use mapPartition and create/close connection per partition

So, I want to do certain operations on my spark DataFrame, write them to DB and create another DataFrame at the end. It looks like this :
import sqlContext.implicits._
val newDF = myDF.mapPartitions(
iterator => {
val conn = new DbConnection
iterator.map(
row => {
addRowToBatch(row)
convertRowToObject(row)
})
conn.writeTheBatchToDB()
conn.close()
})
.toDF()
This gives me an error as mapPartitions expects return type of Iterator[NotInferedR], but here it is Unit. I know this is possible with forEachPartition, but I'd like to do the mapping also. Doing it separate would be an overhead (extra spark job). What to do?
Thanks!
On most cases, eager consuming the iterator will result to execution failure if not slow down of jobs. Thus what I've done was to check if iterator is already empty then do the cleanup routines.
rdd.mapPartitions(itr => {
val conn = new DbConnection
itr.map(data => {
val yourActualResult = // do something with your data and conn here
if(itr.isEmpty) conn.close // close the connection
yourActualResult
})
})
Thought this as a spark problem at first but was a scala one actually. http://www.scala-lang.org/api/2.12.0/scala/collection/Iterator.html#isEmpty:Boolean
The last expression in the anonymous function implementation must be the return value:
import sqlContext.implicits._
val newDF = myDF.mapPartitions(
iterator => {
val conn = new DbConnection
// using toList to force eager computation - make it happen now when connection is open
val result = iterator.map(/* the same... */).toList
conn.writeTheBatchToDB()
conn.close()
result.iterator
}
).toDF()

scala.MatchError: null on spark RDDs

I am relatively new to both spark and scala.
I was trying to implement collaborative filtering using scala on spark.
Below is the code
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val data = sc.textFile("/user/amohammed/CB/input-cb.txt")
val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)
val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val keywords = distinctKeywords collect
distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()
It throws a scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) at the last line
Thw code works fine if I collect the distinctUsers rdd into an array and execute the same code:
val users = distinctUsers collect
users.map(x => {(x, keywords.map(y => model.predict(x, y)))})
Where am I getting it wrong when dealing with RDDs?
Spark Version : 1.0.0
Scala Version : 2.10.4
Going one call further back in the stack trace, line 43 of the MatrixFactorizationModel source says:
val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
Note that the userFeatures field of model is itself another RDD; I believe it isn't getting serialized properly when the anonymous function block closes over model, and thus the lookup method on it is failing. I also tried placing both model and keywords into broadcast variables, but that didn't work either.
Instead of falling back to Scala collections and losing the benefits of Spark, it's probably better to stick with RDDs and take advantage of other ways of transforming them.
I'd start with this:
val ratings = data.map(_.split(',') match {
case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})
// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class
val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()
val model = ALS.train(ratings, 1, 20, 0.01)
Then, instead of calculating each prediction one by one, we can obtain the Cartesian product of all possible user-keyword pairs as an RDD and use the other predict method in MatrixFactorizationModel, which takes an RDD of such pairs as its argument.
val userKeywords = distinctUsers.cartesian(distinctKeywords)
val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
(user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }
Now predictions has an immutable map for each user that can be queried for the predicted rating of a particular keyword. If you specifically want arrays as in your original example, you can do:
val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))
Caveat: I tested this with Spark 1.0.1 as it's what I had installed, but it should work with 1.0.0 as well.