Pass array as an UDF parameter in Spark SQL - scala

I'm trying to transform a dataframe via a function that takes an array as a parameter. My code looks something like this:
def getCategory(categories:Array[String], input:String): String = {
categories(input.toInt)
}
val myArray = Array("a", "b", "c")
val myCategories =udf(getCategory _ )
val df = sqlContext.parquetFile("myfile.parquet)
val df1 = df.withColumn("newCategory", myCategories(lit(myArray), col("myInput"))
However, lit doesn't like arrays and this script errors. I tried definining a new partially applied function and then the udf after that :
val newFunc = getCategory(myArray, _:String)
val myCategories = udf(newFunc)
val df1 = df.withColumn("newCategory", myCategories(col("myInput")))
This doesn't work either as I get a nullPointer exception and it appears myArray is not being recognized. Any ideas on how I pass an array as a parameter to a function with a dataframe?
On a separate note, any explanation as to why doing something simple like using a function on a dataframe is so complicated (define function, redefine it as UDF, etc, etc)?

Most likely not the prettiest solution but you can try something like this:
def getCategory(categories: Array[String]) = {
udf((input:String) => categories(input.toInt))
}
df.withColumn("newCategory", getCategory(myArray)(col("myInput")))
You could also try an array of literals:
val getCategory = udf(
(input:String, categories: Array[String]) => categories(input.toInt))
df.withColumn(
"newCategory", getCategory($"myInput", array(myArray.map(lit(_)): _*)))
On a side note using Map instead of Array is probably a better idea:
def mapCategory(categories: Map[String, String], default: String) = {
udf((input:String) => categories.getOrElse(input, default))
}
val myMap = Map[String, String]("1" -> "a", "2" -> "b", "3" -> "c")
df.withColumn("newCategory", mapCategory(myMap, "foo")(col("myInput")))
Since Spark 1.5.0 you can also use an array function:
import org.apache.spark.sql.functions.array
val colArray = array(myArray map(lit _): _*)
myCategories(lit(colArray), col("myInput"))
See also Spark UDF with varargs

Related

Cannot splat an Array into function's arguments that accepts varargs

I have tried to make a function that can enrich a given DataFrame with a "session" column using a window function. So I need to use partitionBy and orderBy.
val by_uuid_per_date = Window.partitionBy("uuid").orderBy("year","month","day")
// A Session = A day of events for a certain user. uuid x (year+month+day)
val enriched_df = df
.withColumn("session", dense_rank().over(by_uuid_per_date))
.orderBy("uuid","timestamp")
.select("uuid","year","month","day","session")
This works perfectly, but when I try to make a function that encapsulates this behavior :
PS: I used the _* splat operator.
def enrich_with_session(df:DataFrame,
window_partition_cols:Array[String],
window_order_by_cols:Array[String],
presentation_order_by_cols:Array[String]):DataFrame={
val by_uuid_per_date = Window.partitionBy(window_partition_cols: _*).orderBy(window_order_by_cols: _*)
df.withColumn("session", dense_rank().over(by_uuid_per_date))
.orderBy(presentation_order_by_cols:_*)
.select("uuid","year","month","mday","session")
}
I get the following error:
notebook:6: error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to -parameters)
val by_uuid_per_date = Window.partitionBy(window_partition_cols: _).orderBy(window_order_by_cols: _*)
partitionBy and orderBy are expecting Seq[Column] or
Array[Column] as arguments, see below:
val data = Seq(
(1,99),
(1,99),
(1,70),
(1,20)
).toDF("id","value")
data.select('id,'value, rank().over(Window.partitionBy('id).orderBy('value))).show()
val partitionBy: Seq[Column] = Seq(data("id"))
val orderBy: Seq[Column] = Seq(data("value"))
data.select('id,'value, rank().over(Window.partitionBy(partitionBy:_*).orderBy(orderBy:_*))).show()
So in this case, your code should looks like this:
def enrich_with_session(df:DataFrame,
window_partition_cols:Array[String],
window_order_by_cols:Array[String],
presentation_order_by_cols:Array[String]):DataFrame={
val window_partition_cols_2: Array[Column] = window_partition_cols.map(df(_))
val window_order_by_cols_2: Array[Column] = window_order_by_cols.map(df(_))
val presentation_order_by_cols_2: Array[Column] = presentation_order_by_cols.map(df(_))
val by_uuid_per_date = Window.partitionBy(window_partition_cols_2: _*).orderBy(window_order_by_cols_2: _*)
df.withColumn("session", dense_rank().over(by_uuid_per_date))
.orderBy(presentation_order_by_cols_2:_*)
.select("uuid","year","month","mday","session")
}

Convert map to mapPartitions in spark

I have a code to analyze the log file using map transformation. Then the RDD got converted to DF.
val logData = sc.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/syslog.txt")
val logDataDF = logData.map(rec => (rec.split(" ")(0), rec.split(" ")(2), rec.split(" ")(5))).toDF("month", "date", "process")
I would like to know whether I can use mapPartitions in this case instead of map.
I don't know what is your use case but you can definitely use mapPartition instead of map. Below code will return the same logDataDF.
val logDataDF = logData.mapPartitions(x => {
val lst = scala.collection.mutable.ListBuffer[(String, String, String)]()
while (x.hasNext) {
val rec = x.next().split(" ")
lst += ((rec(0), rec(2), rec(5)))
}
lst.iterator
}).toDF("month", "date", "process")

Return a tuple of Map and String from UDF

I am trying to construct a temporary column from an expensive UDF that I need to run on each row of my Dataset[Row]. Currently it looks something like:
val myUDF = udf((values: Array[Byte], schema: String) => {
val list = new MyDecoder(schema).decode(values)
val myMap = list.map(
(value: SomeStruct) => (value.field1, value.field2)
).toMap
val field3 = list.head.field3
return (myMap, field3)
})
val decoded = myDF.withColumn("decoded_tmp", myUDF(col("data"), lit(schema))
.withColumn("myMap", col("decoded_tmp._1"))
.withColumn("field3", col("decoded_tmp._2"))
.drop("decoded_tmp")
However, when I try to compile this, I get a type mismatch error:
type mismatch;
found : (scala.collection.immutable.Map[String,Double], String)
required: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
How can I get around this, or will I have to have 2 expensive UDF functions, one to produce myMap and the other to produce the field3 column?

How to create collection of RDDs out of RDD?

I have an RDD[String], wordRDD. I also have a function that creates an RDD[String] from a string/word. I would like to create a new RDD for each string in wordRDD. Here are my attempts:
1) Failed because Spark does not support nested RDDs:
var newRDD = wordRDD.map( word => {
// execute myFunction()
(new MyClass(word)).myFunction()
})
2) Failed (possibly due to scope issue?):
var newRDD = sc.parallelize(new Array[String](0))
val wordArray = wordRDD.collect
for (w <- wordArray){
newRDD = sc.union(newRDD,(new MyClass(w)).myFunction())
}
My ideal result would look like:
// input RDD (wordRDD)
wordRDD: org.apache.spark.rdd.RDD[String] = ('apple','banana','orange'...)
// myFunction behavior
new MyClass('apple').myFunction(): RDD[String] = ('pple','aple'...'appl')
// after executing myFunction() on each word in wordRDD:
newRDD: RDD[String] = ('pple','aple',...,'anana','bnana','baana',...)
I found a relevant question here: Spark when union a lot of RDD throws stack overflow error, but it didn't address my issue.
Use flatMap to get RDD[String] as you desire.
var allWords = wordRDD.flatMap { word =>
(new MyClass(word)).myFunction().collect()
}
You cannot create a RDD from within another RDD.
However, it is possible to rewrite your function myFunction: String => RDD[String], which generates all words from the input where one letter is removed, into another function modifiedFunction: String => Seq[String] such that it can be used from within an RDD. That way, it will also be executed in parallel on your cluster. Having the modifiedFunction you can obtain the final RDD with all words by simply calling wordRDD.flatMap(modifiedFunction).
The crucial point is to use flatMap (to map and flatten the transformations):
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val input = sc.parallelize(Seq("apple", "ananas", "banana"))
// RDD("pple", "aple", ..., "nanas", ..., "anana", "bnana", ...)
val result = input.flatMap(modifiedFunction)
}
def modifiedFunction(word: String): Seq[String] = {
word.indices map {
index => word.substring(0, index) + word.substring(index+1)
}
}

How to join two RDDs by key to get RDD of (String, String)?

I have two paired rdds in the form RDD [(String, mutable.HashSet[String]):
For example:
rdd1: 332101231222, "320758, 320762, 320760, 320759, 320757, 320761"
rdd2: 332101231222, "220758, 220762, 220760, 220759, 220757, 220761"
I want to combine rdd1 and rdd2 based on common keys, so o/p should be like:
332101231222 320758, 320762, 320760, 320759, 320757, 320761 220758, 220762, 220760, 220759, 220757, 220761
Here is my code:
def cogroupTest (rdd1: RDD [(String, mutable.HashSet[String])], rdd2: RDD [(String, mutable.HashSet[String])] ): Unit =
{
val prods_per_user_co_grouped = (rdd1).cogroup(rdd2)
prods_per_user_co_grouped.map { case (key: String, (value1: mutable.HashSet[String], value2: mutable.HashSet[String])) => {
val combinedhs = value1 ++ value2
val sstr = combinedhs.mkString("\t")
val keypadded = key + "\t"
s"$keypadded$sstr"
}
}.saveAsTextFile("/scratch/rdds_joined/")
Here is the error that I get when I run the my program:
scala.MatchError: (32101231222,(CompactBuffer(Set(320758, 320762, 320760, 320759, 320757, 320761)),CompactBuffer(Set(220758, 220762, 220760, 220759, 220757, 220761)))) (of class scala.Tuple2)
Any help with this will be great!
As you might guess from the name cogroup groups observations by key. It means that in your case you get:
(String, (Iterable[mutable.HashSet[String]], Iterable[mutable.HashSet[String]]))
not
(String, (mutable.HashSet[String], mutable.HashSet[String]))
It is pretty clear when you take a look at the error you get. If you want to combine pairs you should use join method. If not you should adjust pattern to match structure you get and then use something like this:
val combinedhs = value1.reduce(_ ++ _) ++ value2.reduce(_ ++ _)