In Spark SQL, how do you register and use a generic UDF? - scala

In my Project, I want to achieve ADD(+) function, but my parameter maybe LongType, DoubleType, IntType. I use sqlContext.udf.register("add",XXX), but I don't know how to write XXX, which is to make generic functions.

You can create a generic UDF by creating a StructType with struct($"col1", $"col2") that holds your values and have your UDF work off of this. It gets passed into your UDF as a Row object, so you can do something like this:
val multiAdd = udf[Double,Row](r => {
var n = 0.0
r.toSeq.foreach(n1 => n = n + (n1 match {
case l: Long => l.toDouble
case i: Int => i.toDouble
case d: Double => d
case f: Float => f.toDouble
}))
n
})
val df = Seq((1.0,2),(3.0,4)).toDF("c1","c2")
df.withColumn("add", multiAdd(struct($"c1", $"c2"))).show
+---+---+---+
| c1| c2|add|
+---+---+---+
|1.0| 2|3.0|
|3.0| 4|7.0|
+---+---+---+
You can even do interesting things like take a variable number of columns as input. In fact, our UDF defined above already does that:
val df = Seq((1, 2L, 3.0f,4.0),(5, 6L, 7.0f,8.0)).toDF("int","long","float","double")
df.printSchema
root
|-- int: integer (nullable = false)
|-- long: long (nullable = false)
|-- float: float (nullable = false)
|-- double: double (nullable = false)
df.withColumn("add", multiAdd(struct($"int", $"long", $"float", $"double"))).show
+---+----+-----+------+----+
|int|long|float|double| add|
+---+----+-----+------+----+
| 1| 2| 3.0| 4.0|10.0|
| 5| 6| 7.0| 8.0|26.0|
+---+----+-----+------+----+
You can even add a hard-coded number into the mix:
df.withColumn("add", multiAdd(struct(lit(100), $"int", $"long"))).show
+---+----+-----+------+-----+
|int|long|float|double| add|
+---+----+-----+------+-----+
| 1| 2| 3.0| 4.0|103.0|
| 5| 6| 7.0| 8.0|111.0|
+---+----+-----+------+-----+
If you want to use the UDF in SQL syntax, you can do:
sqlContext.udf.register("multiAdd", (r: Row) => {
var n = 0.0
r.toSeq.foreach(n1 => n = n + (n1 match {
case l: Long => l.toDouble
case i: Int => i.toDouble
case d: Double => d
case f: Float => f.toDouble
}))
n
})
df.registerTempTable("df")
// Note that 'int' and 'long' are column names
sqlContext.sql("SELECT *, multiAdd(struct(int, long)) as add from df").show
+---+----+-----+------+----+
|int|long|float|double| add|
+---+----+-----+------+----+
| 1| 2| 3.0| 4.0| 3.0|
| 5| 6| 7.0| 8.0|11.0|
+---+----+-----+------+----+
This works too:
sqlContext.sql("SELECT *, multiAdd(struct(*)) as add from df").show
+---+----+-----+------+----+
|int|long|float|double| add|
+---+----+-----+------+----+
| 1| 2| 3.0| 4.0|10.0|
| 5| 6| 7.0| 8.0|26.0|
+---+----+-----+------+----+

I don't think you can register a generic UDF.
If we take a look at the signature of the register method
(actually, it's just one of the 22 register overloads, used for UDFs with one argument, the others are equivalent):
def register[RT: TypeTag, A1: TypeTag](name: String, func: Function1[A1, RT]): UserDefinedFunction
We can see that it's parameterized with a A1: TypeTag type - the TypeTag means that at the time of registration, we must have evidence of the actual type of the UDF's argument. So - passing a generic function func without typing it explicitly can't compile.
For your case, you might be able to take advantage of Spark's ability to cast numeric types automatically - write a UDF for Doubles only, and you can also apply it to Ints (the output would be Double, though):
sqlContext.udf.register("add", (i: Double) => i + 1)
// creating a table with Double and Int types:
sqlContext.createDataFrame(Seq((1.5, 4), (2.2, 5))).registerTempTable("table1")
// applying UDF to both types:
sqlContext.sql("SELECT add(_1), add(_2) FROM table1").show()
// output:
// +---+---+
// |_c0|_c1|
// +---+---+
// |2.5|5.0|
// |3.2|6.0|
// +---+---+

Related

How to add a column collection based on the maximum and minimum values in a dataframe

I've got this DataFrame
val for_df = Seq((5,7,"5k-7k"),(4,8,"4k-8k"),(6,12,"6k-2k")).toDF("min","max","salary")
I want to convert 5k-7k to 5,6,7 and 4k-8k to 4,5,6,7,8.
Original DataFrame:
Desired DataFrame
a.select("min","max","salary")
.as[(Integer,Integer,String)]
.map{
case(min,max,salary) =>
(min,max,salary.split("-").flatMap(x => {
for(i <- 0 to x.length-1) yield (i)
}))
}.toDF("1","2","3").show()
you need to create a UDF to expand the limits. The following UDF will convert convert 5k-7k to 5,6,7 and 4k-8k to 4,5,6,7,8 and so on
import org.apache.spark.sql.functions._
val inputDF = sc.parallelize(List((5,7,"5k-7k"),(4,8,"4k-8k"),(6,12,"6k-12k"))).toDF("min","max","salary")
val extendUDF = udf((str: String) => {
val nums = str.replace("k","").split("-").map(_.toInt)
(nums(0) to nums(1)).toList.mkString(",")
})
val output = inputDF.withColumn("salary_level", extendUDF($"salary"))
Output:
scala> output.show
+---+---+------+----------------+
|min|max|salary| salary_level|
+---+---+------+----------------+
| 5| 7| 5k-7k| 5,6,7|
| 4| 8| 4k-8k| 4,5,6,7,8|
| 6| 12|6k-12k|6,7,8,9,10,11,12|
+---+---+------+----------------+
You can easily do this with a udf.
// The following defines a udf in spark which create a list as per your requirement.
val makeRangeLists = udf( (min: Int, max: Int) => List.range(min, max+1) )
val input = sc.parallelize(List((5,7,"5k-7k"),
(4,8,"4k-8k"),(6,12,"6k-12k"))).toDF("min","max","salary")
// Create a new column using the UDF and pass the max and min columns.
input.withColumn("salary_level", makeRangeLists($"min", $"max")).show
Here one quick option with an UDF
import org.apache.spark.sql.functions
val toSalary = functions.udf((value: String) => {
val array = value.filterNot(_ == 'k').split("-").map(_.trim.toInt).sorted
val (startSalary, endSalary) = (array.headOption, array.tail.headOption)
(startSalary, endSalary) match {
case (Some(s), Some(e)) => (s to e).toList.mkString(",")
case _ => ""
}
})
for_df.withColumn("salary_level", toSalary($"salary")).drop("salary")
Input
+---+---+------+
|min|max|salary|
+---+---+------+
| 5| 7| 5k-7k|
| 4| 8| 4k-8k|
| 6| 12| 6k-2k|
+---+---+------+
Result
+---+---+------------+
|min|max|salary_level|
+---+---+------------+
| 5| 7| 5,6,7|
| 4| 8| 4,5,6,7,8|
| 6| 12| 2,3,4,5,6|
+---+---+------------+
First you remove the k and split your string by the dash. Then you get the start and endSalary and perform a range beetwen them.

Spark Scala replace Dataframe blank records to "0"

I need to replace my Dataframe field's blank records to "0"
Here is my code -->
import sqlContext.implicits._
case class CInspections (business_id:Int, score:String, date:String, type1:String)
val baseDir = "/FileStore/tables/484qrxx21488929011080/"
val raw_inspections = sc.textFile (s"$baseDir/inspections_plus.txt")
val raw_inspectionsmap = raw_inspections.map ( line => line.split ("\t"))
val raw_inspectionsRDD = raw_inspectionsmap.map ( raw_inspections => CInspections (raw_inspections(0).toInt,raw_inspections(1), raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
raw_inspectionsDF.printSchema
raw_inspectionsDF.show()
I am using case class and then converting to Dataframe. But I need "score" as Int as I have to perform some operations and sort it.
But if I declare it as score:Int then I am getting error for blank values.
java.lang.NumberFormatException: For input string: "" 
+-----------+-----+--------+--------------------+
|business_id|score| date| type1|
+-----------+-----+--------+--------------------+
| 10| |20140807|Reinspection/Foll...|
| 10| 94|20140729|Routine - Unsched...|
| 10| |20140124|Reinspection/Foll...|
| 10| 92|20140114|Routine - Unsched...|
| 10| 98|20121114|Routine - Unsched...|
| 10| |20120920|Reinspection/Foll...|
| 17| |20140425|Reinspection/Foll...|
+-----------+-----+--------+--------------------+
I need score field as Int because for the below query, it sort as String not Int and giving wrong result
sqlContext.sql("""select raw_inspectionsDF.score from raw_inspectionsDF where score <>"" order by score""").show()
+-----+
|score|
+-----+
| 100|
| 100|
| 100|
+-----+
Empty string can't be converted to Integer, you need to make the Score nullable so that if the field is missing, it is represented as null, you can try the following:
import scala.util.{Try, Success, Failure}
1) Define a customized parse function which returns None, if the string can't be converted to an Int, in your case empty string;
def parseScore(s: String): Option[Int] = {
Try(s.toInt) match {
case Success(x) => Some(x)
case Failure(x) => None
}
}
2) Define the score field in your case class to be an Option[Int] type;
case class CInspections (business_id:Int, score: Option[Int], date:String, type1:String)
val raw_inspections = sc.textFile("test.csv")
val raw_inspectionsmap = raw_inspections.map(line => line.split("\t"))
3) Use the customized parseScore function to parse the score field;
val raw_inspectionsRDD = raw_inspectionsmap.map(raw_inspections =>
CInspections(raw_inspections(0).toInt, parseScore(raw_inspections(1)),
raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
raw_inspectionsDF.printSchema
//root
// |-- business_id: integer (nullable = false)
// |-- score: integer (nullable = true)
// |-- date: string (nullable = true)
// |-- type1: string (nullable = true)
raw_inspectionsDF.show()
+-----------+-----+----+-----+
|business_id|score|date|type1|
+-----------+-----+----+-----+
| 1| null| a| b|
| 2| 3| s| k|
+-----------+-----+----+-----+
4) After parsing the file correctly, you can easily replace null value with 0 using na functions fill:
raw_inspectionsDF.na.fill(0).show
+-----------+-----+----+-----+
|business_id|score|date|type1|
+-----------+-----+----+-----+
| 1| 0| a| b|
| 2| 3| s| k|
+-----------+-----+----+-----+

how to map an DataFrame to an EdgeRDD

I have a DataFrame like:
val data = sc.parallelize(Array((1,10,10,7,7),(2,7,7,7,8),(3, 5,5,6,8))).toDF("id","col1","col2","col3","col4")
What I want to do is to create an EdgeRDD where two ids share a link if they share the same value in at least one of the columns
id col1 col2 col3 col4
1 10 10 7 7
2 7 7 7 8
3 5 5 6 8
then node 1 and 2 have an undirected link 1--2, because they share a common value in col3.
For the same reason, node 2 and 3 share an undirected link because they share a common value in col4
I know how to resolve this in a ugly way (but I have way too many columns to adopt this strategy in my real case)
val data2 = data.withColumnRenamed("id", "idd").withColumnRenamed("col1", "col1d").withColumnRenamed("col2", "col2d").withColumnRenamed("col3", "col3d").withColumnRenamed("col4", "col4d")
val res = data.join(data2, data("id") < data2("idd")
&& (data("col1") === data2("col1d")
|| data("col2") === data2("col2d")
|| data("col3") === data2("col3d")
|| data("col4") === data2("col4d")))
//> res : org.apache.spark.sql.DataFrame = [id: int, col1: int, col2: int, col
//| 3: int, col4: int, idd: int, col1d: int, col2d: int, col3d: int, col4d: int
//| ]
res.show //> +---+----+----+----+----+---+-----+-----+-----+-----+
//| | id|col1|col2|col3|col4|idd|col1d|col2d|col3d|col4d|
//| +---+----+----+----+----+---+-----+-----+-----+-----+
//| | 1| 10| 10| 7| 7| 2| 7| 7| 7| 8|
//| | 2| 7| 7| 7| 8| 3| 5| 5| 6| 8|
//| +---+----+----+----+----+---+-----+-----+-----+-----+
//|
val links = EdgeRDD.fromEdges(res.map(row => Edge(row.getAs[Int]("id").toLong, row.getAs[Int]("idd").toLong, "indirect")))
//> links : org.apache.spark.graphx.impl.EdgeRDDImpl[String,Nothing] = EdgeRDD
//| Impl[27] at RDD at EdgeRDD.scala:42
links.foreach(println) //> Edge(1,2,indirect)
//| Edge(2,3,indirect)
how to resolve this for much more columns?
Do you mean something like this?
val expr = data.columns.diff(Seq("id"))
.map(c => data(c) === data2(s"${c}d"))
.reduce(_ || _)
data.join(data2, data("id") < data2("idd") && expr)
You can also use aliases
import org.apache.spark.sql.functions.col
val expr = data.columns.diff(Seq("id"))
.map(c => col(s"d1.$c") === col(s"d2.$c"))
.reduce(_ || _)
data.alias("d1").join(data.alias("d2"), col("d1.id") < col("d2.id") && expr)
You can easily follow each of this by as simple select ($ is equivalent to col but requires an import of sqlContext.implicits.StringToColumn)
.select($"id".cast("long"), $"idd".cast("long"))
or
.select($"d1.id".cast("long"), $"d2.id".cast("long"))
and a pattern matching:
.rdd.map { case Row(src: Long, dst: Long) => Edge(src, dst, "indirect") }
Just note that logical disjunctions like this one cannot be optimized and are expanded to a Cartesian product followed by a filter. If you want to avoid you can try to approach this problem in different ways.
Lets start with reshaping data from wide to long:
val expr = explode(array(data.columns.tail.map(
c => struct(lit(c).alias("column"), col(c).alias("value"))
): _*))
val long = data.withColumn("tmp", expr)
.select($"id", $"tmp.column", $"tmp.value")
This will give us a DataFrame with a following schema:
long.printSchema
// root
// |-- id: integer (nullable = false)
// |-- column: string (nullable = false)
// |-- value: integer (nullable = false)
With data like this you have multiple choices including optimized join:
val pairs = long.as("long1")
.join(long.as("long2"),
$"long1.column" === $"long2.column" && // Optimized
$"long1.value" === $"long2.value" && // Optimized
$"long1.id" < $"long2.id" // Not optimized - filtered after sort-merge join
)
// Select only ids
.select($"long1.id".alias("src"), $"long2.id".alias("dst"))
// And keep distict
.distinct
pairs.show
// +---+---+
// |src|dst|
// +---+---+
// | 1| 2|
// | 2| 3|
// +---+---+
This can further improved by using different hashing techniques to avoid a large number of record generated by explode.
You can also think about this problem as bipartite graph where observations belong to on category of nodes and property-value pairs to another.
sealed trait CustomNode
case class Record(id: Long) extends CustomNode
case class Property(name: String, value: Int) extends CustomNode
With this as a starting point you can use long to generate edges of the following type:
Record -> Property
and solve this problem using GraphX directly by searching for paths like
Record -> Property <- Record
Hint: Collect neighbors for each property and propagate back.
Same as before you should consider using hashing or buckets to reduce limit a number of the generated Property nodes.

Aggregations in JDBCRDD or RDD

I'm brand new in Sacla and Spark, and I'm trying to create a SQL query over SqlServer with Spark using jdbcRDD, and do some transformations on it with mappings and aggregations.
This is what I have, a Table with n String columns and m Number columns.
like
"A", "A1",1,2
"A", "A1",4,3
"A", "A2",3,4
"B", "B1",6,7
...
...
what i'm looking for is create a hierarchival structure grouping the strings and aggregating the numeric columns like
A
|->A1
|->(5,5)
|->A2
|->(3,4)
B
|->B1
|->(6,7)
I was able to create the hierarchie but I'm not able to perform the agregation on the list of numeric values.
If you're loading your data over JDBC I would simply use DataFrames:
import sqlContext.implicits._
import org.apache.spark.sql.functions.sum
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.DataFrame
val options: Map[(String, String)] = ???
val df: DataFrame = sqlContext.read
.format("jdbc")
.options(options)
.load()
.toDF("k1", "k2", "v1", "v2")
df.printSchema
// root
// |-- k1: string (nullable = true)
// |-- k2: string (nullable = true)
// |-- v1: integer (nullable = true)
// |-- v2: integer (nullable = true)
df.show
// +---+---+---+---+
// | k1| k2| v1| v2|
// +---+---+---+---+
// | A| A1| 1| 2|
// | A| A1| 4| 3|
// | A| A2| 3| 4|
// | B| B1| 6| 7|
// +---+---+---+---+
With input like above all you need is a basic aggregation
df
.groupBy($"k1", $"k2")
.agg(sum($"v1").alias("v1"), sum($"v2").alias("v2")).show
// +---+---+---+---+
// | k1| k2| v1| v2|
// +---+---+---+---+
// | A| A1| 5| 5|
// | A| A2| 3| 4|
// | B| B1| 6| 7|
// +---+---+---+---+
If you have RDD like this:
val rdd RDD[(String, String, Int, Int)] = ???
rdd.first
// (String, String, Int, Int) = (A,A1,1,2)
There is no reason to built complex hierarchy. Simple PairRDD should be enough:
val aggregated: RDD[((String, String), breeze.linalg.Vector[Int])] = rdd
.map{case (k1, k2, v1, v2) => ((k1, k2), breeze.linalg.Vector(v1, v2))}
.reduceByKey(_ + _)
aggregated.first
// ((String, String), breeze.linalg.Vector[Int]) = ((A,A2),DenseVector(3, 4))
Keeping hierarchical structure is ineffective but you can group above RDD like this:
aggregated.map{case ((k1, k2), v) => (k1, (k2, v))}.groupByKey

Derive multiple columns from a single column in a Spark DataFrame

I have a DF with a huge parseable metadata as a single string column in a Dataframe, lets call it DFA, with ColmnA.
I would like to break this column, ColmnA into multiple columns thru a function, ClassXYZ = Func1(ColmnA). This function returns a class ClassXYZ, with multiple variables, and each of these variables now has to be mapped to new Column, such a ColmnA1, ColmnA2 etc.
How would I do such a transformation from 1 Dataframe to another with these additional columns by calling this Func1 just once, and not have to repeat-it to create all the columns.
Its easy to solve if I were to call this huge function every time to add a new column, but that what I wish to avoid.
Kindly please advise with a working or pseudo code.
Thanks
Sanjay
Generally speaking what you want is not directly possible. UDF can return only a single column at the time. There are two different ways you can overcome this limitation:
Return a column of complex type. The most general solution is a StructType but you can consider ArrayType or MapType as well.
import org.apache.spark.sql.functions.udf
val df = Seq(
(1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
).toDF("x", "y", "z")
case class Foobar(foo: Double, bar: Double)
val foobarUdf = udf((x: Long, y: Double, z: String) =>
Foobar(x * y, z.head.toInt * y))
val df1 = df.withColumn("foobar", foobarUdf($"x", $"y", $"z"))
df1.show
// +---+----+---+------------+
// | x| y| z| foobar|
// +---+----+---+------------+
// | 1| 3.0| a| [3.0,291.0]|
// | 2|-1.0| b|[-2.0,-98.0]|
// | 3| 0.0| c| [0.0,0.0]|
// +---+----+---+------------+
df1.printSchema
// root
// |-- x: long (nullable = false)
// |-- y: double (nullable = false)
// |-- z: string (nullable = true)
// |-- foobar: struct (nullable = true)
// | |-- foo: double (nullable = false)
// | |-- bar: double (nullable = false)
This can be easily flattened later but usually there is no need for that.
Switch to RDD, reshape and rebuild DF:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
def foobarFunc(x: Long, y: Double, z: String): Seq[Any] =
Seq(x * y, z.head.toInt * y)
val schema = StructType(df.schema.fields ++
Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))
val rows = df.rdd.map(r => Row.fromSeq(
r.toSeq ++
foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z"))))
val df2 = sqlContext.createDataFrame(rows, schema)
df2.show
// +---+----+---+----+-----+
// | x| y| z| foo| bar|
// +---+----+---+----+-----+
// | 1| 3.0| a| 3.0|291.0|
// | 2|-1.0| b|-2.0|-98.0|
// | 3| 0.0| c| 0.0| 0.0|
// +---+----+---+----+-----+
Assume that after your function there will be a sequence of elements, giving an example as below:
val df = sc.parallelize(List(("Mike,1986,Toronto", 30), ("Andre,1980,Ottawa", 36), ("jill,1989,London", 27))).toDF("infoComb", "age")
df.show
+------------------+---+
| infoComb|age|
+------------------+---+
|Mike,1986,Toronto| 30|
| Andre,1980,Ottawa| 36|
| jill,1989,London| 27|
+------------------+---+
now what you can do with this infoComb is that you can start split the string and get more columns with:
df.select(expr("(split(infoComb, ','))[0]").cast("string").as("name"), expr("(split(infoComb, ','))[1]").cast("integer").as("yearOfBorn"), expr("(split(infoComb, ','))[2]").cast("string").as("city"), $"age").show
+-----+----------+-------+---+
| name|yearOfBorn| city|age|
+-----+----------+-------+---+
|Mike| 1986|Toronto| 30|
|Andre| 1980| Ottawa| 36|
| jill| 1989| London| 27|
+-----+----------+-------+---+
Hope this helps.
If your resulting columns will be of the same length as the original one, you can create brand new columns with withColumn function and by applying an udf. After this you can drop your original column, eg:
val newDf = myDf.withColumn("newCol1", myFun(myDf("originalColumn")))
.withColumn("newCol2", myFun2(myDf("originalColumn"))
.drop(myDf("originalColumn"))
where myFun is an udf defined like this:
def myFun= udf(
(originalColumnContent : String) => {
// do something with your original column content and return a new one
}
)
I opted to create a function to flatten one column and then just call it simultaneously with the udf.
First define this:
implicit class DfOperations(df: DataFrame) {
def flattenColumn(col: String) = {
def addColumns(df: DataFrame, cols: Array[String]): DataFrame = {
if (cols.isEmpty) df
else addColumns(
df.withColumn(col + "_" + cols.head, df(col + "." + cols.head)),
cols.tail
)
}
val field = df.select(col).schema.fields(0)
val newCols = field.dataType.asInstanceOf[StructType].fields.map(x => x.name)
addColumns(df, newCols).drop(col)
}
def withColumnMany(colName: String, col: Column) = {
df.withColumn(colName, col).flattenColumn(colName)
}
}
Then usage is very simple:
case class MyClass(a: Int, b: Int)
val df = sc.parallelize(Seq(
(0),
(1)
)).toDF("x")
val f = udf((x: Int) => MyClass(x*2,x*3))
df.withColumnMany("test", f($"x")).show()
// +---+------+------+
// | x|test_a|test_b|
// +---+------+------+
// | 0| 0| 0|
// | 1| 2| 3|
// +---+------+------+
This can be easily achieved by using pivot function
df4.groupBy("year").pivot("course").sum("earnings").collect()