Spark - pass full row to a udf and then get column name inside udf - scala

I am using Spark with Scala and want to pass the entire row to udf and select for each column name and column value in side udf. How can I do this?
I am trying following -
inputDataDF.withColumn("errorField", mapCategory(ruleForNullValidation) (col(_*)))
def mapCategory(categories: Map[String, Boolean]) = {
udf((input:Row) => //write a recursive function to check if each row is in categories if yes check for null if null then false, repeat this for all columns and then combine results)
})

In Spark 1.6 you can use Row as external type and struct as expression. as expression. Column name can be fetched from the schema. For example:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, struct}
val df = Seq((1, 2, 3)).toDF("a", "b", "c")
val f = udf((row: Row) => row.schema.fieldNames)
df.select(f(struct(df.columns map col: _*))).show
// +-----------------------------------------------------------------------------+
// |UDF(named_struct(NamePlaceholder, a, NamePlaceholder, b, NamePlaceholder, c))|
// +-----------------------------------------------------------------------------+
// | [a, b, c]|
// +-----------------------------------------------------------------------------+
Values can be accessed by name using Row.getAs method.

Here is a simple working example:
Input Data:
+-----+---+--------+
| NAME|AGE|CATEGORY|
+-----+---+--------+
| RIO| 35| FIN|
| TOM| 90| ACC|
|KEVIN| 32| |
| STEF| 22| OPS|
+-----+---+--------+
//Define category list and UDF
val categoryList = List("FIN","ACC")
def mapCategoryUDF(ls: List[String]) = udf[Boolean,Row]((x: Row) => if (!ls.contains(x.getAs("CATEGORY"))) false else true)
import org.apache.spark.sql.functions.{struct}
df.withColumn("errorField",mapCategoryUDF(categoryList)(struct("*"))).show()
Result should look like this:
+-----+---+--------+----------+
| NAME|AGE|CATEGORY|errorField|
+-----+---+--------+----------+
| RIO| 35| FIN| true|
| TOM| 90| ACC| true|
|KEVIN| 32| | false|
| STEF| 22| OPS| false|
+-----+---+--------+----------+
Hope this helps!!

Related

Transform dataset with empty data for dates

I have a dataset with date,accountid and value. I want to transform the dataset to a new dataset where if accountid is not present in a particular date then add a accountid with value of 0 against that date.Is this possible
val df = sc.parallelize(Seq(("2018-01-01", 100.5,"id1"),
("2018-01-02", 120.6,"id1"),
("2018-01-03", 450.2,"id2")
)).toDF("date", "val","accountid")
+----------+-----+---------+
| date| val|accountid|
+----------+-----+---------+
|2018-01-01|100.5| id1|
|2018-01-02|120.6| id1|
|2018-01-03|450.2| id2|
+----------+-----+---------+
I want to transform this dataset into this format
+----------+-----+---------+
| date| val|accountid|
+----------+-----+---------+
|2018-01-01|100.5| id1|
|2018-01-01| 0.0| id2|
|2018-01-02|120.6| id1|
|2018-01-02| 0.0| id2|
|2018-01-03|450.2| id2|
|2018-01-03|0.0 | id1|
+----------+-----+---------+
You can simply use a udf function to get your requirement fulfilled.
But before that you will have to get the complete set of accountids and get it broadcasted to be used in udf function.
The returned array from udf function is to be exploded and finally select the columns.
import org.apache.spark.sql.functions._
val idList = df.select(collect_set("accountid")).first().getAs[Seq[String]](0)
val broadCastedIdList = sc.broadcast(idList)
def populateUdf = udf((date: String, value: Double, accountid: String)=> Array(accounts(date, value, accountid)) ++ broadCastedIdList.value.filterNot(_ == accountid).map(accounts(date, 0.0, _)))
df.select(populateUdf(col("date"), col("val"), col("accountid")).as("struct"))
.withColumn("struct", explode(col("struct")))
.select(col("struct.date"), col("struct.value").as("val"), col("struct.accountid"))
.show(false)
And of course you would need a case class
case class accounts(date:String, value:Double, accountid:String)
which should give you
+----------+-----+---------+
|date |val |accountid|
+----------+-----+---------+
|2018-01-01|100.5|id1 |
|2018-01-01|0.0 |id2 |
|2018-01-02|120.6|id1 |
|2018-01-02|0.0 |id2 |
|2018-01-03|450.2|id2 |
|2018-01-03|0.0 |id1 |
+----------+-----+---------+
Note: value keyword is used in case class because reserved identifier names cannot be used as variable names
You can create reference
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val Row(minTs: Long, maxTs: Long) = df
.select(to_date($"date").cast("timestamp").cast("bigint") as "date")
.select(min($"date"), max($"date")).first
val by = 60 * 60 * 24
val ref = spark
.range(minTs, maxTs + by, by)
.select($"id".cast("timestamp").cast("date").cast("string").as("date"))
.crossJoin(df.select("accountid").distinct)
and outer join with input data:
ref.join(df, Seq("date", "accountid"), "leftouter").na.fill(0.0).show
// +----------+---------+-----+
// | date|accountid| val|
// +----------+---------+-----+
// |2018-01-03| id1| 0.0|
// |2018-01-01| id1|100.5|
// |2018-01-02| id2| 0.0|
// |2018-01-02| id1|120.6|
// |2018-01-03| id2|450.2|
// |2018-01-01| id2| 0.0|
// +----------+---------+-----+
Concept adopted from this sparklyr answer by user6910411.

Writing Spark UDAFs in Scala to return Array type as output

I have a dataframe as below -
val myDF = Seq(
(1,"A",100),
(1,"E",300),
(1,"B",200),
(2,"A",200),
(2,"C",300),
(2,"D",100)
).toDF("id","channel","time")
myDF.show()
+---+-------+----+
| id|channel|time|
+---+-------+----+
| 1| A| 100|
| 1| E| 300|
| 1| B| 200|
| 2| A| 200|
| 2| C| 300|
| 2| D| 100|
+---+-------+----+
For each id, I want the channel sorted by time in ascending fashion. I want to implement an UDAF for this logic.
I would like to call this UDAF as -
scala > spark.sql("""select customerid , myUDAF(customerid,channel,time) group by customerid """).show()
Ouptut dataframe should look like -
+---+-------+
| id|channel|
+---+-------+
| 1|[A,B,E]|
| 2|[D,A,C]|
+---+-------+
I am trying to write an UDAF but unable to implement it -
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
class myUDAF extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function
override def inputSchema : org.apache.spark.sql.types.Structype =
Structype(
StructField("id" , IntegerType)
StructField("channel", StringType)
StructField("time", IntegerType) :: Nil
)
// This is the internal fields we would keep for computing the aggregate
// output
override def bufferSchema : Structype =
Structype(
StructField("Sequence", ArrayType(StringType)) :: Nil
)
// This is the output type of my aggregate function
override def dataType : DataType = ArrayType(StringType)
// no comments here
override def deterministic : Booelan = true
// initialize
override def initialize(buffer: MutableAggregationBuffer) : Unit = {
buffer(0) = Seq("")
}
}
Please help.
This will do it (no need to define your own UDF):
df.groupBy("id")
.agg(sort_array(collect_list( // NOTE: sort based on the first element of the struct
struct("time", "channel"))).as("stuff"))
.select("id", "stuff.channel")
.show(false)
+---+---------+
|id |channel |
+---+---------+
|1 |[A, B, E]|
|2 |[D, A, C]|
+---+---------+
I would not write an UDAF for that. In my experience UDAF are rather slow, especially with complex types. I would use the collect_list & UDF approach:
val sortByTime = udf((rws:Seq[Row]) => rws.sortBy(_.getInt(0)).map(_.getString(1)))
myDF
.groupBy($"id")
.agg(collect_list(struct($"time",$"channel")).as("channel"))
.withColumn("channel", sortByTime($"channel"))
.show()
+---+---------+
| id| channel|
+---+---------+
| 1|[A, B, E]|
| 2|[D, A, C]|
+---+---------+
A much simpler way without UDF.
import org.apache.spark.sql.functions._
myDF.orderBy($"time".asc).groupBy($"id").agg(collect_list($"channel") as "channel").show()

How to compose column name using another column's value for withColumn in Scala Spark

I'm trying to add a new column to a DataFrame. The value of this column is the value of another column whose name depends on other columns from the same DataFrame.
For instance, given this:
+---+---+----+----+
| A| B| A_1| B_2|
+---+---+----+----+
| A| 1| 0.1| 0.3|
| B| 2| 0.2| 0.4|
+---+---+----+----+
I'd like to obtain this:
+---+---+----+----+----+
| A| B| A_1| B_2| C|
+---+---+----+----+----+
| A| 1| 0.1| 0.3| 0.1|
| B| 2| 0.2| 0.4| 0.4|
+---+---+----+----+----+
That is, I added column C whose value came from either column A_1 or B_2. The name of the source column A_1 comes from concatenating the value of columns A and B.
I know that I can add a new column based on another and a constant like this:
df.withColumn("C", $"B" + 1)
I also know that the name of the column can come from a variable like this:
val name = "A_1"
df.withColumn("C", col(name) + 1)
However, what I'd like to do is something like this:
df.withColumn("C", col(s"${col("A")}_${col("B")}"))
Which doesn't work.
NOTE: I'm coding in Scala 2.11 and Spark 2.2.
You can achieve your requirement by writing a udf function. I am suggesting udf, as your requirement is to process dataframe row by row contradicting to inbuilt functions which functions column by column.
But before that you would need array of column names
val columns = df.columns
Then write a udf function as
import org.apache.spark.sql.functions._
def getValue = udf((A: String, B: String, array: mutable.WrappedArray[String]) => array(columns.indexOf(A+"_"+B)))
where
A is the first column value
B is the second column value
array is the Array of all the columns values
Now just call the udf function using withColumn api
df.withColumn("C", getValue($"A", $"B", array(columns.map(col): _*))).show(false)
You should get your desired output dataframe.
You can select from a map. Define map which translates name to column value:
import org.apache.spark.sql.functions.{col, concat_ws, lit, map}
val dataMap = map(
df.columns.diff(Seq("A", "B")).flatMap(c => lit(c) :: col(c) :: Nil): _*
)
df.select(dataMap).show(false)
+---------------------------+
|map(A_1, A_1, B_2, B_2) |
+---------------------------+
|Map(A_1 -> 0.1, B_2 -> 0.3)|
|Map(A_1 -> 0.2, B_2 -> 0.4)|
+---------------------------+
and select from it with apply:
df.withColumn("C", dataMap(concat_ws("_", $"A", $"B"))).show
+---+---+---+---+---+
| A| B|A_1|B_2| C|
+---+---+---+---+---+
| A| 1|0.1|0.3|0.1|
| B| 2|0.2|0.4|0.4|
+---+---+---+---+---+
You can also try mapping, but I suspect it won't perform well with very wide data:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val outputEncoder = RowEncoder(df.schema.add(StructField("C", DoubleType)))
df.map(row => {
val a = row.getAs[String]("A")
val b = row.getAs[String]("B")
val key = s"${a}_${b}"
Row.fromSeq(row.toSeq :+ row.getAs[Double](key))
})(outputEncoder).show
+---+---+---+---+---+
| A| B|A_1|B_2| C|
+---+---+---+---+---+
| A| 1|0.1|0.3|0.1|
| B| 2|0.2|0.4|0.4|
+---+---+---+---+---+
and in general I wouldn't recommend this approach.
If data comes from csv you might consider skipping default csv reader and use custom logic to push column selection directly into parsing process. With pseudocode:
spark.read.text(...).map { line => {
val a = ??? // parse A
val b = ??? // parse B
val c = ??? // find c, based on a and b
(a, b, c)
}}

I need to compare two dataframes for type validation and send a nonzero value as output

I am comparing two dataframes (basically these are schema of two different data sources one from hive and other from SAS9.2)
I need to validate structure for both data sources so I converted schema into two dataframes and here they are:
SAS schema will be in below format:
scala> metadata.show
+----+----------------+----+---+-----------+-----------+
|S_No| Variable|Type|Len| Format| Informat|
+----+----------------+----+---+-----------+-----------+
| 1| DATETIME| Num| 8|DATETIME20.|DATETIME20.|
| 2| LOAD_DATETIME| Num| 8|DATETIME20.|DATETIME20.|
| 3| SOURCE_BANK|Char| 1| null| null|
| 4| EMP_NAME|Char| 50| null| null|
| 5|HEADER_ROW_COUNT| Num| 8| null| null|
| 6| EMP_HOURS| Num| 8| 15.2| 15.1|
+----+----------------+----+---+-----------+-----------+
Similarly hive metadata will be in below format:
df2.show
+----------------+-------------+
| Variable| type|
+----------------+-------------+
| datetime|TimestampType|
| load_datetime|TimestampType|
| source_bank| StringType|
| emp_name| StringType|
|header_row_count| IntegerType|
| emp_hours| DoubleType|
+----------------+-------------+
Now, I need to compare both these on column type and validate structure.Like for "Num" type equivalent is "Integertype".
Finally I need to store anon zero value as output if schema validation is successful
How can I achieve this ?
you can join the two dataframes and then compare the two columns corressponding to the columns type via a Map and UDF.
This is a code sample that does that.
You need to complete the map with the right values
val sqlCtx = sqlContext
import sqlCtx.implicits._
val metadata: DataFrame= Seq(
(Some("1"), "DATETIME", "Num", "8", "DATETIME20", "DATETIME20"),
(Some("3"), "SOURCEBANK", "Num", "1", "null", "null")
).toDF("SNo", "Variable", "Type", "Len", "Format", "Informat")
val metadataAdapted: DataFrame = metadata
.withColumn("Name", functions.upper(col("Variable")))
.withColumnRenamed("Type", "TypeHive")
val sasDF = Seq(("datetime", "TimestampType"),
("datetime", "TimestampType")
).toDF("variable", "type")
val sasDFAdapted = sasDF
.withColumn("Name", functions.upper(col("variable")))
.withColumnRenamed("Type", "TypeSaS")
val res = sasDFAdapted.join(metadataAdapted, Seq("Name"), "inner")
val map = Map("TimestampType" -> "Num")
def udfType(dict: Map[String, String]) = functions.udf( (typeVar: String) => dict(typeVar))
val result = res.withColumn("correctMapping", udfType(map)(col("TypeSaS")) === col("TypeHive"))

how to concat multiple columns in spark while getting the column names to be concatenated from another table (different for each row)

I am trying to concat multiple columns in spark using concat function.
For example below is the table for which I have to add new concatenated column
table - **t**
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| b|
+---+----+
and below is the table which has the information about which columns are to be concatenated for given id (for id 1 column id and name needs to be concatenated and for id 2 only id)
table - **r**
+---+-------+
| id| att |
+---+-------+
| 1|id,name|
| 2| id |
+---+-------+
if I join the two tables and do something like below, I am able to concat but not based on the table r (as the new column is having 1,a for first row but for second row it should be 2 only)
t.withColumn("new",concat_ws(",",t.select("att").first.mkString.split(",").map(c => col(c)): _*)).show
+---+----+-------+---+
| id|name| att |new|
+---+----+-------+---+
| 1| a|id,name|1,a|
| 2| b| id |2,b|
+---+----+-------+---+
I have to apply filter before the select in the above query, but I am not sure how to do that in withColumn for each row.
Something like below, if that is possible.
t.withColumn("new",concat_ws(",",t.**filter**("id="+this.id).select("att").first.mkString.split(",").map(c => col(c)): _*)).show
As it will require to filter each row based on the id.
scala> t.filter("id=1").select("att").first.mkString.split(",").map(c => col(c))
res90: Array[org.apache.spark.sql.Column] = Array(id, name)
scala> t.filter("id=2").select("att").first.mkString.split(",").map(c => col(c))
res89: Array[org.apache.spark.sql.Column] = Array(id)
Below is the final required result.
+---+----+-------+---+
| id|name| att |new|
+---+----+-------+---+
| 1| a|id,name|1,a|
| 2| b| id |2 |
+---+----+-------+---+
We can use UDF
Requirements for this logic to work.
The column name of your table t should be in same order as it comes in col att of table r
scala> input_df_1.show
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| b|
+---+----+
scala> input_df_2.show
+---+-------+
| id| att|
+---+-------+
| 1|id,name|
| 2| id|
+---+-------+
scala> val join_df = input_df_1.join(input_df_2,Seq("id"),"inner")
join_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val req_cols = input_df_1.columns
req_cols: Array[String] = Array(id, name)
scala> def new_col_udf = udf((cols : Seq[String],row : String,attr : String) => {
| val row_values = row.split(",")
| val attrs = attr.split(",")
| val req_val = attrs.map{at =>
| val index = cols.indexOf(at)
| row_values(index)
| }
| req_val.mkString(",")
| })
new_col_udf: org.apache.spark.sql.expressions.UserDefinedFunction
scala> val intermediate_df = join_df.withColumn("concat_column",concat_ws(",",'id,'name)).withColumn("new_col",new_col_udf(lit(req_cols),'concat_column,'att))
intermediate_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 3 more fields]
scala> val result_df = intermediate_df.select('id,'name,'att,'new_col)
result_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 2 more fields]
scala> result_df.show
+---+----+-------+-------+
| id|name| att|new_col|
+---+----+-------+-------+
| 1| a|id,name| 1,a|
| 2| b| id| 2|
+---+----+-------+-------+
Hope it answers your question.
This may be done in a UDF:
val cols: Seq[Column] = dataFrame.columns.map(x => col(x)).toSeq
val indices: Seq[String] = dataFrame.columns.map(x => x).toSeq
val generateNew = udf((values: Seq[Any]) => {
val att = values(indices.indexOf("att")).toString.split(",")
val associatedIndices = indices.filter(x => att.contains(x))
val builder: StringBuilder = StringBuilder.newBuilder
values.filter(x => associatedIndices.contains(values.indexOf(x)))
values.foreach{ v => builder.append(v).append(";") }
builder.toString()
})
val dfColumns = array(cols:_*)
val dNew = dataFrame.withColumn("new", generateNew(dfColumns))
This is just a sketch, but the idea is that you can pass a sequence of items to the user defined function, and select the ones that are needed dynamically.
Note that there are additional types of collection/maps that you can pass - for example How to pass array to UDF