scala spark use expr to value inside a column - scala

I need to add a new column to a dataframe with a boolean value, evaluating a column inside the dataframe. For example, I have a dataframe of
+----+----+----+----+----+-----------+----------------+
|colA|colB|colC|colD|colE|colPRODRTCE| colCOND|
+----+----+----+----+----+-----------+----------------+
| 1| 1| 1| 1| 3| 39|colA=1 && colB>0|
| 1| 1| 1| 1| 3| 45| colD=1|
| 1| 1| 1| 1| 3| 447|colA>8 && colC=1|
+----+----+----+----+----+-----------+----------------+
In my new column I need to evaluate if the expression of colCOND is true or false.
It's easy if you have something like this:
val df = List(
(1,1,1,1,3),
(2,2,3,4,4)
).toDF("colA", "colB", "colC", "colD", "colE")
val myExpression = "colA<colC"
import org.apache.spark.sql.functions.expr
df.withColumn("colRESULT",expr(myExpression)).show()
+----+----+----+----+----+---------+
|colA|colB|colC|colD|colE|colRESULT|
+----+----+----+----+----+---------+
| 1| 1| 1| 1| 3| false|
| 2| 2| 3| 4| 4| true|
+----+----+----+----+----+---------+
But I have to evaluate a different expression in each row and it is inside the column colCOND.
I thought in create a UDF function with all columns, but my real dataframe have a lot of columns. How can I do it?
Thanks to everyone

if && change to AND, can try
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
object DataFrameLogicWithColumn extends App{
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val sourceDF = Seq((1,1,1,1,3,39,"colA=1 AND colB>0"),
(1,1,1,1,3,45,"colD=1"),
(1,1,1,1,3,447,"colA>8 AND colC=1")
).toDF("colA", "colB", "colC", "colD", "colE", "colPRODRTCE", "colCOND").persist(MEMORY_AND_DISK)
val exprs = sourceDF.select('colCOND).distinct().as[String].collect()
val d1 = exprs.map(i => {
val df = sourceDF.filter('colCOND.equalTo(i))
df.withColumn("colRESULT", expr(i))
})
val resultDF = d1.reduce(_ union _)
resultDF.show(false)
// +----+----+----+----+----+-----------+-----------------+---------+
// |colA|colB|colC|colD|colE|colPRODRTCE|colCOND |colRESULT|
// +----+----+----+----+----+-----------+-----------------+---------+
// |1 |1 |1 |1 |3 |39 |colA=1 AND colB>0|true |
// |1 |1 |1 |1 |3 |447 |colA>8 AND colC=1|false |
// |1 |1 |1 |1 |3 |45 |colD=1 |true |
// +----+----+----+----+----+-----------+-----------------+---------+
sourceDF.unpersist()
}
can try DataSet
case class c1 (colA: Int, colB: Int, colC: Int, colD: Int, colE: Int, colPRODRTCE: Int, colCOND: String)
case class cRes (colA: Int, colB: Int, colC: Int, colD: Int, colE: Int, colPRODRTCE: Int, colCOND: String, colResult: Boolean)
val sourceData = Seq(c1(1,1,1,1,3,39,"colA=1 AND colB>0"),
c1(1,1,1,1,3,45,"colD=1"),
c1(1,1,1,1,3,447,"colA>8 AND colC=1")
).toDS()
def f2(a: c1): Boolean={
// we need parse value with colCOUND
a.colCOND match {
case "colA=1 AND colB>0" => (a.colA == 1 && a.colB > 0) == true
case _ => false
}
}
val res2 = sourceData
.map(i => cRes(i.colA, i.colB, i.colC, i.colD, i.colE, i.colPRODRTCE, i.colCOND,
f2(i)))

Related

Scala spark, input dataframe, return columns where all values equal to 1

Given a dataframe, say that it contains 4 columns and 3 rows. I want to write a function to return the columns where all the values in that column are equal to 1.
This is a Scala code. I want to use some spark transformations to transform or filter the dataframe input. This filter should be implemented in a function.
case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral)
val example = Seq(
Grade(1,3,1,1),
Grade(1,1,null,1),
Grade(1,10,2,1)
)
val dfInput = spark.createDataFrame(example)
After I call the function filterColumns()
val dfOutput = dfInput.filterColumns()
it should return 3 row 2 columns dataframe with value all 1.
A bit more readable approach using Dataset[Grade]
import org.apache.spark.sql.functions.col
import scala.collection.mutable
import org.apache.spark.sql.Column
val tmp = dfInput.map(grade => grade.dropWhenNotEqualsTo(1))
val rowsCount = dfInput.count()
val colsToRetain = mutable.Set[Column]()
for (column <- tmp.columns) {
val withoutNullsCount = tmp.select(column).na.drop().count()
if (rowsCount == withoutNullsCount) colsToRetain += col(column)
}
dfInput.select(colsToRetain.toArray:_*).show()
+---+---+
| c4| c1|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
And the case object
case class Grade(c1: Integer, c2: Integer, c3: Integer, c4: Integer) {
def dropWhenNotEqualsTo(n: Integer): Grade = {
Grade(nullOrValue(c1, n), nullOrValue(c2, n), nullOrValue(c3, n), nullOrValue(c4, n))
}
def nullOrValue(c: Integer, n: Integer) = if (c == n) c else null
}
grade.dropWhenNotEqualsTo(1) -> returns a new Grade with values that not satisfies the condition replaced to nulls
+---+----+----+---+
| c1| c2| c3| c4|
+---+----+----+---+
| 1|null| 1| 1|
| 1| 1|null| 1|
| 1|null|null| 1|
+---+----+----+---+
(column <- tmp.columns) -> iterate over the columns
tmp.select(column).na.drop() -> drop rows with nulls
e.g for c2 this will return
+---+
| c2|
+---+
| 1|
+---+
if (rowsCount == withoutNullsCount) colsToRetain += col(column) -> if column contains nulls just drop it
one of the options is reduce on rdd:
import spark.implicits._
val df= Seq(("1","A","3","4"),("1","2","?","4"),("1","2","3","4")).toDF()
df.show()
val first = df.first()
val size = first.length
val diffStr = "#"
val targetStr = "1"
def rowToArray(row: Row): Array[String] = {
val arr = new Array[String](row.length)
for (i <- 0 to row.length-1){
arr(i) = row.getString(i)
}
arr
}
def compareArrays(a1: Array[String], a2: Array[String]): Array[String] = {
val arr = new Array[String](a1.length)
for (i <- 0 to a1.length-1){
arr(i) = if (a1(i).equals(a2(i)) && a1(i).equals(targetStr)) a1(i) else diffStr
}
arr
}
val diff = df.rdd
.map(rowToArray)
.reduce(compareArrays)
val cols = (df.columns zip diff).filter(!_._2.equals(diffStr)).map(s=>df(s._1))
df.select(cols:_*).show()
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| A| 3| 4|
| 1| 2| ?| 4|
| 1| 2| 3| 4|
+---+---+---+---+
+---+
| _1|
+---+
| 1|
| 1|
| 1|
+---+
I would try to prepare dataset for processing without nulls. In case of few columns this simple iterative approach might work fine (don't forget to import spark implicits before import spark.implicits._):
val example = spark.sparkContext.parallelize(Seq(
Grade(1,3,1,1),
Grade(1,1,0,1),
Grade(1,10,2,1)
)).toDS().cache()
def allOnes(colName: String, ds: Dataset[Grade]): Boolean = {
val row = ds.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns = example.columns.filter(col => allOnes(col, example))
example.selectExpr(resultColumns: _*).show()
result is:
+---+---+
| c1| c4|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
If nulls are inevitable, use untyped dataset (aka dataframe):
val schema = StructType(Seq(
StructField("c1", IntegerType, nullable = true),
StructField("c2", IntegerType, nullable = true),
StructField("c3", IntegerType, nullable = true),
StructField("c4", IntegerType, nullable = true)
))
val example = spark.sparkContext.parallelize(Seq(
Row(1,3,1,1),
Row(1,1,null,1),
Row(1,10,2,1)
))
val dfInput = spark.createDataFrame(example, schema).cache()
def allOnes(colName: String, df: DataFrame): Boolean = {
val row = df.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()

How to split Comma-separated multiple columns into multiple rows?

I have a data-frame with N fields as mentioned below. The number of columns and length of the value will vary.
Input Table:
+--------------+-----------+-----------------------+
|Date |Amount |Status |
+--------------+-----------+-----------------------+
|2019,2018,2017|100,200,300|IN,PRE,POST |
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------------------+
I have to convert it into the below format with one sequence column.
Expected Output Table:
+-------------+------+---------+
|Date |Amount|Status| Sequence|
+------+------+------+---------+
|2019 |100 |IN | 1 |
|2018 |200 |PRE | 2 |
|2017 |300 |POST | 3 |
|2018 |73 |IN | 1 |
|2018 |56 |IN | 1 |
|2017 |89 |PRE | 2 |
+-------------+------+---------+
I have Tried using explode but explode only take one array at a time.
var df = dataRefined.withColumn("TOT_OVRDUE_TYPE", explode(split($"TOT_OVRDUE_TYPE", "\\"))).toDF
var df1 = df.withColumn("TOT_OD_TYPE_AMT", explode(split($"TOT_OD_TYPE_AMT", "\\"))).show
Does someone know how I can do it? Thank you for your help.
Here is another approach using posexplode for each column and joining all produced dataframes into one:
import org.apache.spark.sql.functions.{posexplode, monotonically_increasing_id, col}
val df = Seq(
(Seq("2019", "2018", "2017"), Seq("100", "200", "300"), Seq("IN", "PRE", "POST")),
(Seq("2018"), Seq("73"), Seq("IN")),
(Seq("2018", "2017"), Seq("56", "89"), Seq("IN", "PRE")))
.toDF("Date","Amount", "Status")
.withColumn("idx", monotonically_increasing_id)
df.columns.filter(_ != "idx").map{
c => df.select($"idx", posexplode(col(c))).withColumnRenamed("col", c)
}
.reduce((ds1, ds2) => ds1.join(ds2, Seq("idx", "pos")))
.select($"Date", $"Amount", $"Status", $"pos".plus(1).as("Sequence"))
.show
Output:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
You can achieve this by using Dataframe inbuilt functions arrays_zip,split,posexplode
Explanation:
scala>val df=Seq((("2019,2018,2017"),("100,200,300"),("IN,PRE,POST")),(("2018"),("73"),("IN")),(("2018,2017"),("56,89"),("IN,PRE"))).toDF("date","amount","status")
scala>:paste
df.selectExpr("""posexplode(
arrays_zip(
split(date,","), //split date string with ',' to create array
split(amount,","),
split(status,","))) //zip arrays
as (p,colum) //pos explode on zip arrays will give position and column value
""")
.selectExpr("colum.`0` as Date", //get 0 column as date
"colum.`1` as Amount",
"colum.`2` as Status",
"p+1 as Sequence") //add 1 to the position value
.show()
Result:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Yes, I personally also find explode a bit annoying and in your case I would probably go with a flatMap instead:
import spark.implicits._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq((Seq(2019,2018,2017), Seq(100,200,300), Seq("IN","PRE","POST")),(Seq(2018), Seq(73), Seq("IN")),(Seq(2018,2017), Seq(56,89), Seq("IN","PRE")))).toDF()
val transformedDF = df
.flatMap{case Row(dates: Seq[Int], amounts: Seq[Int], statuses: Seq[String]) =>
dates.indices.map(index => (dates(index), amounts(index), statuses(index), index+1))}
.toDF("Date", "Amount", "Status", "Sequence")
Output:
df.show
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Assuming the number of data elements in each column is the same for each row:
First, I recreated your DataFrame
import org.apache.spark.sql._
import scala.collection.mutable.ListBuffer
val df = Seq(("2019,2018,2017", "100,200,300", "IN,PRE,POST"), ("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")).toDF("Date", "Amount", "Status")
Next, I split the rows and added a sequence value, then converted back to a DF:
val exploded = df.rdd.flatMap(row => {
val buffer = new ListBuffer[(String, String, String, Int)]
val dateSplit = row(0).toString.split("\\,", -1)
val amountSplit = row(1).toString.split("\\,", -1)
val statusSplit = row(2).toString.split("\\,", -1)
val seqSize = dateSplit.size
for(i <- 0 to seqSize-1)
buffer += Tuple4(dateSplit(i), amountSplit(i), statusSplit(i), i+1)
buffer.toList
}).toDF((df.columns:+"Sequence"): _*)
I'm sure there are other ways to do it without first converting the DF to an RDD, but this will still result with a DF with the correct answer.
Let me know if you have any questions.
I took advantage of the transpose to zip all Sequences by position and then did a posexplode. Selects on dataFrames are dynamic to satisfy the condition: The number of columns and length of the value will vary in the question.
import org.apache.spark.sql.functions._
val df = Seq(
("2019,2018,2017", "100,200,300", "IN,PRE,POST"),
("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")
).toDF("Date", "Amount", "Status")
df: org.apache.spark.sql.DataFrame = [Date: string, Amount: string ... 1 more field]
scala> df.show(false)
+--------------+-----------+-----------+
|Date |Amount |Status |
+--------------+-----------+-----------+
|2019,2018,2017|100,200,300|IN,PRE,POST|
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------+
scala> def transposeSeqOfSeq[S](x:Seq[Seq[S]]): Seq[Seq[S]] = { x.transpose }
transposeSeqOfSeq: [S](x: Seq[Seq[S]])Seq[Seq[S]]
scala> val myUdf = udf { transposeSeqOfSeq[String] _}
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true))))
scala> val df2 = df.select(df.columns.map(c => split(col(c), ",") as c): _*)
df2: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 1 more field]
scala> df2.show(false)
+------------------+---------------+---------------+
|Date |Amount |Status |
+------------------+---------------+---------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|
|[2018] |[73] |[IN] |
|[2018, 2017] |[56, 89] |[IN, PRE] |
+------------------+---------------+---------------+
scala> val df3 = df2.withColumn("allcols", array(df.columns.map(c => col(c)): _*))
df3: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 2 more fields]
scala> df3.show(false)
+------------------+---------------+---------------+------------------------------------------------------+
|Date |Amount |Status |allcols |
+------------------+---------------+---------------+------------------------------------------------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|[[2019, 2018, 2017], [100, 200, 300], [IN, PRE, POST]]|
|[2018] |[73] |[IN] |[[2018], [73], [IN]] |
|[2018, 2017] |[56, 89] |[IN, PRE] |[[2018, 2017], [56, 89], [IN, PRE]] |
+------------------+---------------+---------------+------------------------------------------------------+
scala> val df4 = df3.withColumn("ab", myUdf($"allcols")).select($"ab", posexplode($"ab"))
df4: org.apache.spark.sql.DataFrame = [ab: array<array<string>>, pos: int ... 1 more field]
scala> df4.show(false)
+------------------------------------------------------+---+-----------------+
|ab |pos|col |
+------------------------------------------------------+---+-----------------+
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|0 |[2019, 100, IN] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|1 |[2018, 200, PRE] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|2 |[2017, 300, POST]|
|[[2018, 73, IN]] |0 |[2018, 73, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |0 |[2018, 56, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |1 |[2017, 89, PRE] |
+------------------------------------------------------+---+-----------------+
scala> val selCols = (0 until df.columns.length).map(i => $"col".getItem(i).as(df.columns(i))) :+ ($"pos"+1).as("Sequence")
selCols: scala.collection.immutable.IndexedSeq[org.apache.spark.sql.Column] = Vector(col[0] AS `Date`, col[1] AS `Amount`, col[2] AS `Status`, (pos + 1) AS `Sequence`)
scala> df4.select(selCols:_*).show(false)
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019|100 |IN |1 |
|2018|200 |PRE |2 |
|2017|300 |POST |3 |
|2018|73 |IN |1 |
|2018|56 |IN |1 |
|2017|89 |PRE |2 |
+----+------+------+--------+
This is why I love spark-core APIs. Just with the help of map and flatMap you can handle many problems. Just pass your df and the instance of SQLContext to below method and it will give the desired result -
def reShapeDf(df:DataFrame, sqlContext: SQLContext): DataFrame ={
val rdd = df.rdd.map(m => (m.getAs[String](0),m.getAs[String](1),m.getAs[String](2)))
val rdd1 = rdd.flatMap(a => a._1.split(",").zip(a._2.split(",")).zip(a._3.split(",")))
val rdd2 = rdd1.map{
case ((a,b),c) => (a,b,c)
}
sqlContext.createDataFrame(rdd2.map(m => Row.fromTuple(m)),df.schema)
}

drop duplicate words in long string using scala

I am curious to learn how to drop duplicate words within strings that are contained in a dataframe column. I would like to accomplish it using scala.
By way of example, below you can find a dataframe I would like to transform.
dataframe:
val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
result:
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+
Using pyspark I have used the following code to get the above result. I could not rewrite such a code via scala. Do you have any suggestion? Thanking you in advance I wish you a nice day.
pyspark code:
# dataframe
l = [("66", "a,b,c,a", "4"),("67", "a,f,g,t", "0"),("70", "b,b,b,d", "4")]
#spark.createDataFrame(l).show()
df1 = spark.createDataFrame(l, ['KEY1', 'KEY2','ID'])
# function
import re
import numpy as np
# drop duplicates in a row
def drop_duplicates(row):
# split string by ', ', drop duplicates and join back
words = re.split(',',row)
return ', '.join(np.unique(words))
# drop duplicates
from pyspark.sql.functions import udf
drop_duplicates_udf = udf(drop_duplicates)
dataset2 = df1.withColumn('KEY2', drop_duplicates_udf(df1.KEY2))
dataset2.show()
Dataframe solution
scala> val df = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
df: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> val distinct :String => String = _.split(",").toSet.mkString(",")
distinct: String => String = <function1>
scala> val distinct_id = udf (distinct)
distinct_id: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.select('key1,distinct_id('key2).as("distinct"),'id).show
+----+--------+---+
|key1|distinct| id|
+----+--------+---+
| 66| a,b,c| 4|
| 67| a,f,g,t| 0|
| 70| b,d| 4|
+----+--------+---+
scala>
There could be a more optimized solution but this could help you.
val rdd2 = dataset1.rdd.map(x => x(1).toString.split(",").distinct.mkString(", "))
// and then transform it to dataset
// or
val distinctUDF = spark.udf.register("distinctUDF", (s: String) => s.split(",").distinct.mkString(", "))
dataset1.createTempView("dataset1")
spark.sql("Select KEY1, distinctUDF(KEY2), ID from dataset1").show
import org.apache.spark.sql._
val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
In spark-shell:
scala> val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
dataset1: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dataset1.show
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
scala> val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
dfUpdated: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dfUpdated.show
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+

Writing Spark UDAFs in Scala to return Array type as output

I have a dataframe as below -
val myDF = Seq(
(1,"A",100),
(1,"E",300),
(1,"B",200),
(2,"A",200),
(2,"C",300),
(2,"D",100)
).toDF("id","channel","time")
myDF.show()
+---+-------+----+
| id|channel|time|
+---+-------+----+
| 1| A| 100|
| 1| E| 300|
| 1| B| 200|
| 2| A| 200|
| 2| C| 300|
| 2| D| 100|
+---+-------+----+
For each id, I want the channel sorted by time in ascending fashion. I want to implement an UDAF for this logic.
I would like to call this UDAF as -
scala > spark.sql("""select customerid , myUDAF(customerid,channel,time) group by customerid """).show()
Ouptut dataframe should look like -
+---+-------+
| id|channel|
+---+-------+
| 1|[A,B,E]|
| 2|[D,A,C]|
+---+-------+
I am trying to write an UDAF but unable to implement it -
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
class myUDAF extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function
override def inputSchema : org.apache.spark.sql.types.Structype =
Structype(
StructField("id" , IntegerType)
StructField("channel", StringType)
StructField("time", IntegerType) :: Nil
)
// This is the internal fields we would keep for computing the aggregate
// output
override def bufferSchema : Structype =
Structype(
StructField("Sequence", ArrayType(StringType)) :: Nil
)
// This is the output type of my aggregate function
override def dataType : DataType = ArrayType(StringType)
// no comments here
override def deterministic : Booelan = true
// initialize
override def initialize(buffer: MutableAggregationBuffer) : Unit = {
buffer(0) = Seq("")
}
}
Please help.
This will do it (no need to define your own UDF):
df.groupBy("id")
.agg(sort_array(collect_list( // NOTE: sort based on the first element of the struct
struct("time", "channel"))).as("stuff"))
.select("id", "stuff.channel")
.show(false)
+---+---------+
|id |channel |
+---+---------+
|1 |[A, B, E]|
|2 |[D, A, C]|
+---+---------+
I would not write an UDAF for that. In my experience UDAF are rather slow, especially with complex types. I would use the collect_list & UDF approach:
val sortByTime = udf((rws:Seq[Row]) => rws.sortBy(_.getInt(0)).map(_.getString(1)))
myDF
.groupBy($"id")
.agg(collect_list(struct($"time",$"channel")).as("channel"))
.withColumn("channel", sortByTime($"channel"))
.show()
+---+---------+
| id| channel|
+---+---------+
| 1|[A, B, E]|
| 2|[D, A, C]|
+---+---------+
A much simpler way without UDF.
import org.apache.spark.sql.functions._
myDF.orderBy($"time".asc).groupBy($"id").agg(collect_list($"channel") as "channel").show()

spark flatten records using a key column

I am trying to implement the logic to flatten the records using spark/Scala API. I am trying to use map function.
Could you please help me with the easiest approach to solve this problem?
Assume, for a given key I need to have 3 process codes
Input dataframe-->
Keycol|processcode
John |1
Mary |8
John |2
John |4
Mary |1
Mary |7
==============================
Output dataframe-->
Keycol|processcode1|processcode2|processcode3
john |1 |2 |4
Mary |8 |1 |7
Assuming same number of rows per Keycol, one approach would be to aggregate processcode into an array for each Keycol and expand out into individual columns:
val df = Seq(
("John", 1),
("Mary", 8),
("John", 2),
("John", 4),
("Mary", 1),
("Mary", 7)
).toDF("Keycol", "processcode")
val df2 = df.groupBy("Keycol").agg(collect_list("processcode").as("processcode"))
val numCols = df2.select( size(col("processcode")) ).as[Int].first
val cols = (0 to numCols - 1).map( i => col("processcode")(i) )
df2.select(col("Keycol") +: cols: _*).show
+------+--------------+--------------+--------------+
|Keycol|processcode[0]|processcode[1]|processcode[2]|
+------+--------------+--------------+--------------+
| Mary| 8| 1| 7|
| John| 1| 2| 4|
+------+--------------+--------------+--------------+
A couple of alternative approaches.
SQL
df.createOrReplaceTempView("tbl")
val q = """
select keycol,
c[0] processcode1,
c[1] processcode2,
c[2] processcode3
from (select keycol, collect_list(processcode) c
from tbl
group by keycol) t0
"""
sql(q).show
Result
scala> sql(q).show
+------+------------+------------+------------+
|keycol|processcode1|processcode2|processcode3|
+------+------------+------------+------------+
| Mary| 1| 7| 8|
| John| 4| 1| 2|
+------+------------+------------+------------+
PairRDDFunctions (groupByKey) + mapPartitions
import org.apache.spark.sql.Row
val my_rdd = df.map{ case Row(a1: String, a2: Int) => (a1, a2)
}.rdd.groupByKey().map(t => (t._1, t._2.toList))
def f(iter: Iterator[(String, List[Int])]) : Iterator[Row] = {
var res = List[Row]();
while (iter.hasNext) {
val (keycol: String, c: List[Int]) = iter.next
res = res ::: List(Row(keycol, c(0), c(1), c(2)))
}
res.iterator
}
import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType}
val schema = new StructType().add(
StructField("Keycol", StringType, true)).add(
StructField("processcode1", IntegerType, true)).add(
StructField("processcode2", IntegerType, true)).add(
StructField("processcode3", IntegerType, true))
spark.createDataFrame(my_rdd.mapPartitions(f, true), schema).show
Result
scala> spark.createDataFrame(my_rdd.mapPartitions(f, true), schema).show
+------+------------+------------+------------+
|Keycol|processcode1|processcode2|processcode3|
+------+------------+------------+------------+
| Mary| 1| 7| 8|
| John| 4| 1| 2|
+------+------------+------------+------------+
Please keep in mind that in all cases order of values in columns for process codes is undetermined unless explicitly specified.