select array of columns and expr from dataframe spark scala - scala

Can we select list of columns and along with expr from a dataframe ?
I need to select list of columns and expr from a dataframe.
Below is list of columns
val dynamicColumnSelection = Array("a", "b", "c", "d", "e", "f")
// These columns will change dynamically.
And also I have a expr to select from the same dataframe along with the above columns.
expr("stack(3, 'g', g, 'h', h, 'i', i) as (Key,Value)")
I am able to select either array of columns or individual columns along with expr."a"), col("b"), col("c"), col("d"), col("e"),
expr("stack(3, 'g', g, 'h', h, 'i', i) as (Key,Value)") )
But here dynamicColumnSelection columns prepared dynamically.
Can we select a list of columns and along with expr from a dataframe ?
Please guide, how can I achieve this?
The dataframe is huge, so not looking for join.

What you can do is transform your Array of column names to an array of columns, add the expression to it and use :_* to "splat" the resulting array.
// simply creating a one line dataframe to check that it's working
val df = Seq((1, 2, 3, 4, 5 ,6, 7, 8, 9))
.toDF("a", "b", "c", "d", "e", "f", "g", "h", "i")
val e = expr("stack(3, 'g', g, 'h', h, 'i', i) as (Key,Value)")
val dynamicColumnSelection = Array("a", "b", "c", "d", "e", "f")
val result = :+ e :_*)
Which yields
| a| b| c| d| e| f|Key|Value|
| 1| 2| 3| 4| 5| 6| g| 7|
| 1| 2| 3| 4| 5| 6| h| 8|
| 1| 2| 3| 4| 5| 6| i| 9|


How to apply a customized function with multiple parameters to each group of a dataframe and union the resulting dataframes in Scala Spark?

I have a customized function that looks like this that returns a different dataframe as the output
def customizedfun(data : DataFrame, param1 : Boolean, param2 : string) : DataFrame = {...}
and I want to apply this function to each group of
then append the output dataframes from each type into one dataframe.
This is a little different from other questions regarding applying customized functions to grouped dataframes in that this function also take other inputs, in addition to the dataframe in question df.groupBy("type").
What's the best way to do this?
You can filter down the original df to the different groups, call customizedfun for each group and then union the results.
I assume that customizedfun is a function that simply adds the two parameters as a new column, but it could be any function:
def customizedfun(data : DataFrame, param1 : Boolean, param2 : String) : DataFrame =
data.withColumn("newCol", lit(s"$param2 $param1"))
I need two helper function that calculate the values of param1 and param2 dependent on the value of type. In a real world application, these functions could be for example a lookup into a dictionary.
def calcParam1(typ: Integer): Boolean = typ % 2 == 0
def calcParam2(typ: Integer): String = s"type is $typ"
Now the original df is filtered into the different groups, customizedfun is called and the result is unioned:
//create some test data
val df = Seq((1, "A", "a"), (1, "B", "b"), (1, "C", "c"), (2, "D", "d"), (2, "E", "e"), (3, "F", "f"))
.toDF("type", "val1", "val2")
//| 1| A| a|
//| 1| B| b|
//| 1| C| c|
//| 2| D| d|
//| 2| E| e|
//| 3| F| f|
//get the distinct values of column type
val distinctTypes ="type").distinct().as[Integer].collect()
//call customizedfun for each group
val resultPerGroup= for( typ <- distinctTypes)
yield customizedfun( df.filter(s"type = $typ"), calcParam1(typ), calcParam2(typ))
//the final union
val result = resultPerGroup.tail.foldLeft(resultPerGroup.head)(_ union _)
//|type|val1|val2| newCol|
//| 1| A| a|type is 1 false|
//| 1| B| b|type is 1 false|
//| 1| C| c|type is 1 false|
//| 3| F| f|type is 3 false|
//| 2| D| d| type is 2 true|
//| 2| E| e| type is 2 true|

Group by and find count before doing pivot spark

I have a dataframe like below
foo one small 1
foo one large 2
foo one large 2
foo two small 3
I need to groupBy based on A and B pivot on column C, and sum column D
I am able to do this using
df.groupBy("A", "B").pivot("C").sum("D")
However I need also to find count after groupBy ,if I try something like
df.groupBy("A", "B").pivot("C").agg(sum("D"), count)
I get an output like
A B large small large_count small_count
Is there a way to get only one count after groupBy before doing pivot
On output try
output.withColumn("count", $"large_count"+$"small_count").show
You can drop the two count columns if you want to
To do it before pivot try
df.groupBy("A", "B").agg(count("C"))
Is this what you are expecting?.
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
| A| B| C| D|
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]
scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
| A| B|sumd|large|small|
|foo|one| 5| 4| 1|
|foo|two| 3| null| 3|
This wont require a join. Is this what you are looking for ?
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
| A| B| C| D|
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show
| A| B|large|small|total|
|foo|one| 4| 1| 5|
|foo|two| null| 3| 3|

Grouping by values on a Spark Dataframe

I'm working on a Spark dataframe containing this kind of data:
I want to aggegate this data on the three last columns, so the output would be :
How can I do it in scala ? (this is not a big dataframe so performance is secondary here)
As mentioned in the comments you can first use groupBy to group your columns and then use concat_ws on your first column. Here is one way of doing it,
//create you original DF
val df = Seq(("A",1,2,3),("B",1,2,3),("C",1,2,3),("D",4,2,3)).toDF("col1","col2","col3","col4")
| A| 1| 2| 3|
| B| 1| 2| 3|
| C| 1| 2| 3|
| D| 4| 2| 3|
//group by "col2","col3","col4" and store "col1" as list and then
//convert it to string
//you can change the string separator by concat_ws first arg
.select(concat_ws("", $"col1") as "col1",$"col2",$"col3",$"col4").show
| D| 4| 2| 3|
| ABC| 1| 2| 3|
Alternatively you can map by your key in this case c2, c3, c4 and then concatenate your values via reduce by key. In the end I format each row as needed through the last map. It should be something like the following:
val data=sc.parallelize(List(
("A", "1", "2", "3"),
("B", "1", "2", "3"),
("C", "1", "2", "3"),
("D", "4", "2", "3")))
val res ={ case (c1, c2, c3, c4) => ((c2, c3, c4), String.valueOf(c1)) }
.reduceByKey((x, y) => x + y)
.map(v => v._2.toString + "," + v._1.productIterator.toArray.mkString(","))

Fill Nan with mean of the row in Scala-Spark

I have an RDD with 6 columns, where the last 5 columns might contain NaNs. My intention is to replace the NaNs with the average value of the rest of the last 5 values of the row which are not Nan. For instance, having this input:
1, 2, 3, 4, 5, 6
2, 2, 2, NaN, 4, 0
3, NaN, NaN, NaN, 6, 0
4, NaN, NaN, 4, 4, 0
The output should be:
1, 2, 3, 4, 5, 6
2, 2, 2, 2, 4, 0
3, 3, 3, 3, 6, 0
4, 3, 3, 4, 4, 0
I know how to fill those NaNs with the average value of the column transforming the RDD to DataFrame:
var aux1 = => mean(col(c))) :_*)
var aux2 =*get values of aux1*/)
My question is, how can you do this operation but instead of filling the NaN with the column average, fill it with an average of the values of a subgroup of the row?
You can do this by defining a function to get the mean, and another function to fill nulls in a row.
Given the DF you presented:
val df = sc.parallelize(List((Some(1),Some(2),Some(3),Some(4),Some(5),Some(6)),(Some(2),Some(2),Some(2),None,Some(4),Some(0)),(Some(3),None,None,None,Some(6),Some(0)),(Some(4),None,None,Some(4),Some(4),Some(0)))).toDF("a","b","c","d","e","f")
We need a function to get the mean of a Row:
import org.apache.spark.sql.Row
def rowMean(row: Row): Int = {
val nonNulls = (0 until row.length).map(i => (!row.isNullAt(i), row.getAs[Int](i))).filter(_._1).map(_._2).toList
nonNulls.sum / nonNulls.length
And another to fill nulls in a Row:
def rowFillNulls(row: Row, fill: Int): Row = {
Row((0 until row.length).map(i => if (row.isNullAt(i)) fill else row.getAs[Int](i)) : _*)
Now we can first compute each row mean:
val rowWithMean = => (row,rowMean(row)))
And then fill it:
val result = sqlContext.createDataFrame({case (row,mean) => rowFillNulls(row,mean)}, df.schema)
Finally view before and after...
| a| b| c| d| e| f|
| 1| 2| 3| 4| 5| 6|
| 2| 2| 2|null| 4| 0|
| 3|null|null|null| 6| 0|
| 4|null|null| 4| 4| 0|
| a| b| c| d| e| f|
| 1| 2| 3| 4| 5| 6|
| 2| 2| 2| 2| 4| 0|
| 3| 3| 3| 3| 6| 0|
| 4| 3| 3| 4| 4| 0|
This will work for any width DF with Int columns. You can easily update this to other datatypes, even non-numeric (hint, inspect the df schema!)
A bunch of imports:
import org.apache.spark.sql.functions.{col, isnan, isnull, round, when}
import org.apache.spark.sql.Column
A few helper functions:
def nullOrNan(c: Column) = isnan(c) || isnull(c)
def rowMean(cols: Column*): Column = {
val sum = cols
.map(c => when(nullOrNan(c), lit(0.0)).otherwise(c))
.fold(lit(0.0))(_ + _)
val count = cols
.map(c => when(nullOrNan(c), lit(0.0)).otherwise(lit(1.0)))
.fold(lit(0.0))(_ + _)
sum / count
A solution:
val mean = round(
rowMean( _*)
val exprs =
c => when(nullOrNan(col(c)), mean).otherwise(col(c)).alias(c)
val filled = +: exprs: _*)
Well, this is a fun little problem - I will post my solution, but I will definitely watch and see if someone comes up with a better way of doing it :)
First I would introduce a couple of udfs:
val avg = udf((values: Seq[Integer]) => {
val notNullValues = values.filter(_ != null).map(_.toInt)
val replaceNullWithAvg = udf((x: Integer, avg: Integer) => if(x == null) avg else x)
which I would then apply to the DataFrame like this:
.withColumn("avg", avg(array( => df.col(s)):_*)))
.select('col1, replaceNullWithAvg('col2, 'avg) as "col2", replaceNullWithAvg('col3, 'avg) as "col3", replaceNullWithAvg('col4, 'avg) as "col4", replaceNullWithAvg('col5, 'avg) as "col5", replaceNullWithAvg('col6, 'avg) as "col6")
This will get you what you are looking for, but arguably not the most sophisticated code I have ever put together...

spark aggregateByKey with tuple

New to the RDD api of spark - thanks to Spark migrate sql window function to RDD for better performance - I managed to generate the following table:
| _1| _2|
| [col3TooMany,C]| 0|
| [col1,A]| 0|
| [col2,B]| 0|
| [col3TooMany,C]| 1|
| [col1,A]| 1|
| [col2,B]| 1|
|[col3TooMany,jkl]| 0|
| [col1,d]| 0|
| [col2,a]| 0|
| [col3TooMany,C]| 0|
| [col1,d]| 0|
| [col2,g]| 0|
| [col3TooMany,t]| 1|
| [col1,A]| 1|
| [col2,d]| 1|
| [col3TooMany,C]| 1|
| [col1,d]| 1|
| [col2,c]| 1|
| [col3TooMany,C]| 1|
| [col1,c]| 1|
with an initial input of
val df = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
val columnsToDrop = Seq("col3TooMany")
val columnsToCode = Seq("col1", "col2")
val target = "TARGET"
import org.apache.spark.sql.functions._
val exploded = explode(array(
(columnsToDrop ++ columnsToCode).map(c =>
struct(lit(c).alias("k"), col(c).alias("v"))): _*
val long =, $"TARGET")
import org.apache.spark.util.StatCounter
then[((String, String), Int)].rdd.aggregateByKey(StatCounter())(_ merge _, _ merge _).collect.head
res71: ((String, String), org.apache.spark.util.StatCounter) = ((col2,B),(count: 3, mean: 0,666667, stdev: 0,471405, max: 1,000000, min: 0,000000))
is aggregating statistics of all the unique values for each column.
How can I add to the count (which is 3 for B in col2) a second count (maybe as a tuple) which represents the number of B in col2 where TARGET == 1. In this case, it should be 2.
You shouldn't need additional aggregate here. With binary target column, mean is just an empirical probability of target being equal 1:
number of 1 - count * mean
number of 0 - count * (1 - mean)