How to write a function that takes a list of column names of a DataFrame, reorders selected columns the left and preserves unselected columns - scala

I'd like to build a function
def reorderColumns(columnNames: List[String]) = ...
that can be applied to a Spark DataFrame such that the columns specified in columnNames gets reordered to the left, and remaining columns (in any order) remain to the right.
Example:
Given a df with the following 5 columns
| A | B | C | D | E
df.reorderColumns(["D","B","A"]) returns a df with columns ordered like so:
| D | B | A | C | E

Try this one:
def reorderColumns(df: DataFrame, columns: Array[String]): DataFrame = {
val restColumns: Array[String] = df.columns.filterNot(c => columns.contains(c))
df.select((columns ++ restColumns).map(col): _*)
}
Usage example:
val spark: SparkSession = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val df = List((1, 3, 1, 6), (2, 4, 2, 5), (3, 6, 3, 4)).toDF("colA", "colB", "colC", "colD")
reorderColumns(df, Array("colC", "colB")).show
// output:
//+----+----+----+----+
//|colC|colB|colA|colD|
//+----+----+----+----+
//| 1| 3| 1| 6|
//| 2| 4| 2| 5|
//| 3| 6| 3| 4|
//+----+----+----+----+

Related

Group by and find count before doing pivot spark

I have a dataframe like below
A B C D
foo one small 1
foo one large 2
foo one large 2
foo two small 3
I need to groupBy based on A and B pivot on column C, and sum column D
I am able to do this using
df.groupBy("A", "B").pivot("C").sum("D")
However I need also to find count after groupBy ,if I try something like
df.groupBy("A", "B").pivot("C").agg(sum("D"), count)
I get an output like
A B large small large_count small_count
Is there a way to get only one count after groupBy before doing pivot
On output try
output.withColumn("count", $"large_count"+$"small_count").show
You can drop the two count columns if you want to
To do it before pivot try
df.groupBy("A", "B").agg(count("C"))
Is this what you are expecting?.
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]
scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
+---+---+----+-----+-----+
| A| B|sumd|large|small|
+---+---+----+-----+-----+
|foo|one| 5| 4| 1|
|foo|two| 3| null| 3|
+---+---+----+-----+-----+
scala>
This wont require a join. Is this what you are looking for ?
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
df.registerTempTable("dummy")
spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show
+---+---+-----+-----+-----+
| A| B|large|small|total|
+---+---+-----+-----+-----+
|foo|one| 4| 1| 5|
|foo|two| null| 3| 3|
+---+---+-----+-----+-----+

How to assign keys to items in a column in Scala?

I have the following RDD:
Col1 Col2
"abc" "123a"
"def" "783b"
"abc "674b"
"xyz" "123a"
"abc" "783b"
I need the following output where each item in each column is converted into a unique key.
for example : abc->1,def->2,xyz->3
Col1 Col2
1 1
2 2
1 3
3 1
1 2
Any help would be appreciated. Thanks!
In this case, you can rely on the hashCode of the string. The hashcode will be the same if the input and datatype is same. Try this.
scala> "abc".hashCode
res23: Int = 96354
scala> "xyz".hashCode
res24: Int = 119193
scala> val df = Seq(("abc","123a"),
| ("def","783b"),
| ("abc","674b"),
| ("xyz","123a"),
| ("abc","783b")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> def hashc(x:String):Int =
| return x.hashCode
hashc: (x: String)Int
scala> val myudf = udf(hashc(_:String):Int)
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))
scala> df.select(myudf('col1), myudf('col2)).show
+---------+---------+
|UDF(col1)|UDF(col2)|
+---------+---------+
| 96354| 1509487|
| 99333| 1694000|
| 96354| 1663279|
| 119193| 1509487|
| 96354| 1694000|
+---------+---------+
scala>
If you must map your columns into natural numbers starting from 1, one approach would be to apply zipWithIndex to the individual columns, add 1 to the index (as zipWithIndex always starts from 0), convert indvidual RDDs to DataFrames, and finally join the converted DataFrames for the index keys:
val rdd = sc.parallelize(Seq(
("abc", "123a"),
("def", "783b"),
("abc", "674b"),
("xyz", "123a"),
("abc", "783b")
))
val df1 = rdd.map(_._1).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col1", "c1key")
val df2 = rdd.map(_._2).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col2", "c2key")
val dfJoined = rdd.toDF("col1", "col2").
join(df1, Seq("col1")).
join(df2, Seq("col2"))
// +----+----+-----+-----+
// |col2|col1|c1key|c2key|
// +----+----+-----+-----+
// |783b| abc| 2| 1|
// |783b| def| 3| 1|
// |123a| xyz| 1| 2|
// |123a| abc| 2| 2|
// |674b| abc| 2| 3|
//+----+----+-----+-----+
dfJoined.
select($"c1key".as("col1"), $"c2key".as("col2")).
show
// +----+----+
// |col1|col2|
// +----+----+
// | 2| 1|
// | 3| 1|
// | 1| 2|
// | 2| 2|
// | 2| 3|
// +----+----+
Note that if you're okay with having the keys start from 0, the step of map(r => (r._1, r._2 + 1)) can be skipped in generating df1 and df2.

Fill Nan with mean of the row in Scala-Spark

I have an RDD with 6 columns, where the last 5 columns might contain NaNs. My intention is to replace the NaNs with the average value of the rest of the last 5 values of the row which are not Nan. For instance, having this input:
1, 2, 3, 4, 5, 6
2, 2, 2, NaN, 4, 0
3, NaN, NaN, NaN, 6, 0
4, NaN, NaN, 4, 4, 0
The output should be:
1, 2, 3, 4, 5, 6
2, 2, 2, 2, 4, 0
3, 3, 3, 3, 6, 0
4, 3, 3, 4, 4, 0
I know how to fill those NaNs with the average value of the column transforming the RDD to DataFrame:
var aux1 = df.select(df.columns.map(c => mean(col(c))) :_*)
var aux2 = df.na.fill(/*get values of aux1*/)
My question is, how can you do this operation but instead of filling the NaN with the column average, fill it with an average of the values of a subgroup of the row?
You can do this by defining a function to get the mean, and another function to fill nulls in a row.
Given the DF you presented:
val df = sc.parallelize(List((Some(1),Some(2),Some(3),Some(4),Some(5),Some(6)),(Some(2),Some(2),Some(2),None,Some(4),Some(0)),(Some(3),None,None,None,Some(6),Some(0)),(Some(4),None,None,Some(4),Some(4),Some(0)))).toDF("a","b","c","d","e","f")
We need a function to get the mean of a Row:
import org.apache.spark.sql.Row
def rowMean(row: Row): Int = {
val nonNulls = (0 until row.length).map(i => (!row.isNullAt(i), row.getAs[Int](i))).filter(_._1).map(_._2).toList
nonNulls.sum / nonNulls.length
}
And another to fill nulls in a Row:
def rowFillNulls(row: Row, fill: Int): Row = {
Row((0 until row.length).map(i => if (row.isNullAt(i)) fill else row.getAs[Int](i)) : _*)
}
Now we can first compute each row mean:
val rowWithMean = df.map(row => (row,rowMean(row)))
And then fill it:
val result = sqlContext.createDataFrame(rowWithMean.map{case (row,mean) => rowFillNulls(row,mean)}, df.schema)
Finally view before and after...
df.show
+---+----+----+----+---+---+
| a| b| c| d| e| f|
+---+----+----+----+---+---+
| 1| 2| 3| 4| 5| 6|
| 2| 2| 2|null| 4| 0|
| 3|null|null|null| 6| 0|
| 4|null|null| 4| 4| 0|
+---+----+----+----+---+---+
result.show
+---+---+---+---+---+---+
| a| b| c| d| e| f|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6|
| 2| 2| 2| 2| 4| 0|
| 3| 3| 3| 3| 6| 0|
| 4| 3| 3| 4| 4| 0|
+---+---+---+---+---+---+
This will work for any width DF with Int columns. You can easily update this to other datatypes, even non-numeric (hint, inspect the df schema!)
A bunch of imports:
import org.apache.spark.sql.functions.{col, isnan, isnull, round, when}
import org.apache.spark.sql.Column
A few helper functions:
def nullOrNan(c: Column) = isnan(c) || isnull(c)
def rowMean(cols: Column*): Column = {
val sum = cols
.map(c => when(nullOrNan(c), lit(0.0)).otherwise(c))
.fold(lit(0.0))(_ + _)
val count = cols
.map(c => when(nullOrNan(c), lit(0.0)).otherwise(lit(1.0)))
.fold(lit(0.0))(_ + _)
sum / count
}
A solution:
val mean = round(
rowMean(df.columns.tail.map(col): _*)
).cast("int").alias("mean")
val exprs = df.columns.tail.map(
c => when(nullOrNan(col(c)), mean).otherwise(col(c)).alias(c)
)
val filled = df.select(col(df.columns(0)) +: exprs: _*)
Well, this is a fun little problem - I will post my solution, but I will definitely watch and see if someone comes up with a better way of doing it :)
First I would introduce a couple of udfs:
val avg = udf((values: Seq[Integer]) => {
val notNullValues = values.filter(_ != null).map(_.toInt)
notNullValues.sum/notNullValues.length
})
val replaceNullWithAvg = udf((x: Integer, avg: Integer) => if(x == null) avg else x)
which I would then apply to the DataFrame like this:
dataframe
.withColumn("avg", avg(array(df.columns.tail.map(s => df.col(s)):_*)))
.select('col1, replaceNullWithAvg('col2, 'avg) as "col2", replaceNullWithAvg('col3, 'avg) as "col3", replaceNullWithAvg('col4, 'avg) as "col4", replaceNullWithAvg('col5, 'avg) as "col5", replaceNullWithAvg('col6, 'avg) as "col6")
This will get you what you are looking for, but arguably not the most sophisticated code I have ever put together...

Select column by name with multiple aggregate columns after pivot with Spark Scala

I am trying to aggregate multitple columns after a pivot in Scala Spark 2.0.1:
scala> val df = List((1, 2, 3, None), (1, 3, 4, Some(1))).toDF("a", "b", "c", "d")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 2 more fields]
scala> df.show
+---+---+---+----+
| a| b| c| d|
+---+---+---+----+
| 1| 2| 3|null|
| 1| 3| 4| 1|
+---+---+---+----+
scala> val pivoted = df.groupBy("a").pivot("b").agg(max("c"), max("d"))
pivoted: org.apache.spark.sql.DataFrame = [a: int, 2_max(`c`): int ... 3 more fields]
scala> pivoted.show
+---+----------+----------+----------+----------+
| a|2_max(`c`)|2_max(`d`)|3_max(`c`)|3_max(`d`)|
+---+----------+----------+----------+----------+
| 1| 3| null| 4| 1|
+---+----------+----------+----------+----------+
I am unable to select or rename those columns so far:
scala> pivoted.select("3_max(`d`)")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: 3_max(`d`);
scala> pivoted.select("`3_max(`d`)`")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `3_max(`d`)`;
scala> pivoted.select("`3_max(d)`")
org.apache.spark.sql.AnalysisException: cannot resolve '`3_max(d)`' given input columns: [2_max(`c`), 3_max(`d`), a, 2_max(`d`), 3_max(`c`)];
There must be a simple trick here, any ideas? Thanks.
Seems like a bug, the back ticks caused the problem. One fix here would be to remove the back ticks from the column names:
val pivotedNewName = pivoted.columns.foldLeft(pivoted)((df, col) =>
df.withColumnRenamed(col, col.replace("`", "")))
Now you can select by column names as normal:
pivotedNewName.select("2_max(c)").show
+--------+
|2_max(c)|
+--------+
| 3|
+--------+

Spark: Add column to dataframe conditionally

I am trying to take my input data:
A B C
--------------
4 blah 2
2 3
56 foo 3
And add a column to the end based on whether B is empty or not:
A B C D
--------------------
4 blah 2 1
2 3 0
56 foo 3 1
I can do this easily by registering the input dataframe as a temp table, then typing up a SQL query.
But I'd really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala.
I've tried .withColumn, but I can't get that to do what I want.
Try withColumn with the function when as follows:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
newDf.show() shows
+---+----+---+---+
| A| B| C| D|
+---+----+---+---+
| 4|blah| 2| 1|
| 2| | 3| 0|
| 56| foo| 3| 1|
|100|null| 5| 0|
+---+----+---+---+
I added the (100, null, 5) row for testing the isNull case.
I tried this code with Spark 1.6.0 but as commented in the code of when, it works on the versions after 1.4.0.
My bad, I had missed one part of the question.
Best, cleanest way is to use a UDF.
Explanation within the code.
// create some example data...BY DataFrame
// note, third record has an empty string
case class Stuff(a:String,b:Int)
val d= sc.parallelize(Seq( ("a",1),("b",2),
("",3) ,("d",4)).map { x => Stuff(x._1,x._2) }).toDF
// now the good stuff.
import org.apache.spark.sql.functions.udf
// function that returns 0 is string empty
val func = udf( (s:String) => if(s.isEmpty) 0 else 1 )
// create new dataframe with added column named "notempty"
val r = d.select( $"a", $"b", func($"a").as("notempty") )
scala> r.show
+---+---+--------+
| a| b|notempty|
+---+---+--------+
| a| 1| 1111|
| b| 2| 1111|
| | 3| 0|
| d| 4| 1111|
+---+---+--------+
How about something like this?
val newDF = df.filter($"B" === "").take(1) match {
case Array() => df
case _ => df.withColumn("D", $"B" === "")
}
Using take(1) should have a minimal hit