Adding a column of rowsums across a list of columns in Spark Dataframe - scala

I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.
For example, my data looks like this:
ID var1 var2 var3 var4 var5
a 5 7 9 12 13
b 6 4 3 20 17
c 4 9 4 6 9
d 1 2 6 8 1
I want a column added summing the rows for specific columns:
ID var1 var2 var3 var4 var5 sums
a 5 7 9 12 13 46
b 6 4 3 20 17 50
c 4 9 4 6 9 32
d 1 2 6 8 10 27
I know it is possible to add columns together if you know the specific columns to add:
val newdf = df.withColumn("sumofcolumns", df("var1") + df("var2"))
But is it possible to pass a list of column names and add them together? Based off of this answer which is basically what I want but it is using the python API instead of scala (Add column sum as new column in PySpark dataframe) I think something like this would work:
//Select columns to sum
val columnstosum = ("var1", "var2","var3","var4","var5")
// Create new column called sumofcolumns which is sum of all columns listed in columnstosum
val newdf = df.withColumn("sumofcolumns", df.select(columstosum.head, columnstosum.tail: _*).sum)
This throws the error value sum is not a member of org.apache.spark.sql.DataFrame. Is there a way to sum across columns?
Thanks in advance for your help.

You should try the following:
import org.apache.spark.sql.functions._
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(
("a", 5, 7, 9, 12, 13),
("b", 6, 4, 3, 20, 17),
("c", 4, 9, 4, 6 , 9),
("d", 1, 2, 6, 8 , 1)
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")
val columnsToSum = List(col("var1"), col("var2"), col("var3"), col("var4"), col("var5"))
val output = input.withColumn("sums", columnsToSum.reduce(_ + _))
output.show()
Then the result is:
+---+----+----+----+----+----+----+
| ID|var1|var2|var3|var4|var5|sums|
+---+----+----+----+----+----+----+
| a| 5| 7| 9| 12| 13| 46|
| b| 6| 4| 3| 20| 17| 50|
| c| 4| 9| 4| 6| 9| 32|
| d| 1| 2| 6| 8| 1| 18|
+---+----+----+----+----+----+----+

Plain and simple:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, col}
def sum_(cols: Column*) = cols.foldLeft(lit(0))(_ + _)
val columnstosum = Seq("var1", "var2", "var3", "var4", "var5").map(col _)
df.select(sum_(columnstosum: _*))
with Python equivalent:
from functools import reduce
from operator import add
from pyspark.sql.functions import lit, col
def sum_(*cols):
return reduce(add, cols, lit(0))
columnstosum = [col(x) for x in ["var1", "var2", "var3", "var4", "var5"]]
select("*", sum_(*columnstosum))
Both will default to NA if there is a missing value in the row. You can use DataFrameNaFunctions.fill or coalesce function to avoid that.

I assume you have a dataframe df. Then you can sum up all cols except your ID col. This is helpful when you have many cols and you don't want to manually mention names of all columns like everyone mentioned above. This post has the same answer.
val sumAll = df.columns.collect{ case x if x != "ID" => col(x) }.reduce(_ + _)
df.withColumn("sum", sumAll)

Here's an elegant solution using python:
NewDF = OldDF.withColumn('sums', sum(OldDF[col] for col in OldDF.columns[1:]))
Hopefully this will influence something similar in Spark ... anyone?.

Related

How to write a function that takes a list of column names of a DataFrame, reorders selected columns the left and preserves unselected columns

I'd like to build a function
def reorderColumns(columnNames: List[String]) = ...
that can be applied to a Spark DataFrame such that the columns specified in columnNames gets reordered to the left, and remaining columns (in any order) remain to the right.
Example:
Given a df with the following 5 columns
| A | B | C | D | E
df.reorderColumns(["D","B","A"]) returns a df with columns ordered like so:
| D | B | A | C | E
Try this one:
def reorderColumns(df: DataFrame, columns: Array[String]): DataFrame = {
val restColumns: Array[String] = df.columns.filterNot(c => columns.contains(c))
df.select((columns ++ restColumns).map(col): _*)
}
Usage example:
val spark: SparkSession = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val df = List((1, 3, 1, 6), (2, 4, 2, 5), (3, 6, 3, 4)).toDF("colA", "colB", "colC", "colD")
reorderColumns(df, Array("colC", "colB")).show
// output:
//+----+----+----+----+
//|colC|colB|colA|colD|
//+----+----+----+----+
//| 1| 3| 1| 6|
//| 2| 4| 2| 5|
//| 3| 6| 3| 4|
//+----+----+----+----+

PySpark: Pivot only one row to column

I have a dataframe like so:
df = sc.parallelize([("num1", "1"), ("num2", "5"), ("total", "10")]).toDF(("key", "val"))
key val
num1 1
num2 5
total 10
I want to pivot only the total row to a new column and keep its value for each row:
key val total
num1 1 10
num2 5 10
I've tried pivoting and aggregating but cannot get the one column with the same value.
You could join a dataframe with only the total to a dataframe without the total.
Another option would be to collect the total and add it as a literal.
from pyspark.sql import functions as f
# option 1
df1 = df.filter("key <> 'total'")
df2 = df.filter("key = 'total'").select(f.col('val').alias('total'))
df1.join(df2).show()
+----+---+-----+
| key|val|total|
+----+---+-----+
|num1| 1| 10|
|num2| 5| 10|
+----+---+-----+
# option 2
total = df.filter("key = 'total'").select('val').collect()[0][0]
df.filter("key <> 'total'").withColumn('total', f.lit(total)).show()
+----+---+-----+
| key|val|total|
+----+---+-----+
|num1| 1| 10|
|num2| 5| 10|
+----+---+-----+

Weighted mean median quartiles in Spark

I have a Spark SQL dataframe:
id
Value
Weights
1
2
4
1
5
2
2
1
4
2
6
2
2
9
4
3
2
4
I need to groupBy by 'id' and aggregate to get the weighted mean, median, and quartiles of the values per 'id'. What is the best way to do this?
Before the calculation you should do a small transformation to your Value column:
F.explode(F.array_repeat('Value', F.col('Weights').cast('int')))
array_repeat creates an array out of your number - the number inside the array will be repeated as many times as is specified in the column 'Weights' (casting to int is necessary, because array_repeat expects this column to be of int type. After this part the first value of 2 will be transformed into [2,2,2,2].
Then, explode will create a row for every element in the array. So, the line [2,2,2,2] will be transformed into 4 rows, each containing an integer 2.
Then you can calculate statistics, the results will have weights applied, as your dataframe is now transformed according to the weights.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, 2, 4),
(1, 5, 2),
(2, 1, 4),
(2, 6, 2),
(2, 9, 4),
(3, 2, 4)],
['id', 'Value', 'Weights']
)
df = df.select('id', F.explode(F.array_repeat('Value', F.col('Weights').cast('int'))))
df = (df
.groupBy('id')
.agg(F.mean('col').alias('weighted_mean'),
F.expr('percentile(col, 0.5)').alias('weighted_median'),
F.expr('percentile(col, 0.25)').alias('weighted_lower_quartile'),
F.expr('percentile(col, 0.75)').alias('weighted_upper_quartile')))
df.show()
#+---+-------------+---------------+-----------------------+-----------------------+
#| id|weighted_mean|weighted_median|weighted_lower_quartile|weighted_upper_quartile|
#+---+-------------+---------------+-----------------------+-----------------------+
#| 1| 3.0| 2.0| 2.0| 4.25|
#| 2| 5.2| 6.0| 1.0| 9.0|
#| 3| 2.0| 2.0| 2.0| 2.0|
#+---+-------------+---------------+-----------------------+-----------------------+

Create separate columns from array column in Spark Dataframe in Scala when array is large [duplicate]

This question already has answers here:
Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]
(5 answers)
Closed 4 years ago.
I have two columns: one of type Integer and one of type linalg.Vector. I can convert linalg.Vector to array. Each array has 32 elements. I want to convert each element in the array to a column. So the input is like :
column1 column2
(3, 5, 25, ...., 12) 3
(2, 7, 15, ...., 10) 4
(1, 10, 12, ..., 35) 2
Output should be:
column1_1 column1_2 column1_3 ......... column1_32 column 2
3 5 25 ......... 12 3
2 7 15 ......... 10 4
1 1 0 12 ......... 12 2
Except, in my case there are 32 elements in the array. It is too many to use the method in question Convert Array of String column to multiple columns in spark scala
I tried a few ways and none of it worked. What is the right way to do this?
Thanks a lot.
scala> import org.apache.spark.sql.Column
scala> val df = Seq((Array(3,5,25), 3),(Array(2,7,15),4),(Array(1,10,12),2)).toDF("column1", "column2")
df: org.apache.spark.sql.DataFrame = [column1: array<int>, column2: int]
scala> def getColAtIndex(id:Int): Column = col(s"column1")(id).as(s"column1_${id+1}")
getColAtIndex: (id: Int)org.apache.spark.sql.Column
scala> val columns: IndexedSeq[Column] = (0 to 2).map(getColAtIndex) :+ col("column2") //Here, instead of 2, you can give the value of n
columns: IndexedSeq[org.apache.spark.sql.Column] = Vector(column1[0] AS `column1_1`, column1[1] AS `column1_2`, column1[2] AS `column1_3`, column2)
scala> df.select(columns: _*).show
+---------+---------+---------+-------+
|column1_1|column1_2|column1_3|column2|
+---------+---------+---------+-------+
| 3| 5| 25| 3|
| 2| 7| 15| 4|
| 1| 10| 12| 2|
+---------+---------+---------+-------+
This can be done best by writing a UserDefinedFunction like:
val getElementFromVectorUDF = udf(getElementFromVector(_: Vector, _: Int))
def getElementFromVector(vec: Vector, idx: Int) = {
vec(idx)
}
You can use it like this then:
df.select(
getElementFromVectorUDF($"column1", 0) as "column1_0",
...
getElementFromVectorUDF($"column1", n) as "column1_n",
)
I hope this helps.

Spark: Add column to dataframe conditionally

I am trying to take my input data:
A B C
--------------
4 blah 2
2 3
56 foo 3
And add a column to the end based on whether B is empty or not:
A B C D
--------------------
4 blah 2 1
2 3 0
56 foo 3 1
I can do this easily by registering the input dataframe as a temp table, then typing up a SQL query.
But I'd really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala.
I've tried .withColumn, but I can't get that to do what I want.
Try withColumn with the function when as follows:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
newDf.show() shows
+---+----+---+---+
| A| B| C| D|
+---+----+---+---+
| 4|blah| 2| 1|
| 2| | 3| 0|
| 56| foo| 3| 1|
|100|null| 5| 0|
+---+----+---+---+
I added the (100, null, 5) row for testing the isNull case.
I tried this code with Spark 1.6.0 but as commented in the code of when, it works on the versions after 1.4.0.
My bad, I had missed one part of the question.
Best, cleanest way is to use a UDF.
Explanation within the code.
// create some example data...BY DataFrame
// note, third record has an empty string
case class Stuff(a:String,b:Int)
val d= sc.parallelize(Seq( ("a",1),("b",2),
("",3) ,("d",4)).map { x => Stuff(x._1,x._2) }).toDF
// now the good stuff.
import org.apache.spark.sql.functions.udf
// function that returns 0 is string empty
val func = udf( (s:String) => if(s.isEmpty) 0 else 1 )
// create new dataframe with added column named "notempty"
val r = d.select( $"a", $"b", func($"a").as("notempty") )
scala> r.show
+---+---+--------+
| a| b|notempty|
+---+---+--------+
| a| 1| 1111|
| b| 2| 1111|
| | 3| 0|
| d| 4| 1111|
+---+---+--------+
How about something like this?
val newDF = df.filter($"B" === "").take(1) match {
case Array() => df
case _ => df.withColumn("D", $"B" === "")
}
Using take(1) should have a minimal hit