drop all columns with a special condition on a column spark - scala

I have a dataset and I need to drop columns which has a standard deviation equal to 0. I've tried:
val df = spark.read.option("header",true)
.option("inferSchema", "false").csv("C:/gg.csv")
val finalresult = df
.agg(df.columns.map(stddev(_)).head, df.columns.map(stddev(_)).tail: _*)
I want to compute the standard deviation of each column and drop the column if it it is is equal to zero
RowNumber,Poids,Age,Taille,0MI,Hmean,CoocParam,LdpParam,Test2,Classe,
0,87,72,160,5,0.6993,2.9421,2.3745,3,4,
1,54,70,163,5,0.6301,2.7273,2.2205,3,4,
2,72,51,164,5,0.6551,2.9834,2.3993,3,4,
3,75,74,170,5,0.6966,2.9654,2.3699,3,4,
4,108,62,165,5,0.6087,2.7093,2.1619,3,4,
5,84,61,159,5,0.6876,2.938,2.3601,3,4,
6,89,64,168,5,0.6757,2.9547,2.3676,3,4,
7,75,72,160,5,0.7432,2.9331,2.3339,3,4,
8,64,62,153,5,0.6505,2.7676,2.2255,3,4,
9,82,58,159,5,0.6748,2.992,2.4043,3,4,
10,67,49,160,5,0.6633,2.9367,2.333,3,4,
11,85,53,160,5,0.6821,2.981,2.3822,3,4,

You can try this, use getValueMap and filter to get the column names which you want to drop, and then drop them:
//Extract the standard deviation from the data frame summary:
val stddev = df.describe().filter($"summary" === "stddev").drop("summary").first()
// Use `getValuesMap` and `filter` to get the columns names where stddev is equal to 0:
val to_drop = stddev.getValuesMap[String](df.columns).filter{ case (k, v) => v.toDouble == 0 }.keys
//Drop 0 stddev columns
df.drop(to_drop.toSeq: _*).show
+---------+-----+---+------+------+---------+--------+
|RowNumber|Poids|Age|Taille| Hmean|CoocParam|LdpParam|
+---------+-----+---+------+------+---------+--------+
| 0| 87| 72| 160|0.6993| 2.9421| 2.3745|
| 1| 54| 70| 163|0.6301| 2.7273| 2.2205|
| 2| 72| 51| 164|0.6551| 2.9834| 2.3993|
| 3| 75| 74| 170|0.6966| 2.9654| 2.3699|
| 4| 108| 62| 165|0.6087| 2.7093| 2.1619|
| 5| 84| 61| 159|0.6876| 2.938| 2.3601|
| 6| 89| 64| 168|0.6757| 2.9547| 2.3676|
| 7| 75| 72| 160|0.7432| 2.9331| 2.3339|
| 8| 64| 62| 153|0.6505| 2.7676| 2.2255|
| 9| 82| 58| 159|0.6748| 2.992| 2.4043|
| 10| 67| 49| 160|0.6633| 2.9367| 2.333|
| 11| 85| 53| 160|0.6821| 2.981| 2.3822|
+---------+-----+---+------+------+---------+--------+

OK, I have written a solution that is independent of your dataset. Required imports and example data:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, stddev, col}
val df = spark.range(1, 1000).withColumn("X2", lit(0)).toDF("X1","X2")
df.show(5)
// +---+---+
// | X1| X2|
// +---+---+
// | 1| 0|
// | 2| 0|
// | 3| 0|
// | 4| 0|
// | 5| 0|
First compute standard deviation by column:
// no need to rename but I did it to become more human
// readable when you show df2
val aggs = df.columns.map(c => stddev(c).as(c))
val stddevs = df.select(aggs: _*)
stddevs.show // df2 contains the stddev of each columns
// +-----------------+---+
// | X1| X2|
// +-----------------+---+
// |288.5307609250702|0.0|
// +-----------------+---+
Collect the first row and filter columns to keep:
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(df.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
Select and check the results:
df.select(columnsToKeep: _*).show
// +---+
// | X1|
// +---+
// | 1|
// | 2|
// | 3|
// | 4|
// | 5|

Related

Collect statistics from the DataFrame

I'm collecting dataframe statistics.
The maximum minimum average value of the column. The number of zeros in the column. The number of empty values in the column.
Conditions:
Number of columns n < 2000
Number of dataframe entries r < 10^9
The stack() function is used for the solution
https://www.hadoopinrealworld.com/understanding-stack-function-in-spark/#:~:text=stack%20function%20in%20Spark%20takes,an%20argument%20followed%20by%20expressions.&text=stack%20function%20will%20generate%20n%20rows%20by%20evaluating%20the%20expressions.
What scares:
The number of rows in the intermediate dataframe rusultDF. cal("period_date").dropDuplicates * columnsNames.size * r = many
Input:
val columnsNames = List("col_name1", "col_name2")
+---------+---------+-----------+
|col_name1|col_name2|period_date|
+---------+---------+-----------+
| 11| 21| 2022-01-31|
| 12| 22| 2022-01-31|
| 13| 23| 2022-03-31|
+---------+---------+-----------+
Output:
+-----------+---------+----------+----------+---------+---------+---------+
|period_date| columns|count_null|count_zero|avg_value|mix_value|man_value|
+-----------+---------+----------+----------+---------+---------+---------+
| 2022-01-31|col_name2| 0| 0| 21.5| 21| 22|
| 2022-03-31|col_name1| 0| 0| 13.0| 13| 13|
| 2022-03-31|col_name2| 0| 0| 23.0| 23| 23|
| 2022-01-31|col_name1| 0| 0| 11.5| 11| 12|
+-----------+---------+----------+----------+---------+---------+---------+
My solution:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().master("local").appName("spark test5").getOrCreate()
import spark.implicits._
case class RealStructure(col_name1: Int, col_name2: Int, period_date: String)
val userTableDf = List(
RealStructure(11, 21, "2022-01-31"),
RealStructure(12, 22, "2022-01-31"),
RealStructure(13, 23, "2022-03-31")
) toDF()
//userTableDf.show()
//Start
new StatisticCollector(userTableDf)
class StatisticCollector(userTableDf: DataFrame) {
val columnsNames = List("col_name1", "col_name2")
val stack = s"stack(${columnsNames.length}, ${columnsNames.map(name => s"'$name', $name").mkString(",")})"
val resultDF = userTableDf.select(col("period_date"),
expr(s"$stack as (columns, values)")
)
//resultDF.show()
println(stack)
/**
+-----------+---------+------+
|period_date| columns|values|
+-----------+---------+------+
| 2022-01-31|col_name1| 11|
| 2022-01-31|col_name2| 21|
| 2022-01-31|col_name1| 12|
| 2022-01-31|col_name2| 22|
| 2022-03-31|col_name1| 13|
| 2022-03-31|col_name2| 23|
+-----------+---------+------+
stack(2, 'col_name1', col_name1,'col_name2', col_name2)
**/
val superResultDF = resultDF.groupBy(col("period_date"), col("columns")).agg(
sum(when(col("values").isNull, 1).otherwise(0)).alias("count_null"),
sum(when(col("values") === 0, 1).otherwise(0)).alias("count_zero"),
avg("values").cast("double").alias("avg_value"),
min(col("values")).alias("mix_value"),
max(col("values")).alias("man_value")
)
superResultDF.show()
}
Please give your assessment, if you see what can be solved more efficiently, then write how you would solve it.
The calculation speed is important.
It is necessary as quickly as it is provided by God.

How to compute cumulative sum on multiple float columns?

I have 100 float columns in a Dataframe which are ordered by date.
ID Date C1 C2 ....... C100
1 02/06/2019 32.09 45.06 99
1 02/04/2019 32.09 45.06 99
2 02/03/2019 32.09 45.06 99
2 05/07/2019 32.09 45.06 99
I need to get C1 to C100 in the cumulative sum based on id and date.
Target dataframe should look like this:
ID Date C1 C2 ....... C100
1 02/04/2019 32.09 45.06 99
1 02/06/2019 64.18 90.12 198
2 02/03/2019 32.09 45.06 99
2 05/07/2019 64.18 90.12 198
I want to achieve this without looping from C1- C100.
Initial code for one column:
var DF1 = DF.withColumn("CumSum_c1", sum("C1").over(
Window.partitionBy("ID")
.orderBy(col("date").asc)))
I found a similar question here but he manually did it for two columns : Cumulative sum in Spark
Its a classical use for foldLeft. Let's generate some data first :
import org.apache.spark.sql.expressions._
val df = spark.range(1000)
.withColumn("c1", 'id + 3)
.withColumn("c2", 'id % 2 + 1)
.withColumn("date", monotonically_increasing_id)
.withColumn("id", 'id % 10 + 1)
// We will select the columns we want to compute the cumulative sum of.
val columns = df.drop("id", "date").columns
val w = Window.partitionBy(col("id")).orderBy(col("date").asc)
val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))
results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2| date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// | 1| 3| 1| 0| 3| 1|
// | 1| 13| 1| 10| 16| 2|
// | 1| 23| 1| 20| 39| 3|
// | 1| 33| 1| 30| 72| 4|
// | 1| 43| 1| 40| 115| 5|
// | 1| 53| 1| 8589934592| 168| 6|
// | 1| 63| 1| 8589934602| 231| 7|
Here is another way using simple select expression :
val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns
// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq
df.select(selectExpr:_*).show()
Gives:
+---+----------+-----+-----+----+
| ID| Date| C1| C2|C100|
+---+----------+-----+-----+----+
| 1|02/04/2019|32.09|45.06| 99|
| 1|02/06/2019|64.18|90.12| 198|
| 2|02/03/2019|32.09|45.06| 99|
| 2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+

Apply UDF function to Spark window where the input paramter is a list of all column values in range

I would like to build a moving average on each row in a window. Let's say -10 rows. BUT if there are less than 10 rows available I would like to insert a 0 in the resulting row -> new column.
So what I would try to achieve is using a UDF in an aggregate window with input paramter List() (or whatever superclass) which has the values of all rows available.
Here's a code example that doesn't work:
val w = Window.partitionBy("id").rowsBetween(-10, +0)
dfRetail2.withColumn("test", udftestf(dfRetail2("salesMth")).over(w))
Expected output: List( 1,2,3,4) if no more rows are available and take this as input paramter for the udf function. udf function should return a calculated value or 0 if less than 10 rows available.
the above code terminates: Expression 'UDF(salesMth#152L)' not supported within a window function.;;
You can use Spark's built-in Window functions along with when/otherwise for your specific condition without the need of UDF/UDAF. For simplicity, the sliding-window size is reduced to 4 in the following example with dummy data:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = (1 to 2).flatMap(i => Seq.tabulate(8)(j => (i, i * 10.0 + j))).
toDF("id", "amount")
val slidingWin = 4
val winSpec = Window.partitionBy($"id").rowsBetween(-(slidingWin - 1), 0)
df.
withColumn("slidingCount", count($"amount").over(winSpec)).
withColumn("slidingAvg", when($"slidingCount" < slidingWin, 0.0).
otherwise(avg($"amount").over(winSpec))
).show
// +---+------+------------+----------+
// | id|amount|slidingCount|slidingAvg|
// +---+------+------------+----------+
// | 1| 10.0| 1| 0.0|
// | 1| 11.0| 2| 0.0|
// | 1| 12.0| 3| 0.0|
// | 1| 13.0| 4| 11.5|
// | 1| 14.0| 4| 12.5|
// | 1| 15.0| 4| 13.5|
// | 1| 16.0| 4| 14.5|
// | 1| 17.0| 4| 15.5|
// | 2| 20.0| 1| 0.0|
// | 2| 21.0| 2| 0.0|
// | 2| 22.0| 3| 0.0|
// | 2| 23.0| 4| 21.5|
// | 2| 24.0| 4| 22.5|
// | 2| 25.0| 4| 23.5|
// | 2| 26.0| 4| 24.5|
// | 2| 27.0| 4| 25.5|
// +---+------+------------+----------+
Per remark in the comments section, I'm including a solution via UDF below as an alternative:
def movingAvg(n: Int) = udf{ (ls: Seq[Double]) =>
val (avg, count) = ls.takeRight(n).foldLeft((0.0, 1)){
case ((a, i), next) => (a + (next-a)/i, i + 1)
}
if (count <= n) 0.0 else avg // Expand/Modify this for specific requirement
}
// To apply the UDF:
df.
withColumn("average", movingAvg(slidingWin)(collect_list($"amount").over(winSpec))).
show
Note that unlike sum or count, collect_list ignores rowsBetween() and generates partitioned data that can potentially be very large to be passed to the UDF (hence the need for takeRight()). If the computed Window sum and count are sufficient for what's needed for your specific requirement, consider passing them to the UDF instead.
In general, especially if the data at hand is already in DataFrame format, it'd perform and scale better by using built-in DataFrame API to take advantage of Spark's execution engine optimization than using user-defined UDF/UDAF. You might be interested in reading this article re: advantages of DataFrame/Dataset API over UDF/UDAF.

Combining RDD's with some values missing

Hi I have two RDD's I want to combine into 1.
The first RDD is of the format
//((UserID,MovID),Rating)
val predictions =
model.predict(user_mov).map { case Rating(user, mov, rate) =>
((user, mov), rate)
}
I have another RDD
//((UserID,MovID),"NA")
val user_mov_rat=user_mov.map(x=>(x,"N/A"))
So the keys in the second RDD are more in no. but overlap with RDD1. I need to combine the RDD's so that only those keys of 2nd RDD append to RDD1 which are not there in RDD1.
You can do something like this -
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
// Setting up the rdds as described in the question
case class UserRating(user: String, mov: String, rate: Int = -1)
val list1 = List(UserRating("U1", "M1", 1),UserRating("U2", "M2", 3),UserRating("U3", "M1", 3),UserRating("U3", "M2", 1),UserRating("U4", "M2", 2))
val list2 = List(UserRating("U1", "M1"),UserRating("U5", "M4", 3),UserRating("U6", "M6"),UserRating("U3", "M2"), UserRating("U4", "M2"), UserRating("U4", "M3", 5))
val rdd1 = sc.parallelize(list1)
val rdd2 = sc.parallelize(list2)
// Convert to Dataframe so it is easier to handle
val df1 = rdd1.toDF
val df2 = rdd2.toDF
// What we got:
df1.show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| 1|
| U2| M2| 3|
| U3| M1| 3|
| U3| M2| 1|
| U4| M2| 2|
+----+---+----+
df2.show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| -1|
| U5| M4| 3|
| U6| M6| -1|
| U3| M2| -1|
| U4| M2| -1|
| U4| M3| 5|
+----+---+----+
// Figure out the extra reviews in second dataframe that do not match (user, mov) in first
val xtraReviews = df2.join(df1.withColumnRenamed("rate", "rate1"), Seq("user", "mov"), "left_outer").where("rate1 is null")
// Union them. Be careful because of this: http://stackoverflow.com/questions/32705056/what-is-going-wrong-with-unionall-of-spark-dataframe
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).union(b.select(columns: _*))
}
// Final result of combining only unique values in df2
unionByName(df1, xtraReviews).show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| 1|
| U2| M2| 3|
| U3| M1| 3|
| U3| M2| 1|
| U4| M2| 2|
| U5| M4| 3|
| U4| M3| 5|
| U6| M6| -1|
+----+---+----+
It might also be possible to do it in this way:
RDD's are really slow, so read your data or convert your data in dataframes.
Use spark dropDuplicates() on both the dataframes like df.dropDuplicates(['Key1', 'Key2']) to get distinct values on keys in both of your dataframe and then
simply union them like df1.union(df2).
Benefit is you are doing it in spark way and hence you have all the parallelism and speed.

I want to convert all my existing UDTFs in Hive to Scala functions and use it from Spark SQL

Can any one give me an example UDTF (eg; explode) written in scala which returns multiple row and use it as UDF in SparkSQL?
Table: table1
+------+----------+----------+
|userId|someString| varA|
+------+----------+----------+
| 1| example1| [0, 2, 5]|
| 2| example2|[1, 20, 5]|
+------+----------+----------+
I'd like to create the following Scala code:
def exampleUDTF(var: Seq[Int]) = <Return Type???> {
// code to explode varA field ???
}
sqlContext.udf.register("exampleUDTF",exampleUDTF _)
sqlContext.sql("FROM table1 SELECT userId, someString, exampleUDTF(varA)").collect().foreach(println)
Expected output:
+------+----------+----+
|userId|someString|varA|
+------+----------+----+
| 1| example1| 0|
| 1| example1| 2|
| 1| example1| 5|
| 2| example2| 1|
| 2| example2| 20|
| 2| example2| 5|
+------+----------+----+
You can't do this with a UDF. A UDF can only add a single column to a DataFrame. There is, however, a function called DataFrame.explode, which you can use instead. To do it with your example, you would do this:
import org.apache.spark.sql._
val df = Seq(
(1,"example1", Array(0,2,5)),
(2,"example2", Array(1,20,5))
).toDF("userId", "someString", "varA")
val explodedDf = df.explode($"varA"){
case Row(arr: Seq[Int]) => arr.toArray.map(a => Tuple1(a))
}.drop($"varA").withColumnRenamed("_1", "varA")
+------+----------+-----+
|userId|someString| varA|
+------+----------+-----+
| 1| example1| 0|
| 1| example1| 2|
| 1| example1| 5|
| 2| example2| 1|
| 2| example2| 20|
| 2| example2| 5|
+------+----------+-----+
Note that explode takes a function as an argument. So even though you can't create a UDF to do what you want, you can create a function to pass to explode to do what you want. Like this:
def exploder(row: Row) : Array[Tuple1[Int]] = {
row match { case Row(arr) => arr.toArray.map(v => Tuple1(v)) }
}
df.explode($"varA")(exploder)
That's about the best you are going to get in terms of recreating a UDTF.
Hive Table:
name id
["Subhajit Sen","Binoy Mondal","Shantanu Dutta"] 15
["Gobinathan SP","Harsh Gupta","Rahul Anand"] 16
Creating a scala function :
def toUpper(name: Seq[String]) = (name.map(a => a.toUpperCase)).toSeq
Registering function as UDF :
sqlContext.udf.register("toUpper",toUpper _)
Calling the UDF using sqlContext and storing output as DataFrame object :
var df = sqlContext.sql("SELECT toUpper(name) FROM namelist").toDF("Name")
Exploding the DataFrame :
df.explode(df("Name")){case org.apache.spark.sql.Row(arr: Seq[String]) => arr.toSeq.map(v => Tuple1(v))}.drop(df("Name")).withColumnRenamed("_1","Name").show
Result:
+--------------+
| Name|
+--------------+
| SUBHAJIT SEN|
| BINOY MONDAL|
|SHANTANU DUTTA|
| GOBINATHAN SP|
| HARSH GUPTA|
| RAHUL ANAND|
+--------------+