Spark Scala Split DataFrame by some value range - scala

Suppose I have a dataframe with a column named x with a value range of [0, 1]. I hope to split it by the value of column x with ranges like [0, 0.1), [0.1, 0.2)...[0.9, 1]. Is there a good and fast way to do that? I'm using Spark 2 in Scala.
Update: Ideally there should be 10 new dataframes that contain data for each range.

Expanding on #Psidom's solution for creating ranges, here's one approach to create a dataframe for each range:
import org.apache.spark.sql.types.IntegerType
val df = Seq(0.2, 0.71, 0.95, 0.33, 0.28, 0.8, 0.73).toDF("x")
val df2 = df.withColumn("g", ($"x" * 10.0).cast(IntegerType))
df2.show
+----+---+
| x| g|
+----+---+
| 0.2| 2|
|0.71| 7|
|0.95| 9|
|0.33| 3|
|0.28| 2|
| 0.8| 8|
|0.73| 7|
+----+---+
val dfMap = df2.select($"g").distinct.
collect.
flatMap(_.toSeq).
map( g => g -> df2.where($"g" === g) ).
toMap
dfMap.getOrElse(3, null).show
+----+---+
| x| g|
+----+---+
|0.33| 3|
+----+---+
dfMap.getOrElse(7, null).show
+----+---+
| x| g|
+----+---+
|0.71| 7|
|0.73| 7|
+----+---+
[UPDATE]
If your ranges are irregular, you can define a function which maps a Double into the corresponding Int range id, then wrap it with a UDF, like in the following:
val g: Double => Int = x => x match {
case x if (x >= 0.0 && x < 0.12345) => 1
case x if (x >= 0.12345 && x < 0.4834) => 2
case x if (x >= 0.4834 && x < 1.0) => 3
case _ => 99 // catch-all
}
val groupUDF = udf(g)
val df = Seq(0.1, 0.2, 0.71, 0.95, 0.03, 0.09, 0.44, 5.0).toDF("x")
val df2 = df.withColumn("g", groupUDF($"x"))
df2.show
+----+---+
| x| g|
+----+---+
| 0.1| 1|
| 0.2| 2|
|0.71| 3|
|0.95| 3|
|0.03| 1|
|0.09| 1|
|0.44| 2|
| 5.0| 99|
+----+---+

If you meant to discretize a double typed column, you might just do this (multiply the column by 10 and then cast it to integer type, the column will be cut into 10 discrete bins):
import org.apache.spark.sql.types.IntegerType
val df = Seq(0.32, 0.5, 0.99, 0.72, 0.11, 0.03).toDF("A")
// df: org.apache.spark.sql.DataFrame = [A: double]
df.withColumn("new", ($"A" * 10).cast(IntegerType)).show
+----+---+
| A|new|
+----+---+
|0.32| 3|
| 0.5| 5|
|0.99| 9|
|0.72| 7|
|0.11| 1|
|0.03| 0|
+----+---+

Related

How to remove Spark values that are out of sequence

I need to remove some values from dataframe that is not in right place.
I have the following dataframe, for example:
+-----+-----+
|count|PHASE|
+-----+-----+
| 1| 3|
| 2| 3|
| 3| 6|
| 4| 6|
| 5| 8|
| 6| 4|
| 7| 4|
| 8| 4|
+-----+-----+
I need to remove 6 and 8 from dataframe because of some rules:
phase === 3 and lastPhase.isNull
phase === 4 and lastPhase.isin(2, 3)
phase === 6 and lastPhase.isin(4, 5)
phase === 8 and lastPhase.isin(6, 7)
This is a huge dataframe and those misplaced values can happen many times.
Could you help with that, please?
Expected output:
+-----+-----+------+
|count|PHASE|CHANGE|
+-----+-----+------+
| 1| 3| 3|
| 2| 3| 3|
| 3| 6| 3|
| 4| 6| 3|
| 5| 8| 3|
| 6| 4| 4|
| 7| 4| 4|
| 8| 4| 4|
+-----+-----+------+
val rows = Seq(
Row(1, 3),
Row(2, 3),
Row(3, 6),
Row(4, 6),
Row(5, 8),
Row(6, 4),
Row(7, 4),
Row(8, 4)
)
val schema = StructType(
Seq(StructField("count", IntegerType), StructField("PHASE", IntegerType))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(rows),
schema
)
Thanks in advance!
If I correctly understood your question, you want to populate column CHANGE as follow:
For a dataframe sorted by count column, for each row, if the value of the PHASE column matches a defined set of rules, set this value in CHANGE column. If value doesn't match the rules, set latest valid PHASE value in CHANGE column
To do so, You can use an user-defined aggregate function to setup CHANGE column over a window ordered by COUNT column
First, you define an Aggregator object where its buffer will be the last valid phase, and you implement your set of rules in its reduce function:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{Encoder, Encoders}
object LatestValidPhase extends Aggregator[Integer, Integer, Integer] {
def zero: Integer = null
def reduce(lastPhase: Integer, phase: Integer): Integer = {
if (lastPhase == null && phase == 3) {
phase
} else if (Set(2, 3).contains(lastPhase) && phase == 4) {
phase
} else if (Set(4, 5).contains(lastPhase) && phase == 6) {
phase
} else if (Set(6, 7).contains(lastPhase) && phase == 8) {
phase
} else {
lastPhase
}
}
def merge(b1: Integer, b2: Integer): Integer = {
throw new NotImplementedError("should not use as general aggregation")
}
def finish(reduction: Integer): Integer = reduction
def bufferEncoder: Encoder[Integer] = Encoders.INT
def outputEncoder: Encoder[Integer] = Encoders.INT
}
Then you transform it into an aggregate user-defined function that you apply over your window ordered by COUNT column:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, udaf}
val latest_valid_phase = udaf(LatestValidPhase)
val window = Window.orderBy("count")
df.withColumn("CHANGE", latest_valid_phase(col("PHASE")).over(window))

How to retain the column structure of a Spark Dataframe following a map operation on rows

I am trying to apply a function to each row of a Spark DataFrame, as in the example.
val df = sc.parallelize(
Seq((1, 2, 0), (0, 0, 1), (0, 0, 0))).toDF("x", "y", "z")
df.show()
which yields
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 0| 0| 1|
| 0| 0| 0|
+---+---+---+
Suppose I want to do something to the values in each row, for example changing 0 to 5.
val b = df.map(row => row.toSeq.map(x => x match{
case 0 => 5
case x: Int => x
}))
b.show()
+---------+
| value|
+---------+
|[1, 2, 5]|
|[5, 5, 1]|
|[5, 5, 5]|
+---------+
The function worked, but I now have one column whose entries are Lists, instead of 3 columns of Ints. I would like my named columns back.
You can define an UDF to apply this substitution. For example:
def subsDef(k: Int): Int = if(k==0) 5 else k
val subs = udf[Int, Int](subsDef)
Then you can apply the UDF to a specific column or, if you desire, to every columns of the DF:
// to a single column, for example "x"
df = df.withColumn("x", subs(col("x")))
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 5| 0| 1|
| 5| 0| 0|
+---+---+---+
// to every columns of DF
df.columns.foreach(c => {
df = df.withColumn(c, subs(col(c)))
})
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 5|
| 5| 5| 1|
| 5| 5| 5|
+---+---+---+
Rather than transforming the DataFrame row-wise, consider using built-in Spark API function when/otherwise, as follows:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq((1, 2, 0), (0, 0, 1), (0, 0, 0)).toDF("x", "y", "z")
val vFrom = 0
val vTo = 5
val cols = df.columns // Filter for specific columns if necessary
df.select( cols.map( c =>
when(col(c) === vFrom, vTo).otherwise(col(c)).as(c)
): _*
).show
// +---+---+---+
// | x| y| z|
// +---+---+---+
// | 1| 2| 5|
// | 5| 5| 1|
// | 5| 5| 5|
// +---+---+---+
There are various ways to do it here are some:
df.map(row => {
val size = row.size
var seq: Seq[Int] = Seq.empty[Int]
for (a <- 0 to size - 1) {
val value: Int = row(a).asInstanceOf[Int]
val newVal: Int = value match {
case 0 =>
5
case _ =>
value
}
seq = seq :+ newVal
}
Row.fromSeq(seq)
})(RowEncoder.apply(df.schema))
val columns = df.columns
df.select(
columns.map(c => when(col(c) === 0, 5).otherwise(col(c)).as(c)): _*)
.show()
def fun: (Int => Int) = { x =>
if (x == 0) 5 else x
}
val function = udf(fun)
df.select(function(col("x")).as("x"),
function(col("y")).as("y"),
function(col("z")).as("z"))
.show()
def checkZero(a: Int): Int = if (a == 0) 5 else a
df.map {
case Row(a: Int, b: Int, c: Int) =>
Row(checkZero(a), checkZero(b), checkZero(c))
} { RowEncoder.apply(df.schema) }
.show()

Find smallest value in a rolling window partitioned by group

I have a dataframe containing different geographical positions as well as the distance to some other places. My problem is that I want to find the closest n places for each geographical position. My first idea was to use groupBy() followed by some sort of aggregation but I couldn't get that to work.
Instead I tried to first convert the dataframe to an RDD and the use groupByKey(), it works, but the method is cumbersome. Is there is any better alternative to solve this problem? Maybe using groupBy() and aggregate somehow?
A small example of my approach where n=2 with input:
+---+--------+
| id|distance|
+---+--------+
| 1| 5.0|
| 1| 3.0|
| 1| 7.0|
| 1| 4.0|
| 2| 1.0|
| 2| 3.0|
| 2| 3.0|
| 2| 7.0|
+---+--------+
Code:
df.rdd.map{case Row(id: Long, distance: Double) => (id, distance)}
.groupByKey()
.map{case (id: Long, iter: Iterable[Double]) => (id, iter.toSeq.sorted.take(2))}
.toDF("id", "distance")
.withColumn("distance", explode($"distance"))
Output:
+---+--------+
| id|distance|
+---+--------+
| 1| 3.0|
| 1| 4.0|
| 2| 1.0|
| 2| 3.0|
+---+--------+
You can use Window as below:
val spark = SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
case class A(id: Long, distance: Double)
val df = List(A(1, 5.0), A(1,3.0), A(1, 7.0), A(1, 4.0), A(2, 1.0), A(2, 3.0), A(2, 4.0), A(2, 7.0))
.toDF("id", "distance")
val window = Window.partitionBy("id").orderBy("distance")
val result = df.withColumn("rank", row_number().over(window)).where(col("rank") <= 2 )
result.drop("rank").show()
You can increase the number of result you want by replacing the 2.
Hope this helps.

Fill Nan with mean of the row in Scala-Spark

I have an RDD with 6 columns, where the last 5 columns might contain NaNs. My intention is to replace the NaNs with the average value of the rest of the last 5 values of the row which are not Nan. For instance, having this input:
1, 2, 3, 4, 5, 6
2, 2, 2, NaN, 4, 0
3, NaN, NaN, NaN, 6, 0
4, NaN, NaN, 4, 4, 0
The output should be:
1, 2, 3, 4, 5, 6
2, 2, 2, 2, 4, 0
3, 3, 3, 3, 6, 0
4, 3, 3, 4, 4, 0
I know how to fill those NaNs with the average value of the column transforming the RDD to DataFrame:
var aux1 = df.select(df.columns.map(c => mean(col(c))) :_*)
var aux2 = df.na.fill(/*get values of aux1*/)
My question is, how can you do this operation but instead of filling the NaN with the column average, fill it with an average of the values of a subgroup of the row?
You can do this by defining a function to get the mean, and another function to fill nulls in a row.
Given the DF you presented:
val df = sc.parallelize(List((Some(1),Some(2),Some(3),Some(4),Some(5),Some(6)),(Some(2),Some(2),Some(2),None,Some(4),Some(0)),(Some(3),None,None,None,Some(6),Some(0)),(Some(4),None,None,Some(4),Some(4),Some(0)))).toDF("a","b","c","d","e","f")
We need a function to get the mean of a Row:
import org.apache.spark.sql.Row
def rowMean(row: Row): Int = {
val nonNulls = (0 until row.length).map(i => (!row.isNullAt(i), row.getAs[Int](i))).filter(_._1).map(_._2).toList
nonNulls.sum / nonNulls.length
}
And another to fill nulls in a Row:
def rowFillNulls(row: Row, fill: Int): Row = {
Row((0 until row.length).map(i => if (row.isNullAt(i)) fill else row.getAs[Int](i)) : _*)
}
Now we can first compute each row mean:
val rowWithMean = df.map(row => (row,rowMean(row)))
And then fill it:
val result = sqlContext.createDataFrame(rowWithMean.map{case (row,mean) => rowFillNulls(row,mean)}, df.schema)
Finally view before and after...
df.show
+---+----+----+----+---+---+
| a| b| c| d| e| f|
+---+----+----+----+---+---+
| 1| 2| 3| 4| 5| 6|
| 2| 2| 2|null| 4| 0|
| 3|null|null|null| 6| 0|
| 4|null|null| 4| 4| 0|
+---+----+----+----+---+---+
result.show
+---+---+---+---+---+---+
| a| b| c| d| e| f|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6|
| 2| 2| 2| 2| 4| 0|
| 3| 3| 3| 3| 6| 0|
| 4| 3| 3| 4| 4| 0|
+---+---+---+---+---+---+
This will work for any width DF with Int columns. You can easily update this to other datatypes, even non-numeric (hint, inspect the df schema!)
A bunch of imports:
import org.apache.spark.sql.functions.{col, isnan, isnull, round, when}
import org.apache.spark.sql.Column
A few helper functions:
def nullOrNan(c: Column) = isnan(c) || isnull(c)
def rowMean(cols: Column*): Column = {
val sum = cols
.map(c => when(nullOrNan(c), lit(0.0)).otherwise(c))
.fold(lit(0.0))(_ + _)
val count = cols
.map(c => when(nullOrNan(c), lit(0.0)).otherwise(lit(1.0)))
.fold(lit(0.0))(_ + _)
sum / count
}
A solution:
val mean = round(
rowMean(df.columns.tail.map(col): _*)
).cast("int").alias("mean")
val exprs = df.columns.tail.map(
c => when(nullOrNan(col(c)), mean).otherwise(col(c)).alias(c)
)
val filled = df.select(col(df.columns(0)) +: exprs: _*)
Well, this is a fun little problem - I will post my solution, but I will definitely watch and see if someone comes up with a better way of doing it :)
First I would introduce a couple of udfs:
val avg = udf((values: Seq[Integer]) => {
val notNullValues = values.filter(_ != null).map(_.toInt)
notNullValues.sum/notNullValues.length
})
val replaceNullWithAvg = udf((x: Integer, avg: Integer) => if(x == null) avg else x)
which I would then apply to the DataFrame like this:
dataframe
.withColumn("avg", avg(array(df.columns.tail.map(s => df.col(s)):_*)))
.select('col1, replaceNullWithAvg('col2, 'avg) as "col2", replaceNullWithAvg('col3, 'avg) as "col3", replaceNullWithAvg('col4, 'avg) as "col4", replaceNullWithAvg('col5, 'avg) as "col5", replaceNullWithAvg('col6, 'avg) as "col6")
This will get you what you are looking for, but arguably not the most sophisticated code I have ever put together...

drop all columns with a special condition on a column spark

I have a dataset and I need to drop columns which has a standard deviation equal to 0. I've tried:
val df = spark.read.option("header",true)
.option("inferSchema", "false").csv("C:/gg.csv")
val finalresult = df
.agg(df.columns.map(stddev(_)).head, df.columns.map(stddev(_)).tail: _*)
I want to compute the standard deviation of each column and drop the column if it it is is equal to zero
RowNumber,Poids,Age,Taille,0MI,Hmean,CoocParam,LdpParam,Test2,Classe,
0,87,72,160,5,0.6993,2.9421,2.3745,3,4,
1,54,70,163,5,0.6301,2.7273,2.2205,3,4,
2,72,51,164,5,0.6551,2.9834,2.3993,3,4,
3,75,74,170,5,0.6966,2.9654,2.3699,3,4,
4,108,62,165,5,0.6087,2.7093,2.1619,3,4,
5,84,61,159,5,0.6876,2.938,2.3601,3,4,
6,89,64,168,5,0.6757,2.9547,2.3676,3,4,
7,75,72,160,5,0.7432,2.9331,2.3339,3,4,
8,64,62,153,5,0.6505,2.7676,2.2255,3,4,
9,82,58,159,5,0.6748,2.992,2.4043,3,4,
10,67,49,160,5,0.6633,2.9367,2.333,3,4,
11,85,53,160,5,0.6821,2.981,2.3822,3,4,
You can try this, use getValueMap and filter to get the column names which you want to drop, and then drop them:
//Extract the standard deviation from the data frame summary:
val stddev = df.describe().filter($"summary" === "stddev").drop("summary").first()
// Use `getValuesMap` and `filter` to get the columns names where stddev is equal to 0:
val to_drop = stddev.getValuesMap[String](df.columns).filter{ case (k, v) => v.toDouble == 0 }.keys
//Drop 0 stddev columns
df.drop(to_drop.toSeq: _*).show
+---------+-----+---+------+------+---------+--------+
|RowNumber|Poids|Age|Taille| Hmean|CoocParam|LdpParam|
+---------+-----+---+------+------+---------+--------+
| 0| 87| 72| 160|0.6993| 2.9421| 2.3745|
| 1| 54| 70| 163|0.6301| 2.7273| 2.2205|
| 2| 72| 51| 164|0.6551| 2.9834| 2.3993|
| 3| 75| 74| 170|0.6966| 2.9654| 2.3699|
| 4| 108| 62| 165|0.6087| 2.7093| 2.1619|
| 5| 84| 61| 159|0.6876| 2.938| 2.3601|
| 6| 89| 64| 168|0.6757| 2.9547| 2.3676|
| 7| 75| 72| 160|0.7432| 2.9331| 2.3339|
| 8| 64| 62| 153|0.6505| 2.7676| 2.2255|
| 9| 82| 58| 159|0.6748| 2.992| 2.4043|
| 10| 67| 49| 160|0.6633| 2.9367| 2.333|
| 11| 85| 53| 160|0.6821| 2.981| 2.3822|
+---------+-----+---+------+------+---------+--------+
OK, I have written a solution that is independent of your dataset. Required imports and example data:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, stddev, col}
val df = spark.range(1, 1000).withColumn("X2", lit(0)).toDF("X1","X2")
df.show(5)
// +---+---+
// | X1| X2|
// +---+---+
// | 1| 0|
// | 2| 0|
// | 3| 0|
// | 4| 0|
// | 5| 0|
First compute standard deviation by column:
// no need to rename but I did it to become more human
// readable when you show df2
val aggs = df.columns.map(c => stddev(c).as(c))
val stddevs = df.select(aggs: _*)
stddevs.show // df2 contains the stddev of each columns
// +-----------------+---+
// | X1| X2|
// +-----------------+---+
// |288.5307609250702|0.0|
// +-----------------+---+
Collect the first row and filter columns to keep:
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(df.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
Select and check the results:
df.select(columnsToKeep: _*).show
// +---+
// | X1|
// +---+
// | 1|
// | 2|
// | 3|
// | 4|
// | 5|