transform a feature of a spark groupedBy DataFrame - scala

I'm searching for a scala analogue of python .transform()
Namely, i need to create a new feature - a group mean of a corresponding: class
val df = Seq(
("a", 1),
("a", 3),
("b", 3),
("b", 7)
).toDF("class", "val")
+-----+---+
|class|val|
+-----+---+
| a| 1|
| a| 3|
| b| 3|
| b| 7|
+-----+---+
val grouped_df = df.groupBy('class)
Here's python implementation:
df["class_mean"] = grouped_df["class"].transform(
lambda x: x.mean())
So, the desired result:
+-----+---+----------+
|class|val|class_mean|
+-----+---+---+------+
| a| 1| 2.0|
| a| 3| 2.0|
| b| 3| 5.0|
| b| 7| 5.0|
+-----+---+----------+

You can use
df.groupBy("class").agg(mean("val").as("class_mean"))
If you can want all the columns then you can use window function
val w = Window.partitionBy("class")
df.withColumn("class_mean", mean("val").over(w))
.show(false)
Output:
+-----+---+----------+
|class|val|class_mean|
+-----+---+----------+
|b |3 |5.0 |
|b |7 |5.0 |
|a |1 |2.0 |
|a |3 |2.0 |
+-----+---+----------+

Related

Spark: explode multiple columns into one

Is it possible to explode multiple columns into one new column in spark? I have a dataframe which looks like this:
userId varA varB
1 [0,2,5] [1,2,9]
desired output:
userId bothVars
1 0
1 2
1 5
1 1
1 2
1 9
What I have tried so far:
val explodedDf = df.withColumn("bothVars", explode($"varA")).drop("varA")
.withColumn("bothVars", explode($"varB")).drop("varB")
which doesn't work. Any suggestions is much appreciated.
You could wrap the two arrays into one and flatten the nested array before exploding it, as shown below:
val df = Seq(
(1, Seq(0, 2, 5), Seq(1, 2, 9)),
(2, Seq(1, 3, 4), Seq(2, 3, 8))
).toDF("userId", "varA", "varB")
df.
select($"userId", explode(flatten(array($"varA", $"varB"))).as("bothVars")).
show
// +------+--------+
// |userId|bothVars|
// +------+--------+
// | 1| 0|
// | 1| 2|
// | 1| 5|
// | 1| 1|
// | 1| 2|
// | 1| 9|
// | 2| 1|
// | 2| 3|
// | 2| 4|
// | 2| 2|
// | 2| 3|
// | 2| 8|
// +------+--------+
Note that flatten is available on Spark 2.4+.
Use array_union and then use explode function.
scala> df.show(false)
+------+---------+---------+
|userId|varA |varB |
+------+---------+---------+
|1 |[0, 2, 5]|[1, 2, 9]|
|2 |[1, 3, 4]|[2, 3, 8]|
+------+---------+---------+
scala> df
.select($"userId",explode(array_union($"varA",$"varB")).as("bothVars"))
.show(false)
+------+--------+
|userId|bothVars|
+------+--------+
|1 |0 |
|1 |2 |
|1 |5 |
|1 |1 |
|1 |9 |
|2 |1 |
|2 |3 |
|2 |4 |
|2 |2 |
|2 |8 |
+------+--------+
array_union is available in Spark 2.4+

I have a DataFrame in two rows and multiple columns, how to transpose into two columns and multiple rows?

I have a spark DataFrame like this:
+---+---+---+---+---+---+---+
| f1| f2| f3| f4| f5| f6| f7|
+---+---+---+---+---+---+---+
| 5| 4| 5| 2| 5| 5| 5|
+---+---+---+---+---+---+---+
how can you povit to
+---+---+
| f1| 5|
+---+---+
| f2| 4|
+---+---+
| f3| 5|
+---+---+
| f4| 2|
+---+---+
| f5| 5|
+---+---+
| f6| 5|
+---+---+
| f7| 5|
+---+---+
Is there a simple code in spark scala that can be used for transposition?
scala> df.show()
+---+---+---+---+---+---+---+
| f1| f2| f3| f4| f5| f6| f7|
+---+---+---+---+---+---+---+
| 5| 4| 5| 2| 5| 5| 5|
+---+---+---+---+---+---+---+
scala> import org.apache.spark.sql.DataFrame
scala> def transposeUDF(transDF: DataFrame, transBy: Seq[String]): DataFrame = {
| val (cols, types) = transDF.dtypes.filter{ case (c, _) => !transBy.contains(c)}.unzip
| require(types.distinct.size == 1)
|
| val kvs = explode(array(
| cols.map(c => struct(lit(c).alias("columns"), col(c).alias("value"))): _*
| ))
|
| val byExprs = transBy.map(col(_))
|
| transDF
| .select(byExprs :+ kvs.alias("_kvs"): _*)
| .select(byExprs ++ Seq($"_kvs.columns", $"_kvs.value"): _*)
| }
scala> val df1 = df.withColumn("tempColumn", lit("1"))
scala> transposeUDF(df1, Seq("tempColumn")).drop("tempColumn").show(false)
+-------+-----+
|columns|value|
+-------+-----+
|f1 |5 |
|f2 |4 |
|f3 |5 |
|f4 |2 |
|f5 |5 |
|f6 |5 |
|f7 |5 |
+-------+-----+
spark 2.4+ use map_from_arrays
scala> var df =Seq(( 5, 4, 5, 2, 5, 5, 5)).toDF("f1", "f2", "f3", "f4", "f5", "f6", "f7")
scala> df.select(array('*).as("v"), lit(df.columns).as("k")).select('v.getItem(0).as("cust_id"), map_from_arrays('k,'v).as("map")).select(explode('map)).show(false)
+---+-----+
|key|value|
+---+-----+
|f1 |5 |
|f2 |4 |
|f3 |5 |
|f4 |2 |
|f5 |5 |
|f6 |5 |
|f7 |5 |
+---+-----+
hope its helps you.
I wrote a function
object DT {
val KEY_COL_NAME = "dt_key"
val VALUE_COL_NAME = "dt_value"
def pivot(df: DataFrame, valueDataType: DataType, cols: Array[String], keyColName: String, valueColName: String): DataFrame = {
val tempData: RDD[Row] = df.rdd.flatMap(row => row.getValuesMap(cols).map(Row.fromTuple))
val keyStructField = DataTypes.createStructField(keyColName, DataTypes.StringType, false)
val valueStructField = DataTypes.createStructField(valueColName, DataTypes.StringType, true)
val structType = DataTypes.createStructType(Array(keyStructField, valueStructField))
df.sparkSession.createDataFrame(tempData, structType).select(col(keyColName), col(valueColName).cast(valueDataType))
}
def pivot(df: DataFrame, valueDataType: DataType): DataFrame = {
pivot(df, valueDataType, df.columns, KEY_COL_NAME, VALUE_COL_NAME)
}
}
it worked
df.show()
DT.pivot(df,DoubleType).show()
like this
+---+---+-----------+---+---+ +------+-----------+
| f1| f2| f3| f4| f5| |dt_key| dt_value|
+---+---+-----------+---+---+ to +------+-----------+
|100| 1|0.355072464| 0| 31| | f1| 100.0|
+---+---+-----------+---+---+ | f5| 31.0|
| f3|0.355072464|
| f4| 0.0|
| f2| 1.0|
+------+-----------+
and
+---+---+-----------+-----------+---+ +------+-----------+
| f1| f2| f3| f4| f5| |dt_key| dt_value|
+---+---+-----------+-----------+---+ to +------+-----------+
|100| 1|0.355072464| 0| 31| | f1| 100.0|
| 63| 2|0.622775801|0.685809375| 16| | f5| 31.0|
+---+---+-----------+-----------+---+ | f3|0.355072464|
| f4| 0.0|
| f2| 1.0|
| f1| 63.0|
| f5| 16.0|
| f3|0.622775801|
| f4|0.685809375|
| f2| 2.0|
+------+-----------+
very nice!

Scala/Spark drop duplicates based in other column specific value [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
I want to drop duplicates with same ID that not have a specific value in other column (in this case filter by those rows that have same ID and value = 1)
Input df:
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 1| 3|
| 4| 0| 6|
| 4| 1| 5|
| 5| 4| 6|
+---+-----+------+
Result I want:
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 1| 3|
| 4| 1| 5|
| 5| 4| 6|
+---+-----+------+
Can be done by getting rows where values is "1", and then left join with orignal data:
val df = List(
(3, 0, 2),
(3, 1, 3),
(4, 0, 6),
(4, 1, 5),
(5, 4, 6)
).toDF("id", "value", "sorted")
val withOne = df.filter($"value" === 1)
val joinedWithOriginal = df.alias("orig").join(withOne.alias("one"), Seq("id"), "left")
val result = joinedWithOriginal
.where($"one.value".isNull || $"one.value" === $"orig.value")
.select("orig.id", "orig.value", "orig.sorted")
result.show(false)
Output:
+---+-----+------+
|id |value|sorted|
+---+-----+------+
|3 |1 |3 |
|4 |1 |5 |
|5 |4 |6 |
+---+-----+------+

Flatten RDD[(String,Map[String,Int])] to RDD[String,String,Int]

I am trying to flatten an RDD[(String,Map[String,Int])] to RDD[String,String,Int] and ultimately save it as a dataframe.
val rdd=hashedContent.map(f=>(f._1,f._2.flatMap(x=> (x._1, x._2))))
val rdd=hashedContent.map(f=>(f._1,f._2.flatMap(x=>x)))
All having type mismatch errors.
Any help on how to flatten structures like this one?
EDIT:
hashedContent -- ("A", Map("acs"->2, "sdv"->2, "sfd"->1)),
("B", Map("ass"->2, "fvv"->2, "ffd"->1)),
("c", Map("dg"->2, "vd"->2, "dgr"->1))
You were close:
rdd.flatMap(x => x._2.map(y => (x._1, y._1, y._2)))
.toDF()
.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| A|acs| 2|
| A|sdv| 2|
| A|sfd| 1|
| B|ass| 2|
| B|fvv| 2|
| B|ffd| 1|
| c| dg| 2|
| c| vd| 2|
| c|dgr| 1|
+---+---+---+
Data
val data = Seq(("A", Map("acs"->2, "sdv"->2, "sfd"->1)),
("B", Map("ass"->2, "fvv"->2, "ffd"->1)),
("c", Map("dg"->2, "vd"->2, "dgr"->1)))
val rdd = sc.parallelize(data)
For completeness: an alternative solution (which might be considered more readable) would be to first convert the RDD into a DataFrame, and then to transform its structure using explode:
import org.apache.spark.sql.functions._
import spark.implicits._
rdd.toDF("c1", "map")
.select($"c1", explode($"map"))
.show(false)
// same result:
// +---+---+-----+
// |c1 |key|value|
// +---+---+-----+
// |A |acs|2 |
// |A |sdv|2 |
// |A |sfd|1 |
// |B |ass|2 |
// |B |fvv|2 |
// |B |ffd|1 |
// |c |dg |2 |
// |c |vd |2 |
// |c |dgr|1 |
// +---+---+-----+

Perform Arithmetic Operations on multiple columns in Spark dataframe

I have an input spark-dataframe named df as
+---------------+---+---+---+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+---+---+---+-----------+
| 725153| 1| 0| 2| 3|
| 873008| 0| 0| 3| 3|
| 625109| 1| 1| 0| 2|
+---------------+---+---+---+-----------+
Here,Total_Count is the sum of P1,P2,P3 and P1,P2,P3 were the product names. I need to find the frequency of each product by dividing the values of products with its Total_Count. I need to create a new spark-dataframe named frequencyTable as follows,
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+------------------+---+------------------+-----------+
| 725153|0.3333333333333333|0.0|0.6666666666666666| 3|
| 873008| 0.0|0.0| 1.0| 3|
| 625109| 0.5|0.5| 0.0| 2|
+---------------+------------------+---+------------------+-----------+
I have done this using Scala as,
val df_columns = df.columns.toSeq
var frequencyTable = df
for (index <- df_columns) {
if (index != "Main_CustomerID" && index != "Total_Count") {
frequencyTable = frequencyTable.withColumn(index, df.col(index) / df.col("Total_Count"))
}
}
But I don't prefer this for loop because my df is of larger size. What is the optimized solution?
If you have dataframe as
val df = Seq(
("725153", 1, 0, 2, 3),
("873008", 0, 0, 3, 3),
("625109", 1, 1, 0, 2)
).toDF("Main_CustomerID", "P1", "P2", "P3", "Total_Count")
+---------------+---+---+---+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+---+---+---+-----------+
|725153 |1 |0 |2 |3 |
|873008 |0 |0 |3 |3 |
|625109 |1 |1 |0 |2 |
+---------------+---+---+---+-----------+
You can simply use foldLeft on the columns except Main_CustomerID, Total_Count i.e. on P1 P2 and P3
val df_columns = df.columns.toSet - "Main_CustomerID" - "Total_Count" toList
df_columns.foldLeft(df){(tempdf, colName) => tempdf.withColumn(colName, df.col(colName) / df.col("Total_Count"))}.show(false)
which should give you
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+------------------+---+------------------+-----------+
|725153 |0.3333333333333333|0.0|0.6666666666666666|3 |
|873008 |0.0 |0.0|1.0 |3 |
|625109 |0.5 |0.5|0.0 |2 |
+---------------+------------------+---+------------------+-----------+
I hope the answer is helpful