How to perform division operation in dataFrame Spark using Scala? - scala

I have a dataFrame something like below.
+---+---+-----+
|uId| Id| sum |
+---+---+-----+
| 3| 1| 1.0|
| 7| 1| 1.0|
| 1| 2| 3.0|
| 1| 1| 1.0|
| 6| 5| 1.0|
using above DataFrame, I want to generate new DataFrame mention below
Sum column should be :-
For example:
For uid=3 and id=1, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/3=0.333
For uid=7 and id=1, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/3=0.333
For uid=1 and id=2, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
3.0*1/1=3.0
For uid=6 and id=5, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/1=1.0
My final output should be:
+---+---+---------+
|uId| Id| sum |
+---+---+---------+
| 3| 1| 0.33333|
| 7| 1| 0.33333|
| 1| 2| 3.0 |
| 1| 1| 0.3333 |
| 6| 5| 1.0 |

You can use Window function to get the count of each group of id column and finally use that count to divide the original sum
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("id")
import org.apache.spark.sql.functions._
df.withColumn("sum", $"sum"/count("id").over(windowSpec))
you should have the final dataframe as
+---+---+------------------+
|uId|Id |sum |
+---+---+------------------+
|3 |1 |0.3333333333333333|
|7 |1 |0.3333333333333333|
|1 |1 |0.3333333333333333|
|6 |5 |1.0 |
|1 |2 |3.0 |
+---+---+------------------+

Related

How to perform one to many mapping on spark scala dataframe column using flatmaps

I am looking for specifically a flatmap solution to a problem of mocking the data column in a spark-scala dataframe by using data duplicacy technique like 1 to many mapping inside flatmap
My given data is something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
+---+----+-----+
and my expectation after doing 1 to 3 mapping of the id column will be something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
|2 |null|null |
|3 |null|null |
|1 |null|null |
|2 |null|null |
|1 |null|null |
|3 |null|null |
+---+----+-----+
Please feel free to let me know if there is any clarification required on the requirement part
Thanks in advance!!!
I see that you are attempting to generate data with a requirement of re-using values in the ID column.
You can just select the ID column and generate random values and do a union back to your original dataset.
For example:
val data = Seq((1,"asd",15), (2,"asd",20), (3,"test",99)).toDF("id","testName","marks")
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
+---+--------+-----+
import org.apache.spark.sql.types._
val newRecords = data.select("id").withColumn("testName", concat(lit("name_"), lit(rand()*10).cast(IntegerType).cast(StringType))).withColumn("marks", lit(rand()*100).cast(IntegerType))
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
val result = data.unionAll(newRecords)
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
you can run the randomisation portion of the code using a loop and do a union of all the generated dataframes.

Pyspark: How to build sum of a column(which contains negative and positive values) with stop at 0

I think a example says more then the describtion.
The right column "sum" is the one i am looking for.
enter image description here
to_count|sum
-------------
-1 |0
+1 |1
-1 |0
-1 |0
+1 |1
+1 |2
-1 |1
+1 |2
. |.
. |.
I tried to rebuild that with several groupings with comparing lead and lag but that only works for the first time the sum usually ends in a negativ value.
Summing only positive and negative values seperatly also ends in another final result.
Would be great if anyone has a good idea how to solve this in pyspark!
I would use pandas_udf here:
from pyspark.sql.functions import pandas_udf, PandasUDFType
pdf = pd.DataFrame({'g':[1]*8, 'id':range(8), 'value': [-1,1,-1,-1,1,1,-1,1]})
df = spark.createDataFrame(pdf)
df = df.withColumn('cumsum', F.lit(math.inf))
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def _calc_cumsum(pdf):
pdf.sort_values(by=['id'], inplace=True, ascending=True)
cumsums = []
prev = 0
for v in pdf['value'].values:
prev = max(prev + v, 0)
cumsums.append(prev)
pdf['cumsum'] = cumsums
return pdf
df = df.groupby('g').apply(_calc_cumsum)
df.show()
The results:
+---+---+-----+------+
| g| id|value|cumsum|
+---+---+-----+------+
| 1| 0| -1| 0.0|
| 1| 1| 1| 1.0|
| 1| 2| -1| 0.0|
| 1| 3| -1| 0.0|
| 1| 4| 1| 1.0|
| 1| 5| 1| 2.0|
| 1| 6| -1| 1.0|
| 1| 7| 1| 2.0|
+---+---+-----+------+
Please look at the pic first there is a testdataset(first 3 columns) and the calc steps.
The column "flag" is now in another format. We also checked our datasource and realized that we only have to handle 1 and -1 entries. We mapped 1 to 0 and -1 to 1. Now it's working like exspected as you see in the column result.
The code is this:
w1 = Window.partitionBy('group').orderBy('order')
df_0 = tst.withColumn('edge_det',F.when(((F.col('flag')==0)&((F.lag('flag',default=1).over(w1))==1)),1).otherwise(0))
df_0 = df_0.withColumn('edge_cyl',F.sum('edge_det').over(w1))
df1 = df_0.withColumn('condition', F.when(F.col('edge_cyl')==0,0).otherwise(F.when(F.col('flag')==1,-1).otherwise(1)))
df1 =df1.withColumn('cond_sum',F.sum('condition').over(w1))
cond = (F.col('cond_sum')>=0)|(F.col('condition')==1)
df2 = df1.withColumn('new_cond',F.when(cond,F.col('condition')).otherwise(0))
df3 = df2.withColumn("result",F.sum('new_cond').over(w1))

How to find the max length unique rows from a dataframe with spark?

I am trying to find the unique rows (based on id) that have the maximum length value in a Spark dataframe. Each Column has a value of string type.
The dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | | d |
|3 |b | c | a | d |
+-----+---+----+---+---+
The expectation is:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I can't figure how to do this using Spark easily...
Thanks in advance
Note: This approach takes care of any addition/deletion of columns to the DataFrame, without the need of code change.
It can be done by first finding length of all columns after concatenating (except the first column), then filter all other rows except the row with the maximum length.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val output = input.withColumn("rowLength", length(concat(input.columns.toList.drop(1).map(col): _*)))
.withColumn("maxLength", max($"rowLength").over(Window.partitionBy($"id")))
.filter($"rowLength" === $"maxLength")
.drop("rowLength", "maxLength")
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi| |
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| | d|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> df.groupBy("id").agg(concat_ws("",collect_set(col("A"))).alias("A"),concat_ws("",collect_set(col("B"))).alias("B"),concat_ws("",collect_set(col("C"))).alias("C"),concat_ws("",collect_set(col("D"))).alias("D")).show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| a| d|
+---+----+----+----+----+

Apply function on all rows of dataframe [duplicate]

This question already has answers here:
Process all columns / the entire row in a Spark UDF
(2 answers)
Closed 3 years ago.
I want to apply a function on all rows of DataFrame.
Example:
|A |B |C |
|1 |3 |5 |
|6 |2 |0 |
|8 |2 |7 |
|0 |9 |4 |
Myfunction(df)
Myfunction(df: DataFrame):{
//Apply sum of columns on each row
}
Wanted output:
1+3+5 = 9
6+2+0 = 8
...
How can that be done is Scala please? i followed this but got no luck.
It's simple. You don't need to write any function for this, all you can do is to create a new column by summing up all the columns you want.
scala> df.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 1| 2| 4|
| 1| 2| 5|
+---+---+---+
scala> df.withColumn("sum",col("A")+col("B")+col("C")).show
+---+---+---+---+
| A| B| C|sum|
+---+---+---+---+
| 1| 2| 3| 6|
| 1| 2| 4| 7|
| 1| 2| 5| 8|
+---+---+---+---+
Edited:
Well you can run map function on each row and get the sum using row index/field name.
scala> df.map(x=>x.getInt(0) + x.getInt(1) + x.getInt(2)).toDF("sum").show
+---+
|sum|
+---+
| 6|
| 7|
| 8|
+---+
scala> df.map(x=>x.getInt(x.fieldIndex("A")) + x.getInt(x.fieldIndex("B")) + x.getInt(x.fieldIndex("C"))).toDF("sum").show
+---+
|sum|
+---+
| 6|
| 7|
| 8|
+---+
Map is the solution if you want to apply a function to every row of a dataframe. For every Row, you can return a tuple and a new RDD is made.
This is perfect when working with Dataset or RDD but not really for Dataframe. For your use case and for Dataframe, I would recommend just adding a column and use columns objects to do what you want.
// Using expr
df.withColumn("TOTAL", expr("A+B+C"))
// Using columns
df.withColumn("TOTAL", col("A")+col("B")+col("C"))
// Using dynamic selection of all columns
df.withColumn("TOTAL", df.colums.map(col).reduce((c1, c2) => c1 + c2))
In that case, you'll be very interested in this question.
UDF is also a good solution and is better explained here.
If you don't want to keep source columns, you can replace .withColumn(name, value) with .select(value.alias(name))

Fill null values in dataframe column with next value

I have to fill the first null values with immediate value of the same column in dataframe. This logic applies only on first consecutive null values only of the column.
I have a dataframe with similar to below
//I replaced null to 0 in value column
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|0 |exA |30 |
|0 |exB |22 |
|0 |exC |19 |
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 |
|13 |exH |53 |
+-----+----+----+
From this dataframe I am expecting as below
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|16 |exA |30 | // Change the value 0 to 16 at value column
|16 |exB |22 | // Change the value 0 to 16 at value column
|16 |exC |19 | // Change the value 0 to 16 at value column
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 | // value should not be change here
|13 |exH |53 |
+-----+----+----+
Please help me solve this.
You can use Window function for this purpose
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
val w = Window.orderBy($"col2".desc)
df.withColumn("Result", last(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.orderBy($"col2")
.show(10)
Will result in
+-----+----+----+------+
|value|col2|col3|Result|
+-----+----+----+------+
| 0| exA| 30| 16|
| 0| exB| 22| 16|
| 0| exC| 19| 16|
| 16| exD| 13| 16|
| 5| exE| 28| 5|
| 6| exF| 26| 6|
| 0| exG| 12| 13|
| 13| exH| 53| 13|
+-----+----+----+------+
Expression df.orderBy($"col2") is needed only to show final results in right order. You can skip it if you don't care about final order.
UPDATE
To get exactly what you need you should you a little bit more complicated code
val w = Window.orderBy($"col2")
val w2 = Window.orderBy($"col2".desc)
df.withColumn("IntermediateResult", first(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.withColumn("Result", when($"IntermediateResult".isNull, last($"IntermediateResult", ignoreNulls = true).over(w2)).otherwise($"value"))
.orderBy($"col2")
.show(10)
+-----+----+----+------------------+------+
|value|col2|col3|IntermediateResult|Result|
+-----+----+----+------------------+------+
| 0| exA| 30| null| 16|
| 0| exB| 22| null| 16|
| 0| exC| 19| null| 16|
| 16| exD| 13| 16| 16|
| 5| exE| 28| 16| 5|
| 6| exF| 26| 16| 6|
| 0| exG| 12| 16| 0|
| 13| exH| 53| 16| 13|
+-----+----+----+------------------+------+
I think you need to take the 1st not null or non-zero value based on col2 's order. Please find the script below. I have created a table in spark's memory to write sql.
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
df.registerTempTable("table_df")
spark.sql("with cte as(select *,row_number() over(order by col2) rno from table_df) select case when value = 0 and rno<(select min(rno) from cte where value != 0) then (select value from cte where rno=(select min(rno) from cte where value != 0)) else value end value,col2,col3 from cte").show(df.count.toInt,false)
Please let me know if you have any questions.
I added a new column with incremental id to your DF
import org.apache.spark.sql.functions._
val df_1 = Seq((0,"exA",30),
(0,"exB",22),
(0,"exC",19),
(16,"exD",13),
(5,"exE",28),
(6,"exF",26),
(0,"exG",12),
(13,"exH",53))
.toDF("value", "col2", "col3")
.withColumn("UniqueID", monotonically_increasing_id)
filter DF to have non-zero values
val df_2 = df_1.filter("value != 0")
create a variable "limit" to limit first N row that we need and variable Nvar for the first non-zero value
val limit = df_2.agg(min("UniqueID")).collect().map(_(0)).mkString("").toInt + 1
val nVal = df_1.limit(limit).agg(max("value")).collect().map(_(0)).mkString("").toInt
create DF with a column with the same name ("value") with a condition
val df_4 = df_1.withColumn("value", when(($"UniqueID" < limit), nVal).otherwise($"value"))