Apply function on all rows of dataframe [duplicate] - scala

This question already has answers here:
Process all columns / the entire row in a Spark UDF
(2 answers)
Closed 3 years ago.
I want to apply a function on all rows of DataFrame.
Example:
|A |B |C |
|1 |3 |5 |
|6 |2 |0 |
|8 |2 |7 |
|0 |9 |4 |
Myfunction(df)
Myfunction(df: DataFrame):{
//Apply sum of columns on each row
}
Wanted output:
1+3+5 = 9
6+2+0 = 8
...
How can that be done is Scala please? i followed this but got no luck.

It's simple. You don't need to write any function for this, all you can do is to create a new column by summing up all the columns you want.
scala> df.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 1| 2| 4|
| 1| 2| 5|
+---+---+---+
scala> df.withColumn("sum",col("A")+col("B")+col("C")).show
+---+---+---+---+
| A| B| C|sum|
+---+---+---+---+
| 1| 2| 3| 6|
| 1| 2| 4| 7|
| 1| 2| 5| 8|
+---+---+---+---+
Edited:
Well you can run map function on each row and get the sum using row index/field name.
scala> df.map(x=>x.getInt(0) + x.getInt(1) + x.getInt(2)).toDF("sum").show
+---+
|sum|
+---+
| 6|
| 7|
| 8|
+---+
scala> df.map(x=>x.getInt(x.fieldIndex("A")) + x.getInt(x.fieldIndex("B")) + x.getInt(x.fieldIndex("C"))).toDF("sum").show
+---+
|sum|
+---+
| 6|
| 7|
| 8|
+---+

Map is the solution if you want to apply a function to every row of a dataframe. For every Row, you can return a tuple and a new RDD is made.
This is perfect when working with Dataset or RDD but not really for Dataframe. For your use case and for Dataframe, I would recommend just adding a column and use columns objects to do what you want.
// Using expr
df.withColumn("TOTAL", expr("A+B+C"))
// Using columns
df.withColumn("TOTAL", col("A")+col("B")+col("C"))
// Using dynamic selection of all columns
df.withColumn("TOTAL", df.colums.map(col).reduce((c1, c2) => c1 + c2))
In that case, you'll be very interested in this question.
UDF is also a good solution and is better explained here.
If you don't want to keep source columns, you can replace .withColumn(name, value) with .select(value.alias(name))

Related

Pyspark: How to build sum of a column(which contains negative and positive values) with stop at 0

I think a example says more then the describtion.
The right column "sum" is the one i am looking for.
enter image description here
to_count|sum
-------------
-1 |0
+1 |1
-1 |0
-1 |0
+1 |1
+1 |2
-1 |1
+1 |2
. |.
. |.
I tried to rebuild that with several groupings with comparing lead and lag but that only works for the first time the sum usually ends in a negativ value.
Summing only positive and negative values seperatly also ends in another final result.
Would be great if anyone has a good idea how to solve this in pyspark!
I would use pandas_udf here:
from pyspark.sql.functions import pandas_udf, PandasUDFType
pdf = pd.DataFrame({'g':[1]*8, 'id':range(8), 'value': [-1,1,-1,-1,1,1,-1,1]})
df = spark.createDataFrame(pdf)
df = df.withColumn('cumsum', F.lit(math.inf))
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def _calc_cumsum(pdf):
pdf.sort_values(by=['id'], inplace=True, ascending=True)
cumsums = []
prev = 0
for v in pdf['value'].values:
prev = max(prev + v, 0)
cumsums.append(prev)
pdf['cumsum'] = cumsums
return pdf
df = df.groupby('g').apply(_calc_cumsum)
df.show()
The results:
+---+---+-----+------+
| g| id|value|cumsum|
+---+---+-----+------+
| 1| 0| -1| 0.0|
| 1| 1| 1| 1.0|
| 1| 2| -1| 0.0|
| 1| 3| -1| 0.0|
| 1| 4| 1| 1.0|
| 1| 5| 1| 2.0|
| 1| 6| -1| 1.0|
| 1| 7| 1| 2.0|
+---+---+-----+------+
Please look at the pic first there is a testdataset(first 3 columns) and the calc steps.
The column "flag" is now in another format. We also checked our datasource and realized that we only have to handle 1 and -1 entries. We mapped 1 to 0 and -1 to 1. Now it's working like exspected as you see in the column result.
The code is this:
w1 = Window.partitionBy('group').orderBy('order')
df_0 = tst.withColumn('edge_det',F.when(((F.col('flag')==0)&((F.lag('flag',default=1).over(w1))==1)),1).otherwise(0))
df_0 = df_0.withColumn('edge_cyl',F.sum('edge_det').over(w1))
df1 = df_0.withColumn('condition', F.when(F.col('edge_cyl')==0,0).otherwise(F.when(F.col('flag')==1,-1).otherwise(1)))
df1 =df1.withColumn('cond_sum',F.sum('condition').over(w1))
cond = (F.col('cond_sum')>=0)|(F.col('condition')==1)
df2 = df1.withColumn('new_cond',F.when(cond,F.col('condition')).otherwise(0))
df3 = df2.withColumn("result",F.sum('new_cond').over(w1))

Sum of column in sqlDataframe without using groupBy or agg functions in scala/spark

For a dataframe given below, i want a new column in dataframe which should have constant value of sum of freq column.
+------+----+
|number|freq|
+------+----+
| 8| 1|
| 6| 2|
| 2| 4|
+------+----+
The result should look like
+------+----+-------+
|number|freq|new_col|
+------+----+-------+
| 8| 1| 7|
| 6| 2| 7|
| 2| 4| 7|
+------+----+-------+
and i want this without groupBy or agg.
I tried :
var x = sum(df("freq"))
df.withColumn("new_col",lit(x))
or
df.withColumn("new_col",x)
or
df.withColumn("new_col",sum($"freq"))
But none worked.
You can try this but be careful, it uses a single partition :
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(8,1),
(6,2),
(2,4)
).toDF("number","freq")
df.withColumn("new_col", sum($"freq").over())
.show(false)
+------+----+-------+
|number|freq|new_col|
+------+----+-------+
|8 |1 |7 |
|6 |2 |7 |
|2 |4 |7 |
+------+----+-------+
You could use a window over the entire dataframe to do that but I highly recommend not to do it for all the data would need to go to only one partition which would be terrible in terms of performance.
A simple way to do it, very similar to your 1st approach, is:
import org.apache.spark.sql.Row
val Row(x) = df.select(sum('freq)).head
val new_df = df.withColumn("new_col", lit(x))

How to create a sequence of events (column values) per some other column?

I have a Spark data frame as shown below -
val myDF = Seq(
(1,"A",100,0,0),
(1,"E",200,0,0),
(1,"",300,1,49),
(2,"A",200,0,0),
(2,"C",300,0,0),
(2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")
scala> myDF.show
+-------+-------+---------+-------------+------+
|visitor|channel|timestamp|purchase_flag|amount|
+-------+-------+---------+-------------+------+
| 1| A| 100| 0| 0|
| 1| E| 200| 0| 0|
| 1| | 300| 1| 49|
| 2| A| 200| 0| 0|
| 2| C| 300| 0| 0|
| 2| D| 100| 0| 0|
+-------+-------+---------+-------------+------+
I would like to create Sequence dataframe for every visitor from myDF that traces a visitor's path to purchase ordered by timestamp dimension.
The output dataframe should look like below(-> can be any delimiter) -
+-------+---------------------+
|visitor|channel sequence |
+-------+---------------------+
| 1| A->E->purchase |
| 2| D->A->C->no_purchase|
+-------+---------------------+
To make things clear, visitor 2 has been exposed to channel D, then A and then C; and he does not make a purchase.
Hence the sequence is to be formed as D->A-C->no_purchase.
NOTE: Whenever a purchase happens, channel value goes blank and purchase_flag is set to 1.
I want to do this using a Scala UDF in Spark so that I re-apply the method on other datasets.
Here's how it is done using udf function
val myDF = Seq(
(1,"A",100,0,0),
(1,"E",200,0,0),
(1,"",300,1,49),
(2,"A",200,0,0),
(2,"C",300,0,0),
(2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")
import org.apache.spark.sql.functions._
def sequenceUdf = udf((struct: Seq[Row], purchased: Seq[Int])=> struct.map(row => (row.getAs[String]("channel"), row.getAs[Int]("timestamp"))).sortBy(_._2).map(_._1).filterNot(_ == "").mkString("->")+{if(purchased.contains(1)) "->purchase" else "->no_purchase"})
myDF.groupBy("visitor").agg(collect_list(struct("channel", "timestamp")).as("struct"), collect_list("purchase_flag").as("purchased"))
.select(col("visitor"), sequenceUdf(col("struct"), col("purchased")).as("channel sequence"))
.show(false)
which should give you
+-------+--------------------+
|visitor|channel sequence |
+-------+--------------------+
|1 |A->E->purchase |
|2 |D->A->C->no_purchase|
+-------+--------------------+
You can make it as much generic as you can . this is just a demo on how you should proceed

spark withcolumn create a column duplicating values from existining column

I am having problem figuring this. Here is the problem statement
lets say I have a dataframe, I want to select value for column c where column b value is foo and create a new column D and repeat the vale "3" for all rows
+---+----+---+
| A| B| C|
+---+----+---+
| 4|blah| 2|
| 2| | 3|
| 56| foo| 3|
|100|null| 5|
+---+----+---+
want it to become:
+---+----+---+-----+
| A| B| C| D |
+---+----+---+-----+
| 4|blah| 2| 3 |
| 2| | 3| 3 |
| 56| foo| 3| 3 |
|100|null| 5| 3 |
+---+----+---+-----+
You will have to extract the column C value i.e. 3 with foo in column B
import org.apache.spark.sql.functions._
val value = df.filter(col("B") === "foo").select("C").first()(0)
Then use that value using withColumn to create a new column D using lit function
df.withColumn("D", lit(value)).show(false)
You should get your desired output.

How to perform division operation in dataFrame Spark using Scala?

I have a dataFrame something like below.
+---+---+-----+
|uId| Id| sum |
+---+---+-----+
| 3| 1| 1.0|
| 7| 1| 1.0|
| 1| 2| 3.0|
| 1| 1| 1.0|
| 6| 5| 1.0|
using above DataFrame, I want to generate new DataFrame mention below
Sum column should be :-
For example:
For uid=3 and id=1, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/3=0.333
For uid=7 and id=1, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/3=0.333
For uid=1 and id=2, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
3.0*1/1=3.0
For uid=6 and id=5, my sum column value should be (old sum value * 1 / count of ID(1)) I.e.
1.0*1/1=1.0
My final output should be:
+---+---+---------+
|uId| Id| sum |
+---+---+---------+
| 3| 1| 0.33333|
| 7| 1| 0.33333|
| 1| 2| 3.0 |
| 1| 1| 0.3333 |
| 6| 5| 1.0 |
You can use Window function to get the count of each group of id column and finally use that count to divide the original sum
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("id")
import org.apache.spark.sql.functions._
df.withColumn("sum", $"sum"/count("id").over(windowSpec))
you should have the final dataframe as
+---+---+------------------+
|uId|Id |sum |
+---+---+------------------+
|3 |1 |0.3333333333333333|
|7 |1 |0.3333333333333333|
|1 |1 |0.3333333333333333|
|6 |5 |1.0 |
|1 |2 |3.0 |
+---+---+------------------+