Spark: Transpose DataFrame Without Aggregating - scala

I have looked at a number of questions online, but they don't seem to do what I'm trying to achieve.
I'm using Apache Spark 2.0.2 with Scala.
I have a dataframe:
+----------+-----+----+----+----+----+----+
|segment_id| val1|val2|val3|val4|val5|val6|
+----------+-----+----+----+----+----+----+
| 1| 100| 0| 0| 0| 0| 0|
| 2| 0| 50| 0| 0| 20| 0|
| 3| 0| 0| 0| 0| 0| 0|
| 4| 0| 0| 0| 0| 0| 0|
+----------+-----+----+----+----+----+----+
which I want to transpose to
+----+-----+----+----+----+
|vals| 1| 2| 3| 4|
+----+-----+----+----+----+
|val1| 100| 0| 0| 0|
|val2| 0| 50| 0| 0|
|val3| 0| 0| 0| 0|
|val4| 0| 0| 0| 0|
|val5| 0| 20| 0| 0|
|val6| 0| 0| 0| 0|
+----+-----+----+----+----+
I've tried using pivot() but I couldn't get to the right answer. I ended up looping through my val{x} columns, and pivoting each as per below, but this is proving to be very slow.
val d = df.select('segment_id, 'val1)
+----------+-----+
|segment_id| val1|
+----------+-----+
| 1| 100|
| 2| 0|
| 3| 0|
| 4| 0|
+----------+-----+
d.groupBy('val1).sum().withColumnRenamed('val1', 'vals')
+----+-----+----+----+----+
|vals| 1| 2| 3| 4|
+----+-----+----+----+----+
|val1| 100| 0| 0| 0|
+----+-----+----+----+----+
Then using union() on each iteration of val{x} to my first dataframe.
+----+-----+----+----+----+
|vals| 1| 2| 3| 4|
+----+-----+----+----+----+
|val2| 0| 50| 0| 0|
+----+-----+----+----+----+
Is there a more efficient way of a transpose where I do not want to aggregate data?
Thanks :)

Unfortunately there is no case when:
Spark DataFrame is justified considering amount of data.
Transposition of data is feasible.
You have to remember that DataFrame, as implemented in Spark, is a distributed collection of rows and each row is stored and processed on a single node.
You could express transposition on a DataFrame as pivot:
val kv = explode(array(df.columns.tail.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
df
.withColumn("kv", kv)
.select($"segment_id", $"kv.k", $"kv.v")
.groupBy($"k")
.pivot("segment_id")
.agg(first($"v"))
.orderBy($"k")
.withColumnRenamed("k", "vals")
but it is merely a toy code with no practical applications. In practice it is not better than collecting data:
val (header, data) = df.collect.map(_.toSeq.toArray).transpose match {
case Array(h, t # _*) => {
(h.map(_.toString), t.map(_.collect { case x: Int => x }))
}
}
val rows = df.columns.tail.zip(data).map { case (x, ys) => Row.fromSeq(x +: ys) }
val schema = StructType(
StructField("vals", StringType) +: header.map(StructField(_, IntegerType))
)
spark.createDataFrame(sc.parallelize(rows), schema)
For DataFrame defined as:
val df = Seq(
(1, 100, 0, 0, 0, 0, 0),
(2, 0, 50, 0, 0, 20, 0),
(3, 0, 0, 0, 0, 0, 0),
(4, 0, 0, 0, 0, 0, 0)
).toDF("segment_id", "val1", "val2", "val3", "val4", "val5", "val6")
both would you give you the desired result:
+----+---+---+---+---+
|vals| 1| 2| 3| 4|
+----+---+---+---+---+
|val1|100| 0| 0| 0|
|val2| 0| 50| 0| 0|
|val3| 0| 0| 0| 0|
|val4| 0| 0| 0| 0|
|val5| 0| 20| 0| 0|
|val6| 0| 0| 0| 0|
+----+---+---+---+---+
That being said if you need an efficient transpositions on distributed data structure you'll have to look somewhere else. There is a number of structures, including core CoordinateMatrix and BlockMatrix, which can distribute data across both dimensions and can be transposed.

In python, this can be done in a simple way
I normally use transpose function in Pandas by converting the spark DataFrame
spark_df.toPandas().T

Here is the solution for Pyspark
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
Here is the solution code for your problem:
Step1: Choose columns
d = df.select('val1','val2','val3','val4','val5','val6','segment_id')
This code part can form the data frame like this:
+----------+-----+----+----+----+----+----+
| val1|val2|val3|val4|val5|val6|segment_id
+----------+-----+----+----+----+----+----+
| 100| 0| 0| 0| 0| 0| 1 |
| 0| 50| 0| 0| 20| 0| 2 |
| 0| 0| 0| 0| 0| 0| 3 |
| 0| 0| 0| 0| 0| 0| 4 |
+----------+-----+----+----+----+----+----+
Step 2: Transpose the whole table.
d_transposed = d.T.sort_index()
This code part can form the data frame like this:
+----+-----+----+----+----+----+-
|segment_id| 1| 2| 3| 4|
+----+-----+----+----+----+----+-
|val1 | 100| 0| 0| 0|
|val2 | 0| 50| 0| 0|
|val3 | 0| 0| 0| 0|
|val4 | 0| 0| 0| 0|
|val5 | 0| 20| 0| 0|
|val6 | 0| 0| 0| 0|
+----+-----+----+----+----+----+-
Step 3: You need to rename the segment_id to vals:
d_transposed.withColumnRenamed("segment_id","vals")
+----+-----+----+----+----+----+-
|vals | 1| 2| 3| 4|
+----+-----+----+----+----+----+-
|val1 | 100| 0| 0| 0|
|val2 | 0| 50| 0| 0|
|val3 | 0| 0| 0| 0|
|val4 | 0| 0| 0| 0|
|val5 | 0| 20| 0| 0|
|val6 | 0| 0| 0| 0|
+----+-----+----+----+----+----+-
Here is your full code:
d = df.select('val1','val2','val3','val4','val5','val6','segment_id')
d_transposed = d.T.sort_index()
d_transposed.withColumnRenamed("segment_id","vals")

This should be a perfect solution.
val seq = Seq((1,100,0,0,0,0,0),(2,0,50,0,0,20,0),(3,0,0,0,0,0,0),(4,0,0,0,0,0,0))
val df1 = seq.toDF("segment_id", "val1", "val2", "val3", "val4", "val5", "val6")
df1.show()
val schema = df1.schema
val df2 = df1.flatMap(row => {
val metric = row.getInt(0)
(1 until row.size).map(i => {
(metric, schema(i).name, row.getInt(i))
})
})
val df3 = df2.toDF("metric", "vals", "value")
df3.show()
import org.apache.spark.sql.functions._
val df4 = df3.groupBy("vals").pivot("metric").agg(first("value"))
df4.show()

Related

How to count rows of a group and add groups of count zero in Spark Dataset?

I'm using Spark in Scala, with Datasets, and I'm facing the following situation. (EDIT: I've rewritten the examples to be more concrete. This is meant to be put in a method and reused with different datasets.)
Suppose we have the following Dataset, representing moves between region a to region b, at specific timestamps:
case class Move(id: Int, ra: Int, rb: Int, time_a: Timestamp, time_b: Timestamp)
val ds = Seq(
(1, 123, 125, "2021-07-25 13:05:20", "2021-07-25 15:05:20"),
(2, 470, 125, "2021-07-25 00:05:20", "2021-07-25 02:05:20"),
(1, 470, 125, "2021-07-25 02:05:20", "2021-07-25 04:05:20"),
(3, 123, 125, "2021-07-26 00:45:20", "2021-07-26 02:45:20"),
(3, 125, 123, "2021-07-28 16:05:20", "2021-07-28 18:05:20"),
(1, 125, 123, "2021-07-29 20:05:20", "2021-07-30 01:05:20")
).toDF("id", "ra", "rb", "time_a", "time_b")
.withColumn("time_a", to_timestamp($"time_a"))
.withColumn("time_b", to_timestamp($"time_b"))
.as[Move]
What I want is to count the number of moves for each day, so that I can get an origin-destination matrix with some time divisions. Also, I want to be able to apply some labels/categories along those time divisions. (The code for how this categorization is made isn't relevant, we can just assume here it works and each category is described in a val categs: Seq[TimeCategory]). This could be done like this:
case class Flow(ra: Int, rb: Int, time: Long, categ: Long, inflow: Long, outflow: Long)
val Row(t0: Timestamp, tf: Timestamp) = ds.select(min($"time_a"), max($"time_b")).head
val ldt0 = t0.toLocalDateTime()
val ldtf = tf.toLocalDateTime()
val unit = ChronoUnit.DAYS
val fmds = ds.map({
case Move(id, ra, rb, ta, tb) => {
val ldta = ta.toLocalDateTime()
val ldtb = tb.toLocalDateTime()
(
id, ra, rb,
unit.between(ldt0, ldta),
unit.between(ldt0, ldtb),
TimeCategory.encode(categs, ldta),
TimeCategory.encode(categs, ldtb)
)
}
})
val outfds = fmds
.groupBy($"ra", $"rb", col("time_a").as("time"), col("categ_a").as("categ"))
.agg(count("id").as("outflow"))
.withColumn("inflow", lit(0))
val infds = fmds
.groupBy($"ra", $"rb", col("time_b").as("time"), col("categ_b").as("categ"))
.agg(count("id").as("inflow"))
.withColumn("outflow", lit(0))
val fds = outfds
.unionByName(infds)
.groupBy("ra", "rb", "time", "categ")
.agg(sum("outflow").as("outflow"), sum("inflow").as("inflow"))
.orderBy("ra", "rb", "time", "categ")
.as[Flow]
and that would yield the following result:
+---+---+----+-----+-------+------+
| ra| rb|time|categ|outflow|inflow|
+---+---+----+-----+-------+------+
|123|125| 0| 19| 1| 1|
|123|125| 1| 13| 1| 0|
|123|125| 1| 14| 0| 1|
|125|123| 3| 11| 1| 0|
|125|123| 3| 12| 0| 1|
|125|123| 4| 12| 1| 0|
|125|123| 5| 14| 0| 1|
|470|125| 0| 21| 1| 0|
|470|125| 0| 22| 1| 2|
+---+---+----+-----+-------+------+
The problem is, if I wanted to get the average inflow per day for each pair of regions, many days with inflow = 0 would be missing. For example, if the following agg is done:
fds.groupBy("ra", "rb", "categ").agg(avg("inflow"))
// output:
+---+---+-----+-----------+
| ra| rb|categ|avg(inflow)|
+---+---+-----+-----------+
|123|125| 14| 1.0|
|123|125| 13| 0.0|
|470|125| 21| 0.0|
|125|123| 12| 0.5|
|125|123| 11| 0.0|
|123|125| 19| 1.0|
|125|123| 14| 1.0|
|470|125| 22| 2.0|
+---+---+-----+-----------+
For 125 -> 123 and category 12, the avg was 0.5, but considering the start and end timestamp of the whole dataset, there should be 5 days with category 12, not just 2. The avg should be 1 / 5 = 0.2. This second case is what I want to calculate. Considering I want to calculate other agg functions too (like stddev), I suppose the most flexible alternative would be to "fill" the rows whose values should be zero. What is the best of way - in terms of performance/scalability - of doing that?
(EDIT: I thought of a better approach here as well.) So far, a possible solution that comes to my mind is to create another DataFrame with the "time slots" (in this case, each "slot" would be a day index with all suitable categories) and do one more union, like this:
// TimeCategory.encodeAll basically just returns
// every category that suits that timestamp
val timeindex: Seq[(Long, Long)] = (0L to unit.between(ldt0, ldtf))
.flatMap(i => {
val t = ldt0.plus(i, unit)
val categCodes = TimeCategory.encodeAll(categs, unit, t)
categCodes.map((i, _))
})
val zds = infds
.select($"ra", $"rb", explode(typedLit(timeindex)).as("time"))
.select($"ra", $"rb", col("time")("_1").as("time"), col("time")("_2").as("categ"),
lit(0).as("outflow"), lit(0).as("inflow"))
val fds = outfds
.unionByName(infds)
.unionByName(zds)
.groupBy("ra", "rb", "time", "categ")
.agg(sum("outflow").as("outflow"), sum("inflow").as("inflow"))
.orderBy("ra", "rb", "time", "categ")
.as[Flow]
.show()
fds.groupBy("ra", "rb", "categ")
.agg(avg("inflow"))
.where($"avg(inflow)" > 0.0)
.show()
and the result:
+---+---+----+-----+-------+------+
| ra| rb|time|categ|outflow|inflow|
+---+---+----+-----+-------+------+
|123|125| 0| 17| 0| 0|
|123|125| 0| 18| 0| 0|
|123|125| 0| 19| 1| 1|
|123|125| 0| 20| 0| 0|
|123|125| 0| 21| 0| 0|
|123|125| 0| 22| 0| 0|
|123|125| 1| 9| 0| 0|
|123|125| 1| 10| 0| 0|
|123|125| 1| 11| 0| 0|
|123|125| 1| 12| 0| 0|
|123|125| 1| 13| 1| 0|
|123|125| 1| 14| 0| 1|
|123|125| 2| 9| 0| 0|
|123|125| 2| 10| 0| 0|
|123|125| 2| 11| 0| 0|
|123|125| 2| 12| 0| 0|
|123|125| 2| 13| 0| 0|
|123|125| 2| 14| 0| 0|
|123|125| 3| 9| 0| 0|
|123|125| 3| 10| 0| 0|
+---+---+----+-----+-------+------+
only showing top 20 rows
+---+---+-----+-----------+
| ra| rb|categ|avg(inflow)|
+---+---+-----+-----------+
|123|125| 14| 0.2|
|125|123| 12| 0.2| // <- the correct avg
|123|125| 19| 1.0|
|125|123| 14| 0.2|
|470|125| 22| 2.0|
+---+---+-----+-----------+
Is it possible to improve this?
I can only propose a little improvement to your initial solution, to replace join with unionByBame. It should be more efficient than join, but I didn't test performance:
val ds = Seq(
1L -> java.sql.Timestamp.valueOf("2021-07-25 13:05:20"),
2L -> java.sql.Timestamp.valueOf("2021-07-25 00:05:20"),
1L -> java.sql.Timestamp.valueOf("2021-07-25 02:05:20"),
3L -> java.sql.Timestamp.valueOf("2021-07-26 00:45:20"),
3L -> java.sql.Timestamp.valueOf("2021-07-28 16:05:20"),
1L -> java.sql.Timestamp.valueOf("2021-07-29 20:05:20")
).toDF("id", "timestamp")
val ddf = ds
.withColumn("day", dayofmonth($"timestamp"))
.drop("timestamp")
val r = ds.select(min($"timestamp"), max($"timestamp")).head()
val t0 = r.getTimestamp(0)
val tf = r.getTimestamp(1)
val idf = (t0.getDate() to tf.getDate()).toDF("day")
.withColumn("id", lit(null).cast(LongType))
ddf.unionByName(idf)
.groupBy($"day")
.agg(sum(when($"id".isNotNull, 1).otherwise(0)).as("count"))
.orderBy("day")
.show

How to remove the first set of zero-valued columns (or rows) in spark and scala

Hello I am new to spark and I have two data frames such that:
+--------------+-------+-------+-------+-------+-------+-------+-------+
| Region| 3/7/20| 3/8/20| 3/9/20|3/10/20|3/11/20|3/12/20|3/13/20|
+--------------+-------+-------+-------+-------+-------+-------+-------+
| Paris| 0| 0| 0| 1| 7| 0| 5|
+--------------+-------+-------+-------+-------+-------+-------+-------+
+----------+-------+
| Period|Reports|
+----------+-------+
|2020/07/20| 0|
|2020/07/21| 0|
|2020/07/22| 0|
|2020/07/23| 8|
|2020/07/24| 0|
|2020/07/25| 1|
+----------+-------+
How to can I drop the first 0-valued consecutive column 3/7/20, 3/8/20, 3/9/20, without deleting the column 3/12/20 ?
Similarly for the second dataframe how to remove the rows 3/12/20, 0 and 2020/07/21, 0 and 2020/07/22, 0 without deleting the row with 2020/07/22, 0
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df=Seq(("0","0","0","1","7","0","5")).toDF("3/7/20","3/8/20","3/9/20","3/10/20","3/11/20","3/12/20","3/13/20")
var columnsAndValues = df.columns.flatMap { c => Array(lit(c), col(c)) }
df.printSchema()
val df1 = df.withColumn("myMap", map(columnsAndValues:_*)).select(explode($"myMap"))
.toDF("Region","Paris")
val windowSpec = Window.partitionBy(lit("A")).orderBy(lit("A"))
df1.withColumn("row_number",row_number.over(windowSpec))
.withColumn("lag", lag("Paris", 1, 0).over(windowSpec))
.withColumn("lead", lead("Paris", 1, 0)
.over(windowSpec)).where(($"lag">0) or ($"Paris"> 0)).show()
/*
+-------+-----+----------+---+----+
| Region|Paris|row_number|lag|lead|
+-------+-----+----------+---+----+
|3/10/20| 1| 4| 0| 7|
|3/11/20| 7| 5| 1| 0|
|3/12/20| 0| 6| 7| 5|
|3/13/20| 5| 7| 0| 0|
+-------+-----+----------+---+----+
*/
val df2=Seq(("2020/07/20","0"),("2020/07/21","0"),("2020/07/22","0"),("2020/07/23","8"),("2020/07/24","0"),("2020/07/25","1")).toDF("Period","Reports")
df2.withColumn("row_number",row_number.over(windowSpec))
.withColumn("lag", lag("Reports", 1, 0).over(windowSpec))
.withColumn("lead", lead("Reports", 1, 0).over(windowSpec))
.where((($"lag">0) or ($"Reports"> 0)) and ($"row_number">1)).show()
/*
+----------+-------+----------+---+----+
| Period|Reports|row_number|lag|lead|
+----------+-------+----------+---+----+
|2020/07/23| 8| 4| 0| 0|
|2020/07/24| 0| 5| 8| 1|
|2020/07/25| 1| 6| 0| 0|
+----------+-------+----------+---+----+
*/

Is there any generic function to finding the column names in pyspark

how to print column names in generic way. I want col1,col2,… instead of _1,_2,…
+---+---+---+---+---+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7| _8| _9|_10|_11|_12|
+---+---+---+---+---+---+---+---+---+---+---+---+
| 0| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| |
| 0| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
assuming df is your dataframe, you can juste rename :
for col in df.columns:
df = df.withColumnRenamed(col, col.replace("_", "col"))

How to transform row-information into columns?

I have the following DataFrame in Spark 2.2 and Scala 2.11.8:
+--------+---------+-------+-------+----+-------+
|event_id|person_id|channel| group|num1| num2|
+--------+---------+-------+-------+----+-------+
| 560| 9410| web| G1| 0| 5|
| 290| 1430| web| G1| 0| 3|
| 470| 1370| web| G2| 0| 18|
| 290| 1430| web| G2| 0| 5|
| 290| 1430| mob| G2| 1| 2|
+--------+---------+-------+-------+----+-------+
Here is the equivalent DataFrame in Scala:
df = sqlCtx.createDataFrame(
[(560,9410,"web","G1",0,5),
(290,1430,"web","G1",0,3),
(470,1370,"web","G2",0,18),
(290,1430,"web","G2",0,5),
(290,1430,"mob","G2",1,2)],
["event_id","person_id","channel","group","num1","num2"]
)
The column group can only have two values: G1 and G2. I need to transform these values of the column group into new columns as follows:
+--------+---------+-------+--------+-------+--------+-------+
|event_id|person_id|channel| num1_G1|num2_G1| num1_G2|num2_G2|
+--------+---------+-------+--------+-------+--------+-------+
| 560| 9410| web| 0| 5| 0| 0|
| 290| 1430| web| 0| 3| 0| 0|
| 470| 1370| web| 0| 0| 0| 18|
| 290| 1430| web| 0| 0| 0| 5|
| 290| 1430| mob| 0| 0| 1| 2|
+--------+---------+-------+--------+-------+--------+-------+
How can I do it?
AFAIK (at least i couldn't find a way to perform PIVOT without aggregation) we must use aggregation function when doing pivoting in Spark
Scala version:
scala> df.groupBy("event_id","person_id","channel")
.pivot("group")
.agg(max("num1") as "num1", max("num2") as "num2")
.na.fill(0)
.show
+--------+---------+-------+-------+-------+-------+-------+
|event_id|person_id|channel|G1_num1|G1_num2|G2_num1|G2_num2|
+--------+---------+-------+-------+-------+-------+-------+
| 560| 9410| web| 0| 5| 0| 0|
| 290| 1430| web| 0| 3| 0| 5|
| 470| 1370| web| 0| 0| 0| 18|
| 290| 1430| mob| 0| 0| 1| 2|
+--------+---------+-------+-------+-------+-------+-------+

Run a Cumulative/Iterative Costum Method on a Column in Spark Scala

Hi I am new to Spark/Scala, I have been trying - AKA failing, to create a column in a spark dataframe based on a particular recursive formula:
Here it is in pseudo code.
someDf.col2[0] = 0
for i > 0
someDf.col2[i] = x * someDf.col1[i-1] + (1-x) * someDf.col2[i-1]
To dive into more details, here is my starting point:
this dataframe is the result of aggregations both on the level of dates and individual id's.
all further calculations have to happen with respect to that particular id, and have to take into consideration what happened in the previous week.
to illustrate this I have simplified the values to zeros and ones and removed the multiplier x and 1-x, and I also have initialized the col2 to zero.
var someDf = Seq(("2016-01-10 00:00:00.0","385608",0,0),
("2016-01-17 00:00:00.0","385608",0,0),
("2016-01-24 00:00:00.0","385608",1,0),
("2016-01-31 00:00:00.0","385608",1,0),
("2016-02-07 00:00:00.0","385608",1,0),
("2016-02-14 00:00:00.0","385608",1,0),
("2016-01-17 00:00:00.0","105010",0,0),
("2016-01-24 00:00:00.0","105010",1,0),
("2016-01-31 00:00:00.0","105010",0,0),
("2016-02-07 00:00:00.0","105010",1,0)
).toDF("dates", "id", "col1","col2" )
someDf.show()
+--------------------+------+----+----+
| dates| id|col1|col2|
+--------------------+------+----+----+
|2016-01-10 00:00:...|385608| 0| 0|
|2016-01-17 00:00:...|385608| 0| 0|
|2016-01-24 00:00:...|385608| 1| 0|
|2016-01-31 00:00:...|385608| 1| 0|
|2016-02-07 00:00:...|385608| 1| 0|
|2016-02-14 00:00:...|385608| 1| 0|
+--------------------+------+----+----+
|2016-01-17 00:00:...|105010| 0| 0|
|2016-01-24 00:00:...|105010| 1| 0|
|2016-01-31 00:00:...|105010| 0| 0|
|2016-02-07 00:00:...|105010| 1| 0|
+--------------------+------+----+----+
what I have tried so far vs what is desired
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val date_id_window = Window.partitionBy("id").orderBy(asc("dates"))
someDf.withColumn("col2", lag($"col1",1 ).over(date_id_window) +
lag($"col2",1 ).over(date_id_window) ).show()
+--------------------+------+----+----+ / +--------------------+
| dates| id|col1|col2| / | what_col2_should_be|
+--------------------+------+----+----+ / +--------------------+
|2016-01-17 00:00:...|105010| 0|null| / | 0|
|2016-01-24 00:00:...|105010| 1| 0| / | 0|
|2016-01-31 00:00:...|105010| 0| 1| / | 1|
|2016-02-07 00:00:...|105010| 1| 0| / | 1|
+-------------------------------------+ / +--------------------+
|2016-01-10 00:00:...|385608| 0|null| / | 0|
|2016-01-17 00:00:...|385608| 0| 0| / | 0|
|2016-01-24 00:00:...|385608| 1| 0| / | 0|
|2016-01-31 00:00:...|385608| 1| 1| / | 1|
|2016-02-07 00:00:...|385608| 1| 1| / | 2|
|2016-02-14 00:00:...|385608| 1| 1| / | 3|
+--------------------+------+----+----+ / +--------------------+
Is there a way to do this with Spark dataframe, I have seen multiple cumulative type computations, but never including the same column, I believe the problem is that the newly computed value for row i-1 is not considered, instead the old i-1 is used which is always 0.
Any help would be appreciated.
Dataset should work just fine:
val x = 0.1
case class Record(dates: String, id: String, col1: Int)
someDf.drop("col2").as[Record].groupByKey(_.id).flatMapGroups((_, records) => {
val sorted = records.toSeq.sortBy(_.dates)
sorted.scanLeft((null: Record, 0.0)){
case ((_, col2), record) => (record, x * record.col1 + (1 - x) * col2)
}.tail
}).select($"_1.*", $"_2".alias("col2"))
You can use rowsBetween api with the Window function you are using and you should have desired output
val date_id_window = Window.partitionBy("id").orderBy(asc("dates"))
someDf.withColumn("col2", sum(lag($"col1", 1).over(date_id_window)).over(date_id_window.rowsBetween(Long.MinValue, 0)))
.withColumn("col2", when($"col2".isNull, lit(0)).otherwise($"col2"))
.show()
Given input dataframe as
+--------------------+------+----+----+
| dates| id|col1|col2|
+--------------------+------+----+----+
|2016-01-10 00:00:...|385608| 0| 0|
|2016-01-17 00:00:...|385608| 0| 0|
|2016-01-24 00:00:...|385608| 1| 0|
|2016-01-31 00:00:...|385608| 1| 0|
|2016-02-07 00:00:...|385608| 1| 0|
|2016-02-14 00:00:...|385608| 1| 0|
|2016-01-17 00:00:...|105010| 0| 0|
|2016-01-24 00:00:...|105010| 1| 0|
|2016-01-31 00:00:...|105010| 0| 0|
|2016-02-07 00:00:...|105010| 1| 0|
+--------------------+------+----+----+
You should have output dataframe after applying above logic as
+--------------------+------+----+----+
| dates| id|col1|col2|
+--------------------+------+----+----+
|2016-01-17 00:00:...|105010| 0| 0|
|2016-01-24 00:00:...|105010| 1| 0|
|2016-01-31 00:00:...|105010| 0| 1|
|2016-02-07 00:00:...|105010| 1| 1|
|2016-01-10 00:00:...|385608| 0| 0|
|2016-01-17 00:00:...|385608| 0| 0|
|2016-01-24 00:00:...|385608| 1| 0|
|2016-01-31 00:00:...|385608| 1| 1|
|2016-02-07 00:00:...|385608| 1| 2|
|2016-02-14 00:00:...|385608| 1| 3|
+--------------------+------+----+----+
I hope the answer is helpful
You should apply transformations to your dataframe rather than treating it as a var. One way to get what you want is to use Window's rowsBetween to cumulatively sum value of col1 for rows within each window partition through the previous row (i.e. row -1):
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("id").orderBy("dates").rowsBetween(Long.MinValue, -1)
val newDF = someDf.
withColumn(
"col2", sum($"col1").over(window)
).withColumn(
"col2", when($"col2".isNull, 0).otherwise($"col2")
).orderBy("id", "dates")
newDF.show
+--------------------+------+----+----+
| dates| id|col1|col2|
+--------------------+------+----+----+
|2016-01-17 00:00:...|105010| 0| 0|
|2016-01-24 00:00:...|105010| 1| 0|
|2016-01-31 00:00:...|105010| 0| 1|
|2016-02-07 00:00:...|105010| 1| 1|
|2016-01-10 00:00:...|385608| 0| 0|
|2016-01-17 00:00:...|385608| 0| 0|
|2016-01-24 00:00:...|385608| 1| 0|
|2016-01-31 00:00:...|385608| 1| 1|
|2016-02-07 00:00:...|385608| 1| 2|
|2016-02-14 00:00:...|385608| 1| 3|
+--------------------+------+----+----+