Get a cumulative sum with time interval condition - pyspark

I have a dataframe with value, date_start, and date_end. I want to take the cumulative sum of all values:
partition by group
that end before the current date_start
Here is how the data looks like:
+-----+-----+----------+----------+
group |value|date_start|date_end |
+-----+-----+----------+----------+
a |1 |2016-05-04|2016-05-05|
a |2 |2016-05-06|2016-05-06|
a |5 |2016-07-06|2016-10-06|
a |2 |2016-09-10|2016-09-20|
a |3 |2016-11-12|2016-12-20|
b |8 |2016-09-03|2016-11-06|
b |2 |2016-11-04|2016-12-05|
b |4 |2016-12-04|2016-12-06|
This is what expect.
+-----+-----+----------+----------+-------+
group |value|date_start|date_end |cum_sum|
+-----+-----+----------+----------+-------+
a |1 |2016-05-04|2016-05-05| 0|
a |2 |2016-05-06|2016-05-06| 1|
a |5 |2016-07-06|2016-10-06| 3| => 1+2
a |2 |2016-09-10|2016-09-20| 3| => 1+2; not include the 3rd row
a |3 |2016-11-12|2016-12-20| 10| => 1+2+5+2
b |8 |2016-09-03|2016-11-06| 0| => no sample satisfies the time condition
b |2 |2016-11-04|2016-12-05| 0| => no sample satisfies the time condition
b |4 |2016-12-04|2016-12-06| 8|
I'm current set the window functions as this:
from pyspark.sql import Window
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
windowval = (Window.partitionBy('group').orderBy(F.col("date_start"))
.rangeBetween(Window.unboundedPreceding, 0))
df = df.withColumn('cum_sum', F.sum('value').over(windowval) - F.col('cum_sum'))
Here's the the result I got. I know it's not correct, but it just simply takes the cumsum along the partition.
+-----+-----+----------+----------+-------+
group |value|date_start|date_end |cum_sum|
+-----+-----+----------+----------+-------+
a |1 |2016-05-04|2016-05-05| 0|
a |2 |2016-05-06|2016-05-06| 1|
a |5 |2016-07-06|2016-10-06| 3|
a |2 |2016-09-10|2016-09-20| 8|
a |3 |2016-11-12|2016-12-20| 10|
b |8 |2016-09-03|2016-11-06| 0|
b |2 |2016-11-04|2016-12-05| 8|
b |4 |2016-12-04|2016-12-06| 12|
Is there a way to apply the condition on the date_start and date_end?

Related

How would I repeat each row in a Scala dataframe N times

Here is the before of the dataframe:
and here is the after:
notice how the rows that are repeated are all next to each other, as opposed to just starting the dataframe over from scratch at the end.
Thanks
Try with array_repeat with struct function then explode the array.
Example:
df.show()
/*
+----+----+
|col1|col2|
+----+----+
| 1| 4|
| 2| 5|
| 3| 6|
+----+----+
*/
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
df.withColumn("arr",explode(array_repeat(struct(df.columns.head,df.columns.tail:_*),7))).
select("arr.*").
toDF("col1","col2").
show(100,false)
/*
+----+----+
|col1|col2|
+----+----+
|1 |4 |
|1 |4 |
|1 |4 |
|1 |4 |
|1 |4 |
|1 |4 |
|1 |4 |
|2 |5 |
|2 |5 |
|2 |5 |
|2 |5 |
|2 |5 |
|2 |5 |
|2 |5 |
|3 |6 |
|3 |6 |
|3 |6 |
|3 |6 |
|3 |6 |
|3 |6 |
|3 |6 |
+----+----+
*/
Here's a function which duplicates a DataFrame:
def repeatRows(df: DataFrame, numRepeats: Int): DataFrame = {
(1 until numRepeats).foldLeft(df)((growingDF, _) => growingDF.union(df))
}
The problem of having the resulting DataFrame sorted is separate from the duplication process, and hence wasn't included in the function, but can be easily achieved afterwards.
So let's take your problem:
// Problem setup
val someDF = Seq((1,4),(2,4),(3,6)).toDF("col1","col2")
// Duplicate followed by sort
val duplicatedSortedDF = repeatRows(someDF, 3).sort("col1")
// Show result
duplicatedSortedDF.show()
+----+----+
|col1|col2|
+----+----+
| 1| 4|
| 1| 4|
| 1| 4|
| 2| 4|
| 2| 4|
| 2| 4|
| 3| 6|
| 3| 6|
| 3| 6|
+----+----+
And there you have it.

Update the value of a array of column depend on the value of different array of column in same dataframe in Spark-Scala

There is almost 1500+ columns in a dataframe where differnt set of array of column existing like : col1, col2, col3, col4.... , then c1, c2, c3, c4...... , then column1, column2, column3, column4 ....... etc etc. Now based on certain logic and condition subset of array of column of same name need to be updated/modified . Suppose the logic is like :
col(i) = 2 *column(i) + 3* c(i)
and ,
col(i) = 2* column(i) + col(i)
where i is ranging the between upper of lower limits of subset of array. Here subset of array of col1-col4 can be col1, col2, col3.
I need an expression where above mentioned two operation can be expressed.
I am giving an example for the below expression :
col(i) = col(i) + 1
Example :
scala> val original_df = Seq((1,2,3,4,9,8,7,6),(2,3,4,5,8,7,6,5),(3,4,5,6,7,6,5,4),(4,5,6,7,6,5,4,3),(5,6,7,8,5,4,3,2),(6,7,8,9,4,3,2,1)).toDF("col1","col2","col3","col4","c1","c2","c3","c4")
original_df: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 6 more fields]
scala> original_df.show()
+----+----+----+----+---+---+---+---+
|col1|col2|col3|col4| c1| c2| c3| c4|
+----+----+----+----+---+---+---+---+
| 1| 2| 3| 4| 9| 8| 7| 6|
| 2| 3| 4| 5| 8| 7| 6| 5|
| 3| 4| 5| 6| 7| 6| 5| 4|
| 4| 5| 6| 7| 6| 5| 4| 3|
| 5| 6| 7| 8| 5| 4| 3| 2|
| 6| 7| 8| 9| 4| 3| 2| 1|
+----+----+----+----+---+---+---+---+
scala> val requiredColumns = original_df.columns.zipWithIndex.filter(_._2 < 3).map(_._1).toSet
requiredColumns: scala.collection.immutable.Set[String] = Set(col1, col2, col3)
scala> val allColumns = original_df.columns
allColumns: Array[String] = Array(col1, col2, col3, col4, c1, c2, c3, c4)
scala> val columnExpr = allColumns.filterNot(requiredColumns(_)).map(col(_)) ++ requiredColumns.map(c => (col(c) + 1).alias(c))
columnExpr: Array[org.apache.spark.sql.Column] = Array(col4, c1, c2, c3, c4, (col1 + 1) AS `col1`, (col2 + 1) AS `col2`, (col3 + 1) AS `col3`)
scala> original_df.select(columnExpr:_*).show(false)
+----+---+---+---+---+----+----+----+
|col4|c1 |c2 |c3 |c4 |col1|col2|col3|
+----+---+---+---+---+----+----+----+
|4 |9 |8 |7 |6 |2 |3 |4 |
|5 |8 |7 |6 |5 |3 |4 |5 |
|6 |7 |6 |5 |4 |4 |5 |6 |
|7 |6 |5 |4 |3 |5 |6 |7 |
|8 |5 |4 |3 |2 |6 |7 |8 |
|9 |4 |3 |2 |1 |7 |8 |9 |
+----+---+---+---+---+----+----+----+
So I need an expression in this case for below two expression :
col(i) = 2 *column(i) + 3* c(i)
and ,
col(i) = 2* column(i) + col(i)

Iterate Over a Dataframe as each time column is passing to do transformation

I have a dataframe with 100 columns and col names like col1, col2, col3.... I want to apply certain transformation on the values of columns based on condition matches. I can store the column names in a array of string. And pass the value each element of the array in withColumn and based on When condition i can transform the values of the column vertically.
But the question is, as Dataframe is immutable, so each updated version is need to store in a new variable and also new dataframe need to pass in withColumn to transform for next iteration.
Is there any way to create array of dataframe so that new dataframe can be stored as a element of array and it can iterate based on the value of iterator.
Or is there any other way to handle the same.
var arr_df : Array[DataFrame] = new Array[DataFrame](60)
--> This throws error "not found type DataFrame"
val df(0) = df1.union(df2)
for(i <- 1 to 99){
val df(i) = df(i-1).withColumn(col(i), when(col(i)> 0, col(i) +
1).otherwise(col(i)))
Here col(i) is an array of strings that stores the name of the columns of the original datframe .
As a example :
scala> val original_df = Seq((1,2,3,4),(2,3,4,5),(3,4,5,6),(4,5,6,7),(5,6,7,8),(6,7,8,9)).toDF("col1","col2","col3","col4")
original_df: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 2 more fields]
scala> original_df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 2| 3| 4|
| 2| 3| 4| 5|
| 3| 4| 5| 6|
| 4| 5| 6| 7|
| 5| 6| 7| 8|
| 6| 7| 8| 9|
+----+----+----+----+
I want to iterate 3 columns : col1, col2, col3 if the value of that column is greater than 3, then it will be updated by +1
Check below code.
scala> df.show(false)
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|1 |2 |3 |4 |
|2 |3 |4 |5 |
|3 |4 |5 |6 |
|4 |5 |6 |7 |
|5 |6 |7 |8 |
|6 |7 |8 |9 |
+----+----+----+----+
scala> val requiredColumns = df.columns.zipWithIndex.filter(_._2 < 3).map(_._1).toSet
requiredColumns: scala.collection.immutable.Set[String] = Set(col1, col2, col3)
scala> val allColumns = df.columns
allColumns: Array[String] = Array(col1, col2, col3, col4)
scala> val columnExpr = allColumns.filterNot(requiredColumns(_)).map(col(_)) ++ requiredColumns.map(c => when(col(c) > 3, col(c) + 1).otherwise(col(c)).as(c))
scala> df.select(columnExpr:_*).show(false)
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|1 |2 |3 |4 |
|2 |3 |5 |5 |
|3 |5 |6 |6 |
|5 |6 |7 |7 |
|6 |7 |8 |8 |
|7 |8 |9 |9 |
+----+----+----+----+
If I understand you right, you are trying to do a dataframe wise operation. you dont need to iterate for this . I can show you how it can be done in pyspark. probably it can be taken over in scala.
from pyspark.sql import functions as F
tst= sqlContext.createDataFrame([(1,7,0),(1,8,4),(1,0,10),(5,1,90),(7,6,0),(0,3,11)],schema=['col1','col2','col3'])
expr = [F.when(F.col(coln)>3,F.col(coln)+1).otherwise(F.col(coln)).alias(coln) for coln in tst.columns if 'col3' not in coln]
tst1= tst.select(*expr)
results:
tst1.show()
+----+----+
|col1|col2|
+----+----+
| 1| 8|
| 1| 9|
| 1| 0|
| 6| 1|
| 8| 7|
| 0| 3|
+----+----+
This should give you the desired result
You can iterate over all columns and apply the condition in single line as below,
original_df.select(original_df.columns.map(c => (when(col(c) > lit(3), col(c)+1).otherwise(col(c))).alias(c)):_*).show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 2| 3| 5|
| 2| 3| 5| 6|
| 3| 5| 6| 7|
| 5| 6| 7| 8|
| 6| 7| 8| 9|
| 7| 8| 9| 10|
+----+----+----+----+
You can use foldLeft whenever you want to make changes on multiple columns as below
val original_df = Seq(
(1,2,3,4),
(2,3,4,5),
(3,4,5,6),
(4,5,6,7),
(5,6,7,8),
(6,7,8,9)
).toDF("col1","col2","col3","col4")
//Filter the columns that yuou want to update
val columns = original_df.columns
columns.foldLeft(original_df){(acc, colName) =>
acc.withColumn(colName, when(col(colName) > 3, col(colName) + 1).otherwise(col(colName)))
}
.show(false)
Output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|1 |2 |3 |5 |
|2 |3 |5 |6 |
|3 |5 |6 |7 |
|5 |6 |7 |8 |
|6 |7 |8 |9 |
|7 |8 |9 |10 |
+----+----+----+----+

column split in Spark Scala dataframe

I have the below Data frame with me -
scala> val df1=Seq(
| ("1_10","2_20","3_30"),
| ("7_70","8_80","9_90")
| )toDF("c1","c2","c3")
scala> df1.show
+----+----+----+
| c1| c2| c3|
+----+----+----+
|1_10|2_20|3_30|
|7_70|8_80|9_90|
+----+----+----+
How to split this to different columns based on delimiter "_".
Expected output -
+----+----+----+----+----+----+
| c1| c2| c3|c1_1|c2_1|c3_1|
+----+----+----+----+----+----+
|1 |2 |3 | 10| 20| 30|
|7 |8 |9 | 70| 80| 90|
+----+----+----+----+----+----+
Also I have 50 + columns in the DF. Thanks in Advance.
Here is the good use of foldLeft. Split each column and create a new column for each splited value
val cols = df1.columns
cols.foldLeft(df1) { (acc, name) =>
acc.withColumn(name, split(col(name), "_"))
.withColumn(s"${name}_1", col(name).getItem(0))
.withColumn(s"${name}_2", col(name).getItem(1))
}.drop(cols:_*)
.show(false)
If you need the columns name exactly as you want then you need to filter the columns that ends with _1 and rename them again with foldLeft
Output:
+----+----+----+----+----+----+
|c1_1|c1_2|c2_1|c2_2|c3_1|c3_2|
+----+----+----+----+----+----+
|1 |10 |2 |20 |3 |30 |
|7 |70 |8 |80 |9 |90 |
+----+----+----+----+----+----+
You can use split method
split(col("c1"), '_')
This will return you ArrayType(StringType)
Then you can access items with .getItem(index) method.
That is if you have a stable number of elements after spliting if that isnt the case you will have some null values if the indexed value isnt present in the array after splitting.
Example of code:
df.select(
split(col("c1"), "_").alias("c1_items"),
split(col("c2"), "_").alias("c2_items"),
split(col("c3"), "_").alias("c3_items"),
).select(
col("c1_items").getItem(0).alias("c1"),
col("c1_items").getItem(1).alias("c1_1"),
col("c2_items").getItem(0).alias("c2"),
col("c2_items").getItem(1).alias("c2_1"),
col("c3_items").getItem(0).alias("c3"),
col("c3_items").getItem(1).alias("c3_1")
)
Since you need to do this for 50+ columns I would probably suggest to wrap this in a method for a single column + withColumn statement in this kind of way
def splitMyCol(df: Dataset[_], name: String) = {
df.withColumn(
s"${name}_items", split(col("name"), "_")
).withColumn(
name, col(s"${name}_items).getItem(0)
).withColumn(
s"${name}_1", col(s"${name}_items).getItem(1)
).drop(s"${name}_items")
}
Note I assume you do not need items to be maintained thus I drop it. Also not that due to _ in the name between two variable is s"" string you need to wrap first one in {}, while second doesnt really need {} wrapping and $ is enough.
You can wrap this then in a fold method in this way:
val result = columnsToExpand.foldLeft(df)(
(acc, next) => splitMyCol(acc, next)
)
pyspark solution:
import pyspark.sql.functions as F
df1=sqlContext.createDataFrame([("1_10","2_20","3_30"),("7_70","8_80","9_90")]).toDF("c1","c2","c3")
expr = [F.split(coln,"_") for coln in df1.columns]
df2=df1.select(*expr)
#%%
df3= df2.withColumn("clctn",F.flatten(F.array(df2.columns)))
#%% assuming all columns will have data in the same format x_y
arr_size = len(df1.columns)*2
df_fin= df3.select([F.expr("clctn["+str(x)+"]").alias("c"+str(x/2)+'_'+str(x%2)) for x in range(arr_size)])
Results:
+----+----+----+----+----+----+
|c0_0|c0_1|c1_0|c1_1|c2_0|c2_1|
+----+----+----+----+----+----+
| 1| 10| 2| 20| 3| 30|
| 7| 70| 8| 80| 9| 90|
+----+----+----+----+----+----+
Try to use select instead of foldLeft for better performance. As foldLeft might be taking longer time than select
Check this post - foldLeft,select
val expr = df
.columns
.flatMap(c => Seq(
split(col(c),"_")(0).as(s"${c}_1"),
split(col(c),"_")(1).as(s"${c}_2")
)
)
.toSeq
Result
df.select(expr:_*).show(false)
+----+----+----+----+----+----+
|c1_1|c1_2|c2_1|c2_2|c3_1|c3_2|
+----+----+----+----+----+----+
|1 |10 |2 |20 |3 |30 |
|7 |70 |8 |80 |9 |90 |
+----+----+----+----+----+----+
You can do like this.
var df=Seq(("1_10","2_20","3_30"),("7_70","8_80","9_90")).toDF("c1","c2","c3")
for (cl <- df.columns) {
df=df.withColumn(cl+"_temp",split(df.col(cl),"_")(0))
df=df.withColumn(cl+"_"+cl.substring(1),split(df.col(cl),"_")(1))
df=df.withColumn(cl,df.col(cl+"_temp")).drop(cl+"_temp")
}
df.show(false)
}
//Sample output
+---+---+---+----+----+----+
|c1 |c2 |c3 |c1_1|c2_2|c3_3|
+---+---+---+----+----+----+
|1 |2 |3 |10 |20 |30 |
|7 |8 |9 |70 |80 |90 |
+---+---+---+----+----+----+

apply an aggregate result to all ungrouped rows of a dataframe in spark

assume there is a dataframe as follows:
machine_id | value
1| 5
1| 3
2| 6
2| 9
2| 14
I want to produce a final dataframe like this
machine_id | value | diff
1| 5| 1
1| 3| -1
2| 6| -4
2| 10| 0
2| 14| 4
the values in "diff" column is computed as groupBy($"machine_id").avg($"value") - value.
note that the avg for machine_id==1 is (5+3)/2 = 4 and for machine_id ==2 is (6+10+14)/3 = 10
What is the best way to produce such a final dataframe in Apache Spark?
You can use Window function to get the desired output
Given the dataframe as
+----------+-----+
|machine_id|value|
+----------+-----+
|1 |5 |
|1 |3 |
|2 |6 |
|2 |10 |
|2 |14 |
+----------+-----+
You can use following code
df.withColumn("diff", avg("value").over(Window.partitionBy("machine_id")))
.withColumn("diff", 'value - 'diff)
to get the final result as
+----------+-----+----+
|machine_id|value|diff|
+----------+-----+----+
|1 |5 |1.0 |
|1 |3 |-1.0|
|2 |6 |-4.0|
|2 |10 |0.0 |
|2 |14 |4.0 |
+----------+-----+----+