Last Entry that matches a condition per Window

Last Entry that matches a condition per Window - scala

This dummy data represents a device with measurement cycles.
One measurement clycle goes from "Type" Init to Init.
What I want to find out is the f.e. last error (the condition will get way more complicated) within each measurement cylce.
I already figured out a solution for this. What I really want to know is if there is an easier / more efficient way to calculate this.
Example Dataset
val df_orig = spark.sparkContext.parallelize(Seq(
("Init", 1, 17, "I"),
("TypeA", 2, 17, "W"),
("TypeA", 3, 17, "E"),
("TypeA", 4, 17, "W"),
("TypeA", 5, 17, "E"),
("TypeA", 6, 17, "W"),
("Init", 7, 12, "I"),
("TypeB", 8, 12, "W"),
("TypeB", 9, 12, "E"),
("TypeB", 10, 12, "W"),
("TypeB", 11, 12, "W"),
("TypeB", 12, 12, "E"),
("TypeB", 13, 12, "E")
)).toDF("Type", "rn", "X_ChannelC", "Error_Type")
The following code represents my solution.
val fillWindow = Window.partitionBy().orderBy($"rn").rowsBetween(Window.unboundedPreceding, 0)
//create window
val df_with_window = df_orig.withColumn("window_flag", when($"Type".contains("Init"), 1).otherwise(null))
.withColumn("window_filled", sum($"window_flag").over(fillWindow))
val window = Window.partitionBy("window_filled").orderBy($"rn").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
//calulate last entry
val df_new = df_with_window.withColumn("is_relevant", when($"Error_Type".contains("E"), $"rn").otherwise(null))
.withColumn("last", last($"is_relevant", true).over(window))
.withColumn("pass", when($"last" === $"is_relevant", "Fail").otherwise(null))
df_new.show()
Result:
+-----+---+----------+----------+-----------+-------------+-----------+----+--------+
| Type| rn|X_ChannelC|Error_Type|window_flag|window_filled|is_relevant|last| pass|
+-----+---+----------+----------+-----------+-------------+-----------+----+--------+
| Init| 1| 17| I| 1| 1| null| 5| null|
|TypeA| 2| 17| W| null| 1| null| 5| null|
|TypeA| 3| 17| E| null| 1| 3| 5| null|
|TypeA| 4| 17| W| null| 1| null| 5| null|
|TypeA| 5| 17| E| null| 1| 5| 5|This one|
|TypeA| 6| 17| W| null| 1| null| 5| null|
| Init| 7| 12| I| 1| 2| null| 13| null|
|TypeB| 8| 12| W| null| 2| null| 13| null|
|TypeB| 9| 12| E| null| 2| 9| 13| null|
|TypeB| 10| 12| W| null| 2| null| 13| null|
|TypeB| 11| 12| W| null| 2| null| 13| null|
|TypeB| 12| 12| E| null| 2| 12| 13| null|
|TypeB| 13| 12| E| null| 2| 13| 13|This one|
+-----+---+----------+----------+-----------+-------------+-----------+----+--------+

Not sure if that is more efficient (still 2 window functions used, but a bit shorter):
val df_new = df_orig
.withColumn("measurement", sum(when($"Type"==="Init",1)).over(Window.orderBy($"rn")))
.withColumn("pass", $"rn"===max(when($"Error_Type"==="E",$"rn")).over(Window.partitionBy($"measurement")))
.show()
+-----+---+----------+----------+-----------+-----+
| Type| rn|X_ChannelC|Error_Type|measurement| pass|
+-----+---+----------+----------+-----------+-----+
| Init| 1| 17| I| 1|false|
|TypeA| 2| 17| W| 1|false|
|TypeA| 3| 17| E| 1|false|
|TypeA| 4| 17| W| 1|false|
|TypeA| 5| 17| E| 1| true|
|TypeA| 6| 17| W| 1|false|
| Init| 7| 12| I| 2|false|
|TypeB| 8| 12| W| 2|false|
|TypeB| 9| 12| E| 2|false|
|TypeB| 10| 12| W| 2|false|
|TypeB| 11| 12| W| 2|false|
|TypeB| 12| 12| E| 2|false|
|TypeB| 13| 12| E| 2| true|
+-----+---+----------+----------+-----------+-----+

Related

Pyspark: Window function for stddev and quantiles produce NaN and Nulls

Trying to compute the stddev and 25,75 quantiles but they produce NaN and Null values
# Window Time = 30min
window_time = 1800
# Stats fields for window
stat_fields = ['source_packets', 'destination_packets']
df = sqlContext.createDataFrame([('192.168.1.1','10.0.0.1',22,51000, 17, 1, "2017-03-10T15:27:18+00:00"),
('192.168.1.2','10.0.0.2',51000,22, 1,2, "2017-03-15T12:27:18+00:00"),
('192.168.1.2','10.0.0.2',53,51000, 2,3, "2017-03-15T12:28:18+00:00"),
('192.168.1.2','10.0.0.2',51000,53, 3,4, "2017-03-15T12:29:18+00:00"),
('192.168.1.3','10.0.0.3',80,51000, 4,5, "2017-03-15T12:28:18+00:00"),
('192.168.1.3','10.0.0.3',51000,80, 5,6, "2017-03-15T12:29:18+00:00"),
('192.168.1.3','10.0.0.3',22,51000, 25,7, "2017-03-18T11:27:18+00:00")],
["source_ip","destination_ip","source_port","destination_port", "source_packets", "destination_packets", "timestampGMT"])
def add_stats_column(r_df, field, window):
'''
Input:
r_df: dataframe
field: field to generate stats with
window: pyspark window to be used
'''
r_df = r_df \
.withColumn('{}_sum_30m'.format(field), F.sum(field).over(window))\
.withColumn('{}_avg_30m'.format(field), F.avg(field).over(window))\
.withColumn('{}_std_30m'.format(field), F.stddev(field).over(window))\
.withColumn('{}_min_30m'.format(field), F.min(field).over(window))\
.withColumn('{}_max_30m'.format(field), F.max(field).over(window))\
.withColumn('{}_q25_30m'.format(field), F.expr("percentile_approx('{}', 0.25)".format(field)).over(window))\
.withColumn('{}_q75_30m'.format(field), F.expr("percentile_approx('{}', 0.75)".format(field)).over(window))
return r_df
w_s = (Window()
.partitionBy("ip")
.orderBy(F.col("timestamp"))
.rangeBetween(-window_time, 0))
df2 = df.withColumn("timestamp", F.unix_timestamp(F.to_timestamp("timestampGMT"))) \
.withColumn("arr",F.array(F.col("source_ip"),F.col("destination_ip")))\
.selectExpr("explode(arr) as ip","*")\
.drop(*['arr','source_ip','destination_ip'])
df2 = (reduce(partial(add_stats_column,window=w_s),
stat_fields,
df2
))
#print(df2.explain())
df2.show(100)
output
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| ip|source_port|destination_port|source_packets|destination_packets| timestampGMT| timestamp|source_packets_sum_30m|source_packets_avg_30m|source_packets_std_30m|source_packets_min_30m|source_packets_max_30m|source_packets_q25_30m|source_packets_q75_30m|destination_packets_sum_30m|destination_packets_avg_30m|destination_packets_std_30m|destination_packets_min_30m|destination_packets_max_30m|destination_packets_q25_30m|destination_packets_q75_30m|
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
|192.168.1.3| 80| 51000| 4| 5|2017-03-15T12:28:...|1489580898| 4| 4.0| NaN| 4| 4| null| null| 5| 5.0| NaN| 5| 5| null| null|
|192.168.1.3| 51000| 80| 5| 6|2017-03-15T12:29:...|1489580958| 9| 4.5| 0.7071067811865476| 4| 5| null| null| 11| 5.5| 0.7071067811865476| 5| 6| null| null|
|192.168.1.3| 22| 51000| 25| 7|2017-03-18T11:27:...|1489836438| 25| 25.0| NaN| 25| 25| null| null| 7| 7.0| NaN| 7| 7| null| null|
| 10.0.0.1| 22| 51000| 17| 1|2017-03-10T15:27:...|1489159638| 17| 17.0| NaN| 17| 17| null| null| 1| 1.0| NaN| 1| 1| null| null|
| 10.0.0.2| 51000| 22| 1| 2|2017-03-15T12:27:...|1489580838| 1| 1.0| NaN| 1| 1| null| null| 2| 2.0| NaN| 2| 2| null| null|
| 10.0.0.2| 53| 51000| 2| 3|2017-03-15T12:28:...|1489580898| 3| 1.5| 0.7071067811865476| 1| 2| null| null| 5| 2.5| 0.7071067811865476| 2| 3| null| null|
| 10.0.0.2| 51000| 53| 3| 4|2017-03-15T12:29:...|1489580958| 6| 2.0| 1.0| 1| 3| null| null| 9| 3.0| 1.0| 2| 4| null| null|
| 10.0.0.3| 80| 51000| 4| 5|2017-03-15T12:28:...|1489580898| 4| 4.0| NaN| 4| 4| null| null| 5| 5.0| NaN| 5| 5| null| null|
| 10.0.0.3| 51000| 80| 5| 6|2017-03-15T12:29:...|1489580958| 9| 4.5| 0.7071067811865476| 4| 5| null| null| 11| 5.5| 0.7071067811865476| 5| 6| null| null|
| 10.0.0.3| 22| 51000| 25| 7|2017-03-18T11:27:...|1489836438| 25| 25.0| NaN| 25| 25| null| null| 7| 7.0| NaN| 7| 7| null| null|
|192.168.1.2| 51000| 22| 1| 2|2017-03-15T12:27:...|1489580838| 1| 1.0| NaN| 1| 1| null| null| 2| 2.0| NaN| 2| 2| null| null|
|192.168.1.2| 53| 51000| 2| 3|2017-03-15T12:28:...|1489580898| 3| 1.5| 0.7071067811865476| 1| 2| null| null| 5| 2.5| 0.7071067811865476| 2| 3| null| null|
|192.168.1.2| 51000| 53| 3| 4|2017-03-15T12:29:...|1489580958| 6| 2.0| 1.0| 1| 3| null| null| 9| 3.0| 1.0| 2| 4| null| null|
|192.168.1.1| 22| 51000| 17| 1|2017-03-10T15:27:...|1489159638| 17| 17.0| NaN| 17| 17| null| null| 1| 1.0| NaN| 1| 1| null| null|
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+

from pyspark api doc, we can get that:
pyspark.sql.functions.stddev(col)
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
New in version 1.6.
pyspark.sql.functions.stddev_pop(col)
Aggregate function: returns population standard deviation of the expression in a group.
New in version 1.6.
pyspark.sql.functions.stddev_samp(col)
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
New in version 1.6.
so, maybe you can try stddev_pop:population standard deviation other than stddev:unbiased sample standard deviation.
unbiased sample standard deviation cause divide by zero error (get NaN) when only one sample.

Pyspark cumsum over same values in orderBy column

I have the following dataframe:
+----+----+-----+
|col1|col2|value|
+----+----+-----+
| 11| a| 1|
| 11| a| 2|
| 11| b| 3|
| 11| a| 4|
| 11| b| 5|
| 22| a| 6|
| 22| b| 7|
+----+----+-----+
I want to calculate to calculate the cumsum of the 'value' column that is partitioned by 'col1' and ordered by 'col2'.
This is the desired output:
+----+----+-----+------+
|col1|col2|value|cumsum|
+----+----+-----+------+
| 11| a| 1| 1|
| 11| a| 2| 3|
| 11| a| 4| 7|
| 11| b| 3| 10|
| 11| b| 5| 15|
| 22| a| 6| 6|
| 22| b| 7| 13|
+----+----+-----+------+
I have used this code which gives me the df shown below. It is not what I wanted. Can someone help me please?
df.withColumn("cumsum", F.sum("value").over(Window.partitionBy("col1").orderBy("col2").rangeBetween(Window.unboundedPreceding, 0)))
+----+----+-----+------+
|col1|col2|value|cumsum|
+----+----+-----+------+
| 11| a| 2| 7|
| 11| a| 1| 7|
| 11| a| 4| 7|
| 11| b| 3| 15|
| 11| b| 5| 15|
| 22| a| 6| 6|
| 22| b| 7| 13|
+----+----+-----+------+

You have to use .rowsBetween instead of .rangeBetween in your window clause.
rowsBetween (vs) rangeBetween
Example:
df.withColumn("cumsum", sum("value").over(Window.partitionBy("col1").orderBy("col2").rowsBetween(Window.unboundedPreceding, 0))).show()
#+----+----+-----+------+
#|col1|col2|value|cumsum|
#+----+----+-----+------+
#| 11| a| 1| 1|
#| 11| a| 2| 3|
#| 11| a| 4| 7|
#| 11| b| 3| 10|
#| 11| b| 5| 15|
#| 12| a| 6| 6|
#| 12| b| 7| 13|
#+----+----+-----+------+

Spark Scala Window extend result until the end

I will expose my problem based on the initial dataframe and the one I want to achieve:
val df_997 = Seq [(Int, Int, Int, Int)]((1,1,7,10),(1,10,4,300),(1,3,14,50),(1,20,24,70),(1,30,12,90),(2,10,4,900),(2,25,30,40),(2,15,21,60),(2,5,10,80)).toDF("policyId","FECMVTO","aux","IND_DEF").orderBy(asc("policyId"), asc("FECMVTO"))
df_997.show
+--------+-------+---+-------+
|policyId|FECMVTO|aux|IND_DEF|
+--------+-------+---+-------+
| 1| 1| 7| 10|
| 1| 3| 14| 50|
| 1| 10| 4| 300|
| 1| 20| 24| 70|
| 1| 30| 12| 90|
| 2| 5| 10| 80|
| 2| 10| 4| 900|
| 2| 15| 21| 60|
| 2| 25| 30| 40|
+--------+-------+---+-------+
Imagine I have partitioned this DF by the column policyId and created the column row_num based on it to better see the Windows:
val win = Window.partitionBy("policyId").orderBy("FECMVTO")
val df_998 = df_997.withColumn("row_num",row_number().over(win))
df_998.show
+--------+-------+---+-------+-------+
|policyId|FECMVTO|aux|IND_DEF|row_num|
+--------+-------+---+-------+-------+
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 10| 4| 300| 3|
| 1| 20| 24| 70| 4|
| 1| 30| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 10| 4| 900| 2|
| 2| 15| 21| 60| 3|
| 2| 25| 30| 40| 4|
+--------+-------+---+-------+-------+
Now, for each window, if the value of aux is 4, I want to set the value of IND_DEF column for that register to the column FEC_MVTO for this register on until the end of the window.
The resulting DF would be:
+--------+-------+---+-------+-------+
|policyId|FECMVTO|aux|IND_DEF|row_num|
+--------+-------+---+-------+-------+
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 300| 4| 300| 3|
| 1| 300| 24| 70| 4|
| 1| 300| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 900| 4| 900| 2|
| 2| 900| 21| 60| 3|
| 2| 900| 30| 40| 4|
+--------+-------+---+-------+-------+
Thanks for your suggestions as I am very stuck in here...

Here's one approach: First left-join the DataFrame with its aux == 4 filtered version, followed by applying Window function first to backfill nulls with the wanted IND_DEF values per partition, and finally conditionally recreate column FECMVTO:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1,1,7,10), (1,10,4,300), (1,3,14,50), (1,20,24,70), (1,30,12,90),
(2,10,4,900), (2,25,30,40), (2,15,21,60), (2,5,10,80)
).toDF("policyId","FECMVTO","aux","IND_DEF")
val win = Window.partitionBy("policyId").orderBy("FECMVTO").
rowsBetween(Window.unboundedPreceding, 0)
val df2 = df.
select($"policyId", $"aux", $"IND_DEF".as("IND_DEF2")).
where($"aux" === 4)
df.join(df2, Seq("policyId", "aux"), "left_outer").
withColumn("IND_DEF3", first($"IND_DEF2", ignoreNulls=true).over(win)).
withColumn("FECMVTO", coalesce($"IND_DEF3", $"FECMVTO")).
show
// +--------+---+-------+-------+--------+--------+
// |policyId|aux|FECMVTO|IND_DEF|IND_DEF2|IND_DEF3|
// +--------+---+-------+-------+--------+--------+
// | 1| 7| 1| 10| null| null|
// | 1| 14| 3| 50| null| null|
// | 1| 4| 300| 300| 300| 300|
// | 1| 24| 300| 70| null| 300|
// | 1| 12| 300| 90| null| 300|
// | 2| 10| 5| 80| null| null|
// | 2| 4| 900| 900| 900| 900|
// | 2| 21| 900| 60| null| 900|
// | 2| 30| 900| 40| null| 900|
// +--------+---+-------+-------+--------+--------+
Columns IND_DEF2, IND_DEF3 are kept only for illustration (and can certainly be dropped).

#I believe below can be solution for your issue
Considering input_df is your input dataframe
//Step#1 - Filter rows with IND_DEF = 4 from input_df
val only_FECMVTO_4_df1 = input_df.filter($"IND_DEF" === 4)
//Step#2 - Filling FECMVTO value from IND_DEF for the above result
val only_FECMVTO_4_df2 = only_FECMVTO_4_df1.withColumn("FECMVTO_NEW",$"IND_DEF").drop($"FECMVTO").withColumnRenamed("FECMVTO",$"FECMVTO_NEW")
//Step#3 - removing all the records from step#1 from input_df
val input_df_without_FECMVTO_4 = input_df.except(only_FECMVTO_4_df1)
//combining Step#2 output with output of Step#3
val final_df = input_df_without_FECMVTO_4.union(only_FECMVTO_4_df2)

How to find the previous occurence of a value 'a' before some value 'b'

I join two data frames and have the resulting data frame as below.Now I want to
+---------+-----------+-----------+-------------------+---------+-------------------+
|a |b | c | d | e | f |
+---------+-----------+-----------+-------------------+---------+-------------------+
| 7| 2| 1|2015-04-12 23:59:01| null| null |
| 15| 2| 2|2015-04-12 23:59:02| | |
| 11| 2| 4|2015-04-12 23:59:03| null| null|
| 3| 2| 4|2015-04-12 23:59:04| null| null|
| 8| 2| 3|2015-04-12 23:59:05| {NORMAL} 2015-04-12 23:59:05|
| 16| 2| 3|2017-03-12 23:59:06| null| null|
| 5| 2| 3|2015-04-12 23:59:07| null| null|
| 18| 2| 3|2015-03-12 23:59:08| null| null|
| 17| 2| 1|2015-03-12 23:59:09| null| null|
| 6| 2| 1|2015-04-12 23:59:10| null| null|
| 19| 2| 3|2015-03-12 23:59:11| null| null|
| 9| 2| 3|2015-04-12 23:59:12| null| null|
| 1| 2| 2|2015-04-12 23:59:13| null| null|
| 1| 2| 2|2015-04-12 23:59:14| null| null|
| 1| 2| 2|2015-04-12 23:59:15| null| null|
| 10| 3| 2|2015-04-12 23:59:16| null| null|
| 4| 2| 3|2015-04-12 23:59:17| {NORMAL}|2015-04-12 23:59:17|
| 12| 3| 1|2015-04-12 23:59:18| null| null|
| 13| 3| 1|2015-04-12 23:59:19| null| null|
| 14| 2| 1|2015-04-12 23:59:20| null| null|
+---------+-----------+-----------+-------------------+---------+-------------------+
Now I have to find the first occuring 1 before each 3 in column c .For example
| 4| 2| 3|2015-04-12 23:59:17| {NORMAL}|2015-04-12 23:59:17|
Before this record I want to know the first occured 1 in column c which is
| 17| 2| 1|2015-03-12 23:59:09| null| null|
Any help is appreciated

You can use Spark window function lag import org.apache.spark.sql.expressions.Window
In first step you filter your data on the column "c" based on value as either 1 or 3. You will get data similar to
dft.show()
+---+---+---+---+
| id| a| b| c|
+---+---+---+---+
| 1| 7| 2| 1|
| 2| 15| 2| 3|
| 3| 11| 2| 3|
| 4| 3| 2| 1|
| 5| 8| 2| 3|
+---+---+---+---+
Next, define the window
val w = Window.orderBy("id")
Once this is done, create a new column and put previous value in it
dft.withColumn("prev", lag("c",1).over(w)).show()
+---+---+---+---+----+
| id| a| b| c|prev|
+---+---+---+---+----+
| 1| 7| 2| 1|null|
| 2| 15| 2| 3| 1|
| 3| 11| 2| 3| 3|
| 4| 3| 2| 1| 3|
| 5| 8| 2| 3| 1|
+---+---+---+---+----+
Finally filter on the values of column "c" and "prev"
Note: Do combine the steps when you are writing final code, so as to apply filter directly.

Variable number of arguments for pyspark udf

I have around 275 columns and I would like to search 25 columns for a regex string "^D(410|412). If this search string is is present in any of 25 columns I would like to add true to MyNewColumn.
using below I could do it for 2 columns. Is there anyway for passing variable number of columns ?
Below code works for 2 columns
def moreThanTwoArgs(col1,col2):
return bool((re.search("^D(410|412)",col1) or re.search("^D(410|412)",col2)))
twoUDF= udf(moreThanTwoArgs,BooleanType())
df = df.withColumn("MyNewColumn", twoUDF(df["X1"], df["X2"]))

I tried some what similar have sample code try this and proceed:-
df1 = sc.parallelize(
[
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
]).toDF(['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10'])
df1.show()
+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5| c6| c7| c8| c9|c10|
+---+---+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
+---+---+---+---+---+---+---+---+---+---+
import pyspark.sql.functions as F
import pyspark.sql.types as T
import re
def booleanFindFunc(*args):
return sum(args)
udfBoolean = F.udf(booleanFindFunc, T.StringType())
#Below is Sum of three columns (c1+c2+c2)
df1.withColumn("MyNewColumn", booleanFindFunc(F.col("c1"), F.col("c2"), F.col("c2"))).show()
+---+---+---+---+---+---+---+---+---+---+-----------+
| c1| c2| c3| c4| c5| c6| c7| c8| c9|c10|MyNewColumn|
+---+---+---+---+---+---+---+---+---+---+-----------+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
+---+---+---+---+---+---+---+---+---+---+-----------+
#Below is Sum of All Columns (c1+c2+c3---+c10)
df1.withColumn("MyNewColumn", booleanFindFunc(*[F.col(i) for i in df1.columns])).show()
+---+---+---+---+---+---+---+---+---+---+-----------+
| c1| c2| c3| c4| c5| c6| c7| c8| c9|c10|MyNewColumn|
+---+---+---+---+---+---+---+---+---+---+-----------+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
+---+---+---+---+---+---+---+---+---+---+-----------+
#Below is Sum of All odd Columns (c1+c3+c5--+c9)
df1.withColumn("MyNewColumn", booleanFindFunc(*[F.col(i) for i in df1.columns if int(i[1:])%2])).show()
+---+---+---+---+---+---+---+---+---+---+-----------+
| c1| c2| c3| c4| c5| c6| c7| c8| c9|c10|MyNewColumn|
+---+---+---+---+---+---+---+---+---+---+-----------+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
+---+---+---+---+---+---+---+---+---+---+-----------+
Hope This will solve your problem

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Last Entry that matches a condition per Window - scala

Related

Pyspark: Window function for stddev and quantiles produce NaN and Nulls

Pyspark cumsum over same values in orderBy column

Spark Scala Window extend result until the end

How to find the previous occurence of a value 'a' before some value 'b'

Variable number of arguments for pyspark udf

Categories

Resources