I want to calculate the runtime of multiple recorders. There can be infinitely recorders running at the same time.
When I have a start and end point I get the expected result with the following code snippet.
val ds2 = ds
.withColumn("started", when($"status" === "start", 1).otherwise(lit(0)))
.withColumn("stopped", when($"status" === "stop", -1).otherwise(lit(0)))
.withColumn("engFlag", when($"started" === 1, $"started").otherwise($"stopped"))
.withColumn("engWindow", sum($"engFlag").over(Window.orderBy($"timestamp")))
.withColumn("runtime", when($"engWindow" > 0,
(unix_timestamp(lead($"timestamp", 1).over(Window.orderBy($"timestamp"))) - unix_timestamp($"timestamp"))/60*$"engWindow").otherwise(lit(0)))
Input data:
val ds_working = spark.sparkContext.parallelize(Seq(
("2017-01-01 06:00:00", "start", "1"),
("2017-01-01 07:00:00", "start", "2"),
("2017-01-01 08:00:00", "foo", "2"),
("2017-01-01 09:00:00", "blub", "2"),
("2017-01-01 10:00:00", "stop", "3"),
("2017-01-01 11:00:00", null, "3"),
("2017-01-01 12:00:00", "ASC_c", "4"),
("2017-01-01 13:00:00", "stop", "5" ),
("2017-01-01 14:00:00", null, "3"),
("2017-01-01 15:00:00", "ASC_c", "4")
)).toDF("timestamp", "status", "msg")
Output:
+-------------------+------+---+-------+-------+-------+---------+-------+
| timestamp|status|msg|started|stopped|engFlag|engWindow|runtime|
+-------------------+------+---+-------+-------+-------+---------+-------+
|2017-01-01 06:00:00| start| 1| 1| 0| 1| 1| 60.0|
|2017-01-01 07:00:00| start| 2| 1| 0| 1| 2| 120.0|
|2017-01-01 08:00:00| foo| 2| 0| 0| 0| 2| 120.0|
|2017-01-01 09:00:00| blub| 2| 0| 0| 0| 2| 120.0|
|2017-01-01 10:00:00| stop| 3| 0| -1| -1| 1| 60.0|
|2017-01-01 11:00:00| null| 3| 0| 0| 0| 1| 60.0|
|2017-01-01 12:00:00| ASC_c| 4| 0| 0| 0| 1| 60.0|
|2017-01-01 13:00:00| stop| 5| 0| -1| -1| 0| 0.0|
|2017-01-01 14:00:00| null| 3| 0| 0| 0| 0| 0.0|
|2017-01-01 15:00:00| ASC_c| 4| 0| 0| 0| 0| 0.0|
+-------------------+------+---+-------+-------+-------+---------+-------+
Now to my problem:
I have no idea how to calculate the runtime if I start calculating in the middle of a running recorder. That means I dont see the start flag but a stop flag. Which indicates that in the past a start flag must have happened.
Data:
val ds_notworking = spark.sparkContext.parallelize(Seq(
("2017-01-01 02:00:00", "foo", "1"),
("2017-01-01 03:00:00", null, "2"),
("2017-01-01 04:00:00", "stop", "1"),
("2017-01-01 05:00:00", "stop", "2"),
("2017-01-01 06:00:00", "start", "1"),
("2017-01-01 07:00:00", "start", "2"),
("2017-01-01 08:00:00", "foo", "2"),
("2017-01-01 09:00:00", "blub", "2"),
("2017-01-01 10:00:00", "stop", "3"),
("2017-01-01 11:00:00", null, "3"),
("2017-01-01 12:00:00", "ASC_c", "4"),
("2017-01-01 13:00:00", "stop", "5" ),
("2017-01-01 14:00:00", null, "3"),
("2017-01-01 15:00:00", "ASC_c", "4"),
)).toDF("timestamp", "status", "msg")
Wanted output:
+-------------------+------+---+-------+-------+---------+-----+
| timestamp|status|msg|started|stopped|engWindow|runt |
+-------------------+------+---+-------+-------+---------+-----+
|2017-01-01 02:00:00| foo| 1| 0| 0| 0| 120 |
|2017-01-01 03:00:00| null| 2| 0| 0| 0| 120 |
|2017-01-01 04:00:00| stop| 1| 0| -1| -1| 60 |
|2017-01-01 05:00:00| stop| 2| 0| -1| -1| 0 |
|2017-01-01 06:00:00| start| 1| 1| 0| 1| 60 |
|2017-01-01 07:00:00| start| 2| 1| 0| 1| 120 |
|2017-01-01 08:00:00| foo| 2| 0| 0| 0| 120 |
|2017-01-01 09:00:00| blub| 2| 0| 0| 0| 120 |
|2017-01-01 10:00:00| stop| 3| 0| -1| -1| 60 |
|2017-01-01 11:00:00| null| 3| 0| 0| 0| 60 |
|2017-01-01 12:00:00| ASC_c| 4| 0| 0| 0| 60 |
|2017-01-01 13:00:00| stop| 5| 0| -1| -1| 0 |
|2017-01-01 14:00:00| null| 3| 0| 0| 0| 0 |
|2017-01-01 15:00:00| ASC_c| 4| 0| 0| 0| 0 |
+-------------------+------+---+-------+-------+---------+-----+
I have solved this problem when only one instance of recorder can run at the same time with:
.withColumn("engWindow", last($"engFlag", true).over(systemWindow.rowsBetween(Window.unboundedPreceding, 0)))
But with 2 or more instances sadly I have no clue how to accomplish this.
It would be nice if someone could point me into the right direction.
I think I found the answer. I was thinking about this way to complicated.
Though I am not sure yet if there are any cases where this approach will not work.
I am summarizing the flags like I did in the working example, order the data descending by timestamp, find the minimum value and add this value to the current value. This should always indicate the correct number of running recorders.
val ds2 = ds_notworking
.withColumn("started", when($"status" === "start", 1).otherwise(lit(0)))
.withColumn("stopped", when($"status" === "stop", -1).otherwise(lit(0)))
.withColumn("engFlag", when($"started" === 1, $"started").otherwise($"stopped"))
.withColumn("engWindow", sum($"engFlag").over(Window.orderBy($"timestamp")))
.withColumn("newEngWindow", $"engWindow" - min($"engWindow").over(Window.orderBy($"timestamp".desc)))
.withColumn("runtime2", when($"newEngWindow" > 0,
(unix_timestamp(lead($"timestamp", 1).over(Window.orderBy($"timestamp"))) - unix_timestamp($"timestamp"))/60*$"newEngWindow").otherwise(lit(0)))
EDIT: maybe this would be more correct to calculate the minimun value and apply it to the entire window.
.withColumn("test1", last(min($"engWindow").over(Window.orderBy($"timestamp"))).over(Window.orderBy($"timestamp").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
Output:
+-------------------+------+---+-------+-------+-------+---------+------------+--------+
| timestamp|status|msg|started|stopped|engFlag|engWindow|newEngWindow|runtime2|
+-------------------+------+---+-------+-------+-------+---------+------------+--------+
|2017-01-01 02:00:00| foo| 1| 0| 0| 0| 0| 2| 120.0|
|2017-01-01 03:00:00| null| 2| 0| 0| 0| 0| 2| 120.0|
|2017-01-01 04:00:00| stop| 1| 0| -1| -1| -1| 1| 60.0|
|2017-01-01 05:00:00| stop| 2| 0| -1| -1| -2| 0| 0.0|
|2017-01-01 06:00:00| start| 1| 1| 0| 1| -1| 1| 60.0|
|2017-01-01 07:00:00| start| 2| 1| 0| 1| 0| 2| 120.0|
|2017-01-01 08:00:00| foo| 2| 0| 0| 0| 0| 2| 120.0|
|2017-01-01 09:00:00| blub| 2| 0| 0| 0| 0| 2| 120.0|
|2017-01-01 10:00:00| stop| 3| 0| -1| -1| -1| 1| 60.0|
|2017-01-01 11:00:00| null| 3| 0| 0| 0| -1| 1| 60.0|
|2017-01-01 12:00:00| ASC_c| 4| 0| 0| 0| -1| 1| 60.0|
|2017-01-01 13:00:00| stop| 5| 0| -1| -1| -2| 0| 0.0|
|2017-01-01 14:00:00| null| 3| 0| 0| 0| -2| 0| 0.0|
|2017-01-01 15:00:00| ASC_c| 4| 0| 0| 0| -2| 0| 0.0|
+-------------------+------+---+-------+-------+-------+---------+------------+--------+
Related
how to print column names in generic way. I want col1,col2,… instead of _1,_2,…
+---+---+---+---+---+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7| _8| _9|_10|_11|_12|
+---+---+---+---+---+---+---+---+---+---+---+---+
| 0| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| |
| 0| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
assuming df is your dataframe, you can juste rename :
for col in df.columns:
df = df.withColumnRenamed(col, col.replace("_", "col"))
is there any generic functions to assign column names in pyspark ?instead of _1,_2,_3....... it has to give col_1,col_2,col_3
+---+---+---+---+---+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7| _8| _9|_10|_11|_12|
+---+---+---+---+---+---+---+---+---+---+---+---+
| 0| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| |
| 0| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
+---+---+---+---+---+---+---+---+---+---+---+---+
only showing top 20 rows
Try this-
df.toDF(*["col_{}".format(i) for i in range(1,len(df.columns)+1)])
Below is the dataframe i have
df = sqlContext.createDataFrame(
[("0", "0"), ("1", "2"), ("2", "3"), ("3", "4"), ("4", "0"), ("5", "5"), ("6", "5")],
["id", "value"])
+---+-----+
| id|value|
+---+-----+
| 0| 0|
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 0|
| 5| 5|
| 6| 5|
+---+-----+
And what I want to get is :
+---+-----+---+-----+
| id|value|masterid|partsum|
+---+-----|---+-----+
| 0| 0| 0| 0|
| 1| 2| 0| 2|
| 2| 3| 0| 5|
| 3| 4| 0| 9|
| 4| 0| 4| 0|
| 5| 5| 4| 5|
| 6| 5| 4| 10|
+---+-----+---+-----+
So I try to use SparkSQL to do so:
df=df.withColumn("masterid", F.when( df.value !=0 , F.lag(df.id)).otherwise(df.id))
I original thought the lag function can help me process before next iteration so as to get the masterid col. Unfortunately, after i check the manual , it cant help.
So , i would like to ask if there are any special functions i could use to do what i want? Or is there any "conditional lag" function i could use? so that, when i see non-zero item, i can use lag until find a zero number?
IIUC, you can try defining a sub-group label (g in the below code) and two Window Specs:
from pyspark.sql import Window, functions as F
w1 = Window.orderBy('id')
w2 = Window.partitionBy('g').orderBy('id')
df.withColumn('g', F.sum(F.expr('if(value=0,1,0)')).over(w1)).select(
'id'
, 'value'
, F.first('id').over(w2).alias('masterid')
, F.sum('value').over(w2).alias('partsum')
).show()
#+---+-----+--------+-------+
#| id|value|masterid|partsum|
#+---+-----+--------+-------+
#| 0| 0| 0| 0.0|
#| 1| 2| 0| 2.0|
#| 2| 3| 0| 5.0|
#| 3| 4| 0| 9.0|
#| 4| 0| 4| 0.0|
#| 5| 5| 4| 5.0|
#| 6| 5| 4| 10.0|
#+---+-----+--------+-------+
Suppose I have a dataframe in Spark as shown below -
val df = Seq(
(0,0,0,0.0),
(1,0,0,0.1),
(0,1,0,0.11),
(0,0,1,0.12),
(1,1,0,0.24),
(1,0,1,0.27),
(0,1,1,0.30),
(1,1,1,0.40)
).toDF("A","B","C","rate")
Here is how it looks like -
scala> df.show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 0| 0| 0| 0.0|
| 1| 0| 0| 0.1|
| 0| 1| 0|0.11|
| 0| 0| 1|0.12|
| 1| 1| 0|0.24|
| 1| 0| 1|0.27|
| 0| 1| 1| 0.3|
| 1| 1| 1| 0.4|
+---+---+---+----+
A,B and C are the advertising channels in this case. 0 and 1 represent absence and presence of channels respectively. 2^3 shows 8 combinations in the data-frame.
I want to filter records from this data-frame that shows presence of 2 channels at a time( AB, AC, BC) . Here is how I want my output to be -
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 1| 0|0.24|
| 1| 0| 1|0.27|
| 0| 1| 1| 0.3|
+---+---+---+----+
I can write 3 statements to get the output by doing -
scala> df.filter($"A" === 1 && $"B" === 1 && $"C" === 0).show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 1| 0|0.24|
+---+---+---+----+
scala> df.filter($"A" === 1 && $"B" === 0 && $"C" === 1).show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 0| 1|0.27|
+---+---+---+----+
scala> df.filter($"A" === 0 && $"B" === 1 && $"C" === 1).show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 0| 1| 1| 0.3|
+---+---+---+----+
However, I want to achieve this using either a single statement that does my job or a function that helps me get the output.
I was thinking of using a case statement to match the values. However in general my dataframe might consist of more than 3 channels -
scala> df.show()
+---+---+---+---+----+
| A| B| C| D|rate|
+---+---+---+---+----+
| 0| 0| 0| 0| 0.0|
| 0| 0| 0| 1| 0.1|
| 0| 0| 1| 0| 0.1|
| 0| 0| 1| 1|0.59|
| 0| 1| 0| 0| 0.1|
| 0| 1| 0| 1|0.89|
| 0| 1| 1| 0|0.39|
| 0| 1| 1| 1| 0.4|
| 1| 0| 0| 0| 0.0|
| 1| 0| 0| 1|0.99|
| 1| 0| 1| 0|0.49|
| 1| 0| 1| 1| 0.1|
| 1| 1| 0| 0|0.79|
| 1| 1| 0| 1| 0.1|
| 1| 1| 1| 0| 0.1|
| 1| 1| 1| 1| 0.1|
+---+---+---+---+----+
In this scenario I would want my output as -
scala> df.show()
+---+---+---+---+----+
| A| B| C| D|rate|
+---+---+---+---+----+
| 0| 0| 1| 1|0.59|
| 0| 1| 0| 1|0.89|
| 0| 1| 1| 0|0.39|
| 1| 0| 0| 1|0.99|
| 1| 0| 1| 0|0.49|
| 1| 1| 0| 0|0.79|
+---+---+---+---+----+
which shows rates for paired presence of channels => (AB, AC, AD, BC, BD, CD).
Kindly help.
One way could be to sum the columns and then filter only when the result of the sum is 2.
import org.apache.spark.sql.functions._
df.withColumn("res", $"A" + $"B" + $"C").filter($"res" === lit(2)).drop("res").show
The output is:
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 1| 0|0.24|
| 1| 0| 1|0.27|
| 0| 1| 1| 0.3|
+---+---+---+----+
I have the following DataFrame in Spark 2.2 and Scala 2.11.8:
+--------+---------+-------+-------+----+-------+
|event_id|person_id|channel| group|num1| num2|
+--------+---------+-------+-------+----+-------+
| 560| 9410| web| G1| 0| 5|
| 290| 1430| web| G1| 0| 3|
| 470| 1370| web| G2| 0| 18|
| 290| 1430| web| G2| 0| 5|
| 290| 1430| mob| G2| 1| 2|
+--------+---------+-------+-------+----+-------+
Here is the equivalent DataFrame in Scala:
df = sqlCtx.createDataFrame(
[(560,9410,"web","G1",0,5),
(290,1430,"web","G1",0,3),
(470,1370,"web","G2",0,18),
(290,1430,"web","G2",0,5),
(290,1430,"mob","G2",1,2)],
["event_id","person_id","channel","group","num1","num2"]
)
The column group can only have two values: G1 and G2. I need to transform these values of the column group into new columns as follows:
+--------+---------+-------+--------+-------+--------+-------+
|event_id|person_id|channel| num1_G1|num2_G1| num1_G2|num2_G2|
+--------+---------+-------+--------+-------+--------+-------+
| 560| 9410| web| 0| 5| 0| 0|
| 290| 1430| web| 0| 3| 0| 0|
| 470| 1370| web| 0| 0| 0| 18|
| 290| 1430| web| 0| 0| 0| 5|
| 290| 1430| mob| 0| 0| 1| 2|
+--------+---------+-------+--------+-------+--------+-------+
How can I do it?
AFAIK (at least i couldn't find a way to perform PIVOT without aggregation) we must use aggregation function when doing pivoting in Spark
Scala version:
scala> df.groupBy("event_id","person_id","channel")
.pivot("group")
.agg(max("num1") as "num1", max("num2") as "num2")
.na.fill(0)
.show
+--------+---------+-------+-------+-------+-------+-------+
|event_id|person_id|channel|G1_num1|G1_num2|G2_num1|G2_num2|
+--------+---------+-------+-------+-------+-------+-------+
| 560| 9410| web| 0| 5| 0| 0|
| 290| 1430| web| 0| 3| 0| 5|
| 470| 1370| web| 0| 0| 0| 18|
| 290| 1430| mob| 0| 0| 1| 2|
+--------+---------+-------+-------+-------+-------+-------+