Related
I try to group by values of metric which can be I or M and make a summe based on the values of x. The result should be stored in each row of its respective value. Normally I make it in R with group and then ungroup but I dont know the equivalent in PysSpark. Any advice?
from pyspark.sql.functions import *
from pyspark.sql.functions import last
from pyspark.sql.functions import arrays_zip
from pyspark.sql.types import *
data = [["1", "Amit", "DU", "I", "8", "6"],
["2", "Mohit", "DU", "I", "4", "2"],
["3", "rohith", "BHU", "I", "5", "3"],
["4", "sridevi", "LPU", "I", "1", "6"],
["1", "sravan", "KLMP", "M", "2", "4"],
["5", "gnanesh", "IIT", "M", "6", "8"],
["6", "gnadesh", "KLM", "M","0", "9"]]
columns = ['ID', 'NAME', 'college', 'metric', 'x', 'y']
dataframe = spark.createDataFrame(data, columns)
dataframe = dataframe.withColumn("x",dataframe.x.cast(DoubleType()))
This is how the data looks like
+---+-------+-------+------+----+---+
| ID| NAME|college|metric| x| y|
+---+-------+-------+------+----+---+
| 1| Amit| DU| I| 8| 6|
| 2| Mohit| DU| I| 4| 2|
| 3| rohith| BHU| I| 5| 3|
| 4|sridevi| LPU| I| 1| 6|
| 1| sravan| KLMP| M| 2| 4|
| 5|gnanesh| IIT| M| 6| 8|
| 6|gnadesh| KLM| M|0 | 9|
+---+-------+-------+------+----+---+
Expected output
+---+-------+-------+------+----+---+------+
| ID| NAME|college|metric| x| y| total|
+---+-------+-------+------+----+---+------+
| 1| Amit| DU| I| 8| 6| 18 |
| 2| Mohit| DU| I| 4| 2| 18 |
| 3| rohith| BHU| I| 5| 3| 18 |
| 4|sridevi| LPU| I| 1| 6| 18 |
| 1| sravan| KLMP| M| 2| 4| 8 |
| 5|gnanesh| IIT| M| 6| 8| 8 |
| 6|gnadesh| KLM| M| 0| 9| 8 |
+---+-------+-------+------+----+---+------+
I tried this but it does not work
dataframe.withColumn("total",dataframe.groupBy("metric").sum("x"))
You can do groupby on data and calculate the total value and then join the grouped dataframe with original data
metric_sum_df = dataframe.groupby('metric').agg(F.sum('x').alias('total'))
total_df = dataframe.join(metric_sum_df, 'metric')
I have a sample dataset with salaries. I want to distribute that salary into 3 buckets and then find the lower of the salary in each bucket and then convert that into an array and attach it to the original set. I am trying to use window function to do that. And it seems to do it in a progressive fashion.
Here is the code that I have written
val spark = sparkSession
import spark.implicits._
val simpleData = Seq(("James", "Sales", 3000),
("Michael", "Sales", 3100),
("Robert", "Sales", 3200),
("Maria", "Finance", 3300),
("James", "Sales", 3400),
("Scott", "Finance", 3500),
("Jen", "Finance", 3600),
("Jeff", "Marketing", 3700),
("Kumar", "Marketing", 3800),
("Saif", "Sales", 3900)
)
val df = simpleData.toDF("employee_name", "department", "salary")
val windowSpec = Window.orderBy("salary")
val ntileFrame = df.withColumn("ntile", ntile(3).over(windowSpec))
val lowWindowSpec = Window.partitionBy("ntile")
val ntileMinDf = ntileFrame.withColumn("lower_bound", min("salary").over(lowWindowSpec))
var rangeDf = ntileMinDf.withColumn("range", collect_set("lower_bound").over(windowSpec))
rangeDf.show()
I am getting the dataset like this
+-------------+----------+------+-----+-----------+------------------+
|employee_name|department|salary|ntile|lower_bound| range|
+-------------+----------+------+-----+-----------+------------------+
| James| Sales| 3000| 1| 3000| [3000]|
| Michael| Sales| 3100| 1| 3000| [3000]|
| Robert| Sales| 3200| 1| 3000| [3000]|
| Maria| Finance| 3300| 1| 3000| [3000]|
| James| Sales| 3400| 2| 3400| [3000, 3400]|
| Scott| Finance| 3500| 2| 3400| [3000, 3400]|
| Jen| Finance| 3600| 2| 3400| [3000, 3400]|
| Jeff| Marketing| 3700| 3| 3700|[3000, 3700, 3400]|
| Kumar| Marketing| 3800| 3| 3700|[3000, 3700, 3400]|
| Saif| Sales| 3900| 3| 3700|[3000, 3700, 3400]|
+-------------+----------+------+-----+-----------+------------------+
I am expecting the dataset to look like this
+-------------+----------+------+-----+-----------+------------------+
|employee_name|department|salary|ntile|lower_bound| range|
+-------------+----------+------+-----+-----------+------------------+
| James| Sales| 3000| 1| 3000|[3000, 3700, 3400]|
| Michael| Sales| 3100| 1| 3000|[3000, 3700, 3400]|
| Robert| Sales| 3200| 1| 3000|[3000, 3700, 3400]|
| Maria| Finance| 3300| 1| 3000|[3000, 3700, 3400]|
| James| Sales| 3400| 2| 3400|[3000, 3700, 3400]|
| Scott| Finance| 3500| 2| 3400|[3000, 3700, 3400]|
| Jen| Finance| 3600| 2| 3400|[3000, 3700, 3400]|
| Jeff| Marketing| 3700| 3| 3700|[3000, 3700, 3400]|
| Kumar| Marketing| 3800| 3| 3700|[3000, 3700, 3400]|
| Saif| Sales| 3900| 3| 3700|[3000, 3700, 3400]|
+-------------+----------+------+-----+-----------+------------------+
To ensure that your windows take into account all rows and not only rows before current row, you can use rowsBetween method with Window.unboundedPreceding and Window.unboundedFollowing as argument. Your last line thus become:
var rangeDf = ntileMinDf.withColumn(
"range",
collect_set("lower_bound")
.over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
)
and you get the following rangeDf dataframe:
+-------------+----------+------+-----+-----------+------------------+
|employee_name|department|salary|ntile|lower_bound| range|
+-------------+----------+------+-----+-----------+------------------+
| James| Sales| 3000| 1| 3000|[3000, 3700, 3400]|
| Michael| Sales| 3100| 1| 3000|[3000, 3700, 3400]|
| Robert| Sales| 3200| 1| 3000|[3000, 3700, 3400]|
| Maria| Finance| 3300| 1| 3000|[3000, 3700, 3400]|
| James| Sales| 3400| 2| 3400|[3000, 3700, 3400]|
| Scott| Finance| 3500| 2| 3400|[3000, 3700, 3400]|
| Jen| Finance| 3600| 2| 3400|[3000, 3700, 3400]|
| Jeff| Marketing| 3700| 3| 3700|[3000, 3700, 3400]|
| Kumar| Marketing| 3800| 3| 3700|[3000, 3700, 3400]|
| Saif| Sales| 3900| 3| 3700|[3000, 3700, 3400]|
+-------------+----------+------+-----+-----------+------------------+
I try to get the time difference "time_d" in seconds of a timestamp within "name" in Pyspark.
+-------------------+----+
| timestamplast|name|
+-------------------+----+
|2019-08-01 00:00:00| 1|
|2019-08-01 00:01:00| 1|
|2019-08-01 00:01:15| 1|
|2019-08-01 03:00:00| 2|
|2019-08-01 04:00:00| 2|
|2019-08-01 00:15:00| 3|
+-------------------+----+
Output should look like:
+-------------------+----+--------+
| timestamplast|name| time_d |
+-------------------+----+------- +
|2019-08-01 00:00:00| 1| 0 |
|2019-08-01 00:01:00| 1| 60 |
|2019-08-01 00:01:15| 1| 15 |
|2019-08-01 03:00:00| 2| 0 |
|2019-08-01 04:00:00| 2| 3600 |
|2019-08-01 00:15:00| 3| 0 |
+-------------------+----+--------+
In Pandas this would be:
df['time_d'] = df.groupby("name")['timestamplast'].diff().fillna(pd.Timedelta(0)).dt.total_seconds()
How would this be done in Pyspark?
You can use a lag window function(partitioned by name) and then compute the difference using timestamp in seconds(unix_timestamp).
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("name").orderBy(F.col("timestamplast"))
df.withColumn("time_d", F.lag(F.unix_timestamp("timestamplast")).over(w))\
.withColumn("time_d", F.when(F.col("time_d").isNotNull(), F.unix_timestamp("timestamplast")-F.col("time_d"))\
.otherwise(F.lit(0))).orderBy("name","timestamplast").show()
#+-------------------+----+------+
#| timestamplast|name|time_d|
#+-------------------+----+------+
#|2019-08-01 00:00:00| 1| 0|
#|2019-08-01 00:01:00| 1| 60|
#|2019-08-01 00:01:15| 1| 15|
#|2019-08-01 03:00:00| 2| 0|
#|2019-08-01 04:00:00| 2| 3600|
#|2019-08-01 00:15:00| 3| 0|
#+-------------------+----+------+
I have got dataframe like this:
client_username|workstation|session_duration|access_point_name|start_date|
XX1#AD |Apple |1.55 |idf_1 |2019-06-01|
XX2#AD |Apple |30.12 |idf_2 |2019-06-04|
XX3#AD |Apple |78.25 |idf_3 |2019-06-02|
XX4#AD |Apple |0.45 |idf_1 |2019-06-02|
XX1#AD |Apple |23.11 |idf_1 |2019-06-02|
client_username - id of user in domain
workstation - user workstation
session_duration - duration (in hours) of the active session (user logged on hist host)
access_point_name - the name of access point that supplies the network to users host
start_date - start session
I would like to achieve dataframe like this:
client_username|workstation|session_duration|access_point_name|start_date|
XX1#AD |Apple |1.55 |idf_1 |2019-06-01|
XX2#AD |Apple |8 |idf_2 |2019-06-04|
XX2#AD |Apple |8 |idf_2 |2019-06-05|
XX3#AD |Apple |8 |idf_3 |2019-06-02|
XX3#AD |Apple |8 |idf_3 |2019-06-03|
XX3#AD |Apple |8 |idf_3 |2019-06-04|
XX3#AD |Apple |8 |idf_3 |2019-06-05|
XX4#AD |Apple |0.45 |idf_1 |2019-06-02|
XX1#AD |Apple |23.11 |idf_1 |2019-06-02|
The idea is as follows:
* if the length of session is over 24 hours, but less than 48 hours I would like change it:
XX2#AD |Apple |30.12 |idf_2 |2019-06-04|
to it:
XX2#AD |Apple |8 |idf_2 |2019-06-04|
XX2#AD |Apple |8 |idf_2 |2019-06-05|
The duration of the session changes to 8 hours, but the number of days increases to two days (2019-06-04 and 2019-06-05).
Analytical situations for duration above 48 hours (3 days), 72 hours (4 days) etc.
I'm starting to learn pyspark. I tried use union or crossJoin on dataframe, but this is very complicated for me at the moment. I would like to do this task with use pyspark.
Here are some methods you can try:
Method-1: string functions: repeat, substring
calculate number of repeats n = ceil(session_duration/24)
create a string a which repeats the substring 8, for n times and then use substring() or regexp_replace() to remove the trailing comma ,
split a by comma and then posexplode it into rows of pos and session_duration
adjust the start_date by pos from the above step
cast the string session_duration into double
see below code example:
from pyspark.sql import functions as F
# assume the columns in your dataframe are read with proper data types
# for example using inferSchema=True
df = spark.read.csv('/path/to/file', header=True, inferSchema=True)
df1 = df.withColumn('n', F.ceil(F.col('session_duration')/24).astype('int')) \
.withColumn('a', F.when(F.col('n')>1, F.expr('substring(repeat("8,",n),0,2*n-1)')).otherwise(F.col('session_duration')))
>>> df1.show()
+---------------+-----------+----------------+-----------------+-------------------+---+-------+
|client_username|workstation|session_duration|access_point_name| start_date| n| a|
+---------------+-----------+----------------+-----------------+-------------------+---+-------+
| XX1#AD| Apple| 1.55| idf_1|2019-06-01 00:00:00| 1| 1.55|
| XX2#AD| Apple| 30.12| idf_2|2019-06-04 00:00:00| 2| 8,8|
| XX3#AD| Apple| 78.25| idf_3|2019-06-02 00:00:00| 4|8,8,8,8|
| XX4#AD| Apple| 0.45| idf_1|2019-06-02 00:00:00| 1| 0.45|
| XX1#AD| Apple| 23.11| idf_1|2019-06-02 00:00:00| 1| 23.11|
+---------------+-----------+----------------+-----------------+-------------------+---+-------+
df_new = df1.select(
'client_username'
, 'workstation'
, F.posexplode(F.split('a', ',')).alias('pos', 'session_duration')
, 'access_point_name'
, F.expr('date_add(start_date, pos)').alias('start_date')
).drop('pos')
>>> df_new.show()
+---------------+-----------+----------------+-----------------+----------+
|client_username|workstation|session_duration|access_point_name|start_date|
+---------------+-----------+----------------+-----------------+----------+
| XX1#AD| Apple| 1.55| idf_1|2019-06-01|
| XX2#AD| Apple| 8| idf_2|2019-06-04|
| XX2#AD| Apple| 8| idf_2|2019-06-05|
| XX3#AD| Apple| 8| idf_3|2019-06-02|
| XX3#AD| Apple| 8| idf_3|2019-06-03|
| XX3#AD| Apple| 8| idf_3|2019-06-04|
| XX3#AD| Apple| 8| idf_3|2019-06-05|
| XX4#AD| Apple| 0.45| idf_1|2019-06-02|
| XX1#AD| Apple| 23.11| idf_1|2019-06-02|
+---------------+-----------+----------------+-----------------+----------+
The above code can also be written into one chain:
df_new = df.withColumn('n'
, F.ceil(F.col('session_duration')/24).astype('int')
).withColumn('a'
, F.when(F.col('n')>1, F.expr('substring(repeat("8,",n),0,2*n-1)')).otherwise(F.col('session_duration'))
).select('client_username'
, 'workstation'
, F.posexplode(F.split('a', ',')).alias('pos', 'session_duration')
, 'access_point_name'
, F.expr('date_add(start_date, pos)').alias('start_date')
).withColumn('session_duration'
, F.col('session_duration').astype('double')
).drop('pos')
Method-2: array function array_repeat (pyspark 2.4+)
Similar to the Method-1, but a is already an array thus no need to split a string into array:
df1 = df.withColumn('n', F.ceil(F.col('session_duration')/24).astype('int')) \
.withColumn('a', F.when(F.col('n')>1, F.expr('array_repeat(8,n)')).otherwise(F.array('session_duration')))
>>> df1.show()
+---------------+-----------+----------------+-----------------+-------------------+---+--------------------+
|client_username|workstation|session_duration|access_point_name| start_date| n| a|
+---------------+-----------+----------------+-----------------+-------------------+---+--------------------+
| XX1#AD| Apple| 1.55| idf_1|2019-06-01 00:00:00| 1| [1.55]|
| XX2#AD| Apple| 30.12| idf_2|2019-06-04 00:00:00| 2| [8.0, 8.0]|
| XX3#AD| Apple| 78.25| idf_3|2019-06-02 00:00:00| 4|[8.0, 8.0, 8.0, 8.0]|
| XX4#AD| Apple| 0.45| idf_1|2019-06-02 00:00:00| 1| [0.45]|
| XX1#AD| Apple| 23.11| idf_1|2019-06-02 00:00:00| 1| [23.11]|
+---------------+-----------+----------------+-----------------+-------------------+---+--------------------+
df_new = df1.select('client_username'
, 'workstation'
, F.posexplode('a').alias('pos', 'session_duration')
, 'access_point_name'
, F.expr('date_add(start_date, pos)').alias('start_date')
).drop('pos')
I want to calculate the runtime of multiple recorders. There can be infinitely recorders running at the same time.
When I have a start and end point I get the expected result with the following code snippet.
val ds2 = ds
.withColumn("started", when($"status" === "start", 1).otherwise(lit(0)))
.withColumn("stopped", when($"status" === "stop", -1).otherwise(lit(0)))
.withColumn("engFlag", when($"started" === 1, $"started").otherwise($"stopped"))
.withColumn("engWindow", sum($"engFlag").over(Window.orderBy($"timestamp")))
.withColumn("runtime", when($"engWindow" > 0,
(unix_timestamp(lead($"timestamp", 1).over(Window.orderBy($"timestamp"))) - unix_timestamp($"timestamp"))/60*$"engWindow").otherwise(lit(0)))
Input data:
val ds_working = spark.sparkContext.parallelize(Seq(
("2017-01-01 06:00:00", "start", "1"),
("2017-01-01 07:00:00", "start", "2"),
("2017-01-01 08:00:00", "foo", "2"),
("2017-01-01 09:00:00", "blub", "2"),
("2017-01-01 10:00:00", "stop", "3"),
("2017-01-01 11:00:00", null, "3"),
("2017-01-01 12:00:00", "ASC_c", "4"),
("2017-01-01 13:00:00", "stop", "5" ),
("2017-01-01 14:00:00", null, "3"),
("2017-01-01 15:00:00", "ASC_c", "4")
)).toDF("timestamp", "status", "msg")
Output:
+-------------------+------+---+-------+-------+-------+---------+-------+
| timestamp|status|msg|started|stopped|engFlag|engWindow|runtime|
+-------------------+------+---+-------+-------+-------+---------+-------+
|2017-01-01 06:00:00| start| 1| 1| 0| 1| 1| 60.0|
|2017-01-01 07:00:00| start| 2| 1| 0| 1| 2| 120.0|
|2017-01-01 08:00:00| foo| 2| 0| 0| 0| 2| 120.0|
|2017-01-01 09:00:00| blub| 2| 0| 0| 0| 2| 120.0|
|2017-01-01 10:00:00| stop| 3| 0| -1| -1| 1| 60.0|
|2017-01-01 11:00:00| null| 3| 0| 0| 0| 1| 60.0|
|2017-01-01 12:00:00| ASC_c| 4| 0| 0| 0| 1| 60.0|
|2017-01-01 13:00:00| stop| 5| 0| -1| -1| 0| 0.0|
|2017-01-01 14:00:00| null| 3| 0| 0| 0| 0| 0.0|
|2017-01-01 15:00:00| ASC_c| 4| 0| 0| 0| 0| 0.0|
+-------------------+------+---+-------+-------+-------+---------+-------+
Now to my problem:
I have no idea how to calculate the runtime if I start calculating in the middle of a running recorder. That means I dont see the start flag but a stop flag. Which indicates that in the past a start flag must have happened.
Data:
val ds_notworking = spark.sparkContext.parallelize(Seq(
("2017-01-01 02:00:00", "foo", "1"),
("2017-01-01 03:00:00", null, "2"),
("2017-01-01 04:00:00", "stop", "1"),
("2017-01-01 05:00:00", "stop", "2"),
("2017-01-01 06:00:00", "start", "1"),
("2017-01-01 07:00:00", "start", "2"),
("2017-01-01 08:00:00", "foo", "2"),
("2017-01-01 09:00:00", "blub", "2"),
("2017-01-01 10:00:00", "stop", "3"),
("2017-01-01 11:00:00", null, "3"),
("2017-01-01 12:00:00", "ASC_c", "4"),
("2017-01-01 13:00:00", "stop", "5" ),
("2017-01-01 14:00:00", null, "3"),
("2017-01-01 15:00:00", "ASC_c", "4"),
)).toDF("timestamp", "status", "msg")
Wanted output:
+-------------------+------+---+-------+-------+---------+-----+
| timestamp|status|msg|started|stopped|engWindow|runt |
+-------------------+------+---+-------+-------+---------+-----+
|2017-01-01 02:00:00| foo| 1| 0| 0| 0| 120 |
|2017-01-01 03:00:00| null| 2| 0| 0| 0| 120 |
|2017-01-01 04:00:00| stop| 1| 0| -1| -1| 60 |
|2017-01-01 05:00:00| stop| 2| 0| -1| -1| 0 |
|2017-01-01 06:00:00| start| 1| 1| 0| 1| 60 |
|2017-01-01 07:00:00| start| 2| 1| 0| 1| 120 |
|2017-01-01 08:00:00| foo| 2| 0| 0| 0| 120 |
|2017-01-01 09:00:00| blub| 2| 0| 0| 0| 120 |
|2017-01-01 10:00:00| stop| 3| 0| -1| -1| 60 |
|2017-01-01 11:00:00| null| 3| 0| 0| 0| 60 |
|2017-01-01 12:00:00| ASC_c| 4| 0| 0| 0| 60 |
|2017-01-01 13:00:00| stop| 5| 0| -1| -1| 0 |
|2017-01-01 14:00:00| null| 3| 0| 0| 0| 0 |
|2017-01-01 15:00:00| ASC_c| 4| 0| 0| 0| 0 |
+-------------------+------+---+-------+-------+---------+-----+
I have solved this problem when only one instance of recorder can run at the same time with:
.withColumn("engWindow", last($"engFlag", true).over(systemWindow.rowsBetween(Window.unboundedPreceding, 0)))
But with 2 or more instances sadly I have no clue how to accomplish this.
It would be nice if someone could point me into the right direction.
I think I found the answer. I was thinking about this way to complicated.
Though I am not sure yet if there are any cases where this approach will not work.
I am summarizing the flags like I did in the working example, order the data descending by timestamp, find the minimum value and add this value to the current value. This should always indicate the correct number of running recorders.
val ds2 = ds_notworking
.withColumn("started", when($"status" === "start", 1).otherwise(lit(0)))
.withColumn("stopped", when($"status" === "stop", -1).otherwise(lit(0)))
.withColumn("engFlag", when($"started" === 1, $"started").otherwise($"stopped"))
.withColumn("engWindow", sum($"engFlag").over(Window.orderBy($"timestamp")))
.withColumn("newEngWindow", $"engWindow" - min($"engWindow").over(Window.orderBy($"timestamp".desc)))
.withColumn("runtime2", when($"newEngWindow" > 0,
(unix_timestamp(lead($"timestamp", 1).over(Window.orderBy($"timestamp"))) - unix_timestamp($"timestamp"))/60*$"newEngWindow").otherwise(lit(0)))
EDIT: maybe this would be more correct to calculate the minimun value and apply it to the entire window.
.withColumn("test1", last(min($"engWindow").over(Window.orderBy($"timestamp"))).over(Window.orderBy($"timestamp").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
Output:
+-------------------+------+---+-------+-------+-------+---------+------------+--------+
| timestamp|status|msg|started|stopped|engFlag|engWindow|newEngWindow|runtime2|
+-------------------+------+---+-------+-------+-------+---------+------------+--------+
|2017-01-01 02:00:00| foo| 1| 0| 0| 0| 0| 2| 120.0|
|2017-01-01 03:00:00| null| 2| 0| 0| 0| 0| 2| 120.0|
|2017-01-01 04:00:00| stop| 1| 0| -1| -1| -1| 1| 60.0|
|2017-01-01 05:00:00| stop| 2| 0| -1| -1| -2| 0| 0.0|
|2017-01-01 06:00:00| start| 1| 1| 0| 1| -1| 1| 60.0|
|2017-01-01 07:00:00| start| 2| 1| 0| 1| 0| 2| 120.0|
|2017-01-01 08:00:00| foo| 2| 0| 0| 0| 0| 2| 120.0|
|2017-01-01 09:00:00| blub| 2| 0| 0| 0| 0| 2| 120.0|
|2017-01-01 10:00:00| stop| 3| 0| -1| -1| -1| 1| 60.0|
|2017-01-01 11:00:00| null| 3| 0| 0| 0| -1| 1| 60.0|
|2017-01-01 12:00:00| ASC_c| 4| 0| 0| 0| -1| 1| 60.0|
|2017-01-01 13:00:00| stop| 5| 0| -1| -1| -2| 0| 0.0|
|2017-01-01 14:00:00| null| 3| 0| 0| 0| -2| 0| 0.0|
|2017-01-01 15:00:00| ASC_c| 4| 0| 0| 0| -2| 0| 0.0|
+-------------------+------+---+-------+-------+-------+---------+------------+--------+