Calculating and aggregating data by date/time - pyspark

I am working with a dataframe like this:
Id | TimeStamp | Event | DeviceId
1 | 5.2.2019 8:00:00 | connect | 1
2 | 5.2.2019 8:00:05 | disconnect| 1
I am using databricks and pyspark to do the ETL process. How can I calculate and create such a dataframe as shown at the bottom? I have already tried using a UDF but I could not find a way to make it work. I also tried to do it by iterating over the whole data frame, but this is extremely slow.
I want to aggregate this dataframe to get a new dataframe which tells me the times, how long each device has been connected and disconnected:
Id | StartDateTime | EndDateTime | EventDuration |State | DeviceId
1 | 5.2.19 8:00:00 | 5.2.19 8:00:05| 0.00:00:05 |connected| 1

I think you can make this work with a window function and some further column creations with withColumn.
The code I did should create the mapping for devices and create a table with the duration for each state. The only requirement is that connect and disconnect appear alternatively.
Then you can use the following code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import Window
import datetime
test_df = sqlContext.createDataFrame([(1,datetime.datetime(2019,2,5,8),"connect",1),
(2,datetime.datetime(2019,2,5,8,0,5),"disconnect",1),
(3,datetime.datetime(2019,2,5,8,10),"connect",1),
(4,datetime.datetime(2019,2,5,8,20),"disconnect",1),],
["Id","TimeStamp","Event","DeviceId"])
#creation of dataframe with 4 events for 1 device
test_df.show()
Output:
+---+-------------------+----------+--------+
| Id| TimeStamp| Event|DeviceId|
+---+-------------------+----------+--------+
| 1|2019-02-05 08:00:00| connect| 1|
| 2|2019-02-05 08:00:05|disconnect| 1|
| 3|2019-02-05 08:10:00| connect| 1|
| 4|2019-02-05 08:20:00|disconnect| 1|
+---+-------------------+----------+--------+
Then you can create the helper functions and the window:
my_window = Window.partitionBy("DeviceId").orderBy(col("TimeStamp").desc()) #create window
get_prev_time = lag(col("Timestamp"),1).over(my_window) #get previous timestamp
time_diff = get_prev_time.cast("long") - col("TimeStamp").cast("long") #compute duration
test_df.withColumn("EventDuration",time_diff)\
.withColumn("EndDateTime",get_prev_time)\ #apply the helper functions
.withColumnRenamed("TimeStamp","StartDateTime")\ #rename according to your schema
.withColumn("State",when(col("Event")=="connect", "connected").otherwise("disconnected"))\ #create the state column
.filter(col("EventDuration").isNotNull()).select("Id","StartDateTime","EndDateTime","EventDuration","State","DeviceId").show()
#finally some filtering for the last events, which do not have a previous time
Output:
+---+-------------------+-------------------+-------------+------------+--------+
| Id| StartDateTime| EndDateTime|EventDuration| State|DeviceId|
+---+-------------------+-------------------+-------------+------------+--------+
| 3|2019-02-05 08:10:00|2019-02-05 08:20:00| 600| connected| 1|
| 2|2019-02-05 08:00:05|2019-02-05 08:10:00| 595|disconnected| 1|
| 1|2019-02-05 08:00:00|2019-02-05 08:00:05| 5| connected| 1|
+---+-------------------+-------------------+-------------+------------+--------+

Related

assign values to a new column depending on old column values in dataframe

I have assigned values to 4 variables in a conf or application.properties file,
A = 1
B = 2
C = 3
D = 4
I have a dataframe as follows,
+-----+
|name |
+-----+
| A |
| C |
| B |
| D |
| B |
+-----+
I want to add a new column that has the values assigned from the conf variables declared above for A,B,C,D respectively depending on the value in the name column.
Final Dataframe should have,
+----+----------+
|name|NAME_VALUE|
+----+----------+
| A | 1 |
| C | 3 |
| B | 2 |
| D | 4 |
| B | 2 |
+----+----------+
I tried lit function in .WITHCOLUMN with conf.getint($name), not accepting Column in lit func requires string, I have to hardcode the variable names in lit. Is there anyway for me to dynamically assign those respective conf variable names in LIT so it can automatically assign values to another column in spark scala?
For this moment i dont have any ideas how to do it as you intended with dynamic usage of vals names.
My proposition is to use a seq of tuples instead of multiple vals, in such case you can create some udf and try to map this value for each row, but you can also use join which i am showing in below example:
val data = Seq(("A"),("C"), ("B"), ("D"), ("B"))
val df = data.toDF("name")
val mappings = Seq(("A",1), ("B",2), ("C",3), ("D",4))
val mappingsDf = mappings.toDF("name", "value")
df.join(broadcast(mappingsDf), df("name") === mappingsDf("name"), "left")
.select(
df("name"),
mappingsDf("value")
).show
output is as expected:
+----+-----+
|name|value|
+----+-----+
| A| 1|
| C| 3|
| B| 2|
| D| 4|
| B| 2|
+----+-----+
This solution is pretty generic as your mapping are df here so you can hardcode them as showed in my example or load them from some csv or json easily with spark api
Due to broadcast join it should be quite efficient (you should remove this hint if you want to use big amount of mappings!)
I think its easy to understand and maintain as its not udf but only Spark api

transform distinct row values to different columns with corresponding rows using Pyspark

I'm new to Pyspark and trying to transform data
Given dataframe
Col1
A=id1a A=id2a B=id1b C=id1c B=id2b
D=id1d A=id3a B=id3b C=id2c
A=id4a C=id3c
Required:
A B C
id1a id1b id1c
id2a id2b id2c
id3a id3b id3b
id4a null null
I have tried pivot, but that gives first value.
There might be a better way , however an approach is splitting the column on spaces to create array of the entries and then using higher order functions(spark 2.4+) to split on the '=' for each entry in the splitted array .Then explode and create 2 columns one with the id and one with the value. Then we can assign a row number to each partition and groupby then pivot:
import pyspark.sql.functions as F
df1 = (df.withColumn("Col1",F.split(F.col("Col1"),"\s+")).withColumn("Col1",
F.explode(F.expr("transform(Col1,x->split(x,'='))")))
.select(F.col("Col1")[0].alias("cols"),F.col("Col1")[1].alias("vals")))
from pyspark.sql import Window
w = Window.partitionBy("cols").orderBy("cols")
final = (df1.withColumn("Rnum",F.row_number().over(w)).groupBy("Rnum")
.pivot("cols").agg(F.first("vals")).orderBy("Rnum"))
final.show()
+----+----+----+----+----+
|Rnum| A| B| C| D|
+----+----+----+----+----+
| 1|id1a|id1b|id1c|id1d|
| 2|id2a|id2b|id2c|null|
| 3|id3a|id3b|id3c|null|
| 4|id4a|null|null|null|
+----+----+----+----+----+
this is how df1 looks like after the transformation:
df1.show()
+----+----+
|cols|vals|
+----+----+
| A|id1a|
| A|id2a|
| B|id1b|
| C|id1c|
| B|id2b|
| D|id1d|
| A|id3a|
| B|id3b|
| C|id2c|
| A|id4a|
| C|id3c|
+----+----+
May be I don't know the full picture, but the data format seems to be strange. If nothing can be done at the data source, then some collects, pivots and joins will be needed. Try this.
import pyspark.sql.functions as F
test = sqlContext.createDataFrame([('A=id1a A=id2a B=id1b C=id1c B=id2b',1),('D=id1d A=id3a B=id3b C=id2c',2),('A=id4a C=id3c',3)],schema=['col1','id'])
tst_spl = test.withColumn("item",(F.split('col1'," ")))
tst_xpl = tst_spl.select(F.explode("item"))
tst_map = tst_xpl.withColumn("key",F.split('col','=')[0]).withColumn("value",F.split('col','=')[1]).drop('col')
#%%
tst_pivot = tst_map.groupby(F.lit(1)).pivot('key').agg(F.collect_list(('value'))).drop('1')
#%%
tst_arr = [tst_pivot.select(F.posexplode(coln)).withColumnRenamed('col',coln) for coln in tst_pivot.columns]
tst_fin = reduce(lambda df1,df2:df1.join(df2,on='pos',how='full'),tst_arr).orderBy('pos')
tst_fin.show()
+---+----+----+----+----+
|pos| A| B| C| D|
+---+----+----+----+----+
| 0|id3a|id3b|id1c|id1d|
| 1|id4a|id1b|id2c|null|
| 2|id1a|id2b|id3c|null|
| 3|id2a|null|null|null|
+---+----+----+----+----

How to check whether a the whole column in a pyspark contains a value using Expr

In pyspark how can i use expr to check whether a whole column contains the value in columnA of that row.
pseudo code below
df=df.withColumn("Result", expr(if any the rows in column1 contains the value colA (for this row) then 1 else 0))
Take an arbitrary example:
valuesCol = [('rose','rose is red'),('jasmine','I never saw Jasmine'),('lily','Lili dont be silly'),('daffodil','what a flower')]
df = sqlContext.createDataFrame(valuesCol,['columnA','columnB'])
df.show()
+--------+-------------------+
| columnA| columnB|
+--------+-------------------+
| rose| rose is red|
| jasmine|I never saw Jasmine|
| lily| Lili dont be silly|
|daffodil| what a flower|
+--------+-------------------+
Application of expr() here. How one can use expr(), just look for the corresponding SQL syntax and it should work with expr() mostly.
df = df.withColumn('columnA_exists',expr("(case when instr(lower(columnB), lower(columnA))>=1 then 1 else 0 end)"))
df.show()
+--------+-------------------+--------------+
| columnA| columnB|columnA_exists|
+--------+-------------------+--------------+
| rose| rose is red| 1|
| jasmine|I never saw Jasmine| 1|
| lily| Lili dont be silly| 0|
|daffodil| what a flower| 0|
+--------+-------------------+--------------+

Apache spark aggregation: aggregate column based on another column value

I am not sure if I am asking this correctly and maybe that is the reason why I didn't find the correct answer so far. Anyway, if it will be duplicate I will delete this question.
I have following data:
id | last_updated | count
__________________________
1 | 20190101 | 3
1 | 20190201 | 2
1 | 20190301 | 1
I want to group by this data by "id" column, get max value from "last_updated" and regarding "count" column I want keep value from row where "last_updated" has max value. So in that case result should be like that:
id | last_updated | count
__________________________
1 | 20190301 | 1
So I imagine it will look like that:
df
.groupBy("id")
.agg(max("last_updated"), ... ("count"))
Is there any function I can use to get "count" based on "last_updated" column.
I am using spark 2.4.0.
Thanks for any help
You have two options, the first the better as for my understanding
OPTION 1
Perform a window function over the ID, create a column with the max value over that window function. Then select where the desired column equals the max value and finally drop the column and rename the max column as desired
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
OPTION 2
You can perform a join with the original dataframe after grouping
df.groupBy("id")
.agg(max("last_updated").as("last_updated"))
.join(df, Seq("id", "last_updated"))
QUICK EXAMPLE
INPUT
df.show
+---+------------+-----+
| id|last_updated|count|
+---+------------+-----+
| 1| 20190101| 3|
| 1| 20190201| 2|
| 1| 20190301| 1|
+---+------------+-----+
OUTPUT
Option 1
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
+---+-----+------------+
| id|count|last_updated|
+---+-----+------------+
| 1| 1| 20190301|
+---+-----+------------+
Option 2
df.groupBy("id")
.agg(max("last_updated").as("last_updated")
.join(df, Seq("id", "last_updated")).show
+---+-----------------+----------+
| id| last_updated| count |
+---+-----------------+----------+
| 1| 20190301| 1|
+---+-----------------+----------+

pyspark - Can I use substring of value as a key of groupBy() function?

I have a dataframe looks like this:
datetime | ID |
======================
20180201000000 | 275 |
20171231113024 | 534 |
20180201220000 | 275 |
20170205000000 | 28 |
what I want to do is to count by ID, monthly.
this way was perfactly worked :
add column of month by extracting from datetime column :
new_df = df.withColumn('month', df.datetime.substr(0,6))
count by ID & month :
count_df = new_df.groupBy('ID','month').count()
but is there a way to use substring of certain column values as an argument of groupBy() function? like :
`count_df = df.groupBy('ID', df.datetime.substr(0,6)).count()`
at least, this code didn't work.
if there exist the way to use substring of values, don't need to add new column and save much of resources(in case of big data).
but even if this approach is wrong, do you have a better idea to get same result?
Try this
>>> df.show()
+--------------+---+
| datetime| id|
+--------------+---+
|20180201000000|275|
|20171231113024|534|
|20180201220000|275|
|20170205000000| 28|
+--------------+---+
>>> df.groupBy('id',df.datetime.substr(0,6)).agg(count('id')).show()
+---+-----------------------+---------+
| id|substring(datetime,0,6)|count(id)|
+---+-----------------------+---------+
|275| 201802| 2|
|534| 201712| 1|
| 28| 201702| 1|
+---+-----------------------+---------+