Pyspark calculated field based off time difference - pyspark

I have a table that looks like this:
trip_distance | tpep_pickup_datetime | tpep_dropoff_datetime|
+-------------+----------------------+----------------------+
1.5 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 |
In the end, I need to get create a speed column for each row, so something like this:
trip_distance | tpep_pickup_datetime | tpep_dropoff_datetime| speed |
+-------------+----------------------+----------------------+-------+
1.5 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 13.5 |
So this is what I'm trying to do to get there. I figure I should add an interium column to help out, called trip_time which is a calculation of tpep_dropoff_datetime - tpep_pickup_datetime. Here is the code I'm doing to get that:
df4 = df.withColumn('trip_time', df.tpep_dropoff_datetime - df.tpep_pickup_datetime)
which is producing a nice trip_time column:
trip_distance | tpep_pickup_datetime | tpep_dropoff_datetime| trip_time|
+-------------+----------------------+----------------------+-----------------------+
1.5 | 2019-01-01 00:46:40 | 2019-01-01 00:53:20 | 6 minutes 40 seconds|
But now I want to do the speed column, and this how I'm trying to do that:
df4 = df4.withColumn('speed', (F.col('trip_distance') / F.col('trip_time')))
But that is giving me this error:
AnalysisException: cannot resolve '(trip_distance/trip_time)' due to data type mismatch: differing types in '(trip_distance/trip_time)' (float and interval).;;
Is there a better way?

One option is to convert your time to unix_timestamp which is in unit of seconds, and then you can do the subtraction, which gives you interval as integer that can be further used to calculate speed:
import pyspark.sql.functions as f
df.withColumn('speed', f.col('trip_distance') * 3600 / (
f.unix_timestamp('tpep_dropoff_datetime') - f.unix_timestamp('tpep_pickup_datetime'))
).show()
+-------------+--------------------+---------------------+-----+
|trip_distance|tpep_pickup_datetime|tpep_dropoff_datetime|speed|
+-------------+--------------------+---------------------+-----+
| 1.5| 2019-01-01 00:46:40| 2019-01-01 00:53:20| 13.5|
+-------------+--------------------+---------------------+-----+

Related

pyspark - left join with random row matching the key

I am looking to a way to join 2 dataframes but with random rows matching the key. This strange request is due to a very long calculation to generate positions.
I would like to do a kind of "random left join" in pyspark.
I have a dataframe with an areaID (string) and a count (int). The areaID is unique (around 7k).
+--------+-------+
| areaID | count |
+--------+-------+
| A | 10 |
| B | 30 |
| C | 1 |
| D | 25 |
| E | 18 |
+--------+-------+
I have a second dataframe with around 1000 precomputed rows for each areaID with 2 positions columns x (float) and y (float). This dataframe is around 7 millions rows.
+--------+------+------+
| areaID | x | y |
+--------+------+------+
| A | 0.0 | 0 |
| A | 0.1 | 0.7 |
| A | 0.3 | 1 |
| A | 0.1 | 0.3 |
| ... | | |
| E | 3.15 | 4.17 |
| E | 3.14 | 4.22 |
+--------+------+------+
I would like to end with a dataframe like:
+--------+------+------+
| areaID | x | y |
+--------+------+------+
| A | 0.1 | 0.32 | < row 1/10 - randomly picked where areaID are the same
| A | 0.0 | 0.18 | < row 2/10
| A | 0.09 | 0.22 | < row 3/10
| ... | | |
| E | 3.14 | 4.22 | < row 1/18
| ... | | |
+--------+------+------+
My first idea is to iterate over each areaID of the first dataframe, filter the second dataframe by areaID and sample count rows of this dataframe. The problem is that this is quite slow with 7k load/filtering/sampling processes.
The second approach is to do an outer join on areaID, then shuffle the dataframe (but seems quite complex), apply a rank and keep when the rank <= count but I don't like the approch to load a lot a data to filter afterward.
I am wondering if there is a way to do it using a "random" left join ? In that case, I'll duplicate each row count times and apply it.
Many thanks in advance,
Nicolas
One can interpret the question as stratified sampling of the second dataframe where the number of samples to be taken from each subpopulation is given by the first dataframe.
There is Spark function for stratified sampling.
df1 = ...
df2 = ...
#first calculate the fraction for each areaID based on the required number
#given in df1 and the number of rows for the areaID in df2
fractionRows = df2.groupBy("areaId").agg(F.count("areaId").alias("count2")) \
.join(df1, "areaId") \
.withColumn("fraction", F.col("count") / F.col("count2")) \
.select("areaId", "fraction") \
.collect()
fractions = {f[0]:f[1] for f in fractionRows}
#now run the statified samling
df2.stat.sampleBy("areaID", fractions).show()
There is caveat with this approach: as the sampling done by Spark is a random process, the exact number of rows given in the first dataframe will not always be met exactly.
Edit: fractions > 1.0 are not supported by sampleBy. Looking at the Scala code of sampleBy shows why: the function is implemented as filter with a random variable indicating whether to keep to row or not. Returning multiple copies of a single row will therefore not work.
A similar idea can be used to support fractions > 1.0: instead of using a filter, an udf is created that returns an array. The array contains one entry per copy of the row that should be contained in the result. After applying the udf, the array column is exploded and then dropped:
from pyspark.sql import functions as F
from pyspark.sql import types as T
fractions = {'A': 1.5, 'C': 0.5}
def ff(stratum,x):
fraction = fractions.get(stratum, 0.0)
ret=[]
while fraction >= 1.0:
ret.append("x")
fraction = fraction - 1
if x < fraction:
ret.append("x")
return ret
f=F.udf(ff, T.ArrayType(T.StringType())).asNondeterministic()
seed=42
df2.withColumn("r", F.rand(seed)) \
.withColumn("r",f("areaID", F.col("r")))\
.withColumn("r", F.explode("r")) \
.drop("r") \
.show()

complex logic on pyspark dataframe including previous row existing value as well as previous row value generated on the fly

I have to apply a logic on spark dataframe or rdd(preferably dataframe) which requires to generate two extra column. First generated column is dependent on other columns of same row and second generated column is dependent on first generated column of previous row.
Below is representation of problem statement in tabular format. A and B columns are available in dataframe. C and D columns are to be generated.
A | B | C | D
------------------------------------
1 | 100 | default val | C1-B1
2 | 200 | D1-C1 | C2-B2
3 | 300 | D2-C2 | C3-B3
4 | 400 | D3-C3 | C4-B4
5 | 500 | D4-C4 | C5-B5
Here is the sample data
A | B | C | D
------------------------
1 | 100 | 1000 | 900
2 | 200 | -100 | -300
3 | 300 | -200 | -500
4 | 400 | -300 | -700
5 | 500 | -400 | -900
Only solution I can think of is to coalesce the input dataframe to 1, convert it to rdd and then apply python function (having all the calcuation logic) to mapPartitions API .
However this approach may create load on one executor.
Mathematically seeing, D1-C1 where D1= C1-B1; so D1-C1 will become C1-B1-C1 => -B1.
In pyspark, window function has a parameter called default. this should simplify your problem. try this:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([(1,100),(2,200),(3,300),(4,400),(5,500)],['a','b'])
w=Window.orderBy('a')
df_lag =df.withColumn('c',F.lag((F.col('b')*-1),default=1000).over(w))
df_final = df_lag.withColumn('d',F.col('c')-F.col('b'))
Results:
df_final.show()
+---+---+----+----+
| a| b| c| d|
+---+---+----+----+
| 1|100|1000| 900|
| 2|200|-100|-300|
| 3|300|-200|-500|
| 4|400|-300|-700|
| 5|500|-400|-900|
+---+---+----+----+
If the operation is something complex other than subtraction, then the same logic applies - fill the column C with your default value- calculate D , then use lag to calculate C and recalculate D.
The lag() function may help you with that:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.orderBy("A")
df1 = df1.withColumn("C", F.lit(1000))
df2 = (
df1
.withColumn("D", F.col("C") - F.col("B"))
.withColumn("C",
F.when(F.lag("C").over(w).isNotNull(),
F.lag("D").over(w) - F.lag("C").over(w))
.otherwise(F.col("C")))
.withColumn("D", F.col("C") - F.col("B"))
)

Calculating and aggregating data by date/time

I am working with a dataframe like this:
Id | TimeStamp | Event | DeviceId
1 | 5.2.2019 8:00:00 | connect | 1
2 | 5.2.2019 8:00:05 | disconnect| 1
I am using databricks and pyspark to do the ETL process. How can I calculate and create such a dataframe as shown at the bottom? I have already tried using a UDF but I could not find a way to make it work. I also tried to do it by iterating over the whole data frame, but this is extremely slow.
I want to aggregate this dataframe to get a new dataframe which tells me the times, how long each device has been connected and disconnected:
Id | StartDateTime | EndDateTime | EventDuration |State | DeviceId
1 | 5.2.19 8:00:00 | 5.2.19 8:00:05| 0.00:00:05 |connected| 1
I think you can make this work with a window function and some further column creations with withColumn.
The code I did should create the mapping for devices and create a table with the duration for each state. The only requirement is that connect and disconnect appear alternatively.
Then you can use the following code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import Window
import datetime
test_df = sqlContext.createDataFrame([(1,datetime.datetime(2019,2,5,8),"connect",1),
(2,datetime.datetime(2019,2,5,8,0,5),"disconnect",1),
(3,datetime.datetime(2019,2,5,8,10),"connect",1),
(4,datetime.datetime(2019,2,5,8,20),"disconnect",1),],
["Id","TimeStamp","Event","DeviceId"])
#creation of dataframe with 4 events for 1 device
test_df.show()
Output:
+---+-------------------+----------+--------+
| Id| TimeStamp| Event|DeviceId|
+---+-------------------+----------+--------+
| 1|2019-02-05 08:00:00| connect| 1|
| 2|2019-02-05 08:00:05|disconnect| 1|
| 3|2019-02-05 08:10:00| connect| 1|
| 4|2019-02-05 08:20:00|disconnect| 1|
+---+-------------------+----------+--------+
Then you can create the helper functions and the window:
my_window = Window.partitionBy("DeviceId").orderBy(col("TimeStamp").desc()) #create window
get_prev_time = lag(col("Timestamp"),1).over(my_window) #get previous timestamp
time_diff = get_prev_time.cast("long") - col("TimeStamp").cast("long") #compute duration
test_df.withColumn("EventDuration",time_diff)\
.withColumn("EndDateTime",get_prev_time)\ #apply the helper functions
.withColumnRenamed("TimeStamp","StartDateTime")\ #rename according to your schema
.withColumn("State",when(col("Event")=="connect", "connected").otherwise("disconnected"))\ #create the state column
.filter(col("EventDuration").isNotNull()).select("Id","StartDateTime","EndDateTime","EventDuration","State","DeviceId").show()
#finally some filtering for the last events, which do not have a previous time
Output:
+---+-------------------+-------------------+-------------+------------+--------+
| Id| StartDateTime| EndDateTime|EventDuration| State|DeviceId|
+---+-------------------+-------------------+-------------+------------+--------+
| 3|2019-02-05 08:10:00|2019-02-05 08:20:00| 600| connected| 1|
| 2|2019-02-05 08:00:05|2019-02-05 08:10:00| 595|disconnected| 1|
| 1|2019-02-05 08:00:00|2019-02-05 08:00:05| 5| connected| 1|
+---+-------------------+-------------------+-------------+------------+--------+

pyspark - Can I use substring of value as a key of groupBy() function?

I have a dataframe looks like this:
datetime | ID |
======================
20180201000000 | 275 |
20171231113024 | 534 |
20180201220000 | 275 |
20170205000000 | 28 |
what I want to do is to count by ID, monthly.
this way was perfactly worked :
add column of month by extracting from datetime column :
new_df = df.withColumn('month', df.datetime.substr(0,6))
count by ID & month :
count_df = new_df.groupBy('ID','month').count()
but is there a way to use substring of certain column values as an argument of groupBy() function? like :
`count_df = df.groupBy('ID', df.datetime.substr(0,6)).count()`
at least, this code didn't work.
if there exist the way to use substring of values, don't need to add new column and save much of resources(in case of big data).
but even if this approach is wrong, do you have a better idea to get same result?
Try this
>>> df.show()
+--------------+---+
| datetime| id|
+--------------+---+
|20180201000000|275|
|20171231113024|534|
|20180201220000|275|
|20170205000000| 28|
+--------------+---+
>>> df.groupBy('id',df.datetime.substr(0,6)).agg(count('id')).show()
+---+-----------------------+---------+
| id|substring(datetime,0,6)|count(id)|
+---+-----------------------+---------+
|275| 201802| 2|
|534| 201712| 1|
| 28| 201702| 1|
+---+-----------------------+---------+

Filter out null strings and empty strings in hivecontext.sql

I'm using pyspark and hivecontext.sql and I want to filter out all null and empty values from my data.
So I used simple sql commands to first filter out the null values, but it doesen't work.
My code:
hiveContext.sql("select column1 from table where column2 is not null")
but it work without the expression "where column2 is not null"
Error:
Py4JavaError: An error occurred while calling o577.showString
I think it was due to my select is wrong.
Data example:
column 1 | column 2
null | 1
null | 2
1 | 3
2 | 4
null | 2
3 | 8
Objective:
column 1 | column 2
1 | 3
2 | 4
3 | 8
Tks
We cannot pass the Hive table name directly to Hive context sql method since it doesn't understand the Hive table name. One of the way to read Hive table is using the pysaprk shell.
We need to register the data frame we get from reading the hive table. Then we can run the SQL query.
You have to give database_name.table and run the same query it will work. Please let me know if that helps
It work for me:
df.na.drop(subset=["column1"])
Have you entered null values manually?
If yes then it will treat those as normal strings,
I tried following two use cases
dbname.person table in hive
name age
aaa null // this null is entered manually -case 1
Andy 30
Justin 19
okay NULL // this null came as this field was left blank. case 2
---------------------------------
hiveContext.sql("select * from dbname.person").show();
+------+----+
| name| age|
+------+----+
| aaa |null|
| Andy| 30|
|Justin| 19|
| okay|null|
+------+----+
-----------------------------
case 2
hiveContext.sql("select * from dbname.person where age is not null").show();
+------+----+
| name|age |
+------+----+
| aaa |null|
| Andy| 30 |
|Justin| 19 |
+------+----+
------------------------------------
case 1
hiveContext.sql("select * from dbname.person where age!= 'null'").show();
+------+----+
| name| age|
+------+----+
| Andy| 30|
|Justin| 19|
| okay|null|
+------+----+
------------------------------------
I hope above use cases would clear your doubts about filtering null values
out.
and if you are querying a table registered in spark then use sqlContext.