Duration between the current row and the next one using Pyspark - pyspark

Trip
id,timestamp
1008,2003-11-03 15:00:31
1008,2003-11-03 15:02:38
1008,2003-11-03 15:03:04
1008,2003-11-03 15:18:00
1009,2003-11-03 22:00:00
1009,2003-11-03 22:02:53
1009,2003-11-03 22:03:44
1009,2003-11-14 10:00:00
1009,2003-11-14 10:02:02
1009,2003-11-14 10:03:10
Using Pandas:
trip['time_diff'] = np.where(trip['id'] == trip['id'].shift(-1),
trip['timestamp'].shift(-1) - trip['timestamp']/1000,
None)
trip['time_diff'] = pd.to_numeric(trip['time_diff'])
I did this operation in Pyspark but nothing works, it's been a week that I program with spark then I still have trouble using the window.
from pyspark.sql.types import *
from pyspark.sql import window
from pyspark.sql import functions as F
my_window = Window.partition('id').orderBy('timestamp').rowsBetween(0, 1)
timeFmt = "yyyy-MM-dd HH:mm:ss"
time_diff = (F.unix_timestamp(trip.timestamp, format=timeFmt).cast("long") -
F.unix_timestamp(trip.timestamp, format=timeFmt).over(my_window).cast("long"))
trip = trip.withColumn('time_diff', time_diff)
I wonder if that's the way to do it!! If not how to translate this operation into Pyspark?
delta_time
error
the result should look like
id, timestamp, diff_time
1008, 2003-11-03 15:00:31, 127
1008, 2003-11-03 15:02:38, 26
1008, 2003-11-03 15:03:04, 896
1008, 2003-11-03 15:18:00, None
1009, 2003-11-03 22:00:00, 173
1009, 2003-11-03 22:02:53, 51
1009, 2003-11-03 22:03:44, 956776
1009, 2003-11-14 10:00:00, .....
1009, 2003-11-14 10:02:02, .....
1009, 2003-11-14 10:03:10, .....

You can use lead function and calculate the time difference. The following gives what you want:
val interdf = spark.sql("select id, timestamp, lead(timestamp) over (partition by id order by timestamp) as next_ts from data")
interdf.createOrReplaceTempView("interdf")
spark.sql("select id, timestamp, next_ts, unix_timestamp(next_ts) - unix_timestamp(timestamp) from interdf").show()
If you want to avoid spark-sql, you can do the same by importing the relevant functions
import org.apache.spark.sql.functions.lead
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("id").orderBy("timestamp")
Relevant Python code:
from pyspark.sql import Window
from pyspark.sql.functions import abs
window = Window.partitionBy("id").orderBy("timestamp")
diff = col("timestamp").cast("long") - lead("timestamp", 1).over(window).cast("long")
df = df.withColumn("diff", diff)
df = df.withColumn('diff', abs(df.diff))
Result:

Related

Convert event time into date and time in Pyspark?

I have below event_time in my data frame
I would like to convert the event_time into date/time. Used below code, however it's not coming properly
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime("Event_Time", "dd/MM/yyyy HH:MM:SS"))
df.show()
I am getting below output and it's not coming properly
Can anyone advise how to do this properly as I am new to pyspark?
Seems that your data is in Microseconds (1/1,000,000 second) so you would have to divide by 1,000,000
df = spark.createDataFrame(
[
('1645904274665267',),
('1645973845823770',),
('1644134156697560',),
('1644722868485010',),
('1644805678702121',),
('1645071502180365',),
('1644220446396240',),
('1645736052650785',),
('1646006645296010',),
('1644544811297016',),
('1644614023559317',),
('1644291365608571',),
('1645643575551339',)
], ['Event_Time']
)
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime(f.col("Event_Time")/1000000))
df.show(truncate = False)
output
+----------------+-------------------+
|Event_Time |date |
+----------------+-------------------+
|1645904274665267|2022-02-26 20:37:54|
|1645973845823770|2022-02-27 15:57:25|
|1644134156697560|2022-02-06 08:55:56|
|1644722868485010|2022-02-13 04:27:48|
|1644805678702121|2022-02-14 03:27:58|
|1645071502180365|2022-02-17 05:18:22|
|1644220446396240|2022-02-07 08:54:06|
|1645736052650785|2022-02-24 21:54:12|
|1646006645296010|2022-02-28 01:04:05|
|1644544811297016|2022-02-11 03:00:11|
|1644614023559317|2022-02-11 22:13:43|
|1644291365608571|2022-02-08 04:36:05|
|1645643575551339|2022-02-23 20:12:55|
+----------------+-------------------+

Adding float column to TimestampType column (seconds+miliseconds)

I am trying to add a float column to a TimestampType column in pyspark, but there does not seem to be a way to do this while maintaining the fractional seconds. example of float_seconds is 19.702300786972046, example of timestamp is 2021-06-17 04:31:32.48761
what I want:
calculated_df = beginning_df.withColumn("calculated_column", float_seconds_col + TimestampType_col)
I have tried the following methods, but neither completely solves the problem:
#method 1 adds a single time, but cannot be used to add an entire column to the timestamp column.
calculated_df = beginning_df.withColumn("calculated_column",col("TimestampType_col") + F.expr('INTERVAL 19.702300786 seconds'))
#method 2 converts the float column to unixtime, but cuts off the decimals (which are important)
timestamp_seconds = beginning_df.select(from_unixtime("float_seconds"))
Image of the two columns in question
You could achieve it using a UDF as follows:
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StructType, StructField, FloatType, TimestampType
spark = SparkSession \
.builder \
.appName("StructuredStreamTesting") \
.getOrCreate()
schema = (StructType([
StructField('dt', TimestampType(), nullable=True),
StructField('sec', FloatType(), nullable=True),
]))
item1 = {
"dt": datetime.fromtimestamp(1611859271.516),
"sec": 19.702300786,
}
item2 = {
"dt": datetime.fromtimestamp(1611859271.517),
"sec": 19.702300787,
}
item3 = {
"dt": datetime.fromtimestamp(1611859271.518),
"sec": 19.702300788,
}
df = spark.createDataFrame([item1, item2, item3], schema=schema)
df.printSchema()
#udf(returnType=TimestampType())
def add_time(dt, sec):
return dt + timedelta(seconds=sec)
df = df.withColumn("new_dt", add_time(col("dt"), col("sec")))
df.printSchema()
df.show(truncate=False)
Timestamp data type supports nanoseconds (max 9 digits precision). Your float_seconds_col has precision > 9 digits (15 in your example, it is femto-seconds), it will be lost if converted to Timestamp anyway.
Plain vanilla Hive:
select
timestamp(cast(concat(cast(unix_timestamp(TimestampType_col) as string), --seconds
'.',
regexp_extract(TimestampType_col,'\\.(\\d+)$')) --fractional
as decimal (30, 15)
) + float_seconds_col --round this value to nanos to get better timestamp conversion (round(float_seconds_col,9))
) as result --max precision is 9 (nanoseconds)
from
(
select 19.702300786972046 float_seconds_col,
timestamp('2021-06-17 04:31:32.48761') TimestampType_col
) s
Result:
2021-06-17 04:31:52.189910786

convert integer into date to count number of days

I need to convert Integer to date(yyyy-mm-dd) format, to calculate number of days.
registryDate
20130826
20130829
20130816
20130925
20130930
20130926
Desired output:
registryDate TodaysDate DaysInBetween
20130826 2018-11-24 1916
20130829 2018-11-24 1913
20130816 2018-11-24 1926
You can cast registryDate to String type, then apply to_date and datediff to compute the difference in days, as shown below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import java.sql.Date
val df = Seq(
20130826, 20130829, 20130816, 20130825
).toDF("registryDate")
df.
withColumn("registryDate2", to_date($"registryDate".cast(StringType), "yyyyMMdd")).
withColumn("todaysDate", lit(Date.valueOf("2018-11-24"))).
withColumn("DaysInBetween", datediff($"todaysDate", $"registryDate2")).
show
// +------------+-------------+----------+-------------+
// |registryDate|registryDate2|todaysDate|DaysInBetween|
// +------------+-------------+----------+-------------+
// | 20130826| 2013-08-26|2018-11-24| 1916|
// | 20130829| 2013-08-29|2018-11-24| 1913|
// | 20130816| 2013-08-16|2018-11-24| 1926|
// | 20130825| 2013-08-25|2018-11-24| 1917|
// +------------+-------------+----------+-------------+

how to get months,years difference between two dates in sparksql

I am getting the error:
org.apache.spark.sql.analysisexception: cannot resolve 'year'
My input data:
1,2012-07-21,2014-04-09
My code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class c (id:Int,start:String,end:String)
val c1 = sc.textFile("date.txt")
val c2 = c1.map(_.split(",")).map(r=>(c(r(0).toInt,r(1).toString,r(2).toString)))
val c3 = c2.toDF();
c3.registerTempTable("c4")
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")
What can I do resolve above error?
I have tried the following code but I got the output in days and I need it in years
val r = sqlContext.sql("select id,datediff(to_date(end), to_date(start)) AS date from c4")
Please advise me if i can use any function like to_date to get year difference.
Another simple way to cast the string to dateType in spark sql and apply sql dates and time functions on the columns like following :
import org.apache.spark.sql.types._
val c4 = c3.select(col("id"),col("start").cast(DateType),col("end").cast(DateType))
c4.withColumn("dateDifference", datediff(col("end"),col("start")))
.withColumn("monthDifference", months_between(col("end"),col("start")))
.withColumn("yearDifference", year(col("end"))-year(col("start")))
.show()
One of the above answers doesn't return the right Year when days between two dates less than 365. Below example provides the right year and rounds the month and year to 2 decimal.
Seq(("2019-07-01"),("2019-06-24"),("2019-08-24"),("2018-12-23"),("2018-07-20")).toDF("startDate").select(
col("startDate"),current_date().as("endDate"))
.withColumn("datesDiff", datediff(col("endDate"),col("startDate")))
.withColumn("montsDiff", months_between(col("endDate"),col("startDate")))
.withColumn("montsDiff_round", round(months_between(col("endDate"),col("startDate")),2))
.withColumn("yearsDiff", months_between(col("endDate"),col("startDate"),true).divide(12))
.withColumn("yearsDiff_round", round(months_between(col("endDate"),col("startDate"),true).divide(12),2))
.show()
Outputs:
+----------+----------+---------+-----------+---------------+--------------------+---------------+
| startDate| endDate|datesDiff| montsDiff|montsDiff_round| yearsDiff|yearsDiff_round|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
|2019-07-01|2019-07-24| 23| 0.74193548| 0.74| 0.06182795666666666| 0.06|
|2019-06-24|2019-07-24| 30| 1.0| 1.0| 0.08333333333333333| 0.08|
|2019-08-24|2019-07-24| -31| -1.0| -1.0|-0.08333333333333333| -0.08|
|2018-12-23|2019-07-24| 213| 7.03225806| 7.03| 0.586021505| 0.59|
|2018-07-20|2019-07-24| 369|12.12903226| 12.13| 1.0107526883333333| 1.01|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
You can find a complete working example at below URL
https://sparkbyexamples.com/spark-calculate-difference-between-two-dates-in-days-months-and-years/
Hope this helps.
Happy Learning !!
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")
In the above code, "year" is not a column in the data frame i.e it is not a valid column in table "c4" that is why analysis exception is thrown as query is invalid, query is not able to find the "year" column.
Use Spark User Defined Function (UDF), that will be a more robust approach.
Since dateDiff only returns the difference between days. I prefer to use my own UDF.
import java.sql.Timestamp
import java.time.Instant
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.functions.{udf, col}
import org.apache.spark.sql.DataFrame
def timeDiff(chronoUnit: ChronoUnit)(dateA: Timestamp, dateB: Timestamp): Long = {
chronoUnit.between(
Instant.ofEpochMilli(dateA.getTime),
Instant.ofEpochMilli(dateB.getTime)
)
}
def withTimeDiff(dateA: String, dateB: String, colName: String, chronoUnit: ChronoUnit)(df: DataFrame): DataFrame = {
val timeDiffUDF = udf[Long, Timestamp, Timestamp](timeDiff(chronoUnit))
df.withColumn(colName, timeDiffUDF(col(dateA), col(dateB)))
}
Then I call it as a dataframe transformation.
df.transform(withTimeDiff("sleepTime", "wakeupTime", "minutes", ChronoUnit.MINUTES)

Transform rows to columns in Spark Scala SQL

I have a database table containing unique user ids and items clicked.
e.g.
user id,item id
1 , 345
1 , 78993
1 , 784
5, 345
5, 897
15, 454
and I want to transform this data into following format using spark SQL (if possible in Scala)
user id, item ids
1, 345, 78993, 784
5, 345,897
15, 454
Thanks,
A local example:
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
object Main extends App {
case class Record(user: Int, item: Int)
val items = List(
Record(1 , 345),
Record(1 , 78993),
Record(1 , 784),
Record(5, 345),
Record(5, 897),
Record(15, 454)
)
val sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
import hiveContext.sql
val df = sc.parallelize(items).toDF()
df.registerTempTable("records")
sql("SELECT * FROM records").collect().foreach(println)
sql("SELECT user, collect_set(item) From records group by user").collect().foreach(println)
}
This produces:
[1,ArrayBuffer(78993, 784, 345)]
[5,ArrayBuffer(897, 345)]
[15,ArrayBuffer(454)]
This is a pretty simple groupByKey scenario. Although if you want to do something else with it after, then I would suggest using a more efficient PairRDDFunction as groupByKey is inefficient for follow up queries.