pyspark groupby agg with new col: diff between oldest and newest timetamp - pyspark

I have pyspark dataframe with the following columns:
session_id
timestamp
data = [(("ID1", "2021-12-10 10:00:00")),
(("ID1", "2021-12-10 10:05:00")),
(("ID2", "2021-12-10 10:20:00")),
(("ID2", "2021-12-10 10:24:00")),
(("ID2", "2021-12-10 10:26:00")),
]
I would like to group sessions and add a new column called duration which would be the difference between oldest and newest timestamp for that session (in seconds):
ID1: 300
ID2: 360
How to achieve it ?
Thanks,

You can use an aggregate function like collect_list and then perform max and min operations on the list. To get duration in seconds, you can convert the time values to unix_timestamp and then perform the difference.
Try this:
from pyspark.sql.functions import (
col,
array_max,
collect_list,
array_min,
unix_timestamp,
)
data = [
("ID1", "2021-12-10 10:00:00"),
("ID1", "2021-12-10 10:05:00"),
("ID2", "2021-12-10 10:20:00"),
("ID2", "2021-12-10 10:24:00"),
("ID2", "2021-12-10 10:26:00"),
]
df = spark.createDataFrame(data, ["sessionId", "time"]).select(
"sessionId", col("time").cast("timestamp")
)
df2 = (
df.groupBy("sessionId")
.agg(
array_max(collect_list("time")).alias("max_time"),
array_min(collect_list("time")).alias("min_time"),
)
.withColumn("duration", unix_timestamp("max_time") - unix_timestamp("min_time"))
)
df2.show()

Related

Spark (scala): How to get interval of dates from dataframe into a new dataframe

I need help to explode a dataframe in Spark (Scala language) with date interval in a new dataframe as below example.
I use Spark 2.3.2, so don't have the specific function to do it.
Original dataframe:
EVENT INITIAL_DATE END_DATE
event1 01/01/2023 04/01/2023
event2 15/02/2023 17/02/2023
New dataframe:
EVENT DATE
event1 01/01/2023
event1 02/01/2023
event1 03/01/2023
event1 04/01/2023
event2 15/02/2023
event2 16/02/2023
event2 17/02/2023
Thanks a lot!
The code below creates a sample dataframe with your two rows of data.
// libraries
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import spark.implicits._
// column definition
val cols = new StructType()
.add("event_str", "string")
.add("start_str", "string")
.add("end_str", "string")
// create rdd
val rows: RDD[Row] = sc.parallelize(
Seq(
Row("event1", "01/01/2023", "04/01/2023"),
Row("event2", "15/02/2023 ", "17/02/2023")
)
)
// create df
val df = spark.createDataFrame(rows, cols)
// show data
df.show()
The output of the show statement is shown below.
The issue is that the to_date() function does not handle your format. But we can use substring() and concat() to create the final column result as a string. Then we can explode() a expression() using a sequence.
import org.apache.spark.sql.functions._
// str to date
val df2 = df.withColumn("start_date", to_date(concat(substring(col("start_str"), 4, 2), lit("/"), substring(col("start_str"), 1, 2), lit("/"), substring(col("start_str"), 7, 4)),"MM/dd/yyyy"))
// str to date
val df3 = df2.withColumn("end_date", to_date(concat(substring(col("end_str"), 4, 2), lit("/"), substring(col("end_str"), 1, 2), lit("/"), substring(col("end_str"), 7, 4)),"MM/dd/yyyy"))
// explode sequence
val df4 = df3.withColumn("event_date", explode(expr("sequence(start_date, end_date, interval 1 day)")))
// show result
df4.select(col("event_str"), col("event_date")).show()
The output below shows the designed result.
Since I do not use the scala language often, it took me a little while longer with syntax.

Convert event time into date and time in Pyspark?

I have below event_time in my data frame
I would like to convert the event_time into date/time. Used below code, however it's not coming properly
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime("Event_Time", "dd/MM/yyyy HH:MM:SS"))
df.show()
I am getting below output and it's not coming properly
Can anyone advise how to do this properly as I am new to pyspark?
Seems that your data is in Microseconds (1/1,000,000 second) so you would have to divide by 1,000,000
df = spark.createDataFrame(
[
('1645904274665267',),
('1645973845823770',),
('1644134156697560',),
('1644722868485010',),
('1644805678702121',),
('1645071502180365',),
('1644220446396240',),
('1645736052650785',),
('1646006645296010',),
('1644544811297016',),
('1644614023559317',),
('1644291365608571',),
('1645643575551339',)
], ['Event_Time']
)
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime(f.col("Event_Time")/1000000))
df.show(truncate = False)
output
+----------------+-------------------+
|Event_Time |date |
+----------------+-------------------+
|1645904274665267|2022-02-26 20:37:54|
|1645973845823770|2022-02-27 15:57:25|
|1644134156697560|2022-02-06 08:55:56|
|1644722868485010|2022-02-13 04:27:48|
|1644805678702121|2022-02-14 03:27:58|
|1645071502180365|2022-02-17 05:18:22|
|1644220446396240|2022-02-07 08:54:06|
|1645736052650785|2022-02-24 21:54:12|
|1646006645296010|2022-02-28 01:04:05|
|1644544811297016|2022-02-11 03:00:11|
|1644614023559317|2022-02-11 22:13:43|
|1644291365608571|2022-02-08 04:36:05|
|1645643575551339|2022-02-23 20:12:55|
+----------------+-------------------+

Adding float column to TimestampType column (seconds+miliseconds)

I am trying to add a float column to a TimestampType column in pyspark, but there does not seem to be a way to do this while maintaining the fractional seconds. example of float_seconds is 19.702300786972046, example of timestamp is 2021-06-17 04:31:32.48761
what I want:
calculated_df = beginning_df.withColumn("calculated_column", float_seconds_col + TimestampType_col)
I have tried the following methods, but neither completely solves the problem:
#method 1 adds a single time, but cannot be used to add an entire column to the timestamp column.
calculated_df = beginning_df.withColumn("calculated_column",col("TimestampType_col") + F.expr('INTERVAL 19.702300786 seconds'))
#method 2 converts the float column to unixtime, but cuts off the decimals (which are important)
timestamp_seconds = beginning_df.select(from_unixtime("float_seconds"))
Image of the two columns in question
You could achieve it using a UDF as follows:
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StructType, StructField, FloatType, TimestampType
spark = SparkSession \
.builder \
.appName("StructuredStreamTesting") \
.getOrCreate()
schema = (StructType([
StructField('dt', TimestampType(), nullable=True),
StructField('sec', FloatType(), nullable=True),
]))
item1 = {
"dt": datetime.fromtimestamp(1611859271.516),
"sec": 19.702300786,
}
item2 = {
"dt": datetime.fromtimestamp(1611859271.517),
"sec": 19.702300787,
}
item3 = {
"dt": datetime.fromtimestamp(1611859271.518),
"sec": 19.702300788,
}
df = spark.createDataFrame([item1, item2, item3], schema=schema)
df.printSchema()
#udf(returnType=TimestampType())
def add_time(dt, sec):
return dt + timedelta(seconds=sec)
df = df.withColumn("new_dt", add_time(col("dt"), col("sec")))
df.printSchema()
df.show(truncate=False)
Timestamp data type supports nanoseconds (max 9 digits precision). Your float_seconds_col has precision > 9 digits (15 in your example, it is femto-seconds), it will be lost if converted to Timestamp anyway.
Plain vanilla Hive:
select
timestamp(cast(concat(cast(unix_timestamp(TimestampType_col) as string), --seconds
'.',
regexp_extract(TimestampType_col,'\\.(\\d+)$')) --fractional
as decimal (30, 15)
) + float_seconds_col --round this value to nanos to get better timestamp conversion (round(float_seconds_col,9))
) as result --max precision is 9 (nanoseconds)
from
(
select 19.702300786972046 float_seconds_col,
timestamp('2021-06-17 04:31:32.48761') TimestampType_col
) s
Result:
2021-06-17 04:31:52.189910786

How can i split timestamp to Date and time?

//loading DF
val df1 = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
//
+-------------+
| date_time|
+-----+-------+
|1545905416000|
+-----+-------+
when i use the cast to change the column value to DateType, it shows error
=> the datatype is not matching (date_time : bigint)in df
df1.withColumn("date_time", df1("date").cast(DateType)).show()
Any solution for solveing it???
i tried doing
val a = df1.withColumn("date_time",df1("date").cast(StringType)).drop("date").toDF()
a.withColumn("fomatedDateTime",a("date_time").cast(DateType)).show()
but it does not work.
Welcome to StackOverflow!
You need to convert the timestamp from epoch format to date and then do the computation. You can try this:
import spark.implicits._
val df = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
val df1 = df.withColumn(
"dateCreated",
date_format(
to_date(
substring(
from_unixtime($"date_time".divide(1000)),
0,
10
),
"yyyy-MM-dd"
)
,"dd-MM-yyyy")
)
.withColumn(
"timeCreated",
substring(
from_unixtime($"date_time".divide(1000)),
11,
19
)
)
Sample data from my usecase:
+---------+-------------+--------+-----------+-----------+
| adId| date_time| price|dateCreated|timeCreated|
+---------+-------------+--------+-----------+-----------+
|230010452|1469178808000| 5950.0| 22-07-2016| 14:43:28|
|230147621|1469456306000| 19490.0| 25-07-2016| 19:48:26|
|229662644|1468546792000| 12777.0| 15-07-2016| 07:09:52|
|229218611|1467815284000| 9996.0| 06-07-2016| 19:58:04|
|229105894|1467656022000| 7700.0| 04-07-2016| 23:43:42|
|230214681|1469559471000| 4600.0| 27-07-2016| 00:27:51|
|230158375|1469469248000| 999.0| 25-07-2016| 23:24:08|
+---------+-------------+--------+-----------+-----------+
You need to adjust the time. By default it would be your timezone. For me it's GMT +05:30. Hope it helps.

Scala: For loop on dataframe, create new column from existing by index

I have a dataframe with two columns:
id (string), date (timestamp)
I would like to loop through the dataframe, and add a new column with an url, which includes the id. The algorithm should look something like this:
add one new column with the following value:
for each id
"some url" + the value of the dataframe's id column
I tried to make this work in Scala, but I have problems with getting the specific id on the index of "a"
val k = df2.count().asInstanceOf[Int]
// for loop execution with a range
for( a <- 1 to k){
// println( "Value of a: " + a );
val dfWithFileURL = dataframe.withColumn("fileUrl", "https://someURL/" + dataframe("id")[a])
}
But this
dataframe("id")[a]
is not working with Scala. I could not find solution yet, so every kind of suggestions are welcome!
You can simply use the withColumn function in Scala, something like this:
val df = Seq(
( 1, "1 Jan 2000" ),
( 2, "2 Feb 2014" ),
( 3, "3 Apr 2017" )
)
.toDF("id", "date" )
// Add the fileUrl column
val dfNew = df
.withColumn("fileUrl", concat(lit("https://someURL/"), $"id"))
.show
My results:
Not sure if this is what you require but you can use zipWithIndex for indexing.
data.show()
+---+---------------+
| Id| Url|
+---+---------------+
|111|http://abc.go.org/|
|222|http://xyz.go.net/|
+---+---------------+
import org.apache.spark.sql._
val df = sqlContext.createDataFrame(
data.rdd.zipWithIndex
.map{case (r, i) => Row.fromSeq(r.toSeq:+(s"""${r.getString(1)}${i+1}"""))},
StructType(data.schema.fields :+ StructField("fileUrl", StringType, false))
)
Output:
df.show(false)
+---+---------------+----------------+
|Id |Url |fileUrl |
+---+---------------+----------------+
|111|http://abc.go.org/|http://abc.go.org/1|
|222|http://xyz.go.net/|http://xyz.go.net/2|
+---+---------------+----------------+