Fetch start and end between two dates inclusive in pyspark - date

I have been trying to fetch months range from 2 given dates, but it's not working as expected.
e.g.
start_date (dd-mm-yyyy) = 12-01-2022
end_date (dd-mm-yyyy) = 03-06-2022
Expected output:
Valid_From
Valid_To
2022-01-12
2022-01-31
2022-02-01
2022-02-28
2022-03-01
2022-03-31
2022-04-01
2022-04-30
2022-05-01
2022-05-31
2022-06-01
2022-06-03
My code:
var_forecast_start_date = datetime.datetime(2022, 1, 12)
var_forecast_end_date = datetime.datetime(2022, 6, 2)
df_datetime = pandas_to_spark(
df_datetime(start=var_forecast_start_date, end=var_forecast_end_date)
)
df_datetime = df_datetime.withColumn(
"DateID", date_format(df_datetime.Date, "yyyyMMdd").cast(IntegerType())
).withColumn("FiscalDate", date_format(df_datetime.Date, "yyyy-MM-dd"))
df_datetime = df_datetime.selectExpr(
"add_months(date_add(last_day(Date),1),-1) AS Valid_From",
"last_day(Date) AS Valid_To",
).distinct()

try maybe the following:
import findspark
from pyspark.sql import SparkSession, Window
from pyspark.sql import functions as F
findspark.init()
spark = SparkSession.builder.appName("local").getOrCreate()
columns = ["start_date", "end_date"]
data = [("12-01-2022", "03-06-2022")]
df = spark.createDataFrame(data).toDF(*columns)
df = (
df.withColumn(
"start_date", F.to_date(F.col("start_date"), "dd-MM-yyyy").cast("DATE")
)
.withColumn(
"end_date", F.to_date(F.col("end_date"), "dd-MM-yyyy").cast("DATE")
)
.withColumn(
"months_between",
F.round(
F.months_between(F.col("end_date"), F.col("start_date"), True)
).cast("Integer"),
)
.withColumn(
"months_between_seq", F.sequence(F.lit(1), F.col("months_between"))
)
.withColumn("months_between_seq", F.explode(F.col("months_between_seq")))
.withColumn(
"end_of_month",
F.expr(
"""
LAST_DAY(ADD_MONTHS(start_date, months_between_seq - 1))
"""
),
)
.withColumn(
"begin_of_month",
F.expr(
"""
LAST_DAY(ADD_MONTHS(start_date, months_between_seq - 1)) + 1
"""
),
)
)
start_window_agg = Window.partitionBy().orderBy("Valid_From")
start_union_sdf = (
df.select(
F.col("start_date").alias("Valid_From")
)
.unionByName(
df.select(
F.col("begin_of_month").alias("Valid_From")
)
)
.drop_duplicates()
.withColumn(
"row_number",
F.row_number().over(start_window_agg)
)
)
end_window_agg = Window.partitionBy().orderBy("Valid_To")
end_union_sdf = (
df.select(
F.col("end_date").alias("Valid_To")
)
.unionByName(
df.select(
F.col("end_of_month").alias("Valid_To")
)
)
.drop_duplicates()
.withColumn(
"row_number",
F.row_number().over(end_window_agg)
)
)
join_sdf = (
end_union_sdf
.join(
start_union_sdf,
how="inner",
on=["row_number"]
)
.drop("row_number")
.withColumn("Valid_To", F.col("Valid_To").cast("DATE"))
.withColumn("Valid_From", F.col("Valid_From").cast("DATE"))
.select("Valid_From", "Valid_To")
.orderBy("Valid_From")
)
join_sdf.show()
It returns:
+----------+----------+
|Valid_From| Valid_To|
+----------+----------+
|2022-01-12|2022-01-31|
|2022-02-01|2022-02-28|
|2022-03-01|2022-03-31|
|2022-04-01|2022-04-30|
|2022-05-01|2022-05-31|
|2022-06-01|2022-06-03|
+----------+----------+

Related

pyspark groupby agg with new col: diff between oldest and newest timetamp

I have pyspark dataframe with the following columns:
session_id
timestamp
data = [(("ID1", "2021-12-10 10:00:00")),
(("ID1", "2021-12-10 10:05:00")),
(("ID2", "2021-12-10 10:20:00")),
(("ID2", "2021-12-10 10:24:00")),
(("ID2", "2021-12-10 10:26:00")),
]
I would like to group sessions and add a new column called duration which would be the difference between oldest and newest timestamp for that session (in seconds):
ID1: 300
ID2: 360
How to achieve it ?
Thanks,
You can use an aggregate function like collect_list and then perform max and min operations on the list. To get duration in seconds, you can convert the time values to unix_timestamp and then perform the difference.
Try this:
from pyspark.sql.functions import (
col,
array_max,
collect_list,
array_min,
unix_timestamp,
)
data = [
("ID1", "2021-12-10 10:00:00"),
("ID1", "2021-12-10 10:05:00"),
("ID2", "2021-12-10 10:20:00"),
("ID2", "2021-12-10 10:24:00"),
("ID2", "2021-12-10 10:26:00"),
]
df = spark.createDataFrame(data, ["sessionId", "time"]).select(
"sessionId", col("time").cast("timestamp")
)
df2 = (
df.groupBy("sessionId")
.agg(
array_max(collect_list("time")).alias("max_time"),
array_min(collect_list("time")).alias("min_time"),
)
.withColumn("duration", unix_timestamp("max_time") - unix_timestamp("min_time"))
)
df2.show()

CDC with pyspark

I am trying to write pyspark code to fit into 2 scenarios.
Scenario 1:
Input data:
col1|col2|date
100|Austin|2021-01-10
100|Newyork|2021-02-15
100|Austin|2021-03-02
Expected output with CDC:
col1|col2|start_date|end_date
100|Austin|2021-01-10|2021-02-15
100|Newyork|2021-02-15|2021-03-02
100|Austin|2021-03-02|2099-12-31
In sequence there is a change in col2 values and want to maintain CDC
Scenario 2:
Input:
col1|col2|date
100|Austin|2021-01-10
100|Austin|2021-03-02 -> I want to eliminate this version because there is no change in col1 and col2 values between records.
Expected Output:
col1|col2|start_date|end_date
100|Austin|2021-01-10|2099-12-31
I am looking for the same code to work in both scenarios.
I am trying something like this but not working for both scenarios
inputdf = inputdf.groupBy('col1','col2','date').agg(
F.min("date").alias("r_date"))
inputdf = inputdf.drop("date").withColumnRenamed("r_date", "start_date")
my_allcolumnwindowasc = Window.partitionBy('col1','col2').orderBy("start_date")
inputdf = inputdf.withColumn('dropDuplicates',F.lead(inputdf.start_date).over(my_allcolumnwindowasc)).where(F.col("dropDuplicates").isNotNull()).drop('dropDuplicates')
There are more than 20 columns in some of the scenarios.
Thanks for help!
check this out.
Steps:
Use window function too give the row number
convert the dataframe to view
use self join (condition checks are the key)
use Lead window function wrapped by coalesce in the case of null value to give the "2099-12-31" value
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
spark = SparkSession \
.builder \
.appName("SO") \
.getOrCreate()
df = spark.createDataFrame(
[(100, "Austin", "2021-01-10"),
(100, "Newyork", "2021-02-15"),
(100, "Austin", "2021-03-02"),
],
['col1', 'col2', 'date']
)
# df = spark.createDataFrame(
# [(100, "Austin", "2021-01-10"),
# (100, "Austin", "2021-03-02"),
# ],
# ['col1', 'col2', 'date']
# )
df1 = df.withColumn("start_date", F.to_date("date"))
w = Window.partitionBy("col1",).orderBy("start_date")
df_1 = df1.withColumn("rn", F.row_number().over(w))
df_1.createTempView("temp_1")
df_dupe = spark.sql('select temp_1.col1,temp_1.col2,temp_1.start_date, case when temp_1.col1=temp_2.col1 and temp_1.col2=temp_2.col2 then "delete" else "no-delete" end as dupe from temp_1 left join temp_1 as temp_2 '
'on temp_1.col1=temp_2.col1 and temp_1.col2=temp_2.col2 and temp_1.rn-1 = temp_2.rn order by temp_1.start_date ')
df_dupe.filter(F.col("dupe")=="no-delete").drop("dupe")\
.withColumn("end_date", F.coalesce(F.lead("start_date").over(w),F.lit("2099-12-31"))).show()
# Result:
# Scenario1:
#+----+-------+----------+----------+
# |col1| col2|start_date| end_date|
# +----+-------+----------+----------+
# | 100| Austin|2021-01-10|2021-02-15|
# | 100|Newyork|2021-02-15|2021-03-02|
# | 100| Austin|2021-03-02|2099-12-31|
# +----+-------+----------+----------+
#
# Scenario 2:
# +----+------+----------+----------+
# |col1| col2|start_date| end_date|
# +----+------+----------+----------+
# | 100|Austin|2021-01-10|2099-12-31|
# +----+------+----------+----------+

How can i split timestamp to Date and time?

//loading DF
val df1 = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
//
+-------------+
| date_time|
+-----+-------+
|1545905416000|
+-----+-------+
when i use the cast to change the column value to DateType, it shows error
=> the datatype is not matching (date_time : bigint)in df
df1.withColumn("date_time", df1("date").cast(DateType)).show()
Any solution for solveing it???
i tried doing
val a = df1.withColumn("date_time",df1("date").cast(StringType)).drop("date").toDF()
a.withColumn("fomatedDateTime",a("date_time").cast(DateType)).show()
but it does not work.
Welcome to StackOverflow!
You need to convert the timestamp from epoch format to date and then do the computation. You can try this:
import spark.implicits._
val df = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
val df1 = df.withColumn(
"dateCreated",
date_format(
to_date(
substring(
from_unixtime($"date_time".divide(1000)),
0,
10
),
"yyyy-MM-dd"
)
,"dd-MM-yyyy")
)
.withColumn(
"timeCreated",
substring(
from_unixtime($"date_time".divide(1000)),
11,
19
)
)
Sample data from my usecase:
+---------+-------------+--------+-----------+-----------+
| adId| date_time| price|dateCreated|timeCreated|
+---------+-------------+--------+-----------+-----------+
|230010452|1469178808000| 5950.0| 22-07-2016| 14:43:28|
|230147621|1469456306000| 19490.0| 25-07-2016| 19:48:26|
|229662644|1468546792000| 12777.0| 15-07-2016| 07:09:52|
|229218611|1467815284000| 9996.0| 06-07-2016| 19:58:04|
|229105894|1467656022000| 7700.0| 04-07-2016| 23:43:42|
|230214681|1469559471000| 4600.0| 27-07-2016| 00:27:51|
|230158375|1469469248000| 999.0| 25-07-2016| 23:24:08|
+---------+-------------+--------+-----------+-----------+
You need to adjust the time. By default it would be your timezone. For me it's GMT +05:30. Hope it helps.

Validate data from the same column in different rows with pyspark

How can I change the value of a column depending on some validation between some cells? What I need is to compare the kilometraje values of each customer's (id) record to compare whether the record that follows the kilometraje is higher.
fecha id estado id_cliente error_code kilometraje error_km
1/1/2019 1 A 1 10
2/1/2019 2 A ERROR 20
3/1/2019 1 D 1 ERROR 30
4/1/2019 2 O ERROR
The error in the error_km column is because for customer (id) 2 the kilometraje value is less than the same customer record for 2/1/2019 (If time passes the car is used so the kilometraje increases, so that there is no error the mileage has to be higher or the same)
I know that withColumn I can overwrite or create a column that doesn't exist and that using when I can set conditions. For example: This would be the code I use to validate the estado and id_cliente column and ERROR overwrite the error_code column where applicable, but I don't understand how to validate between different rows for the same client.
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
file_path = 'archive.txt'
error = 'ERROR'
df = spark.read.parquet(file_path)
df = df.persist(StorageLevel.MEMORY_AND_DISK)
df = df.select('estado', 'id_cliente')
df = df.withColumn("error_code", lit(''))
df = df.withColumn('error_code',
F.when((F.col('status') == 'O') &
(F.col('client_id') != '') |
(F.col('status') == 'D') &
(F.col('client_id') != '') |
(F.col('status') == 'A') &
(F.col('client_id') == ''),
F.concat(F.col("error_code"), F.lit(":[{}]".format(error)))
)
.otherwise(F.col('error_code')))
You achieve that with the lag window function. The lag function returns you the row before the current row. With that you can easily compare the kilometraje values. Have a look at the code below:
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 30 ),
('7/1/2019', 3 , 30 ),
('4/1/2019', 2 , 5)]
columns = ['fecha', 'id', 'kilometraje']
df=spark.createDataFrame(l, columns)
df = df.withColumn('fecha',F.to_date(df.fecha, 'dd/MM/yyyy'))
w = Window.partitionBy('id').orderBy('fecha')
df = df.withColumn('error_km', F.when(F.lag('kilometraje').over(w) > df.kilometraje, F.lit('ERROR') ).otherwise(F.lit('')))
df.show()
Output:
+----------+---+-----------+--------+
| fecha| id|kilometraje|error_km|
+----------+---+-----------+--------+
|2019-01-01| 1| 10| |
|2019-01-03| 1| 30| |
|2019-01-04| 1| 10| ERROR|
|2019-01-05| 1| 30| |
|2019-01-07| 3| 30| |
|2019-01-02| 2| 20| |
|2019-01-04| 2| 5| ERROR|
+----------+---+-----------+--------+
The fourth row doesn't get labeled with 'ERROR' as the previous value had a smaller kilometraje value (10 < 30). When you want to label all the id's with 'ERROR' which contain at least one corrupted row, perform a left join.
df.drop('error_km').join(df.filter(df.error_km == 'ERROR').groupby('id').agg(F.first(df.error_km).alias('error_km')), 'id', 'left').show()
I use .rangeBetween(Window.unboundedPreceding,0).
This function searches from the current value for the added value for the back
import pyspark
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
error = 'This is error'
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 22 ),
('7/1/2019', 1 , 23 ),
('22/1/2019', 2 , 5),
('11/1/2019', 2 , 24),
('13/2/2019', 1 , 16),
('14/2/2019', 2 , 18),
('5/2/2019', 1 , 19),
('6/2/2019', 2 , 23),
('7/2/2019', 1 , 14),
('8/3/2019', 1 , 50),
('8/3/2019', 2 , 50)]
columns = ['date', 'vin', 'mileage']
df=spark.createDataFrame(l, columns)
df = df.withColumn('date',F.to_date(df.date, 'dd/MM/yyyy'))
df = df.withColumn("max", lit(0))
df = df.withColumn("error_code", lit(''))
w = Window.partitionBy('vin').orderBy('date').rangeBetween(Window.unboundedPreceding,0)
df = df.withColumn('max',F.max('mileage').over(w))
df = df.withColumn('error_code', F.when(F.col('mileage') < F.col('max'), F.lit('ERROR')).otherwise(F.lit('')))
df.show()
Finally, all that remains is to remove the column that has the maximum
df = df.drop('max')
df.show()

Spark Scala: DateDiff of two columns by hour or minute

I have two timestamp columns in a dataframe that I'd like to get the minute difference of, or alternatively, the hour difference of. Currently I'm able to get the day difference, with rounding, by doing
val df2 = df1.withColumn("time", datediff(df1("ts1"), df1("ts2")))
However, when i looked at the doc page
https://issues.apache.org/jira/browse/SPARK-8185
I didn't see any extra parameters to change the unit. Is their a different function I should be using for this?
You can get the difference in seconds by
import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
Then you can do some math to get the unit you want. For example:
val df2 = df1
.withColumn( "diff_secs", diff_secs_col )
.withColumn( "diff_mins", diff_secs_col / 60D )
.withColumn( "diff_hrs", diff_secs_col / 3600D )
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )
Or, in pyspark:
from pyspark.sql.functions import *
diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
df2 = df1 \
.withColumn( "diff_secs", diff_secs_col ) \
.withColumn( "diff_mins", diff_secs_col / 60D ) \
.withColumn( "diff_hrs", diff_secs_col / 3600D ) \
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )
The answer given by Daniel de Paula works, but that solution does not work in the case where the difference is needed for every row in your table. Here is a solution that will do that for each row:
import org.apache.spark.sql.functions
val df2 = df1.selectExpr("(unix_timestamp(ts1) - unix_timestamp(ts2))/3600")
This first converts the data in the columns to a unix timestamp in seconds, subtracts them and then converts the difference to hours.
A useful list of functions can be found at:
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.functions$