I have a DF in which I have bookingDt and arrivalDt columns. I need to find all the dates between these two dates.
Sample code:
df = spark.sparkContext.parallelize(
[Row(vyge_id=1000, bookingDt='2018-01-01', arrivalDt='2018-01-05')]).toDF()
diffDaysDF = df.withColumn("diffDays", datediff('arrivalDt', 'bookingDt'))
diffDaysDF.show()
code output:
+----------+----------+-------+--------+
| arrivalDt| bookingDt|vyge_id|diffDays|
+----------+----------+-------+--------+
|2018-01-05|2018-01-01| 1000| 4|
+----------+----------+-------+--------+
What I tried was finding the number of days between two dates and calculate all the dates using timedelta function and explode it.
dateList = [str(bookingDt + timedelta(i)) for i in range(diffDays)]
Expected output:
Basically, I need to build a DF with a record for each date in between bookingDt and arrivalDt, inclusive.
+----------+----------+-------+----------+
| arrivalDt| bookingDt|vyge_id|txnDt |
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-01|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-02|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-03|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-04|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-05|
+----------+----------+-------+----------+
For Spark 2.4+ sequence can be used to create an array containg all dates between bookingDt and arrivalDt. This array can then be exploded.
from pyspark.sql import functions as F
df = df \
.withColumn('bookingDt', F.col('bookingDt').cast('date')) \
.withColumn('arrivalDt', F.col('arrivalDt').cast('date'))
df.withColumn('txnDt', F.explode(F.expr('sequence(bookingDt, arrivalDt, interval 1 day)')))\
.show()
Output:
+-------+----------+----------+----------+
|vyge_id| bookingDt| arrivalDt| txnDt|
+-------+----------+----------+----------+
| 1000|2018-01-01|2018-01-05|2018-01-01|
| 1000|2018-01-01|2018-01-05|2018-01-02|
| 1000|2018-01-01|2018-01-05|2018-01-03|
| 1000|2018-01-01|2018-01-05|2018-01-04|
| 1000|2018-01-01|2018-01-05|2018-01-05|
+-------+----------+----------+----------+
As long as you're using Spark version 2.1 or higher, you can exploit the fact that we can use column values as arguments when using pyspark.sql.functions.expr():
Create a dummy string of repeating commas with a length equal to diffDays
Split this string on ',' to turn it into an array of size diffDays
Use pyspark.sql.functions.posexplode() to explode this array along with its indices
Finally use pyspark.sql.functions.date_add() to add the index value number of days to the bookingDt
Code:
import pyspark.sql.functions as f
diffDaysDF.withColumn("repeat", f.expr("split(repeat(',', diffDays), ',')"))\
.select("*", f.posexplode("repeat").alias("txnDt", "val"))\
.drop("repeat", "val", "diffDays")\
.withColumn("txnDt", f.expr("date_add(bookingDt, txnDt)"))\
.show()
#+----------+----------+-------+----------+
#| arrivalDt| bookingDt|vyge_id| txnDt|
#+----------+----------+-------+----------+
#|2018-01-05|2018-01-01| 1000|2018-01-01|
#|2018-01-05|2018-01-01| 1000|2018-01-02|
#|2018-01-05|2018-01-01| 1000|2018-01-03|
#|2018-01-05|2018-01-01| 1000|2018-01-04|
#|2018-01-05|2018-01-01| 1000|2018-01-05|
#+----------+----------+-------+----------+
Well, you can do following.
Create a dataframe with dates only:
dates_df # with all days between first bookingDt and last arrivalDt
and then join those df with between condition:
df.join(dates_df,
on=col('dates_df.dates').between(col('df.bookindDt'), col('dt.arrivalDt'))
.select('df.*', 'dates_df.dates')
It might work even faster then solution with explode, however you need to figure out what is start and end date for this df.
10 years df will have just 3650 records not that many to worry about.
As #vvg suggested:
# I assume, bookindDt has dates range including arrivalDt,
# otherwise you have to find intersection of unique dates of bookindDt and arrivalDt
dates_df = df.select('bookindDt').distinct()
dates_df = dates_df.withColumnRenamed('bookindDt', 'day_of_listing')
listing_days_df = df.join(dates_df, on=dates_df.day_of_listing.between(df.bookindDt, df.arrivalDt))
Output:
+----------+----------+-------+-------------------+
| arrivalDt| bookingDt|vyge_id|day_of_listing |
+----------+----------+-------+-------------------+
|2018-01-05|2018-01-01| 1000|2018-01-01 |
+----------+----------+-------+-------------------+
|2018-01-05|2018-01-01| 1000|2018-01-02 |
+----------+----------+-------+-------------------+
|2018-01-05|2018-01-01| 1000|2018-01-03 |
+----------+----------+-------+-------------------+
|2018-01-05|2018-01-01| 1000|2018-01-04 |
+----------+----------+-------+-------------------+
|2018-01-05|2018-01-01| 1000|2018-01-05 |
+----------+----------+-------+-------------------+
Related
Currently I'm working with a dataframe and need to calculate the number of days (as integer) between two dates formatted as timestamp
I've opted for this solution:
from pyspark.sql.functions import lit, when, col, datediff
df1 = df1.withColumn("LD", datediff("MD", "TD"))
But after calculating sum from a list I get an error: "Column in not iterable" which makes me impossible to calculate sum of the rows based on column names
col_list = ["a", "b", "c"]
df2 = df1.withColumn("My_Sum", sum([F.col(c) for c in col_list]))
How can I deal with it in order to calculate the difference between dates and then calculate the sum of the rows given the names of certain columns?
The datediff has nothing to do with the sum of a column. The pyspark sql sum function takes in 1 column and it calculates the sum of the rows in that column.
Here are a couple of ways to get the sum of a column from a list of columns using list comprehension.
Single row output with the sum of the column
data_sdf. \
select(*[func.sum(c).alias(c+'_sum') for c in col_list]). \
show()
# +-----+-----+-----+
# |a_sum|b_sum|c_sum|
# +-----+-----+-----+
# | 1337| 3778| 6270|
# +-----+-----+-----+
the sum of all rows of the column in each row
from pyspark.sql.window import Window as wd
data_sdf. \
select('*',
*[func.sum(c).over(wd.partitionBy()).alias(c+'_sum') for c in col_list]
). \
show(5)
# +---+---+---+-----+-----+-----+
# | a| b| c|a_sum|b_sum|c_sum|
# +---+---+---+-----+-----+-----+
# | 45| 58|125| 1337| 3778| 6270|
# | 9| 99|143| 1337| 3778| 6270|
# | 33| 91|146| 1337| 3778| 6270|
# | 21| 85|118| 1337| 3778| 6270|
# | 30| 55|101| 1337| 3778| 6270|
# +---+---+---+-----+-----+-----+
# only showing top 5 rows
I have a spark dataframe : df :
|id | year | month |
-------------------
| 1 | 2020 | 01 |
| 2 | 2019 | 03 |
| 3 | 2020 | 01 |
I have a sequence year_month = Seq[(2019,01),(2020,01),(2021,01)]
val year_map gets genrated dynamically based on code runs everytime
I want to filter the dataframe : df based on the year_month sequence for on ($year=seq[0] & $month = seq[1]) for each value pair in sequence year_month
You can achieve this by
Create a dataframe from year_month
Perform an inner join on year_month with your original dataframe on month and year
Choosing distinct records
The resulting dataframe will be the matched rows
Working Example
Setup
import spark.implicits._
val dfData = Seq((1,2020,1),(2,2019,3),(3,2020,1))
val df = dfData.toDF()
.selectExpr("_1 as id"," _2 as year","_3 as month")
df.createOrReplaceTempView("original_data")
val year_month = Seq((2019,1),(2020,1),(2021,1))
Step 1
// Create Temporary DataFrame
val yearMonthDf = year_month.toDF()
.selectExpr("_1 as year","_2 as month" )
yearMonthDf.createOrReplaceTempView("temp_year_month")
Step 2
var dfResult = spark.sql("select o.id, o.year, o.month from original_data o inner join temp_year_month t on o.year = t.year and o.month = t.month")
Step3
var dfResultDistinct = dfResult.distinct()
Output
dfResultDistinct.show()
+---+----+-----+
| id|year|month|
+---+----+-----+
| 1|2020| 1|
| 3|2020| 1|
+---+----+-----+
NB. If you are interested in finding the similar records that exist irrespective of the id. You could update the spark sql to the following (it has removed o.id)
select
o.year,
o.month
from
original_data o
inner join
temp_year_month t on o.year = t.year and
o.month = t.month
which would give as the result
+----+-----+
|year|month|
+----+-----+
|2020| 1|
+----+-----+
So I have one pyspark dataframe like so, let's call it dataframe a:
+-------------------+---------------+----------------+
| reg| val1| val2 |
+-------------------+---------------+----------------+
| N110WA| 1590030660| 1590038340000|
| N876LF| 1590037200| 1590038880000|
| N135MH| 1590039060| 1590040080000|
And another like this, let's call it dataframe b:
+-----+-------------+-----+-----+---------+----------+---+----+
| reg| postime| alt| galt| lat| long|spd| vsi|
+-----+-------------+-----+-----+---------+----------+---+----+
|XY679|1590070078549| 50| 130|18.567169|-69.986343|132|1152|
|HI949|1590070091707| 375| 455| 18.5594|-69.987804|148|1344|
|JX784|1590070110666| 825| 905|18.544968|-69.990414|170|1216|
Is there some way to create a numpy array or pyspark dataframe, where for each row in dataframe a, all the rows in dataframe b with the same reg and postime between val 1 and val 2, are included?
You can try the below solution -- and let us know if works or anything else is expected ?
I have modified the imputes a little in order to showcase the working solution--
Input here
from pyspark.sql import functions as F
df_a = spark.createDataFrame([('N110WA',1590030660,1590038340000), ('N110WA',1590070078549,1590070078559)],[ "reg","val1","val2"])
df_b = spark.createDataFrame([('N110WA',1590070078549)],[ "reg","postime"])
df_a.show()
df_a
+------+-------------+-------------+
| reg| val1| val2|
+------+-------------+-------------+
|N110WA| 1590030660|1590038340000|
|N110WA|1590070078549|1590070078559|
+------+-------------+-------------+
df_b
+------+-------------+
| reg| postime|
+------+-------------+
|N110WA|1590070078549|
+------+-------------+
Solution here
from pyspark.sql import types as T
from pyspark.sql import functions as F
#df_a = df_a.join(df_b,'reg','left')
df_a = df_a.withColumn('condition_col', F.when(((F.col('postime') >= F.col('val1')) & (F.col('postime') <= F.col('val2'))),'1').otherwise('0'))
df_a = df_a.filter(F.col('condition_col') == 1).drop('condition_col')
df_a.show()
Final Output
+------+-------------+-------------+-------------+
| reg| val1| val2| postime|
+------+-------------+-------------+-------------+
|N110WA|1590070078549|1590070078559|1590070078549|
+------+-------------+-------------+-------------+
Yes, assuming df_a and df_b are both pyspark dataframes, you can use an inner join in pyspark:
delta = val
df = df_a.join(df_b, [
df_a.res == df_b.res,
df_a.posttime <= df_b.val1 + delta,
df_a.posttime >= df_b.val2 - delta
], "inner")
Will filter out the results to only include the ones specified
I'm trying to add an id to every single group of dates using Spark Scala.
For example, if the input was:
date
2019-01-29
2019-01-29
2019-07-31
2019-01-29
2019-07-31
The output would be:
id, date
ABC1, 2019-01-29
ABC1, 2019-01-29
ABC1, 2019-01-29
ABC2, 2019-07-31
ABC2, 2019-07-31
Can anyone help me with this?
I was successful with adding sequential line numbers for each partition, but I would like a constant value for each partition.
df.withColumn(lineNumColName, row_number().over(Window.partitionBy(partitionByCol).orderBy(orderByCol))).repartition(1).orderBy(orderByCol, lineNumColName)
Option 1 (small dataset):
If you dataset is not to large you can use Window and dense_rank as shown next:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{concat,lit, dense_rank}
val df = Seq(("2019-01-29"),
("2019-01-29"),
("2019-07-31"),
("2019-01-29"),
("2019-07-31")).toDF("date")
val w = Window.orderBy($"date")
val d_rank = dense_rank().over(w)
df.withColumn("id", concat(lit("ABC"), d_rank)).show(false)
Output:
+----------+----+
|date |id |
+----------+----+
|2019-01-29|ABC1|
|2019-01-29|ABC1|
|2019-01-29|ABC1|
|2019-07-31|ABC2|
|2019-07-31|ABC2|
+----------+----+
Since we don't specify any value for the partitionBy part this will use only one partition and therefore it will be very inefficient.
Option 2 (large dataset):
A more efficient approach would be to assign ids to a large dataset using the zipWithIndex function:
val df_d = df.distinct.rdd.zipWithIndex().map{ r => (r._1.getString(0), r._2 + 1) }.toDF("date", "id")
df_d.show
// Output:
+----------+---+
| date| id|
+----------+---+
|2019-01-29| 1|
|2019-07-31| 2|
+----------+---+
First we get the unique value of the dataframe with distinct then we call zipWithIndex to create a unique id for each date record.
Finally we join the two datasets:
df.join(df_d, Seq("date"))
.withColumn("id", concat(lit("ABC"), $"id"))
.show
// Output:
+----------+----+
| date| id|
+----------+----+
|2019-01-29|ABC1|
|2019-01-29|ABC1|
|2019-01-29|ABC1|
|2019-07-31|ABC2|
|2019-07-31|ABC2|
+----------+----+
I am new to Spark API. I am trying to extract weekday number from a column say col_date (having datetime stamp e.g '13AUG15:09:40:15') which is string and add another column as weekday(integer). I am not able to do successfully.
the approach below worked for me, using a 'one line' udf - similar but different to above:
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dayofweek').getOrCreate()
set up the dataframe:
df = spark.createDataFrame(
[(1, "2018-05-12")
,(2, "2018-05-13")
,(3, "2018-05-14")
,(4, "2018-05-15")
,(5, "2018-05-16")
,(6, "2018-05-17")
,(7, "2018-05-18")
,(8, "2018-05-19")
,(9, "2018-05-20")
], ("id", "date"))
set up the udf:
from pyspark.sql.functions import udf,desc
from datetime import datetime
weekDay = udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%w'))
df = df.withColumn('weekDay', weekDay(df['date'])).sort(desc("date"))
results:
df.show()
+---+----------+-------+
| id| date|weekDay|
+---+----------+-------+
| 9|2018-05-20| 0|
| 8|2018-05-19| 6|
| 7|2018-05-18| 5|
| 6|2018-05-17| 4|
| 5|2018-05-16| 3|
| 4|2018-05-15| 2|
| 3|2018-05-14| 1|
| 2|2018-05-13| 0|
| 1|2018-05-12| 6|
+---+----------+-------+
Well, this is quite simple.
This simple function make all the job and returns weekdays as number (monday = 1):
from time import time
from datetime import datetime
# get weekdays and daily hours from timestamp
def toWeekDay(x):
# v = datetime.strptime(datetime.fromtimestamp(int(x)).strftime("%Y %m %d %H"), "%Y %m %d %H").strftime('%w') - from unix timestamp
v = datetime.strptime(x, '%d%b%y:%H:%M:%S').strftime('%w')
return v
days = ['13AUG15:09:40:15','27APR16:20:04:35'] # create example dates
days = sc.parallelize(days) # for example purposes - transform python list to RDD so we can do it in a 'Spark [parallel] way'
days.take(2) # to see whats in RDD
> ['13AUG15:09:40:15', '27APR16:20:04:35']
result = v.map(lambda x: (toWeekDay(x))) # apply functon toWeekDay on each element of RDD
result.take(2) # lets see results
> ['4', '3']
Please see Python documentation for further details on datetime processing.