How to generate hourly timestamps between two dates in PySpark? - pyspark

Consider this sample dataframe
data = [(dt.datetime(2000,1,1,15,20,37), dt.datetime(2000,1,1,19,12,22))]
df = spark.createDataFrame(data, ["minDate", "maxDate"])
df.show()
+-------------------+-------------------+
| minDate| maxDate|
+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|
+-------------------+-------------------+
I would like to explode those two dates into an hourly time-series like
+-------------------+-------------------+
| minDate| maxDate|
+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 16:00:00|
|2000-01-01 16:01:00|2000-01-01 17:00:00|
|2000-01-01 17:01:00|2000-01-01 18:00:00|
|2000-01-01 18:01:00|2000-01-01 19:00:00|
|2000-01-01 19:01:00|2000-01-01 19:12:22|
+-------------------+-------------------+
Do you have any suggestion on how to achieve that without using UDFs?
Thanks

This is how I finally solved it.
Input data
data = [
(dt.datetime(2000,1,1,15,20,37), dt.datetime(2000,1,1,19,12,22)),
(dt.datetime(2001,1,1,15,20,37), dt.datetime(2001,1,1,18,12,22))
]
df = spark.createDataFrame(data, ["minDate", "maxDate"])
df.show()
which results in
+-------------------+-------------------+
| minDate| maxDate|
+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|
|2001-01-01 15:20:37|2001-01-01 18:12:22|
+-------------------+-------------------+
Transformed data
# Compute hours between min and max date
df = df.withColumn(
'hour_diff',
fn.ceil((fn.col('maxDate').cast('long') - fn.col('minDate').cast('long'))/3600)
)
# Duplicate rows a number of times equal to hour_diff
df = df.withColumn("repeat", fn.expr("split(repeat(',', hour_diff), ',')"))\
.select("*", fn.posexplode("repeat").alias("idx", "val"))\
.drop("repeat", "val")\
.withColumn('hour_add', (fn.col('minDate').cast('long') + fn.col('idx')*3600).cast('timestamp'))
# Create the new start and end date according to the boundaries
df = (df
.withColumn(
'start_dt',
fn.when(
fn.col('idx') > 0,
(fn.floor(fn.col('hour_add').cast('long') / 3600)*3600).cast('timestamp')
).otherwise(fn.col('minDate'))
).withColumn(
'end_dt',
fn.when(
fn.col('idx') != fn.col('hour_diff'),
(fn.ceil(fn.col('hour_add').cast('long') / 3600)*3600-60).cast('timestamp')
).otherwise(fn.col('maxDate'))
).drop('hour_diff', 'idx', 'hour_add'))
df.show()
Which results in
+-------------------+-------------------+-------------------+-------------------+
| minDate| maxDate| start_dt| end_dt|
+-------------------+-------------------+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 15:20:37|2000-01-01 15:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 16:00:00|2000-01-01 16:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 17:00:00|2000-01-01 17:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 18:00:00|2000-01-01 18:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 19:00:00|2000-01-01 19:12:22|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 15:20:37|2001-01-01 15:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 16:00:00|2001-01-01 16:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 17:00:00|2001-01-01 17:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 18:00:00|2001-01-01 18:12:22|
+-------------------+-------------------+-------------------+-------------------+

Related

How to select elements of a column of a dataframe with respect to a column of the another dataframe?

How I can use two dataframes, and select elements of df2, if a column in df1 is included in a column in df2and NA otherwise.
df2:
name
summer
winter
water
play
df1:
col1
play ground
winter cold
something
work
output:
col1 name
play ground play
winter cold winter
something NA
work NA
#Create match column
df1 = df1.alias('df1').withColumn('col_new',explode(split('col1','\s')))
new = (df1.join(df2, how='left',on=df1.col_new==df2.name)#merge on common columns
.drop('col_new')#drop the match column introduced
.orderBy([df2.name.desc(),'name'])#Order the df
.drop_duplicates(['col1'])#eliminate duplicates
).show()
+-----------+------+
| col1| name|
+-----------+------+
|play ground| play|
| something| null|
|winter cold|winter|
| work| null|
+-----------+------+
It is recommended to use the contains condition directly to join.
df = df1.join(df2, on=[df1.col1.contains(df2.name)], how='left')
df.show(truncate=False)
df1 = spark.createDataFrame([("play ground",),("winter cold",),("something",),("work",)], ['col1',])
df2 = spark.createDataFrame([("summer",),("winter",),("play bc",),("play",)], ['name',])
df1 = df1.withColumn('common_word', explode(split(col('col1'), '\s')))
# Also split & explode Column 'name' of df2.
df2 = df2.withColumn('common_word', explode(split(col('name'), '\s')))
(
df1
.join(df2, ['common_word'], "left")
.sort('col1')
.fillna("NA")
.show()
)
+-----------+-----------+-------+
|common_word| col1| name|
+-----------+-----------+-------+
| ground|play ground| NA|
| play|play ground|play bc|
| play|play ground| play|
| something| something| NA|
| cold|winter cold| NA|
| winter|winter cold| winter|
| work| work| NA|
+-----------+-----------+-------+

How to add the incremental date value with respect to first row value in spark dataframe

Input :
+------+--------+
|Test |01-12-20|
|Ravi | null|
|Son | null|
Expected output :
+------+--------+
|Test |01-12-20|
|Ravi |02-12-20|
|Son |03-12-20|
I tried with .withColumn(col("dated"),date_add(col("dated"),1));
But this result in NULL for all the columns values.
Could you please help me with getting the incremental values on the date second column?
This will be a working solution for you
Input
df = spark.createDataFrame([("Test", "01-12-20"),("Ravi", None),("Son", None)],[ "col1","col2"])
df.show()
df = df.withColumn("col2", F.to_date(F.col("col2"),"dd-MM-yy"))
# a dummy col for window function
df = df.withColumn("del_col", F.lit(0))
_w = W.partitionBy(F.col("del_col")).orderBy(F.col("del_col").desc())
df = df.withColumn("rn_no", F.row_number().over(_w)-1)
# Create a column with the same date
df = df.withColumn("dated", F.first("col2").over(_w))
df = df.selectExpr('*', 'date_add(dated, rn_no) as next_date')
df.show()
DF
+----+--------+
|col1| col2|
+----+--------+
|Test|01-12-20|
|Ravi| null|
| Son| null|
+----+--------+
Final Output
+----+----------+-------+-----+----------+----------+
|col1| col2|del_col|rn_no| dated| next_date|
+----+----------+-------+-----+----------+----------+
|Test|2020-12-01| 0| 0|2020-12-01|2020-12-01|
|Ravi| null| 0| 1|2020-12-01|2020-12-02|
| Son| null| 0| 2|2020-12-01|2020-12-03|
+----+----------+-------+-----+----------+----------+

How to delete/filter the specific rows from a spark dataframe

I want to delete specific records from a Spark dataframe:
Sample Input:
Expected output:
Discarded Rows:
I have written the below code to filter the dataframe(Which is incorrect):
val Name = List("Rahul","Mahesh","Gaurav")
val Age =List(20,55)
val final_pub_df = df.filter(!col("Name").isin(Name:_*) && !col("Age").isin(Age:_*))
So my question is - How to filter the dataframe for more than one column with specific filter criteria.
The dataframe should be filtered on the basis of the combination of Name and Age fields.
Here's the solution. Based on your dataset I formulated problem -
below dataframe has incorrect entries. I want to remove all incorrect records and keep only correct records -
val Friends = Seq(
("Rahul", "99", "AA"),
("Rahul", "20", "BB"),
("Rahul", "30", "BB"),
("Mahesh", "55", "CC"),
("Mahesh", "88", "DD"),
("Mahesh", "44", "FF"),
("Ramu", "30", "FF"),
("Gaurav", "99", "PP"),
("Gaurav", "20", "HH")).toDF("Name", "Age", "City")
Arrays for filtering -
val Name = List("Rahul", "Mahesh", "Gaurav")
val IncorrectAge = List(20, 55)
Dataops -
Friends.filter(!(col("Name").isin(Name: _*) && col("Age").isin(IncorrectAge: _*))).show
Here's the output -
+------+---+----+
| Name|Age|City|
+------+---+----+
| Rahul| 99| AA|
| Rahul| 30| BB|
|Mahesh| 88| DD|
|Mahesh| 44| FF|
| Ramu| 30| FF|
|Gaurav| 99| PP|
+------+---+----+
You can also do it with help of joins ..
Create a Badrecords df -
val badrecords = Friends.filter(col("Name").isin(Name: _*) && col("Age").isin(IncorrectAge: _*))
User left_anti join to select Friends minus badrecords -
Friends.alias("left").join(badrecords.alias("right"), Seq("Name", "Age"), "left_anti").show
Here's the output -
+------+---+----+
| Name|Age|City|
+------+---+----+
| Rahul| 99| AA|
| Rahul| 30| BB|
|Mahesh| 88| DD|
|Mahesh| 44| FF|
| Ramu| 30| FF|
|Gaurav| 99| PP|
+------+---+----+
I think you may want to flip the not condition .... filter in dataframe is an alias to where clause in sql.
So you want the query to be
df.filter(col("Name").isin(Name:_*) && col("Age").isin(Age:_*))

How to find quantiles inside agg() function after groupBy in Scala SPARK

I have a dataframe, in which I want to groupBy column A then find different stats like mean, min, max, std dev and quantiles.
I am able to find min, max and mean using the following code:
df.groupBy("A").agg(min("B"), max("B"), mean("B")).show(50, false)
But I am unable to find the quantiles(0.25, 0.5, 0.75). I tried approxQuantile and percentile but it gives the following error:
error: not found: value approxQuantile
if you have Hive in classpath, you can use many UDAF like percentile_approx and stddev_samp, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
You can call these functions using callUDF:
import ss.implicits._
import org.apache.spark.sql.functions.callUDF
val df = Seq(1.0,2.0,3.0).toDF("x")
df.groupBy()
.agg(
callUDF("percentile_approx",$"x",lit(0.5)).as("median"),
callUDF("stddev_samp",$"x").as("stdev")
)
.show()
Here is a code that I have tested on Spark 3.1
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()
df.groupBy($"department")
.agg(
percentile_approx($"salary",lit(0.5), lit(10000))
)
.show(false)
Output
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+
+----------+-------------------------------------+
|department|percentile_approx(salary, 0.5, 10000)|
+----------+-------------------------------------+
|Sales |86000 |
|Finance |83000 |
|Marketing |80000 |
+----------+-------------------------------------+

Spark Dataframe implementation similar to Oracle's LISTAGG function - Unable to Order with in the group

I want to implement a function similar to Oracle's LISTAGG function.
Equivalent oracle code is
select KEY,
listagg(CODE, '-') within group (order by DATE) as CODE
from demo_table
group by KEY
Here is my spark scala dataframe implementation, but unable to order the values with in each group.
Input:
val values = List(List("66", "PL", "2016-11-01"), List("66", "PL", "2016-12-01"),
List("67", "JL", "2016-12-01"), List("67", "JL", "2016-11-01"), List("67", "PL", "2016-10-01"), List("67", "PO", "2016-09-01"), List("67", "JL", "2016-08-01"),
List("68", "PL", "2016-12-01"), List("68", "JO", "2016-11-01"))
.map(row => (row(0), row(1), row(2)))
val df = values.toDF("KEY", "CODE", "DATE")
df.show()
+---+----+----------+
|KEY|CODE| DATE|
+---+----+----------+
| 66| PL|2016-11-01|
| 66| PL|2016-12-01|----- group 1
| 67| JL|2016-12-01|
| 67| JL|2016-11-01|
| 67| PL|2016-10-01|
| 67| PO|2016-09-01|
| 67| JL|2016-08-01|----- group 2
| 68| PL|2016-12-01|
| 68| JO|2016-11-01|----- group 3
+---+----+----------+
udf implementation :
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.udf
val listAgg = udf((xs: Seq[String]) => xs.mkString("-"))
df.groupBy("KEY")
.agg(listAgg(collect_list("CODE")).alias("CODE"))
.show(false)
+---+--------------+
|KEY|CODE |
+---+--------------+
|68 |PL-JO |
|67 |JL-JL-PL-PO-JL|
|66 |PL-PL |
+---+--------------+
Expected Output : - order by date
+---+--------------+
|KEY|CODE |
+---+--------------+
|68 |JO-PL |
|67 |JL-PO-PL-JL-JL|
|66 |PL-PL |
+---+--------------+
Use struct inbuilt function to combine the CODE and DATE columns and use that new struct column in collect_list aggregation function. And in the udf function sort by the DATE and collect the CODE as - separated string
import org.apache.spark.sql.functions._
def sortAndStringUdf = udf((codeDate: Seq[Row])=> codeDate.sortBy(row => row.getAs[Long]("DATE")).map(row => row.getAs[String]("CODE")).mkString("-"))
df.withColumn("codeDate", struct(col("CODE"), col("DATE").cast("timestamp").cast("long").as("DATE")))
.groupBy("KEY").agg(sortAndStringUdf(collect_list("codeDate")).as("CODE"))
which should give you
+---+--------------+
|KEY| CODE|
+---+--------------+
| 68| JO-PL|
| 67|JL-PO-PL-JL-JL|
| 66| PL-PL|
+---+--------------+
I hope the answer is helpful
Update
I am sure this will be faster than using udf function
df.withColumn("codeDate", struct(col("DATE").cast("timestamp").cast("long").as("DATE"), col("CODE")))
.groupBy("KEY")
.agg(concat_ws("-", expr("sort_array(collect_list(codeDate)).CODE")).alias("CODE"))
.show(false)
which should give you the same result as above