In pyspark, how do you draw histogram from groupedby data? - pyspark

I have a dataframe that looks like the following,
+-------+--------+
|Charges| Status|
+-------+--------+
| 495.6| Denied|
|1806.28| Denied|
| 261.3|Accepted|
| 8076.5|Accepted|
|1041.24| Denied|
| 507.88| Denied|
| 208.0|Accepted|
| 152.49| Denied|
| 146.0|Accepted|
|2794.14|Accepted|
+-------+--------+
How do I get two histograms based on groupBy('Status), using the databricks' display() function? Thank you.

Related

What is the best practice to handle non-datetime timestamp column within pandas dataframe?

Let's say I have the following pandas dataframe with a non-standard timestamp column without datetime format. Due to I need to include a new column and convert it into an 24hourly-based timestamp for time-series visualizing matter by:
df['timestamp(24hrs)'] = round(df['timestamp(sec)']/24*3600)
and get this:
+----------------+----+-----+
|timestamp(24hrs)|User|count|
+----------------+----+-----+
|0.0 |U100|435 |
|1.0 |U100|1091 |
|2.0 |U100|992 |
|3.0 |U100|980 |
|4.0 |U100|288 |
|8.0 |U100|260 |
|9.0 |U100|879 |
|10.0 |U100|875 |
|11.0 |U100|911 |
|13.0 |U100|628 |
|14.0 |U100|642 |
|16.0 |U100|631 |
|17.0 |U100|233 |
... ... ...
|267.0 |U100|1056 |
|269.0 |U100|878 |
|270.0 |U100|256 |
+----------------+----+-----+
Now I noticed that some records' timestamps are missing, and I need to impute those missing data:
timestamp(24hrs) in continuous order
count value by 0
Expected output:
+----------------+----+-----+
|timestamp(24hrs)|User|count|
+----------------+----+-----+
|0.0 |U100|435 |
|1.0 |U100|1091 |
|2.0 |U100|992 |
|3.0 |U100|980 |
|4.0 |U100|288 |
|5.0 |U100|0 |
|6.0 |U100|0 |
|7.0 |U100|0 |
|8.0 |U100|260 |
|9.0 |U100|879 |
|10.0 |U100|875 |
|11.0 |U100|911 |
|12.0 |U100|0 |
|13.0 |U100|628 |
|14.0 |U100|642 |
|15.0 |U100|0 |
|16.0 |U100|631 |
|17.0 |U100|233 |
... ... ...
|267.0 |U100|1056 |
|268.0 |U100|0 |
|269.0 |U100|878 |
|270.0 |U100|256 |
+----------------+----+-----+
Any idea how can I do this? Based on this answer over standard timestamp, I can imagine I need to create a new column timestamp(24hrs) from the start and end of the previous one and do left join() & crossJoin() but I couldn't manage it yet.
I've tried the following unsuccessfully:
import pyspark.sql.functions as F
all_dates_df = df.selectExpr(
"sequence(min(timestamp(24hrs)), max(timestamp(24hrs)), interval 1 hour) as hour"
).select(F.explode("timestamp(24hrs)").alias("timestamp(24hrs)"))
all_dates_df.show()
result_df = all_dates_df.crossJoin(
df.select("UserName").distinct()
).join(
df,
["count", "timestamp(24hrs)"],
"left"
).fillna(0)
result_df.show()
sequence function is available for integer. It doesn't work for double type so it requires to cast to integer then cast back to double (if you want to retain as double).
df_seq = (df.withColumn('time_int', F.col('timestamp(24hrs)').cast(IntegerType()))
.select(F.explode(F.sequence(F.min('time_int'), F.max('time_int'))).alias('timestamp(24hrs)'))
.select(F.col('timestamp(24hrs)').cast(DoubleType())))
df = (df_seq.crossJoin(df.select("User").distinct())
.join(df, on=['User', 'timestamp(24hrs)'], how='left')
.fillna(0))

Pyspark remove duplicates based on string value from another columns

I have this dataframe below:
+--------+----------+----------+
|SID |Date |Attribute |
+--------+----------+----------+
|1001 |2021-01-01|Y |
|1001 |2021-05-31|N |
|1001 |2021-05-15|N |
|1002 |2021-05-31|N |
|1002 |2021-04-06|N |
|1003 |2021-01-01|Y |
|1003 |2021-02-01|N |
|1004 |2021-03-30|N |
+--------+----------+----------+
I'm trying to get the result like below.
+--------+----------+----------+
|SID |Date |Attribute |
+--------+----------+----------+
|1001 |2021-01-01|Y |
|1002 |2021-05-31|N |
|1002 |2021-04-06|N |
|1003 |2021-01-01|Y |
|1004 |2021-03-30|N |
+--------+----------+----------+
I want to exclude the record when a duplicate SID has Y in one its row for Attribute but keep the records for SID if there's only N in the Attribute.
I think window partition with filter can help but I'm not sure how to do it with the conditions I mentioned. Is there any way this can be achieved in Pyspark? I saw a similar post but it was for Scala SQL and not for Pyspark.
from pyspark.sql import Window
import pyspark.sql.functions as F
#Create a window of each group ordered by Date and containing all elements in a specified column
h = Window.partitionBy('SID').orderBy('Date').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
(
# Create a column in which you broadcast first Attribute value in each SID
df.withColumn('filt', F.first('Attribute').over(h))
# After broadcast, filter where Attribute value equals to the new columns value and drop the new column
.filter(F.col('Attribute') == F.col('filt')).drop('filt')
).show()
+----+----------+---------+
| SID| Date|Attribute|
+----+----------+---------+
|1001|2021-01-01| Y|
|1002|2021-04-06| N|
|1002|2021-05-31| N|
|1003|2021-01-01| Y|
|1004|2021-03-30| N|
+----+----------+---------+

How to keep the maximum value of a column along with other columns in a pyspark dataframe?

Consider that I have this dataframe in pyspark:
+--------+----------------+---------+---------+
|DeviceID| TimeStamp |range | zipcode |
+--------+----------------+---------+---------
| 00236|11-03-2014 07:33| 4.5| 90041 |
| 00236|11-04-2014 05:43| 7.2| 90024 |
| 00236|11-05-2014 05:43| 8.5| 90026 |
| 00234|11-06-2014 05:55| 5.6| 90037 |
| 00234|11-01-2014 05:55| 9.2| 90032 |
| 00235|11-05-2014 05:33| 4.3| 90082 |
| 00235|11-02-2014 05:33| 4.3| 90029 |
| 00235|11-09-2014 05:33| 4.2| 90047 |
+--------+----------------+---------+---------+
How can I write a pyspark script to keep the maximum value of range column along with other columns in this pyspark dataframe? The output will be like this:
+--------+----------------+---------+---------+
|DeviceID| TimeStamp |range | zipcode |
+--------+----------------+---------+---------
| 00236|11-05-2014 05:43| 8.5| 90026 |
| 00234|11-01-2014 05:55| 9.2| 90032 |
| 00235|11-05-2014 05:33| 4.3| 90082 |
+--------+----------------+---------+---------+
Using Window and row_number():
from pyspark.sql.window import Window
w=Window().partitionBy("DeviceID")
df.withColumn("rank", row_number().over(w.orderBy(desc("range"))))\
.filter(col("rank")==1)\
.drop("rank").show()
Output:
+--------+----------------+-----+-------+
|DeviceID| TimeStamp|range|zipcode|
+--------+----------------+-----+-------+
| 00236|11-05-2014 05:43| 8.5| 90026|
| 00234|11-01-2014 05:55| 9.2| 90032|
| 00235|11-05-2014 05:33| 4.3| 90082|
+--------+----------------+-----+-------+

Forward Fill New Row to Account for Missing Dates

I currently have a dataset grouped into hourly increments by a variable "aggregator". There are gaps in this hourly data and what i would ideally like to do is forward fill the rows with the prior row which maps to the variable in column x.
I've seen some solutions to similar problems using PANDAS but ideally i would like to understand how best to approach this with a pyspark UDF.
I'd initially thought about something like the following with PANDAS but also struggled to implement this to just fill ignoring the aggregator as a first pass:
df = df.set_index(keys=[df.timestamp]).resample('1H', fill_method='ffill')
But ideally i'd like to avoid using PANDAS.
In the example below i have two missing rows of hourly data (labeled as MISSING).
| timestamp | aggregator |
|----------------------|------------|
| 2018-12-27T09:00:00Z | A |
| 2018-12-27T10:00:00Z | A |
| MISSING | MISSING |
| 2018-12-27T12:00:00Z | A |
| 2018-12-27T13:00:00Z | A |
| 2018-12-27T09:00:00Z | B |
| 2018-12-27T10:00:00Z | B |
| 2018-12-27T11:00:00Z | B |
| MISSING | MISSING |
| 2018-12-27T13:00:00Z | B |
| 2018-12-27T14:00:00Z | B |
The expected output here would be the following:
| timestamp | aggregator |
|----------------------|------------|
| 2018-12-27T09:00:00Z | A |
| 2018-12-27T10:00:00Z | A |
| 2018-12-27T11:00:00Z | A |
| 2018-12-27T12:00:00Z | A |
| 2018-12-27T13:00:00Z | A |
| 2018-12-27T09:00:00Z | B |
| 2018-12-27T10:00:00Z | B |
| 2018-12-27T11:00:00Z | B |
| 2018-12-27T12:00:00Z | B |
| 2018-12-27T13:00:00Z | B |
| 2018-12-27T14:00:00Z | B |
Appreciate the help.
Thanks.
Here is the solution, to fill the missing hours. using windows, lag and udf. With little modification it can extend to days as well.
from pyspark.sql.window import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
from dateutil.relativedelta import relativedelta
def missing_hours(t1, t2):
return [t1 + relativedelta(hours=-x) for x in range(1, t1.hour-t2.hour)]
missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
df = spark.read.csv('dates.csv',header=True,inferSchema=True)
window = Window.partitionBy("aggregator").orderBy("timestamp")
df_mising = df.withColumn("prev_timestamp",lag(col("timestamp"),1, None).over(window))\
.filter(col("prev_timestamp").isNotNull())\
.withColumn("timestamp", explode(missing_hours_udf(col("timestamp"), col("prev_timestamp"))))\
.drop("prev_timestamp")
df.union(df_mising).orderBy("aggregator","timestamp").show()
which results
+-------------------+----------+
| timestamp|aggregator|
+-------------------+----------+
|2018-12-27 09:00:00| A|
|2018-12-27 10:00:00| A|
|2018-12-27 11:00:00| A|
|2018-12-27 12:00:00| A|
|2018-12-27 13:00:00| A|
|2018-12-27 09:00:00| B|
|2018-12-27 10:00:00| B|
|2018-12-27 11:00:00| B|
|2018-12-27 12:00:00| B|
|2018-12-27 13:00:00| B|
|2018-12-27 14:00:00| B|
+-------------------+----------+

Sum of single column across rows based on a condition in Spark Dataframe

Consider the following dataframe:
+-------+-----------+-------+
| rid| createdon| count|
+-------+-----------+-------+
| 124| 2017-06-15| 1 |
| 123| 2017-06-14| 2 |
| 123| 2017-06-14| 1 |
+-------+-----------+-------+
I need to add the count column among rows which has createdon and rid of are same.
Therefore the resultant dataframe should be follows:
+-------+-----------+-------+
| rid| createdon| count|
+-------+-----------+-------+
| 124| 2017-06-15| 1 |
| 123| 2017-06-14| 3 |
+-------+-----------+-------+
I am using Spark 2.0.2.
I have tried agg, conditions inside select etc, but couldn't find the solution. Can anyone help me?
Try this
import org.apache.spark.sql.{functions => func}
df.groupBy($"rid", $"createdon").agg(func.sum($"count").alias("count"))
this should do what you want:
import org.apache.spark.sql.functions.sum
df
.groupBy($"rid",$"createdon")
.agg(sum($"count").as("count"))
.show