How to plot a graph with data gaps in Zeppelin? - scala

Dataframe was extracted to a temp table to plot the data density per time unit (1 day):
val dailySummariesDf =
getDFFromJdbcSource(SparkSession.builder().appName("test").master("local").getOrCreate(), s"SELECT * FROM values WHERE time > '2020-06-06' and devicename='Voltage' limit 100000000")
.persist(StorageLevel.MEMORY_ONLY_SER)
.groupBy($"digital_twin_id", window($"time", "1 day")).count().as("count")
.withColumn("windowstart", col("window.start"))
.withColumn("windowstartlong", unix_timestamp(col("window.start")))
.orderBy("windowstart")
dailySummariesDf.
registerTempTable("bank")
Then I plot it with %sql processor
%sql
select windowstart, count
from bank
and
%sql
select windowstartlong, count
from bank
What I get is shown below:
So, my expectation is to have gaps in this graph, as there were days with no data at all. But instead I see it being plotted densely, with October days plotted right after August, not showing a gap for September.
How can I force those graphs to display gaps and regard the real X axis values?

Indeed, grouping a dataset by window column won't produce any rows for the intervals that did not contain any original rows within those intervals.
One way to deal with that I can think of, is to add a bunch of fake rows ("manually fill in the gaps" in raw dataset), and only then apply a groupBy/window. For your case, that can be done by creating a trivial one-column dataset containing all the dates within a range you're interested in, and then joining it to your original dataset.
Here is my quick attempt:
import spark.implicits._
import org.apache.spark.sql.types._
// Define sample data
val df = Seq(("a","2021-12-01"),
("b","2021-12-01"),
("c","2021-12-01"),
("a","2021-12-02"),
("b","2021-12-17")
).toDF("c","d").withColumn("d",to_timestamp($"d"))
// Define a dummy dataframe for the range 12/01/2021 - 12/30/2021
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
val start = DateTime.parse("2021-12-01",DateTimeFormat.forPattern("yyyy-MM-dd")).getMillis/1000
val end = start + 30*24*60*60
val temp = spark.range(start,end,24*60*60).toDF().withColumn("tc",to_timestamp($"id".cast(TimestampType))).drop($"id")
// Fill the gaps in original dataframe
val nogaps = temp.join(df, temp.col("tc") === df.col("d"), "left")
// Aggregate counts by a tumbling 1-day window
val result = nogaps.groupBy(window($"tc","1 day","1 day","5 hours")).agg(sum(when($"c".isNotNull,1).otherwise(0)).as("count"))
result.withColumn("windowstart",to_date(col("window.start"))).select("windowstart","count").orderBy("windowstart").show(false)
+-----------+-----+
|windowstart|count|
+-----------+-----+
|2021-12-01 |3 |
|2021-12-02 |1 |
|2021-12-03 |0 |
|2021-12-04 |0 |
|2021-12-05 |0 |
|2021-12-06 |0 |
|2021-12-07 |0 |
|2021-12-08 |0 |
|2021-12-09 |0 |
|2021-12-10 |0 |
|2021-12-11 |0 |
|2021-12-12 |0 |
|2021-12-13 |0 |
|2021-12-14 |0 |
|2021-12-15 |0 |
|2021-12-16 |0 |
|2021-12-17 |1 |
|2021-12-18 |0 |
|2021-12-19 |0 |
|2021-12-20 |0 |
+-----------+-----+
For illustration purposes only :)

Related

Get last value of previous partition/group in pyspark

I have a dataframe looking like this (just some example values):
| id | timestamp | mode | trip | journey | value |
1 2021-09-12 23:59:19.717000 walking 1 1 1.21
1 2021-09-12 23:59:38.617000 walking 1 1 1.36
1 2021-09-12 23:59:38.617000 driving 2 1 1.65
2 2021-09-11 23:52:09.315000 walking 4 6 1.04
I want to create new columns which I fill with the previous and next mode. Something like this:
| id | timestamp | mode | trip | journey | value | prev | next
1 2021-09-12 23:59:19.717000 walking 1 1 1.21 bus driving
1 2021-09-12 23:59:38.617000 walking 1 1 1.36 bus driving
1 2021-09-12 23:59:38.617000 driving 2 1 1.65 walking walking
2 2021-09-11 23:52:09.315000 walking 4 6 1.0 walking driving
I have tried to partition by id, trip, journey and mode and ordered by timestamp. Then I tried to use lag() and lead() but I am not sure these work on other partitions. I came across the Window.unboundedPreceding and Window.unboundedFollowing, however I am not sure I completely understand how these work. In my mind I think that if I partition the data as explained above I will always just need the last value of mode from the previous partition and to fill the next I could reorder the partition from ascending to descending on the timestamp and then do the same to fill the next column. However, I am unsure how I get the last value of the previous partition.
I have tried this:
w = Window.partitionBy("id", "journey", "trip").orderBy(col("timestamp").asc())
w_prev = w.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df = df.withColumn("prev", first("mode").over(w_prev))
Code examples and explainations using pyspark will be very appreciated!
So, based on what I could understand you could do something like this,
Create a partition based on ID and their journey, within each journey there are multiple trips, so order by trip and lastly the timestamp, and then simply use the lead and lag to get the output!
w = Window().partitionBy('id', 'journey').orderBy('trip', 'timestamp')
df.withColumn('prev', F.lag('mode', 1).over(w)) \
.withColumn('next', F.lead('mode', 1).over(w)) \
.show(truncate=False)
Output:
+---+--------------------------+-------+----+-------+-----+-------+-------+
|id |timestamp |mode |trip|journey|value|prev |next |
+---+--------------------------+-------+----+-------+-----+-------+-------+
|1 |2021-09-12 23:59:19.717000|walking|1 |1 |1.21 |null |walking|
|1 |2021-09-12 23:59:38.617000|walking|1 |1 |1.36 |walking|driving|
|1 |2021-09-12 23:59:38.617000|driving|2 |1 |1.65 |walking|null |
|2 |2021-09-11 23:52:09.315000|walking|4 |6 |1.04 |null |null |
+---+--------------------------+-------+----+-------+-----+-------+-------+
EDIT:
Okay as OP asked, you can do this to achieve it,
# Used for taking the latest record from same id, trip, journey
w = Window().partitionBy('id', 'trip', 'journey').orderBy(F.col('timestamp').desc())
# Used to calculate prev and next mode
w1 = Window().partitionBy('id', 'journey').orderBy('trip')
# First take only the latest rows for a particular combination of id, trip, journey
# Second, use the filtered rows to get prev and next modes
df2 = df.withColumn('rn', F.row_number().over(w)) \
.filter(F.col('rn') == 1) \
.withColumn('prev', F.lag('mode', 1).over(w1)) \
.withColumn('next', F.lead('mode', 1).over(w1)) \
.drop('rn')
df2.show(truncate=False)
Output:
+---+--------------------------+-------+----+-------+-----+-------+-------+
|id |timestamp |mode |trip|journey|value|prev |next |
+---+--------------------------+-------+----+-------+-----+-------+-------+
|1 |2021-09-12 23:59:38.617000|walking|1 |1 |1.36 |null |driving|
|1 |2021-09-12 23:59:38.617000|driving|2 |1 |1.65 |walking|null |
|2 |2021-09-11 23:52:09.315000|walking|4 |6 |1.04 |null |null |
+---+--------------------------+-------+----+-------+-----+-------+-------+
# Finally, join the calculated DF with the original DF to get prev and next mode
final_df = df.alias('a').join(df2.alias('b'), ['id', 'trip', 'journey'], how='left') \
.select('a.*', 'b.prev', 'b.next')
final_df.show(truncate=False)
Output:
+---+----+-------+--------------------------+-------+-----+-------+-------+
|id |trip|journey|timestamp |mode |value|prev |next |
+---+----+-------+--------------------------+-------+-----+-------+-------+
|1 |1 |1 |2021-09-12 23:59:19.717000|walking|1.21 |null |driving|
|1 |1 |1 |2021-09-12 23:59:38.617000|walking|1.36 |null |driving|
|1 |2 |1 |2021-09-12 23:59:38.617000|driving|1.65 |walking|null |
|2 |4 |6 |2021-09-11 23:52:09.315000|walking|1.04 |null |null |
+---+----+-------+--------------------------+-------+-----+-------+-------+

How can I apply QuantileDiscretizer on a groupBy object in PySpark?

I want to compute Quantiles on a dataframe after grouping it.
This is my sample dataframe:
|id |shop|amount|
|:--|:--:|---:|
|1 |A. |100|
|2 |B. |200|
|3. |A. |125|
|1. |A |25 |
|2 |B |220|
|3. |A. |110|
I want to bin the amount into low, medium and high, based on each shop.
So, I would group my dataframe like this.
shop_groups= df.groupBy('shop')
The mistake I did originally, was I applied the QuantileDiscretizer on the whole amount set as is, without grouping it by shop.
How can I do this on the shop_groups ?

Spark window function with condition on current row

I am trying to count for a given order_id how many orders there were in the past 365 days which had a payment. And this is not the problem: I use the window function.
Where it gets tricky for me is: I don't want to count orders in this time window where the payment_date is after order_date of the current order_id.
Currently, I have something like this:
val window: WindowSpec = Window
.partitionBy("customer_id")
.orderBy("order_date")
.rangeBetween(-365*days, -1)
and
df.withColumn("paid_order_count", count("*") over window)
which would count all orders for the customer within the last 365 days before his current order.
How can I now incorporate a condition for the counting that takes the order_date of the current order into account?
Example:
+---------+-----------+-------------+------------+
|order_id |order_date |payment_date |customer_id |
+---------+-----------+-------------+------------+
|1 |2017-01-01 |2017-01-10 |A |
|2 |2017-02-01 |2017-02-10 |A |
|3 |2017-02-02 |2017-02-20 |A |
The resulting table should look like this:
+---------+-----------+-------------+------------+-----------------+
|order_id |order_date |payment_date |customer_id |paid_order_count |
+---------+-----------+-------------+------------+-----------------+
|1 |2017-01-01 |2017-01-10 |A |0 |
|2 |2017-02-01 |2017-02-10 |A |1 |
|3 |2017-02-02 |2017-02-20 |A |1 |
For order_id = 3 the paid_order_count should not be 2 but 1 as order_id = 2 is paid after order_id = 3 is placed.
I hope that I explained my problem well and look forward to your ideas!
Very good question!!!
A couple of remarks, using rangeBetween creates a fixed frame that is based on number of rows in it and not on values, so it will be problematic in 2 cases:
customer does not have orders every single day, so 365 rows window might contain rows with order_date well before one year ago
if customer has more than one order per day, it will mess with the one year coverage
combination of the 1 and 2
Also rangeBetween does not work with Date and Timestamp datatypes.
To solve it, it is possible to use window function with lists and an UDF:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.sparkContext.parallelize(Seq(
(1, "2017-01-01", "2017-01-10", "A")
, (2, "2017-02-01", "2017-02-10", "A")
, (3, "2017-02-02", "2017-02-20", "A")
)
).toDF("order_id", "order_date", "payment_date", "customer_id")
.withColumn("order_date_ts", to_timestamp($"order_date", "yyyy-MM-dd").cast("long"))
.withColumn("payment_date_ts", to_timestamp($"payment_date", "yyyy-MM-dd").cast("long"))
// df.printSchema()
// df.show(false)
val window = Window.partitionBy("customer_id").orderBy("order_date_ts").rangeBetween(Window.unboundedPreceding, -1)
val count_filtered_dates = udf( (days: Int, top: Long, array: Seq[Long]) => {
val bottom = top - (days * 60 * 60 * 24).toLong // in spark timestamps are in secconds, calculating the date days ago
array.count(v => v >= bottom && v < top)
}
)
val res = df.withColumn("paid_orders", collect_list("payment_date_ts") over window)
.withColumn("paid_order_count", count_filtered_dates(lit(365), $"order_date_ts", $"paid_orders"))
res.show(false)
Output:
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
|order_id|order_date|payment_date|customer_id|order_date_ts|payment_date_ts|paid_orders |paid_order_count|
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
|1 |2017-01-01|2017-01-10 |A |1483228800 |1484006400 |[] |0 |
|2 |2017-02-01|2017-02-10 |A |1485907200 |1486684800 |[1484006400] |1 |
|3 |2017-02-02|2017-02-20 |A |1485993600 |1487548800 |[1484006400, 1486684800]|1 |
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
Converting dates to Spark timestamps in seconds makes the lists more memory efficient.
It is the easiest code to implement, but not the most optimal as the lists will take up some memory, custom UDAF would be best, but requires more coding, might do later. This will still work if you have thousands of orders per customer.

Spark Dataframe - Push a particular Row to the last in a Dataframe

Have been trying to push a particular row in a Spark Dataframe to the end of the Dataframe.
This is what I have tried so far.
Input Dataframe:
+-------------+-------+------------+
|expected_date|count |Downstream |
+-------------+-------+------------+
|2018-08-26 |1 |abc |
|2018-08-26 |6 |Grand Total |
|2018-08-26 |3 |xyy |
|2018-08-26 |2 |xxx |
+-------------+-------+------------+
Code:
df.withColumn("Downstream_Hierarchy", when(col("Downstream") === "Grand Total", 2)
.otherwise(1))
.orderBy(col("Downstream_Hierarchy").asc)
.drop("Downstream_Hierarchy")
Output Dataframe:
+-------------+-------+------------+
|expected_date|count |Downstream |
+-------------+-------+------------+
|2018-08-26 |1 |abc |
|2018-08-26 |3 |xyy |
|2018-08-26 |2 |xxx |
|2018-08-26 |6 |Grand Total |
+-------------+-------+------------+
Is there a simpler way to do this?
Going through your comments, Since the end result is needed in HDFS you can write it as csv to HDFS twice
1st time write dataframe to hdfs without "Grand Total" row.
2nd time write "Grand Total" row alone with save mode as "append".
Data Frame except the required row :
val df1 = df.filter(col("Downstream") =!= "Grand Total" )
Data Frame with the required row :
val df2 = df.filter(col("Downstream") === "Grand Total" )
Required DataFrame :
val df_final = df1.union(df2)
Might not be the best solution, but it avoids the expensive OrderBy operation .
You can try below straightforward steps.
val lastRowDf = df.filter("Downstream='Grand Total'")
val remainDf = df.filter("Downstream !='Grand Total'")
remainDf.unionAll(lastRowDf).show

Populate a "Grouper" column using .withcolumn in scala.spark dataframe

Trying to populate the grouper column like below. In the table below, X signifies the start of a new record. So, Each X,Y,Z needs to be grouped. In MySQL, I would accomplish like:
select #x:=1;
update table set grouper=if(column_1='X',#x:=#x+1,#x);
I am trying to see if there is a way to do this without using a loop using . With column or something similar.
what I have tried:
var group = 1;
val mydf4 = mydf3.withColumn("grouper", when(col("column_1").equalTo("INS"),group=group+1).otherwise(group))
Example DF
Simple window function and row_number() inbuilt function should get you your desired output
val df = Seq(
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z"),
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z")
).toDF("column_1")
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("column_1").orderBy("column_1")
import org.apache.spark.sql.functions._
df.withColumn("grouper", row_number().over(windowSpec)).orderBy("grouper", "column_1").show(false)
which should give you
+--------+-------+
|column_1|grouper|
+--------+-------+
|X |1 |
|Y |1 |
|Z |1 |
|X |2 |
|Y |2 |
|Z |2 |
+--------+-------+
Note: The last orderBy is just to match the expected output and just for visualization. In real cluster and processing orderBy like that doesn't make sense