How can I apply QuantileDiscretizer on a groupBy object in PySpark? - pyspark

I want to compute Quantiles on a dataframe after grouping it.
This is my sample dataframe:
|id |shop|amount|
|:--|:--:|---:|
|1 |A. |100|
|2 |B. |200|
|3. |A. |125|
|1. |A |25 |
|2 |B |220|
|3. |A. |110|
I want to bin the amount into low, medium and high, based on each shop.
So, I would group my dataframe like this.
shop_groups= df.groupBy('shop')
The mistake I did originally, was I applied the QuantileDiscretizer on the whole amount set as is, without grouping it by shop.
How can I do this on the shop_groups ?

Related

How to get the two nearest values in spark scala DataFrame

Hi EveryOne I'm new in Spark scala. I want to find the nearest values by partition using spark scala. My input is something like this:
first row for example: value 1 is between 2 and 7 in the value2 columns
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |1 |
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |2 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
|3 |5 |7 |
|3 |5 |8 |
My output should like this:
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
Can someone guide me how to resolve this please.
Instead of providing a code answer as you appear to want to learn I've provided you pseudo code and references to allow you to find the answers for yourself.
Group the elements (select id, value1) (aggregate on value2
with collect_list) so you can collect all the value2 into an
array.
select (id, and (add(concat) value1 to the collect_list array)) Sorting the array .
find( array_position ) value1 in the array.
splice the array. retrieving value before and value after
the result of (array_position)
If the array is less than 3 elements do error handling
now the last value in the array and the first value in the array are your 'closest numbers'.
You will need window functions for this.
val window = Window
.partitionBy("id", "value1")
.orderBy(asc("value2"))
val result = df
.withColumn("prev", lag("value2").over(window))
.withColumn("next", lead("value2").over(window))
.withColumn("dist_prev", col("value2").minus(col("prev")))
.withColumn("dist_next", col("next").minus(col("value2")))
.withColumn("min", min(col("dist_prev")).over(window))
.filter(col("dist_prev") === col("min") || col("dist_next") === col("min"))
.drop("prev", "next", "dist_prev", "dist_next", "min")
I haven't tested it, so think about it more as an illustration of the concept than a working ready-to-use example.
Here is what's going on here:
First, create a window that describes your grouping rule: we want the rows grouped by the first two columns, and sorted by the third one within each group.
Next, add prev and next columns to the dataframe that contain the value of value2 column from previous and next row within the group respectively. (prev will be null for the first row in the group, and next will be null for the last row – that is ok).
Add dist_prev and dist_next to contain the distance between value2 and prev and next value respectively. (Note that dist_prev for each row will have the same value as dist_next for the previous row).
Find the minimum value for dist_prev within each group, and add it as min column (note, that the minimum value for dist_next is the same by construction, so we only need one column here).
Filter the rows, selecting those that have the minimum value in either dist_next or dist_prev. This finds the tightest pair unless there are multiple rows with the same distance from each other – this case was not accounted for in your question, so we don't know what kind of behavior you want in this case. This implementation will simply return all of these rows.
Finally, drop all extra columns that were added to the dataframe to return it to its original shape.

How to plot a graph with data gaps in Zeppelin?

Dataframe was extracted to a temp table to plot the data density per time unit (1 day):
val dailySummariesDf =
getDFFromJdbcSource(SparkSession.builder().appName("test").master("local").getOrCreate(), s"SELECT * FROM values WHERE time > '2020-06-06' and devicename='Voltage' limit 100000000")
.persist(StorageLevel.MEMORY_ONLY_SER)
.groupBy($"digital_twin_id", window($"time", "1 day")).count().as("count")
.withColumn("windowstart", col("window.start"))
.withColumn("windowstartlong", unix_timestamp(col("window.start")))
.orderBy("windowstart")
dailySummariesDf.
registerTempTable("bank")
Then I plot it with %sql processor
%sql
select windowstart, count
from bank
and
%sql
select windowstartlong, count
from bank
What I get is shown below:
So, my expectation is to have gaps in this graph, as there were days with no data at all. But instead I see it being plotted densely, with October days plotted right after August, not showing a gap for September.
How can I force those graphs to display gaps and regard the real X axis values?
Indeed, grouping a dataset by window column won't produce any rows for the intervals that did not contain any original rows within those intervals.
One way to deal with that I can think of, is to add a bunch of fake rows ("manually fill in the gaps" in raw dataset), and only then apply a groupBy/window. For your case, that can be done by creating a trivial one-column dataset containing all the dates within a range you're interested in, and then joining it to your original dataset.
Here is my quick attempt:
import spark.implicits._
import org.apache.spark.sql.types._
// Define sample data
val df = Seq(("a","2021-12-01"),
("b","2021-12-01"),
("c","2021-12-01"),
("a","2021-12-02"),
("b","2021-12-17")
).toDF("c","d").withColumn("d",to_timestamp($"d"))
// Define a dummy dataframe for the range 12/01/2021 - 12/30/2021
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
val start = DateTime.parse("2021-12-01",DateTimeFormat.forPattern("yyyy-MM-dd")).getMillis/1000
val end = start + 30*24*60*60
val temp = spark.range(start,end,24*60*60).toDF().withColumn("tc",to_timestamp($"id".cast(TimestampType))).drop($"id")
// Fill the gaps in original dataframe
val nogaps = temp.join(df, temp.col("tc") === df.col("d"), "left")
// Aggregate counts by a tumbling 1-day window
val result = nogaps.groupBy(window($"tc","1 day","1 day","5 hours")).agg(sum(when($"c".isNotNull,1).otherwise(0)).as("count"))
result.withColumn("windowstart",to_date(col("window.start"))).select("windowstart","count").orderBy("windowstart").show(false)
+-----------+-----+
|windowstart|count|
+-----------+-----+
|2021-12-01 |3 |
|2021-12-02 |1 |
|2021-12-03 |0 |
|2021-12-04 |0 |
|2021-12-05 |0 |
|2021-12-06 |0 |
|2021-12-07 |0 |
|2021-12-08 |0 |
|2021-12-09 |0 |
|2021-12-10 |0 |
|2021-12-11 |0 |
|2021-12-12 |0 |
|2021-12-13 |0 |
|2021-12-14 |0 |
|2021-12-15 |0 |
|2021-12-16 |0 |
|2021-12-17 |1 |
|2021-12-18 |0 |
|2021-12-19 |0 |
|2021-12-20 |0 |
+-----------+-----+
For illustration purposes only :)

How to set a dynamic where clause using pyspark

I have a dataset within which there are multiple groups. I have a rank column which incrementally counts counts each entry per group. An example of this structure is shown below:
+-----------+---------+---------+
| equipment| run_id|run_order|
+-----------+---------+---------+
|1 |430032589| 1|
|1 |430332632| 2|
|1 |430563033| 3|
|1 |430785715| 4|
|1 |431368577| 5|
|1 |431672148| 6|
|2 |435497596| 1|
|1 |435522469| 7|
Each group (equipment) has a different amount of runs. Shown above equipment 1 has 7 runs whilst equipment 2 has 1 run. I would like to select the first and last n runs per equipment. To select the first n runs is straightforward:
df.select("equipment", "run_id").distinct().where(df.run_order <= n).orderBy("equipment").show()
The distinct is in the query because each row is equivalent to a timestep and therefore each row will log sensor readings associated with that timestep. Therefore there will be many rows with the same equipment, run_id and run_order, which should be preserved in the end result and not aggregated.
As the number of runs is unique to each equipment I can't do an equivalent select query with a where clause (I think) to get the last n runs:
df.select("equipment", "run_id").distinct().where(df.rank >= total_runs - n).orderBy("equipment").show()
I can run a groupBy to get the highest run_order for each equipment
+-----------+----------------+
| equipment| max(run_order) |
+-----------+----------------+
|1 | 7|
|2 | 1|
But I am unsure if there is a way I can construct a dynamic where clause that works like this. So that I get the last n runs (including all timestep data for each run).
You can add a column of the max rank for each equipment and do a filter based on that column:
from pyspark.sql import functions as F, Window
n = 3
df2 = df.withColumn(
'max_run',
F.max('run_order').over(Window.partitionBy('equipment'))
).where(F.col('run_order') >= F.col('max_run') - n)

Join with uneven columns

I have two dataframes structured the following way:
|Source|#Users|#Clicks|Hour|Type
and
Type|Total # Users|Hour
I'd like to join these columns based on hour however the first dataframe is at a deeper granularity in the second and therefore has more rows. Basically I want a dataframe where I have
|Source|#Users|#Clicks|Hour|Type|Total # Users
where the total # users is from the second dataframe. Any suggestions? I think I maybe want to use map?
Edit:
Here's an example
DF1
|Source|#Users|#Clicks|Hour|Type
|Prod1 |50 |3 |01 |Internet
|Prod2 |10 |2 |07 |iOS
|Prod3 |1 |50 |07 |Internet
|Prod2 |3 |2 |07 |Internet
|Prod3 |8 |2 |05 |Internet
DF2
|Type |Total #Users|Hour
|Internet|100 |01
|iOS |500 |01
|Internet|300 |07
|Internet|15 |05
|iOS |20 |07
Result
|Source|#Users|#Clicks|Hour|Type |Total #Users
|Prod1 |50 |3 |01 |Internet|100
|Prod2 |10 |2 |07 |iOS |20
|Prod3 |1 |50 |07 |Internet|300
|Prod2 |3 |2 |07 |Internet|300
|Prod3 |8 |2 |05 |Internet|15
That's a left join you're trying to do :
df1.join(df2, (df1.Hour === df2.Hour) & (df1.Type === df2.Type), "left_outer")
Short version : a left join keep all the rows from df1 and join on condition with matching rows of df2 if there is a match (null if not, duplicate if multiple matches).
More info on Pyspark join
More info on SQL Joins types

Populate a "Grouper" column using .withcolumn in scala.spark dataframe

Trying to populate the grouper column like below. In the table below, X signifies the start of a new record. So, Each X,Y,Z needs to be grouped. In MySQL, I would accomplish like:
select #x:=1;
update table set grouper=if(column_1='X',#x:=#x+1,#x);
I am trying to see if there is a way to do this without using a loop using . With column or something similar.
what I have tried:
var group = 1;
val mydf4 = mydf3.withColumn("grouper", when(col("column_1").equalTo("INS"),group=group+1).otherwise(group))
Example DF
Simple window function and row_number() inbuilt function should get you your desired output
val df = Seq(
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z"),
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z")
).toDF("column_1")
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("column_1").orderBy("column_1")
import org.apache.spark.sql.functions._
df.withColumn("grouper", row_number().over(windowSpec)).orderBy("grouper", "column_1").show(false)
which should give you
+--------+-------+
|column_1|grouper|
+--------+-------+
|X |1 |
|Y |1 |
|Z |1 |
|X |2 |
|Y |2 |
|Z |2 |
+--------+-------+
Note: The last orderBy is just to match the expected output and just for visualization. In real cluster and processing orderBy like that doesn't make sense