Time series with scala and spark. Rolling window - scala

I'm trying to work on the following exercise using Scala and spark.
Given a file containing two columns: a time in seconds and a value
Example:
|---------------------|------------------|
| seconds | value |
|---------------------|------------------|
| 225 | 1,5 |
| 245 | 0,5 |
| 300 | 2,4 |
| 319 | 1,2 |
| 320 | 4,6 |
|---------------------|------------------|
and given a value V to be used for the rolling window this output should be created:
Example with V=20
|--------------|---------|--------------------|----------------------|
| seconds | value | num_row_in_window |sum_values_in_windows |
|--------------|---------|--------------------|----------------------|
| 225 | 1,5 | 1 | 1,5 |
| 245 | 0,5 | 2 | 2 |
| 300 | 2,4 | 1 | 2,4 |
| 319 | 1,2 | 2 | 3,6 |
| 320 | 4,6 | 3 | 8,2 |
|--------------|---------|--------------------|----------------------|
num_row_in_window is the number of rows contained in the current window and
sum_values_in_windows is the sum of the values contained in the current window.
I've been trying with the sliding function or using the sql api but it's a bit unclear to me which is the best solution to tackle this problem considering that I'm a spark/scala novice.

This is a perfect application for window-functions. By using rangeBetween you can set your sliding window to 20s. Note that in the example below no partitioning is specified (no partitionBy). Without a partitioning, this code will not scale:
import ss.implicits._
val df = Seq(
(225, 1.5),
(245, 0.5),
(300, 2.4),
(319, 1.2),
(320, 4.6)
).toDF("seconds", "value")
val window = Window.orderBy($"seconds").rangeBetween(-20L, 0L) // add partitioning here
df
.withColumn("num_row_in_window", sum(lit(1)).over(window))
.withColumn("sum_values_in_window", sum($"value").over(window))
.show()
+-------+-----+-----------------+--------------------+
|seconds|value|num_row_in_window|sum_values_in_window|
+-------+-----+-----------------+--------------------+
| 225| 1.5| 1| 1.5|
| 245| 0.5| 2| 2.0|
| 300| 2.4| 1| 2.4|
| 319| 1.2| 2| 3.6|
| 320| 4.6| 3| 8.2|
+-------+-----+-----------------+--------------------+

Related

PySpark Relate multiple rows

I have a problem where I need to relate rows to each other. I have tried many things but I am now completly stuck. I have tried partitioning, lag, groupbys but nothing works.
The rows below the ID 26 wil relate to the MPAN of 26
ID | MPAN | Value
---------------------
26 | 12345678 | Hello
27 | 99900234 | Bye
30 | 77563820 | Help
33 | 89898937 | Stuck
26 | 54877273 | Need a genius
29 | 54645643 | So close
30 | 22222222 | Thanks
e.g.
ID | MPAN | Value | Relation
----------------------------------------
26 | 12345678 | Hello | NULL
27 | 99900234 | Bye | 12345678
30 | 77563820 | Help | 12345678
33 | 89898937 | Stuck | 12345678
26 | 54877273 | Genius | NULL
29 | 54645643 | So close | 54877273
30 | 22222222 | Thanks | 54877273
This code below only works for previous row and not the LAG for the 26 record
df = spark.read.load('abfss://Files/', format='parquet')
df = df.withColumn("identity", F.monotonically_increasing_id())
win = Window.orderBy("identity")
condition = F.col("Prop_0") != '026'
df = df.withColumn("FlagY", F.when(condition, mpanlookup))
df.show()
As I said in my comment, you need a column to maintain the order. In your example, you used monotonically_increasing_id to create that "ordering" column, but that is absurd because
The function is non-deterministic because its result depends on partition IDs.
Assuming you have a proper "ordering" column :
df.show()
+---+---+--------+-------------+
|idx| ID| MPAN| Value|
+---+---+--------+-------------+
| 1| 26|12345678|Hello |
| 2| 27|99900234|Bye |
| 3| 30|77563820|Help |
| 4| 33|89898937|Stuck |
| 5| 26|54877273|Need a genius|
| 6| 29|54645643|So close |
| 7| 30|22222222|Thanks |
+---+---+--------+-------------+
you can simply do that with last function :
from pyspark.sql import functions as F, Window
df.withColumn(
"Relation",
F.last(F.when(F.col("ID") == 26, F.col("MPAN")), ignorenulls=True).over(
Window.orderBy("idx")
),
).show()
+---+---+--------+-------------+--------+
|idx| ID| MPAN| Value|Relation|
+---+---+--------+-------------+--------+
| 1| 26|12345678|Hello |12345678|
| 2| 27|99900234|Bye |12345678|
| 3| 30|77563820|Help |12345678|
| 4| 33|89898937|Stuck |12345678|
| 5| 26|54877273|Need a genius|54877273|
| 6| 29|54645643|So close |54877273|
| 7| 30|22222222|Thanks |54877273|
+---+---+--------+-------------+--------+

Create a Dataframe based on ranges of other Dataframe

I have a Spark Dataframe containing ranges of numbers (column start and column end), and a column containing the type of this range.
I want to create a new Dataframe with two columns, the first one lists all ranges (incremented by one), and the second one lists the range's type.
To explain more, this is the input Dataframe :
+-------+------+---------+
| start | end | type |
+-------+------+---------+
| 10 | 20 | LOW |
| 21 | 30 | MEDIUM |
| 31 | 40 | HIGH |
+-------+------+---------+
And this is the desired result :
+-------+---------+
| nbr | type |
+-------+---------+
| 10 | LOW |
| 11 | LOW |
| 12 | LOW |
| 13 | LOW |
| 14 | LOW |
| 15 | LOW |
| 16 | LOW |
| 17 | LOW |
| 18 | LOW |
| 19 | LOW |
| 20 | LOW |
| 21 | MEDIUM |
| 22 | MEDIUM |
| .. | ... |
+-------+---------+
Any ideas ?
Try this.
val data = List((10, 20, "Low"), (21, 30, "MEDIUM"), (31, 40, "High"))
import spark.implicits._
val df = data.toDF("start", "end", "type")
df.withColumn("nbr", explode(sequence($"start", $"end"))).drop("start","end").show(false)
output:
+------+---+
|type |nbr|
+------+---+
|Low |10 |
|Low |11 |
|Low |12 |
|Low |13 |
|Low |14 |
|Low |15 |
|Low |16 |
|Low |17 |
|Low |18 |
|Low |19 |
|Low |20 |
|MEDIUM|21 |
|MEDIUM|22 |
|MEDIUM|23 |
|MEDIUM|24 |
|MEDIUM|25 |
|MEDIUM|26 |
|MEDIUM|27 |
|MEDIUM|28 |
|MEDIUM|29 |
+------+---+
only showing top 20 rows
The solution provided by #Learn-Hadoop works if you're on Spark 2.4+ .
For older Spark version, consider creating a simple UDF to mimic the sequence function:
val sequence = udf{ (lower: Int, upper: Int) =>
Seq.iterate(lower, upper - lower + 1)(_ + 1)
}
df.withColumn("nbr",explode(sequence($"start",$"end"))).drop("start","end").show(false)

Mean with differents columns ignoring Null values, Spark Scala

I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+

How can I add one column to other columns in PySpark?

I have the following PySpark DataFrame where each column represents a time series and I'd like to study their distance to the mean.
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| 1 | 2 | ... | 2 |
| -1 | 5 | ... | 4 |
+----+----+-----+---------+
This is what I'm hoping to get:
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| -1 | 0 | ... | 2 |
| -5 | 1 | ... | 4 |
+----+----+-----+---------+
Up until now, I've tried naively running a UDF on individual columns but it takes respectively 30s-50s-80s... (keeps increasing) per column so I'm probably doing something wrong.
cols = ["T1", "T2", ...]
for c in cols:
df = df.withColumn(c, df[c] - df["Average"])
Is there a better way to do this transformation of adding one column to many other?
By using rdd, it can be done in this way.
+---+---+-------+
|T1 |T2 |Average|
+---+---+-------+
|1 |2 |2 |
|-1 |5 |4 |
+---+---+-------+
df.rdd.map(lambda r: (*[r[i] - r[-1] for i in range(0, len(r) - 1)], r[-1])) \
.toDF(df.columns).show()
+---+---+-------+
| T1| T2|Average|
+---+---+-------+
| -1| 0| 2|
| -5| 1| 4|
+---+---+-------+

How do I transform a Spark dataframe so that my values become column names? [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I'm not sure of a good way to phrase the question, but an example will help. Here is the dataframe that I have with the columns: name, type, and count:
+------+------+-------+
| Name | Type | Count |
+------+------+-------+
| a | 0 | 5 |
| a | 1 | 4 |
| a | 5 | 5 |
| a | 4 | 5 |
| a | 2 | 1 |
| b | 0 | 2 |
| b | 1 | 4 |
| b | 3 | 5 |
| b | 4 | 5 |
| b | 2 | 1 |
| c | 0 | 5 |
| c | ... | ... |
+------+------+-------+
I want to get a new dataframe structured like this where the Type column values have become new columns:
+------+---+-----+---+---+---+---+
| Name | 0 | 1 | 2 | 3 | 4 | 5 | <- Number columns are types from input
+------+---+-----+---+---+---+---+
| a | 5 | 4 | 1 | 0 | 5 | 5 |
| b | 2 | 4 | 1 | 5 | 5 | 0 |
| c | 5 | ... | | | | |
+------+---+-----+---+---+---+---+
The columns here are [Name,0,1,2,3,4,5].
Do this by using the pivot function in Spark.
val df2 = df.groupBy("Name").pivot("Type").sum("Count")
Here, if the name and the type is the same for two rows, the count values are simply added together, but other aggregations are possible as well.
Resulting dataframe when using the example data in the question:
+----+---+----+----+----+----+----+
|Name| 0| 1| 2| 3| 4| 5|
+----+---+----+----+----+----+----+
| c| 5|null|null|null|null|null|
| b| 2| 4| 1| 5| 5|null|
| a| 5| 4| 1|null| 5| 5|
+----+---+----+----+----+----+----+