Create a Dataframe based on ranges of other Dataframe - scala

I have a Spark Dataframe containing ranges of numbers (column start and column end), and a column containing the type of this range.
I want to create a new Dataframe with two columns, the first one lists all ranges (incremented by one), and the second one lists the range's type.
To explain more, this is the input Dataframe :
+-------+------+---------+
| start | end | type |
+-------+------+---------+
| 10 | 20 | LOW |
| 21 | 30 | MEDIUM |
| 31 | 40 | HIGH |
+-------+------+---------+
And this is the desired result :
+-------+---------+
| nbr | type |
+-------+---------+
| 10 | LOW |
| 11 | LOW |
| 12 | LOW |
| 13 | LOW |
| 14 | LOW |
| 15 | LOW |
| 16 | LOW |
| 17 | LOW |
| 18 | LOW |
| 19 | LOW |
| 20 | LOW |
| 21 | MEDIUM |
| 22 | MEDIUM |
| .. | ... |
+-------+---------+
Any ideas ?

Try this.
val data = List((10, 20, "Low"), (21, 30, "MEDIUM"), (31, 40, "High"))
import spark.implicits._
val df = data.toDF("start", "end", "type")
df.withColumn("nbr", explode(sequence($"start", $"end"))).drop("start","end").show(false)
output:
+------+---+
|type |nbr|
+------+---+
|Low |10 |
|Low |11 |
|Low |12 |
|Low |13 |
|Low |14 |
|Low |15 |
|Low |16 |
|Low |17 |
|Low |18 |
|Low |19 |
|Low |20 |
|MEDIUM|21 |
|MEDIUM|22 |
|MEDIUM|23 |
|MEDIUM|24 |
|MEDIUM|25 |
|MEDIUM|26 |
|MEDIUM|27 |
|MEDIUM|28 |
|MEDIUM|29 |
+------+---+
only showing top 20 rows

The solution provided by #Learn-Hadoop works if you're on Spark 2.4+ .
For older Spark version, consider creating a simple UDF to mimic the sequence function:
val sequence = udf{ (lower: Int, upper: Int) =>
Seq.iterate(lower, upper - lower + 1)(_ + 1)
}
df.withColumn("nbr",explode(sequence($"start",$"end"))).drop("start","end").show(false)

Related

Check if a value is between two columns, spark scala

I have two dataframes, one with my data and another one to compare. What I want to do is check if a value is in a range of two different columns, for example:
Df_player
+--------+-------+
| Baller | Power |
+--------+-------+
| John | 1.5 |
| Bilbo | 3.7 |
| Frodo | 6 |
+--------+-------+
Df_Check
+--------+--------+--------+
| First | Second | Value |
+--------+--------+--------+
| 1 | 1.5 | Bad- |
| 1.5 | 3 | Bad |
| 3 | 4.2 | Good |
| 4.2 | 6 | Good+ |
+--------+--------+--------+
The result would be:
Df_out
+--------+-------+--------+
| Baller | Power | Value |
+--------+-------+--------+
| John | 1.5 | Bad- |
| Bilbo | 3.7 | Good |
| Frodo | 6 | Good+ |
+--------+-------+--------+
You can do a join based on a between condition, but note that .between is not appropriate here because you want inequality in one of the comparisons:
val result = df_player.join(
df_check,
df_player("Power") > df_check("First") && df_player("Power") <= df_check("Second"),
"left"
).select("Baller", "Power", "Value")
result.show
+------+-----+-----+
|Baller|Power|Value|
+------+-----+-----+
| John| 1.5| Bad-|
| Bilbo| 3.7| Good|
| Frodo| 6.0|Good+|
+------+-----+-----+

Use different dataframes to create new one with information (Scala Spark)

I have one dataframe with games and three valoration for every game from different reviews, every valoration is traduced in another dataframe as you can see:
Df_reviews
+--------+-------+-------+--------+
|Game | rev_1 | rev_2 | rev_3 |
+- ------+-------+-------+--------+
|CA |XX+ | K2 | L1 |
|FT |Z- | K1+ | L3 |
Df_rev1
+----------+-------------+
| review_1 | Equivalence |
+----------+-------------+
|XX+ | 9 |
|Y | 6 |
|Z- | 3 |
Df_rev2
+----------+-------------+
| review_2 | Equivalence |
+----------+-------------+
|K2 | 7 |
|K1+ | 6 |
|K3 | 10 |
Df_rev3
+----------+-------------+
| review_3 | Equivalence |
+----------+-------------+
|L3 | 10 |
|L2 | 9 |
|L1 | 8 |
I have to traduce it in a new dataframe with the valoration traduced and add a column with the second best valoration, for this example would be:
Df_output
+--------+---------+---------+----------+-------------+
|Game | rev_1_t | rev_2_t | rev_3_t | second_best |
+--------+---------+---------+----------+-------------+
|CA | 9 | 7 | 8 | 8 |
|FT | 3 | 6 | 10 | 6 |
To traduce it, I am trying with a left join but I am so lost. How can I deal with this?
####### Second Part ######
How can I translate columns from one dataframe to others from another dataframe, joining with multiple columns vs one? for example:
Df_revuews
+--------+-------+-------+--------+
|Game | rev_1 | rev_2 | rev_3 |
+- ------+-------+-------+--------+
|CA |XX+ | K2 | L1 |
|FT |Z- | K1+ | L3 |
Df_equiv
+--------+-------+
|Valorat | num |
+- ------+-------+
|X |3 |
|XX+ |5 |
|Z |7 |
|Z- |6 |
|K1+ |6 |
|K2 |4 |
|L1 |5 |
|L2 |6 |
|L3 |7 |
Output
+--------+-------+-------+--------+
|Game | rev_1 | rev_2 | rev_3 |
+- ------+-------+-------+--------+
|CA |5 | 4 | 5 |
|FT |6 | 6 | 7 |
I am doing this as you can see:
val joined = df_reviews
.join(df_equiv, df_reviews("rev_1") === df_equiv("num") && df_reviews("rev_2") === df_equiv("num")
&& df_reviews("rev_3") === df_equiv("num"), "left")
.select(df_reviews("Game"),
df_equiv("num").as("rev_1_t"),
df_equiv("num").as("rev_2_t"),
df_equiv("num").as("rev_3_t")
)
Thanks in advance!
You can do some left joins and get the second highest column using sort_array:
val joined = df_reviews
.join(df_rev1, df_reviews("rev_1") === df_rev1("review_1"), "left")
.join(df_rev2, df_reviews("rev_2") === df_rev2("review_2"), "left")
.join(df_rev3, df_reviews("rev_3") === df_rev3("review_3"), "left")
.select(df_reviews("Game"),
df_rev1("Equivalence").as("rev_1_t"),
df_rev2("Equivalence").as("rev_2_t"),
df_rev3("Equivalence").as("rev_3_t")
)
val array_sort_udf = udf((x: Seq[Int]) => x.sortBy(_ != null))
val result = joined.withColumn(
"second_best",
coalesce(
array_sort_udf(
array(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
)(1),
greatest(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
)
)
result.show
+----+-------+-------+-------+-----------+
|Game|rev_1_t|rev_2_t|rev_3_t|second_best|
+----+-------+-------+-------+-----------+
| CA| 9| 7| 8| 8|
| FT| 3| 6| 10| 6|
+----+-------+-------+-------+-----------+
For your second question:
val joined = df_reviews.as("r1")
.join(df_equiv.as("e1"), expr("r1.rev_1 = e1.Valorat"), "left")
.selectExpr("Game", "e1.num as rev_1", "rev_2", "rev_3")
.as("r2")
.join(df_equiv.as("e2"), expr("r2.rev_2 = e2.Valorat"), "left")
.selectExpr("Game", "rev_1", "e2.num as rev_2", "rev_3")
.as("r3")
.join(df_equiv.as("e3"), expr("r3.rev_3 = e3.Valorat"), "left")
.selectExpr("Game", "rev_1", "rev_2", "e3.num as rev_3")
joined.show
+----+-----+-----+-----+
|Game|rev_1|rev_2|rev_3|
+----+-----+-----+-----+
| CA| 5| 4| 5|
| FT| 6| 6| 7|
+----+-----+-----+-----+

Mean with differents columns ignoring Null values, Spark Scala

I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+

Time series with scala and spark. Rolling window

I'm trying to work on the following exercise using Scala and spark.
Given a file containing two columns: a time in seconds and a value
Example:
|---------------------|------------------|
| seconds | value |
|---------------------|------------------|
| 225 | 1,5 |
| 245 | 0,5 |
| 300 | 2,4 |
| 319 | 1,2 |
| 320 | 4,6 |
|---------------------|------------------|
and given a value V to be used for the rolling window this output should be created:
Example with V=20
|--------------|---------|--------------------|----------------------|
| seconds | value | num_row_in_window |sum_values_in_windows |
|--------------|---------|--------------------|----------------------|
| 225 | 1,5 | 1 | 1,5 |
| 245 | 0,5 | 2 | 2 |
| 300 | 2,4 | 1 | 2,4 |
| 319 | 1,2 | 2 | 3,6 |
| 320 | 4,6 | 3 | 8,2 |
|--------------|---------|--------------------|----------------------|
num_row_in_window is the number of rows contained in the current window and
sum_values_in_windows is the sum of the values contained in the current window.
I've been trying with the sliding function or using the sql api but it's a bit unclear to me which is the best solution to tackle this problem considering that I'm a spark/scala novice.
This is a perfect application for window-functions. By using rangeBetween you can set your sliding window to 20s. Note that in the example below no partitioning is specified (no partitionBy). Without a partitioning, this code will not scale:
import ss.implicits._
val df = Seq(
(225, 1.5),
(245, 0.5),
(300, 2.4),
(319, 1.2),
(320, 4.6)
).toDF("seconds", "value")
val window = Window.orderBy($"seconds").rangeBetween(-20L, 0L) // add partitioning here
df
.withColumn("num_row_in_window", sum(lit(1)).over(window))
.withColumn("sum_values_in_window", sum($"value").over(window))
.show()
+-------+-----+-----------------+--------------------+
|seconds|value|num_row_in_window|sum_values_in_window|
+-------+-----+-----------------+--------------------+
| 225| 1.5| 1| 1.5|
| 245| 0.5| 2| 2.0|
| 300| 2.4| 1| 2.4|
| 319| 1.2| 2| 3.6|
| 320| 4.6| 3| 8.2|
+-------+-----+-----------------+--------------------+

How to transpose a sub-matrix of spark rdd iteratively?

For example,
From:
+-----+-----+
|Date |val_1|
+-----+-----+
| 1-1 | 1.1|
| 1-2 | 1.2|
| 1-3 | 1.3|
| 1-4 | 1.4|
| 1-5 | 1.5|
| 1-6 | 1.6|
| 1-7 | 1.7|
| 1-8 | 1.8|
| 1-9 | 1.9|
| ...| ...|
To:
+-----+-----+-----+-------+
| Date | D-3 | D-2 | D-1 |
+-----+-----+-----+-------+
| 1-4 | 1.1 | 1.2 | 1.3 |
| 1-5 | 1.2 | 1.3 | 1.4 |
| 1-6 | 1.3 | 1.4 | 1.5 |
| 1-7 | 1.4 | 1.5 | 1.6 |
| 1-8 | 1.5 | 1.6 | 1.7 |
| 1-9 | 1.6 | 1.7 | 1.8 |
| ... | ... | ... | ... |
Thanks a lot in advance.
Your question is not entirely clear, in particular with respect to the iterative solution you are after. However, for the example data provided:
df = sc.parallelize([('1-1', 1.1), ('1-2', 1.2), ('1-3', 1.3), ('1-4', 1.4), ('1-5', 1.5), ('1-6', 1.6),('1-7', 1.7),('1-8', 1.8),('1-9', 1.9)]).toDF(["Date", "val_1"])
You can use lag in combination with a Window to retrieve D-3, D-2 and D-1
from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window
w = Window().partitionBy().orderBy(col("Date"))
dfl = df.select("Date", lag("val_1",count=3).over(w).alias("D-3"),
lag("val_1",count=2).over(w).alias("D-2"),
lag("val_1",count=1).over(w).alias("D-1")).na.drop()
dfl.show()
This results in the following output:
+----+---+---+---+
|Date|D-3|D-2|D-1|
+----+---+---+---+
| 1-4|1.1|1.2|1.3|
| 1-5|1.2|1.3|1.4|
| 1-6|1.3|1.4|1.5|
| 1-7|1.4|1.5|1.6|
| 1-8|1.5|1.6|1.7|
| 1-9|1.6|1.7|1.8|
+----+---+---+---+
Thanks for Jaco's inspiration.
Here is the Scala Version:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val df = sc.parallelize(Seq(("1-1", 1.1), ("1-2", 1.2), ("1-3", 1.3), ("1-4", 1.4), ("1-5", 1.5), ("1-6", 1.6),("1-7", 1.7),("1-8", 1.8),("1-9", 1.9))).toDF("Date", "val_1")
val w = Window.partitionBy().orderBy("Date")
val res = df.withColumn("D-3", lag("val_1", 3, 0).over(w)).withColumn("D-2", lag("val_1", 2, 0).over(w)).withColumn("D-1", lag("val_1", 1, 0).over(w)).na.drop()
Result:
+----+-----+---+---+---+
|Date|val_1|D-3|D-2|D-1|
+----+-----+---+---+---+
| 1-4| 1.4|1.1|1.2|1.3|
| 1-5| 1.5|1.2|1.3|1.4|
| 1-6| 1.6|1.3|1.4|1.5|
| 1-7| 1.7|1.4|1.5|1.6|
| 1-8| 1.8|1.5|1.6|1.7|
| 1-9| 1.9|1.6|1.7|1.8|
+----+-----+---+---+---+