Create a Dataframe based on ranges of other Dataframe

Create a Dataframe based on ranges of other Dataframe - scala

I have a Spark Dataframe containing ranges of numbers (column start and column end), and a column containing the type of this range.
I want to create a new Dataframe with two columns, the first one lists all ranges (incremented by one), and the second one lists the range's type.
To explain more, this is the input Dataframe :
+-------+------+---------+
| start | end | type |
+-------+------+---------+
| 10 | 20 | LOW |
| 21 | 30 | MEDIUM |
| 31 | 40 | HIGH |
+-------+------+---------+
And this is the desired result :
+-------+---------+
| nbr | type |
+-------+---------+
| 10 | LOW |
| 11 | LOW |
| 12 | LOW |
| 13 | LOW |
| 14 | LOW |
| 15 | LOW |
| 16 | LOW |
| 17 | LOW |
| 18 | LOW |
| 19 | LOW |
| 20 | LOW |
| 21 | MEDIUM |
| 22 | MEDIUM |
| .. | ... |
+-------+---------+
Any ideas ?

Try this.
val data = List((10, 20, "Low"), (21, 30, "MEDIUM"), (31, 40, "High"))
import spark.implicits._
val df = data.toDF("start", "end", "type")
df.withColumn("nbr", explode(sequence($"start", $"end"))).drop("start","end").show(false)
output:
+------+---+
|type |nbr|
+------+---+
|Low |10 |
|Low |11 |
|Low |12 |
|Low |13 |
|Low |14 |
|Low |15 |
|Low |16 |
|Low |17 |
|Low |18 |
|Low |19 |
|Low |20 |
|MEDIUM|21 |
|MEDIUM|22 |
|MEDIUM|23 |
|MEDIUM|24 |
|MEDIUM|25 |
|MEDIUM|26 |
|MEDIUM|27 |
|MEDIUM|28 |
|MEDIUM|29 |
+------+---+
only showing top 20 rows

The solution provided by #Learn-Hadoop works if you're on Spark 2.4+ .
For older Spark version, consider creating a simple UDF to mimic the sequence function:
val sequence = udf{ (lower: Int, upper: Int) =>
Seq.iterate(lower, upper - lower + 1)(_ + 1)
}
df.withColumn("nbr",explode(sequence($"start",$"end"))).drop("start","end").show(false)

Related

Check if a value is between two columns, spark scala

I have two dataframes, one with my data and another one to compare. What I want to do is check if a value is in a range of two different columns, for example:
Df_player
+--------+-------+
| Baller | Power |
+--------+-------+
| John | 1.5 |
| Bilbo | 3.7 |
| Frodo | 6 |
+--------+-------+
Df_Check
+--------+--------+--------+
| First | Second | Value |
+--------+--------+--------+
| 1 | 1.5 | Bad- |
| 1.5 | 3 | Bad |
| 3 | 4.2 | Good |
| 4.2 | 6 | Good+ |
+--------+--------+--------+
The result would be:
Df_out
+--------+-------+--------+
| Baller | Power | Value |
+--------+-------+--------+
| John | 1.5 | Bad- |
| Bilbo | 3.7 | Good |
| Frodo | 6 | Good+ |
+--------+-------+--------+

You can do a join based on a between condition, but note that .between is not appropriate here because you want inequality in one of the comparisons:
val result = df_player.join(
df_check,
df_player("Power") > df_check("First") && df_player("Power") <= df_check("Second"),
"left"
).select("Baller", "Power", "Value")
result.show
+------+-----+-----+
|Baller|Power|Value|
+------+-----+-----+
| John| 1.5| Bad-|
| Bilbo| 3.7| Good|
| Frodo| 6.0|Good+|
+------+-----+-----+

Use different dataframes to create new one with information (Scala Spark)

I have one dataframe with games and three valoration for every game from different reviews, every valoration is traduced in another dataframe as you can see:
Df_reviews
+--------+-------+-------+--------+
|Game | rev_1 | rev_2 | rev_3 |
+- ------+-------+-------+--------+
|CA |XX+ | K2 | L1 |
|FT |Z- | K1+ | L3 |
Df_rev1
+----------+-------------+
| review_1 | Equivalence |
+----------+-------------+
|XX+ | 9 |
|Y | 6 |
|Z- | 3 |
Df_rev2
+----------+-------------+
| review_2 | Equivalence |
+----------+-------------+
|K2 | 7 |
|K1+ | 6 |
|K3 | 10 |
Df_rev3
+----------+-------------+
| review_3 | Equivalence |
+----------+-------------+
|L3 | 10 |
|L2 | 9 |
|L1 | 8 |
I have to traduce it in a new dataframe with the valoration traduced and add a column with the second best valoration, for this example would be:
Df_output
+--------+---------+---------+----------+-------------+
|Game | rev_1_t | rev_2_t | rev_3_t | second_best |
+--------+---------+---------+----------+-------------+
|CA | 9 | 7 | 8 | 8 |
|FT | 3 | 6 | 10 | 6 |
To traduce it, I am trying with a left join but I am so lost. How can I deal with this?
####### Second Part ######
How can I translate columns from one dataframe to others from another dataframe, joining with multiple columns vs one? for example:
Df_revuews
+--------+-------+-------+--------+
|Game | rev_1 | rev_2 | rev_3 |
+- ------+-------+-------+--------+
|CA |XX+ | K2 | L1 |
|FT |Z- | K1+ | L3 |
Df_equiv
+--------+-------+
|Valorat | num |
+- ------+-------+
|X |3 |
|XX+ |5 |
|Z |7 |
|Z- |6 |
|K1+ |6 |
|K2 |4 |
|L1 |5 |
|L2 |6 |
|L3 |7 |
Output
+--------+-------+-------+--------+
|Game | rev_1 | rev_2 | rev_3 |
+- ------+-------+-------+--------+
|CA |5 | 4 | 5 |
|FT |6 | 6 | 7 |
I am doing this as you can see:
val joined = df_reviews
.join(df_equiv, df_reviews("rev_1") === df_equiv("num") && df_reviews("rev_2") === df_equiv("num")
&& df_reviews("rev_3") === df_equiv("num"), "left")
.select(df_reviews("Game"),
df_equiv("num").as("rev_1_t"),
df_equiv("num").as("rev_2_t"),
df_equiv("num").as("rev_3_t")
)
Thanks in advance!

You can do some left joins and get the second highest column using sort_array:
val joined = df_reviews
.join(df_rev1, df_reviews("rev_1") === df_rev1("review_1"), "left")
.join(df_rev2, df_reviews("rev_2") === df_rev2("review_2"), "left")
.join(df_rev3, df_reviews("rev_3") === df_rev3("review_3"), "left")
.select(df_reviews("Game"),
df_rev1("Equivalence").as("rev_1_t"),
df_rev2("Equivalence").as("rev_2_t"),
df_rev3("Equivalence").as("rev_3_t")
)
val array_sort_udf = udf((x: Seq[Int]) => x.sortBy(_ != null))
val result = joined.withColumn(
"second_best",
coalesce(
array_sort_udf(
array(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
)(1),
greatest(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
)
)
result.show
+----+-------+-------+-------+-----------+
|Game|rev_1_t|rev_2_t|rev_3_t|second_best|
+----+-------+-------+-------+-----------+
| CA| 9| 7| 8| 8|
| FT| 3| 6| 10| 6|
+----+-------+-------+-------+-----------+
For your second question:
val joined = df_reviews.as("r1")
.join(df_equiv.as("e1"), expr("r1.rev_1 = e1.Valorat"), "left")
.selectExpr("Game", "e1.num as rev_1", "rev_2", "rev_3")
.as("r2")
.join(df_equiv.as("e2"), expr("r2.rev_2 = e2.Valorat"), "left")
.selectExpr("Game", "rev_1", "e2.num as rev_2", "rev_3")
.as("r3")
.join(df_equiv.as("e3"), expr("r3.rev_3 = e3.Valorat"), "left")
.selectExpr("Game", "rev_1", "rev_2", "e3.num as rev_3")
joined.show
+----+-----+-----+-----+
|Game|rev_1|rev_2|rev_3|
+----+-----+-----+-----+
| CA| 5| 4| 5|
| FT| 6| 6| 7|
+----+-----+-----+-----+

Mean with differents columns ignoring Null values, Spark Scala

I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+

You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+

Time series with scala and spark. Rolling window

I'm trying to work on the following exercise using Scala and spark.
Given a file containing two columns: a time in seconds and a value
Example:
|---------------------|------------------|
| seconds | value |
|---------------------|------------------|
| 225 | 1,5 |
| 245 | 0,5 |
| 300 | 2,4 |
| 319 | 1,2 |
| 320 | 4,6 |
|---------------------|------------------|
and given a value V to be used for the rolling window this output should be created:
Example with V=20
|--------------|---------|--------------------|----------------------|
| seconds | value | num_row_in_window |sum_values_in_windows |
|--------------|---------|--------------------|----------------------|
| 225 | 1,5 | 1 | 1,5 |
| 245 | 0,5 | 2 | 2 |
| 300 | 2,4 | 1 | 2,4 |
| 319 | 1,2 | 2 | 3,6 |
| 320 | 4,6 | 3 | 8,2 |
|--------------|---------|--------------------|----------------------|
num_row_in_window is the number of rows contained in the current window and
sum_values_in_windows is the sum of the values contained in the current window.
I've been trying with the sliding function or using the sql api but it's a bit unclear to me which is the best solution to tackle this problem considering that I'm a spark/scala novice.

This is a perfect application for window-functions. By using rangeBetween you can set your sliding window to 20s. Note that in the example below no partitioning is specified (no partitionBy). Without a partitioning, this code will not scale:
import ss.implicits._
val df = Seq(
(225, 1.5),
(245, 0.5),
(300, 2.4),
(319, 1.2),
(320, 4.6)
).toDF("seconds", "value")
val window = Window.orderBy($"seconds").rangeBetween(-20L, 0L) // add partitioning here
df
.withColumn("num_row_in_window", sum(lit(1)).over(window))
.withColumn("sum_values_in_window", sum($"value").over(window))
.show()
+-------+-----+-----------------+--------------------+
|seconds|value|num_row_in_window|sum_values_in_window|
+-------+-----+-----------------+--------------------+
| 225| 1.5| 1| 1.5|
| 245| 0.5| 2| 2.0|
| 300| 2.4| 1| 2.4|
| 319| 1.2| 2| 3.6|
| 320| 4.6| 3| 8.2|
+-------+-----+-----------------+--------------------+

How to transpose a sub-matrix of spark rdd iteratively?

For example,
From:
+-----+-----+
|Date |val_1|
+-----+-----+
| 1-1 | 1.1|
| 1-2 | 1.2|
| 1-3 | 1.3|
| 1-4 | 1.4|
| 1-5 | 1.5|
| 1-6 | 1.6|
| 1-7 | 1.7|
| 1-8 | 1.8|
| 1-9 | 1.9|
| ...| ...|
To:
+-----+-----+-----+-------+
| Date | D-3 | D-2 | D-1 |
+-----+-----+-----+-------+
| 1-4 | 1.1 | 1.2 | 1.3 |
| 1-5 | 1.2 | 1.3 | 1.4 |
| 1-6 | 1.3 | 1.4 | 1.5 |
| 1-7 | 1.4 | 1.5 | 1.6 |
| 1-8 | 1.5 | 1.6 | 1.7 |
| 1-9 | 1.6 | 1.7 | 1.8 |
| ... | ... | ... | ... |
Thanks a lot in advance.

Your question is not entirely clear, in particular with respect to the iterative solution you are after. However, for the example data provided:
df = sc.parallelize([('1-1', 1.1), ('1-2', 1.2), ('1-3', 1.3), ('1-4', 1.4), ('1-5', 1.5), ('1-6', 1.6),('1-7', 1.7),('1-8', 1.8),('1-9', 1.9)]).toDF(["Date", "val_1"])
You can use lag in combination with a Window to retrieve D-3, D-2 and D-1
from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window
w = Window().partitionBy().orderBy(col("Date"))
dfl = df.select("Date", lag("val_1",count=3).over(w).alias("D-3"),
lag("val_1",count=2).over(w).alias("D-2"),
lag("val_1",count=1).over(w).alias("D-1")).na.drop()
dfl.show()
This results in the following output:
+----+---+---+---+
|Date|D-3|D-2|D-1|
+----+---+---+---+
| 1-4|1.1|1.2|1.3|
| 1-5|1.2|1.3|1.4|
| 1-6|1.3|1.4|1.5|
| 1-7|1.4|1.5|1.6|
| 1-8|1.5|1.6|1.7|
| 1-9|1.6|1.7|1.8|
+----+---+---+---+

Thanks for Jaco's inspiration.
Here is the Scala Version:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val df = sc.parallelize(Seq(("1-1", 1.1), ("1-2", 1.2), ("1-3", 1.3), ("1-4", 1.4), ("1-5", 1.5), ("1-6", 1.6),("1-7", 1.7),("1-8", 1.8),("1-9", 1.9))).toDF("Date", "val_1")
val w = Window.partitionBy().orderBy("Date")
val res = df.withColumn("D-3", lag("val_1", 3, 0).over(w)).withColumn("D-2", lag("val_1", 2, 0).over(w)).withColumn("D-1", lag("val_1", 1, 0).over(w)).na.drop()
Result:
+----+-----+---+---+---+
|Date|val_1|D-3|D-2|D-1|
+----+-----+---+---+---+
| 1-4| 1.4|1.1|1.2|1.3|
| 1-5| 1.5|1.2|1.3|1.4|
| 1-6| 1.6|1.3|1.4|1.5|
| 1-7| 1.7|1.4|1.5|1.6|
| 1-8| 1.8|1.5|1.6|1.7|
| 1-9| 1.9|1.6|1.7|1.8|
+----+-----+---+---+---+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Create a Dataframe based on ranges of other Dataframe - scala

Related

Check if a value is between two columns, spark scala

Use different dataframes to create new one with information (Scala Spark)

Mean with differents columns ignoring Null values, Spark Scala

Time series with scala and spark. Rolling window

How to transpose a sub-matrix of spark rdd iteratively?

Categories

Resources