Create another col using value of other col - scala

I have a dataframe in which I need to add another col based on the grouping logic.
Dataframe
id|x_id|y_id|val_id|
1| 2 | 3 | 4 |
10| 2 | 3 | 40 |
1| 12 | 13 | 14 |
I need to add other col parent_id which will be based on this rule:
over x_id and y_id select the max value in col val_id and use its corresponding id value
Final frame will look like this
id|x_id|y_id|val_id| parent_id
91| 2 | 3 | 4 | 10 (coming from row 2)
10| 2 | 3 | 40 | 10 (coming from row 2)
1| 12 | 13 | 14 | 14
I have tried using withColumn, but I can only set the row over that group that its value will be parent.
Explanation: Here parent_id is 10 because its coming from col id. Row 2 was chosen because it has max value of val_id over group x_id and y_id
I am using scala

Use Window to split the ids and calculate the maximum over the window by sorting for each partition with respect to the val_id.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('x_id, 'y_id).orderBy('val_id.desc)
df.withColumn("parent_id", first('id).over(w))
.show(false)
The result is:
+---+----+----+------+---------+
|id |x_id|y_id|val_id|parent_id|
+---+----+----+------+---------+
|10 |2 |3 |40 |10 |
|1 |2 |3 |4 |10 |
|1 |12 |13 |14 |1 |
+---+----+----+------+---------+

Related

Fixing hierarchy data with table transformation (Hive, scala, spark)

I have a task with working with hierarchical data, but the source data contains errors in the hierarchy, namely: some parent-child links are broken. I have an algorithm for reestablishing such connections, but I have not yet been able to implement it on my own.
Example:
Initial data is
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Schematically it looks like:
As you can see, connections with C1 and D3 are lost here.
In order to restore connections, I need to apply the following algorithm for this table:
if for some NAME the ID is not in the PARENTID column (like ID = 18, 10), then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID), and take ID and NAME such that the current ID < ID of the node from the LEVEL above.
Result must be like:
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| B1 | 2 | 18 | 2 |#
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| C2 | 3 | 10 | 3 |#
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Where rows with # - new rows created.And new schema looks like:
Are there any ideas on how to do this algorithm in spark/scala? Thanks!
You can build a createdRows dataframe from your current dataframe that you union with your current dataframe to obtain your final dataframe.
You can build this createdRows dataframe in several step:
The first step is to get the IDs (and LEVEL) that are not in PARENTID column. You can use a self left anti join to do that.
Then, you renameID column to PARENTID and updating LEVEL column, decreasing it by 1.
Then, you take ID and NAME columns of new rows by joining it with your input dataframe on the LEVEL column
Finally, you apply your condition ID < PARENTID
You end up with the following code, dataframe is the dataframe with your initial data:
import org.apache.spark.sql.functions.col
val createdRows = dataframe
// if for some NAME the ID is not in the PARENTID column (like ID = 18, 10)
.select("LEVEL", "ID")
.filter(col("LEVEL") > 1) // Remove root node from created rows
.join(dataframe.select("PARENTID"), col("PARENTID") === col("ID"), "left_anti")
// then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID)
.withColumnRenamed("ID", "PARENTID")
.withColumn("LEVEL", col("LEVEL") - 1)
// and take ID and NAME
.join(dataframe.select("NAME", "ID", "LEVEL"), Seq("LEVEL"))
// such that the current ID < ID of the node from the LEVEL above.
.filter(col("ID") < col("PARENTID"))
val result = dataframe
.unionByName(createdRows)
.orderBy("NAME", "PARENTID") // Optional, if you want an ordered result
And in result dataframe you get:
+----+---+--------+-----+
|NAME|ID |PARENTID|LEVEL|
+----+---+--------+-----+
|A1 |1 |2 |1 |
|B1 |2 |3 |2 |
|B1 |2 |18 |2 |
|C1 |18 |4 |3 |
|C2 |3 |5 |3 |
|C2 |3 |10 |3 |
|D1 |4 |null |4 |
|D2 |5 |null |4 |
|D3 |10 |11 |4 |
|E1 |11 |null |5 |
+----+---+--------+-----+

Delete values lower than cummax on multiple spark dataframe columns in scala

I am having a data frame as shown below. The number of signals are more than 100, so there will be more than 100 columns in the data frame.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|......
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 |3 |1 |
|050|2021-01-15 |null |4 |2 |
|050|2021-02-02 |2 |3 |3 |
|051|2021-01-14 |1 |3 |0 |
|051|2021-01-15 |2 |null |null |
|051|2021-02-02 |3 |3 |2 |
|051|2021-02-03 |1 |3 |1 |
|052|2021-03-03 |1 |3 |0 |
|052|2021-03-05 |3 |3 |null |
|052|2021-03-06 |2 |null |2 |
|052|2021-03-16 |3 |5 |5 |.......
+-------------------------------------------+
I have to find out cummax of each signal and then compare with respective signal columns and delete the signal records which are having value lower than cummax and null values.
step1. find cumulative max for each signal with respect to id column.
step2. delete the records which are having lower value than cummax for each signal.
step3. Take count of records which are having cummax less than signal value(excluded of null) for each signals with respect to id.
After the count the final output should be as shown below.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|.....
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 | 3 | 1 |
|050|2021-01-15 |null | null | 2 |
|050|2021-02-02 |2 | 3 | 3 |
|
|051|2021-01-14 |1 | 3 | 0 |
|051|2021-01-15 |2 | null | null |
|051|2021-02-02 |3 | 3 | 2 |
|051|2021-02-03 |null | 3 | null |
|
|052|2021-03-03 |1 | 3 | 0 |
|052|2021-03-05 |3 | 3 | null |
|052|2021-03-06 |null | null | 2 |
|052|2021-03-16 |3 | 5 | 5 | ......
+----------------+--------+--------+--------+
I have tried by using window function as below and it worked for almost all records.
val w = Window.partitionBy("id").orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val signalList01 = ListBuffer[Column]()
signalList01.append(col("id"), col("date"))
for (column <- signalColumns) {
// Applying the max non null aggregate function on each signal column
signalList01 += (col(column), max(column).over(w).alias(column+"_cummax")) }
val cumMaxDf = df.select(signalList01: _*)
But I am getting error values as shown below for few records.
Is there any idea about how this error records in the cummax column? Any leads appreciated!
Just giving out hints here (as you suggested) to help you unblock the situation, but --WARNING-- haven't tested the code !
the code you provided in the comments looks good. It'll get you your max column
val nw_df = original_df.withColumn("singal01_cummax", sum(col("singal01")).over(windowCodedSO))
now, you need to be able to compare the two values in "singal01" and "singal01_cummax". A function like this, maybe:
def takeOutRecordsLessThanCummax (signal:Int, singal_cummax: Int) : Any =
{ if (signal == null || signal < singal_cummax) null
else singal_cummax }
since we'll be applying it to columns, we'll wrap it up in a UDF
val takeOutRecordsLessThanCummaxUDF : UserDefinedFunction = udf {
(i:Int, j:Int) => takeOutRecordsLessThanCummax(i,j)
}
and then, you can combine everything above so it can be applicable on your original dataframe. Something like this could work:
val signal_cummax_suffix = "_cummax"
val result = original_df.columns.foldLeft(original_df)(
(dfac, colname) => dfac
.withColumn(colname.concat(signal_cummax_suffix),
sum(col(colname)).over(windowCodedSO))
.withColumn(colname.concat("output"),
takeOutRecordsLessThanCummaxUDF(col(colname), col(colname.concat(signal_cummax_suffix))))
)

Select max common Date from differents DataFrames (Scala Spark)

I have differents dataframes and I want to select the max common Date of these DF. For example, I have the following dataframes:
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2012-12-21 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
The selected date would be 2016-09-02 because is the max date that exists in these 3 DF (the date 2017-11-19 is not in the third DF).
I am trying to do it with agg(max) but in this way I just have the highest date of a DataFrame:
df1.select("Date").groupBy("Date").agg(max("Date))
Thanks in advance!
You can do semi joins to get the common dates, and aggregate the maximum date. No need to group by date because you want to get its maximum.
val result = df1.join(df2, Seq("Date"), "left_semi").join(df3, Seq("Date"), "left_semi").agg(max("Date"))
You can also use intersect:
val result = df1.select("Date").intersect(df2.select("Date")).intersect(df3.select("Date")).agg(max("Date"))

Compare two values with Scala Spark

I got the next parquet file:
+--------------+------------+-------+
|gf_cutoff | country_id |gf_mlt |
+--------------+------------+-------+
|2020-12-14 |DZ |5 |
|2020-08-06 |DZ |4 |
|2020-07-03 |DZ |4 |
|2020-12-14 |LT |1 |
|2020-08-06 |LT |1 |
|2020-07-03 |LT |1 |
As you can see is particioned by country_id and ordered by gf_cutoff DESC. What I want to do es compare gf_mlt to check if the value has changed. To do that I want to compare the most recently gf_cutoff with the second one.
A example of this case would be compare:
2020-12-14 DZ 5
with
2020-08-06 DZ 4
And I want to write in a new column, if the value of the most recent date is different of the second row, put in a new column, the most recent value that is 5 for DZ and put in another column True if the value has changed or false if has not changed.
Afther did this comparation, delete the rows with the older rows.
For DZ has changed and for LT hasn't changed because is all time 1.
So the output would be like this:
+--------------+------------+-------+------------+-----------+
|gf_cutoff | country_id |gf_mlt | Has_change | old_value |
+--------------+------------+-------+------------+-----------+
|2020-12-14 |DZ |5 | True | 4 |
|2020-12-14 |LT |1 | False | 1 |
If you need more explanation, just tell me it.
You can use lag over an appropriate window to get the most recent value, and then filter the most recent rows using a row_number over another appropriate window:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn(
"last_value",
lag("gf_mlt", 1).over(Window.partitionBy("country_id").orderBy("gf_cutoff"))
).withColumn(
"rn",
row_number().over(Window.partitionBy("country_id").orderBy(desc("gf_cutoff")))
).filter("rn = 1").withColumn(
"changed",
$"gf_mlt" === $"last_value"
).drop("rn")
df2.show
+----------+----------+------+----------+-------+
| gf_cutoff|country_id|gf_mlt|last_value|changed|
+----------+----------+------+----------+-------+
|2020-12-14| DZ| 5| 4| false|
|2020-12-14| LT| 1| 1| true|
+----------+----------+------+----------+-------+

Spark lag with default value as another column

Let's say we have
|bin | min | end | start |
|1 | 5 | 10 |
|2 | 12 | 24 |
|3 | 28 | 36 |
|4 | 40 | 50 |
|5 | null| null |
I would want to populate start as the previous column's end to make continuous bin values. For the missing I would like to fill in with the current min instead. For null row I consider treating it separately.
What lag gives us would be
df.withColumn("start", F.lag(col("end"), 1, ***default_value***).over(orderBy(col("bin"))
|bin | min | end | start |
|1 | 5 | 10 | (5 wanted)
|2 | 12 | 24 | 10
|3 | 28 | 36 | 24
|4 | 40 | 50 | 36
|5 | null| null | null
My questions :
1/ What do we put in default_value for lag to take another column of current row, in this case min
2/ Is there a way to treat null row at the same time without separating ? I intend to filter non-null , perform lag, then union back with the null rows. How will the answer differ if Null is the first(bin 1) or last (bin 5) ?
Use coalesce to get a column value for the first row in a group.
from pyspark.sql import functions as F
df.withColumn("start", F.coalesce(F.lag(col("end"), 1).over(orderBy(col("bin")),col("min")))
lag currently doesn't support ignorenulls option, so you might have to separate out the null rows, compute the start column for non-null rows and union the data frames.