new column based on another column and a value change in spark - scala

I have a spark dataframe and I want to create a new column RESULT based on column ID and DAT_TIMESTAMP depending on the value at the STATUS column. When STATUS value becomes A for the first time, I want to get the value at DAT_TIMESTAMP and concatenate it with the value of ID and keep using the same value of DAT_TIMESTAMP for the proceeding rows until the STATUS goes back to A again.
I hope below sample data helps to explain what I want to do.
+--------+--------------+-------+----------------------+
| ID| DAT_TIMESTAMP| STATUS| RESULT|
+--------+--------------+-------+----------------------+
| ID_1111| 1617214599502| D| ID_1111_1617214600304|
| ID_1111| 1617214600002| D| ID_1111_1617214600304|
| ID_1111| 1617214600502| A| ID_1111_1617214600502| // first appearance of A
| ID_1111| 1617214601003| A| ID_1111_1617214600502|
| ID_1111| 1617214601503| A| ID_1111_1617214600502|
| ID_1111| 1617214602003| B| ID_1111_1617214600502|
| ID_1111| 1617214602503| B| ID_1111_1617214600502|
| ID_1111| 1617214603004| C| ID_1111_1617214600502|
| ID_1111| 1617214603504| C| ID_1111_1617214600502|
| ID_1111| 1617214604004| C| ID_1111_1617214600502|
| ID_1111| 1617214604504| C| ID_1111_1617214600502|
| ID_1111| 1617214605003| D| ID_1111_1617214600502|
| ID_1111| 1617214605506| D| ID_1111_1617214600502|
| ID_1111| 1617214606003| A| ID_1111_1617214606003| // first appearance of A, again
| ID_1111| 1617214606504| A| ID_1111_1617214606003|
| ID_1111| 1617214607004| A| ID_1111_1617214606003|
I have been trying to do it using when function but I am just really unsure what I am doing. Can this be done in spark? Thanks.

finally did it, maybe not the most elegant or best way to do it...
I can confirm if value in STATUS column becomes A for the first time by creating an intermediate column containing the previous value of STATUS using lag function. Then comparing if current value is A and previous value is not A.
import org.apache.spark.sql.expressions.Window
// load data...
// add new column
val df2 = df.orderBy("ID", "DAT_TIMESTAMP")
.withColumn("RESULT",
last(
when(
$"STATUS"==="A" &&
// intermediate column for comparison
lag($"STATUS", 1).over(Window.orderBy("ID", "DAT_TIMESTAMP")).notEqual("A"),
concat_ws("_", $"ID", $"DAT_TIMESTAMP"))
.otherwise(null), true)
.over(Window.orderBy("ID", "DAT_TIMESTAMP")))
.select("*")
df2 will be like below. First two rows of RESULT is null since there is no previous reference to A.
df2.show(false)
+-------+-------------+------+---------------------+
|ID |DAT_TIMESTAMP|STATUS|RESULT |
+-------+-------------+------+---------------------+
|ID_1111|1617214599502|D |null |
|ID_1111|1617214600002|D |null |
|ID_1111|1617214600502|A |ID_1111_1617214600502|
|ID_1111|1617214601003|A |ID_1111_1617214600502|
|ID_1111|1617214601503|A |ID_1111_1617214600502|
|ID_1111|1617214602003|B |ID_1111_1617214600502|
|ID_1111|1617214602503|B |ID_1111_1617214600502|
|ID_1111|1617214603004|C |ID_1111_1617214600502|
|ID_1111|1617214603504|C |ID_1111_1617214600502|
|ID_1111|1617214604004|C |ID_1111_1617214600502|
|ID_1111|1617214604504|C |ID_1111_1617214600502|
|ID_1111|1617214605003|D |ID_1111_1617214600502|
|ID_1111|1617214605506|D |ID_1111_1617214600502|
|ID_1111|1617214606003|A |ID_1111_1617214606003|
|ID_1111|1617214606504|A |ID_1111_1617214606003|
|ID_1111|1617214607004|A |ID_1111_1617214606003|
+-------+-------------+------+---------------------+

Related

In df2, find value that is equal to or is the next greatest closest value to the value in col in df1 without join

I have 2 dfs. Each df cannot be joined to the other df. I am interested in finding the 2 values in the item column of df2 that are equal to or are the closest next greatest values to the value in the item column in df1. I would like to create a column that stores this information as a list for each item and item_name in df2.
As an example:
df1 =
+--------+-----+
|customer| item|
+--------+-----+
| A| 124|
+--------+-----+
| B| 139|
+--------+-----+
| C| 184|
+--------+-----+
| A| 315|
+--------+-----+
df2 =
+----+---------+
|item|item_name|
+----+---------+
| 123| boots|
+----+---------+
| 124| socks|
+----+---------+
| 125| shirt|
+----+---------+
| 188| socks|
+----+---------+
| 111| pants|
+----+---------+
| 142| shorts|
+----+---------+
expected output =
+--------+----+-------------+
|customer|item| items_df2|
+--------+----+-------------+
| A| 124| [124, socks]|
+--------+----+-------------+
| A| 124| [125, shirt]|
+--------+----+-------------+
| B| 139|[142, shorts]|
+--------+----+-------------+
| B| 139| [188, socks]|
+--------+----+-------------+
| C| 184| [188, socks]|
+--------+----+-------------+
I have found examples where this is possible using window functions and rank, but in each of these examples a join is possible. Thank you in advance.

Scala Spark use Window function to find max value

I have a data set that looks like this:
+------------------------|-----+
| timestamp| zone|
+------------------------+-----+
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | B|
| 2019-01-01 01:05:00 | C|
| 2019-01-01 02:05:00 | B|
| 2019-01-01 02:05:00 | B|
+------------------------+-----+
For each hour I need to count which zone had the most rows and end up with a table that looks like this:
+-----|-----+-----+
| hour| zone| max |
+-----+-----+-----+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+-----+-----+-----+
My instructions say that I need to use the Window function along with "group by" to find my max count.
I've tried a few things but I'm not sure if I'm close. Any help would be appreciated.
You can use 2 subsequent window-functions to get your result:
df
.withColumn("hour",hour($"timestamp"))
.withColumn("cnt",count("*").over(Window.partitionBy($"hour",$"zone")))
.withColumn("rnb",row_number().over(Window.partitionBy($"hour").orderBy($"cnt".desc)))
.where($"rnb"===1)
.select($"hour",$"zone",$"cnt".as("max"))
You can use Windowing functions and group by with dataframes.
In your case you could use rank() over(partition by) window function.
import org.apache.spark.sql.function._
// first group by hour and zone
val df_group = data_tms.
select(hour(col("timestamp")).as("hour"), col("zone"))
.groupBy(col("hour"), col("zone"))
.agg(count("zone").as("max"))
// second rank by hour order by max in descending order
val df_rank = df_group.
select(col("hour"),
col("zone"),
col("max"),
rank().over(Window.partitionBy(col("hour")).orderBy(col("max").desc)).as("rank"))
// filter by col rank = 1
df_rank
.select(col("hour"),
col("zone"),
col("max"))
.where(col("rank") === 1)
.orderBy(col("hour"))
.show()
/*
+----+----+---+
|hour|zone|max|
+----+----+---+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+----+----+---+
*/

Spark adding indexes to dataframe and append other dataset that doesn't have index

I have a dataset that has column userid and index values.
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
+---------+--------+
I want to append a new data frame to it and add an index to the newly added columns.
The userid is unique and the existing data frame will not have the Dataframe 2 user ids.
+----------+
| userid |
+----------+
| user11|
| user21|
| user41|
| user51|
| user64|
+----------+
The expected output with newly added userid and index
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
| user11| 11|
| user21| 12|
| user41| 13|
| user51| 14|
| user64| 15|
+---------+--------+
Is it possible to achive this by passing a max index value and start index for second Dataframe from given index value.
If the userid has some ordering, then you can use the rownumber function. Even if it does not have, then you can add an id using monotonically_increasing_id(). For now I assume that userid can be ordered. Then you can do this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_merge = df1.select('userid').union(df2.select('userid'))
w=Window.orderBy('userid')
df_result = df_merge.withColumn('indexid',F.row_number().over(w))
EDIT : After discussions in comment.
#%% Test data and imports
import pyspark.sql.functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([('a',100),('ab',50),('ba',300),('ced',60),('d',500)],schema=['userid','index'])
df1 = sqlContext.createDataFrame([('fgh',100),('ff',50),('fe',300),('er',60),('fi',500)],schema=['userid','dummy'])
#%%
#%% Merge the two dataframes, with a null columns as the index
df1=df1.withColumn('index', F.lit(None))
df_merge = df.select(df.columns).union(df1.select(df.columns))
#%%Define a window to arrange the newly added rows at the last and order them by userid
#%% The user id, even though random strings, can be ordered
w= Window.orderBy(F.col('index').asc_nulls_last(),F.col('userid'))# if possible add a partition column here, otherwise all your data will come in one partition, consider salting
#%% For the newly added rows, define index as the maximum value + increment of number of rows in main dataframe
df_final = df_merge.withColumn("index_new",F.when(~F.col('index').isNull(),F.col('index')).otherwise((F.last(F.col('index'),ignorenulls=True).over(w))+F.sum(F.lit(1)).over(w)))
#%% If number of rows in main dataframe is huge, then add an offset in the above line
df_final.show()
+------+-----+---------+
|userid|index|index_new|
+------+-----+---------+
| ab| 50| 50|
| ced| 60| 60|
| a| 100| 100|
| ba| 300| 300|
| d| 500| 500|
| er| null| 506|
| fe| null| 507|
| ff| null| 508|
| fgh| null| 509|
| fi| null| 510|
+------+-----+---------+

How to find the max length unique rows from a dataframe with spark?

I am trying to find the unique rows (based on id) that have the maximum length value in a Spark dataframe. Each Column has a value of string type.
The dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | | d |
|3 |b | c | a | d |
+-----+---+----+---+---+
The expectation is:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I can't figure how to do this using Spark easily...
Thanks in advance
Note: This approach takes care of any addition/deletion of columns to the DataFrame, without the need of code change.
It can be done by first finding length of all columns after concatenating (except the first column), then filter all other rows except the row with the maximum length.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val output = input.withColumn("rowLength", length(concat(input.columns.toList.drop(1).map(col): _*)))
.withColumn("maxLength", max($"rowLength").over(Window.partitionBy($"id")))
.filter($"rowLength" === $"maxLength")
.drop("rowLength", "maxLength")
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi| |
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| | d|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> df.groupBy("id").agg(concat_ws("",collect_set(col("A"))).alias("A"),concat_ws("",collect_set(col("B"))).alias("B"),concat_ws("",collect_set(col("C"))).alias("C"),concat_ws("",collect_set(col("D"))).alias("D")).show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| a| d|
+---+----+----+----+----+

spark withcolumn create a column duplicating values from existining column

I am having problem figuring this. Here is the problem statement
lets say I have a dataframe, I want to select value for column c where column b value is foo and create a new column D and repeat the vale "3" for all rows
+---+----+---+
| A| B| C|
+---+----+---+
| 4|blah| 2|
| 2| | 3|
| 56| foo| 3|
|100|null| 5|
+---+----+---+
want it to become:
+---+----+---+-----+
| A| B| C| D |
+---+----+---+-----+
| 4|blah| 2| 3 |
| 2| | 3| 3 |
| 56| foo| 3| 3 |
|100|null| 5| 3 |
+---+----+---+-----+
You will have to extract the column C value i.e. 3 with foo in column B
import org.apache.spark.sql.functions._
val value = df.filter(col("B") === "foo").select("C").first()(0)
Then use that value using withColumn to create a new column D using lit function
df.withColumn("D", lit(value)).show(false)
You should get your desired output.