create a column to accumulate the data in an array psypark - pyspark

I need to create a field to accumulate the data in an array.
I have the following dataframe:
+----------------+-------------------+----------+---------------------+
|localSymbol_drop|end_window_drop |detect_DRM|last_detect_price_DRM|
+----------------+-------------------+----------+---------------------+
|BABA |2021-06-15 16:36:30|NO |NA |
|BABA |2021-06-15 16:37:00|NO |NA |
|BABA |2021-06-15 16:37:30|YES |211.85 |
|BABA |2021-06-15 16:38:00|NO |NA |
|BABA |2021-06-15 16:38:30|NO |NA |
|BABA |2021-06-15 16:40:30|NO |NA |
|BABA |2021-06-15 16:41:00|YES |211.91 |
|BABA |2021-06-15 16:42:00|NO |NA |
|BABA |2021-06-15 16:42:30|YES |211.83 |
+----------------+-------------------+----------+---------------------+
and the result will be:
+----------------+-------------------+----------+---------------------+----------------------------------------+
|localSymbol_drop|end_window_drop |detect_DRM|last_detect_price_DRM|accum_array |
+----------------+-------------------+----------+---------------------+----------------------------------------+
|BABA |2021-06-15 16:36:30|NO |NA |[NA] |
|BABA |2021-06-15 16:37:00|NO |NA |[NA,NA] |
|BABA |2021-06-15 16:37:30|YES |211.85 |[NA,NA,211.85] |
|BABA |2021-06-15 16:38:00|NO |NA |[NA,NA,211.85,NA] |
|BABA |2021-06-15 16:38:30|NO |NA |[NA,NA,211.85,NA,NA] |
|BABA |2021-06-15 16:40:30|NO |NA |[NA,NA,211.85,NA,NA,NA] |
|BABA |2021-06-15 16:41:00|YES |211.91 |[NA,NA,211.85,NA,NA,NA,211.91] |
|BABA |2021-06-15 16:42:00|NO |NA |[NA,NA,211.85,NA,NA,NA,211.91,NA] |
|BABA |2021-06-15 16:42:30|YES |211.83 |[NA,NA,211.85,NA,NA,NA,211.91,NA,211.83]|
+----------------+-------------------+----------+---------------------+----------------------------------------+
Any idea? Thank you!!

For my solution you first need to create an index on your dataframe:
1)
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
w = Window.orderBy("last_detect_price_DRM")
df = df.withColumn("index", row_number().over(w))
When you have an index on your dataframe you need to get all values from the column you want to accumulate and sort that list(so that it is in the same order as your dataframe):
2)
my_list =
df.select(f.collect_list('last_detect_price_DRM')).first()[0]
my_list.sort()
Now you just need to create an UserDefinedFunction which takes the index as an input and returns all elements in the list till that given index. After that you just need to call the function withColumn('columnName', udf) on your dataframe
3)
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType, ArrayType
def custom_func(index):
return my_list[0:index]
custom_func = udf(custom_func, ArrayType(StringType()))
df = df.withColumn('acc', custom_func(col('index')))
That will accumulate all values in a given column.
1.
+------+------+
| _1| _2|
+------+------+
| Java| 20000|
|Python|100000|
| Scala| 3000|
+------+------+
2.
+------+------+-----+
| _1| _2|index|
+------+------+-----+
|Python|100000| 1|
| Java| 20000| 2|
| Scala| 3000| 3|
+------+------+-----+
3.
+------+------+-----+--------------------+
| _1| _2|index| acc|
+------+------+-----+--------------------+
|Python|100000| 1| [100000]|
| Java| 20000| 2| [100000, 20000]|
| Scala| 3000| 3|[100000, 20000, 3...|
+------+------+-----+--------------------+

Other solution, without udf function.
First of all I calculate an index:
win = Window.partitionBy('localSymbol_drop').orderBy('end_window_drop')
historical_trends = historical_trends.withColumn("index", f.row_number().over(win))
then I get the ordered list
list_last_detect_price_DRM = historical_trends\
.orderBy('end_window_drop')\
.groupby('localSymbol_drop')\
.agg(
f.collect_list("last_detect_price_DRM").alias('list_last_detect_price_DRM')
).select(
f.col('list_last_detect_price_DRM'),
f.col('localSymbol_drop').alias('localSymbol_other_drop')
)
when I have the ordered list what I do is a join to the original df with slice function:
historical_trends = \
historical_trends \
.join(list_last_detect_price_DRM,
on=[
(historical_trends.localSymbol_drop == list_last_detect_price_DRM.localSymbol_other_drop)
],
how='left').distinct()\
.withColumn(
"correct_list_last_detect_price_DRM", f.expr("slice(list_last_detect_price_DRM,1,index)")
).drop('list_last_detect_price_DRM','localSymbol_other_drop')
and the result is:
+----------------+-------------------+----------+---------------------+-----+--------------------------------------------------------------------------------------------------------------------------------------------+
|localSymbol_drop|end_window_drop |detect_DRM|last_detect_price_DRM|index|correct_list_last_detect_price_DRM |
+----------------+-------------------+----------+---------------------+-----+--------------------------------------------------------------------------------------------------------------------------------------------+
|BABA |2021-06-15 16:36:30|NO |NA |1 |[NA] |
|BABA |2021-06-15 16:37:00|NO |NA |2 |[NA, NA] |
|BABA |2021-06-15 16:37:30|YES |211.85 |3 |[NA, NA, 211.85] |
|BABA |2021-06-15 16:38:00|NO |NA |4 |[NA, NA, 211.85, NA] |
|BABA |2021-06-15 16:38:30|NO |NA |5 |[NA, NA, 211.85, NA, NA] |
|BABA |2021-06-15 16:39:00|NO |NA |6 |[NA, NA, 211.85, NA, NA, NA] |
|BABA |2021-06-15 16:39:30|NO |NA |7 |[NA, NA, 211.85, NA, NA, NA, NA] |
|BABA |2021-06-15 16:40:00|NO |NA |8 |[NA, NA, 211.85, NA, NA, NA, NA, NA] |
|BABA |2021-06-15 16:40:30|NO |NA |9 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA] |
|BABA |2021-06-15 16:41:00|YES |211.91 |10 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91] |
|BABA |2021-06-15 16:41:30|NO |NA |11 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA] |
|BABA |2021-06-15 16:42:00|NO |NA |12 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA] |
|BABA |2021-06-15 16:42:30|YES |211.83 |13 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83] |
|BABA |2021-06-15 16:43:00|NO |NA |14 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA] |
|BABA |2021-06-15 16:43:30|YES |211.75 |15 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75] |
|BABA |2021-06-15 16:44:00|NO |NA |16 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA] |
|BABA |2021-06-15 16:44:30|NO |NA |17 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA, NA] |
|BABA |2021-06-15 16:45:00|NO |NA |18 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA, NA, NA] |
|BABA |2021-06-15 16:45:30|YES |211.72 |19 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA, NA, NA, 211.72] |
|BABA |2021-06-15 16:46:00|NO |NA |20 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA, NA, NA, 211.72, NA] |
|BABA |2021-06-15 16:46:30|NO |NA |21 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA, NA, NA, 211.72, NA, NA] |
|BABA |2021-06-15 16:47:00|NO |NA |22 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA, NA, NA, 211.72, NA, NA, NA] |
|BABA |2021-06-15 16:47:30|YES |211.81 |23 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA, NA, NA, 211.72, NA, NA, NA, 211.81] |
|BABA |2021-06-15 16:48:00|NO |NA |24 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA, NA, NA, 211.72, NA, NA, NA, 211.81, NA] |
|BABA |2021-06-15 16:48:30|NO |NA |25 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA, NA, NA, 211.72, NA, NA, NA, 211.81, NA, NA] |
|BABA |2021-06-15 16:49:00|YES |211.93 |26 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA, NA, NA, 211.72, NA, NA, NA, 211.81, NA, NA, 211.93] |
|BABA |2021-06-15 16:49:30|NO |NA |27 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA, NA, NA, 211.72, NA, NA, NA, 211.81, NA, NA, 211.93, NA] |
|BABA |2021-06-15 16:50:00|NO |NA |28 |[NA, NA, 211.85, NA, NA, NA, NA, NA, NA, 211.91, NA, NA, 211.83, NA, 211.75, NA, NA, NA, 211.72, NA, NA, NA, 211.81, NA, NA, 211.93, NA, NA]|
+----------------+-------------------+----------+---------------------+-----+--------------------------------------------------------------------------------------------------------------------------------------------+

Related

How to re-assign session_id to items when we want to create another session after every null value in items?

I have a pyspark dataframe-
df1 = spark.createDataFrame([
("s1", "i1", 0),
("s1", "i2", 1),
("s1", "i3", 2),
("s1", None, 3),
("s1", "i5", 4),
],
["session_id", "item_id", "pos"])
df1.show(truncate=False)
pos is the position or rank of the item in the session.
Now I want to create new sessions without any null values in them. I want to do this by starting a new session after every null item. Basically I want to break existing sessions into multiple sessions, removing the null item_id in the process.
The expected output would like something like-
+----------+-------+---+--------------+
|session_id|item_id|pos|new_session_id|
+----------+-------+---+--------------+
|s1 |i1 |0 | s1_0|
|s1 |i2 |1 | s1_0|
|s1 |i3 |2 | s1_0|
|s1 |null |3 | None|
|s1 |i5 |4 | s1_4|
+----------+-------+---+--------------+
How do I achieve this?
Not sure about the configs of your spark job, but to prevent to use
collect action to build the reference of your "new" session in Python built-in data structure, I would use built-in spark sql function to build the new session reference. Based on your example, assuming you have already sorted the data frame:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.window import Window
from pyspark.sql.types import *
df = spark.createDataFrame(
[("s1", "i1", 0), ("s1", "i2", 1), ("s1", "i3", 2), ("s1", None, 3), ("s1", None, 4), ("s1", "i6", 5), ("s2", "i7", 6), ("s2", None, 7), ("s2", "i9", 8), ("s2", "i10", 9), ("s2", "i11", 10)],
["session_id", "item_id", "pos"]
)
df.show(20, False)
+----------+-------+---+
|session_id|item_id|pos|
+----------+-------+---+
|s1 |i1 |0 |
|s1 |i2 |1 |
|s1 |i3 |2 |
|s1 |null |3 |
|s1 |null |4 |
|s1 |i6 |5 |
|s2 |i7 |6 |
|s2 |null |7 |
|s2 |i9 |8 |
|s2 |i10 |9 |
|s2 |i11 |10 |
+----------+-------+---+
Step 1: As the data is already sorted, we can use a lag function to shift the data to the next record:
df2 = df\
.withColumn('lag_item', func.lag('item_id', 1).over(Window.partitionBy('session_id').orderBy('pos')))
df2.show(20, False)
+----------+-------+---+--------+
|session_id|item_id|pos|lag_item|
+----------+-------+---+--------+
|s1 |i1 |0 |null |
|s1 |i2 |1 |i1 |
|s1 |i3 |2 |i2 |
|s1 |null |3 |i3 |
|s1 |null |4 |null |
|s1 |i6 |5 |null |
|s2 |i7 |6 |null |
|s2 |null |7 |i7 |
|s2 |i9 |8 |null |
|s2 |i10 |9 |i9 |
|s2 |i11 |10 |i10 |
+----------+-------+---+--------+
Step 2: After using the lag function we can see if the item_id in previous record is NULL or not. Therefore , we can know the boundaries of each new session by doing the filtering and build the reference:
reference = df2\
.filter((func.col('item_id').isNotNull())&(func.col('lag_item').isNull()))\
.groupby('session_id')\
.agg(func.collect_set('pos').alias('session_id_set'))
reference.show(100, False)
+----------+--------------+
|session_id|session_id_set|
+----------+--------------+
|s1 |[0, 5] |
|s2 |[6, 8] |
+----------+--------------+
Step 3: Join the reference back to the data and write a simple UDF to find which new session should be in:
#func.udf(returnType=IntegerType())
def udf_find_session(item_id, pos, session_id_set):
r_val = None
if item_id != None:
for item in session_id_set:
if pos >= item:
r_val = item
else:
break
return r_val
df3 = df2.select('session_id', 'item_id', 'pos')\
.join(reference, on='session_id', how='inner')
df4 = df3.withColumn('new_session_id', udf_find_session(func.col('item_id'), func.col('pos'), func.col('session_id_set')))
df4.show(20, False)
+----------+-------+---+--------------+
|session_id|item_id|pos|new_session_id|
+----------+-------+---+--------------+
|s1 |i1 |0 |0 |
|s1 |i2 |1 |0 |
|s1 |i3 |2 |0 |
|s1 |null |3 |null |
|s1 |null |4 |null |
|s1 |i6 |5 |5 |
|s2 |i7 |6 |6 |
|s2 |null |7 |null |
|s2 |i9 |8 |8 |
|s2 |i10 |9 |8 |
|s2 |i11 |10 |8 |
+----------+-------+---+--------------+
The last step just concat the string you want to show in new session id.

Pyspark get predecessor value

I have a dataset similar to this one
exp
pid
mat
pskey
order
1
CR
P
1-CR-P
1
1
M
C
1-M-C
2
1
CR
C
1-CR-C
3
1
PP
C
1-PP-C
4
2
CR
P
2-CR-P
1
2
CR
P
2-CR-P
1
2
M
C
2-M-C
2
2
CR
C
2-CR-C
3
2
CR
C
2-CR-C
3
2
CR
C
2-CR-C
3
2
CR
C
2-CR-C
3
2
CR
C
2-CR-C
3
2
PP
C
2-PP-C
4
2
PP
C
2-PP-C
4
2
PP
C
2-PP-C
4
2
PP
C
2-PP-C
4
2
PP
C
2-PP-C
4
3
M
C
3-M-C
2
4
CR
P
4-CR-P
1
4
M
C
4-M-C
2
4
CR
C
4-CR-C
3
4
PP
C
4-PP-C
4
What I need is to get pskey of the predecessor for the same exp giving the following relation:
order 1 -> no predecessor
order 2 -> no predecessor
order 3 -> [1,2]
order 4 -> [3]
And add those values to a new column called predecessor
The expected result would be like:
+---+---+---+------+-----+----------------------------------------+
|exp|pid|mat|pskey |order|predecessor |
+---+---+---+------+-----+----------------------------------------+
|1 |CR |P |1-CR-P|1 |null |
|1 |M |C |1-M-C |2 |null |
|1 |CR |C |1-CR-C|3 |[1-CR-P, 1-M-C ] |
|1 |PP |C |1-PP-C|4 |[1-CR-C] |
|3 |M |C |3-M-C |2 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |M |C |2-M-C |2 |null |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|4 |CR |P |4-CR-P|1 |null |
|4 |M |C |4-M-C |2 |null |
|4 |CR |C |4-CR-C|3 |[4-CR-P, 4-M-C] |
|4 |PP |C |4-PP-C|4 |[4-CR-C] |
+---+---+---+------+-----+----------------------------------------+
I am quite new to pyspark so I have no clue how to manage it.
Differents cases on order are handled with when. You aggregate the values with a collect_set to get unic identifiers:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
"predecessor",
F.when(
F.col("order") == 3,
F.collect_set(F.col("pskey")).over(
Window.partitionBy("exp").orderBy("order").rangeBetween(-2, -1)
),
).when(
F.col("order") == 4,
F.collect_set(F.col("pskey")).over(
Window.partitionBy("exp").orderBy("order").rangeBetween(-1, -1)
),
),
)
the result :
df2.show(truncate=False)
+---+---+---+------+-----+----------------+
|exp|pid|mat|pskey |order|predecessor |
+---+---+---+------+-----+----------------+
|1 |CR |P |1-CR-P|1 |null |
|1 |M |C |1-M-C |2 |null |
|1 |CR |C |1-CR-C|3 |[1-CR-P, 1-M-C ]|
|1 |PP |C |1-PP-C|4 |[1-CR-C] |
|3 |M |C |3-M-C |2 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |M |C |2-M-C |2 |null |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|4 |CR |P |4-CR-P|1 |null |
|4 |M |C |4-M-C |2 |null |
+---+---+---+------+-----+----------------+
only showing top 20 rows

How to Split the row by nth delimiter in Spark Scala

I have below data which is stored in a csv file
1|Roy|NA|2|Marry|4.6|3|Richard|NA|4|Joy|NA|5|Joe|NA|6|Jos|9|
Now I want to read the file and store it in the spark dataframe, before storing it into dataframe I want to split at every 3rd | and store it as a row.
Output Expected :
1|Roy|NA|
2|Marry|4.6|
3|Richard|NA|
4|Joy|NA|
5|Joe|NA|
6|Jos|9|
Could you anyone help me out to get the output like above.
Start by reading your csv file
val df = spark.read.option("delimiter", "|").csv(file)
This will give you this dataframe
+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+----+
|_c1|_c2|_c3|_c4 |_c5|_c6|_c7 |_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18|
+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+----+
|Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |null|
|Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |null|
|Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |null|
+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+----+
Last column is created because of the last delimiter in your csv file so we get rid of it
val dataframe = df.drop(df.schema.last.name)
dataframe.show(false)
+---+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+
|_c0|_c1|_c2|_c3|_c4 |_c5|_c6|_c7 |_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|
+---+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+
|1 |Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |
|1 |Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |
|1 |Roy|NA |2 |Marry|4.6|3 |Richard|NA |4 |Joy |NA |5 |Joe |NA |6 |Jos |9 |
+---+---+---+---+-----+---+---+-------+---+---+----+----+----+----+----+----+----+----+
Then, you need to create an array that contains list of columns name you need to have in your final dataframe
val names : Array[String] = Array("colOne", "colTwo", "colThree")
Last, you need a function that reads by 3
def splitCSV(dataFrame: DataFrame, columnNames : Array[String], sparkSession: SparkSession) : DataFrame = {
import sparkSession.implicits._
val columns = dataFrame.columns
var finalDF : DataFrame = Seq.empty[(String,String,String)].toDF(columnNames:_*)
for(order <- 0 until(columns.length) -3 by(3) ){
finalDF = finalDF.union(dataFrame.select(col(columns(order)).as(columnNames(0)), col(columns(order+1)).as(columnNames(1)), col(columns(order+2)).as(columnNames(2))))
}
finalDF
}
After we apply this function on dataframe
val finalDF = splitCSV(dataframe, names, sparkSession)
finalDF.show(false)
+------+-------+--------+
|colOne|colTwo |colThree|
+------+-------+--------+
|1 |Roy |NA |
|1 |Roy |NA |
|1 |Roy |NA |
|2 |Marry |4.6 |
|2 |Marry |4.6 |
|2 |Marry |4.6 |
|3 |Richard|NA |
|3 |Richard|NA |
|3 |Richard|NA |
|4 |Joy |NA |
|4 |Joy |NA |
|4 |Joy |NA |
|5 |Joe |NA |
|5 |Joe |NA |
|5 |Joe |NA |
+------+-------+--------+
You can use regex for most of it. There's no straightforward regex for "split at nth matching occurence", so we work around it by using a match to pick out the pattern, then insert a custom splitter that we can then use.
ds
.withColumn("value",
regexp_replace('value, "([^\\|]*)\\|([^\\|]*)\\|([^\\|]*)\\|", "$1|$2|$3||")) // 1
.withColumn("value", explode(split('value, "\\|\\|"))) // 2
.where(length('value) > 0) // 3
Explanation
Replace every group of 3 |'s with the components, then terminate with ||
Split on each || and use explode to move each to a separate row
Unfortunately, the split picks up the empty group at the end, so we filter it out
Output for your given input:
+------------+
|value |
+------------+
|1|Roy|NA |
|2|Marry|4.6 |
|3|Richard|NA|
|4|Joy|NA |
|5|Joe|NA |
|6|Jos|9 |
+------------+

assign patient to nearest facility in spark,scala

I have a data frame as below
+-----------+----------+-----------+--------+-------------+
|LOCATION_ID|PATIENT_ID|FACILITY_ID|DISTANCE|rank_distance|
+-----------+----------+-----------+--------+-------------+
|LOC0001 |P1 |FAC003 |54 |2 |
|LOC0001 |P1 |FAC002 |45 |1 |
|LOC0001 |P2 |FAC003 |54 |2 |
|LOC0001 |P2 |FAC002 |45 |1 |
|LOC0010 |P3 |FAC006 |12 |1 |
|LOC0010 |P3 |FAC003 |54 |4 |
|LOC0010 |P3 |FAC005 |23 |2 |
|LOC0010 |P3 |FAC002 |45 |3 |
|LOC0010 |P4 |FAC002 |45 |3 |
|LOC0010 |P4 |FAC005 |23 |2 |
|LOC0010 |P4 |FAC003 |54 |4 |
|LOC0010 |P4 |FAC006 |12 |1 |
|LOC0010 |P5 |FAC006 |12 |1 |
|LOC0010 |P5 |FAC002 |45 |3 |
|LOC0010 |P5 |FAC005 |23 |2 |
|LOC0010 |P5 |FAC003 |54 |4 |
|LOC0010 |P6 |FAC006 |12 |1 |
|LOC0010 |P6 |FAC005 |23 |2 |
|LOC0010 |P6 |FAC002 |45 |3 |
|LOC0010 |P6 |FAC003 |54 |4 |
|LOC0043 |P7 |FAC004 |42 |1 |
|LOC0054 |P8 |FAC002 |24 |2 |
|LOC0054 |P8 |FAC006 |12 |1 |
|LOC0054 |P8 |FAC005 |76 |3 |
|LOC0054 |P8 |FAC007 |100 |4 |
|LOC0065 |P9 |FAC006 |32 |1 |
|LOC0065 |P9 |FAC005 |54 |2 |
|LOC0065 |P10 |FAC006 |32 |1 |
|LOC0065 |P10 |FAC005 |54 |2 |
+-----------+----------+-----------+--------+-------------+
for each patient I have to assign facility for which rank is least.my output map should be as below
p1 ---> FAC002 (because its rank is least)
P2 ---> FAC002 (because its rank is least)
note each facility has capacity of just 2,except of FAC003 which has capacity of 3
so for P3,P4,P5 and P6 output should be
p3 ----> FAC006 (because its rank is 1)
P4 ----> FAC006 (because its rank is 1)
p5 ----> FAC005 (bacause FAC006 has fulled its capacity of 2,and now least
rank is of FAC005)
p6 ---->FAC005 (bacause FAC005 has one capacity left)
P7 ----->FAC004
Please check below code.
Extracting data from DataFrame based on patients & checking condition with previously extracted data, if the condition is false skipping that record else join that records with previous extracted records.
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq(("LOC0001","P1","FAC003","54","2"),("LOC0001","P1","FAC002","45","1"),("LOC0001","P2","FAC003","54","2"),("LOC0001","P2","FAC002","45","1"),("LOC0010","P3","FAC006","12","1"),("LOC0010","P3","FAC003","54","4"),("LOC0010","P3","FAC005","23","2"),("LOC0010","P3","FAC002","45","3"),("LOC0010","P4","FAC002","45","3"),("LOC0010","P4","FAC005","23","2"),("LOC0010","P4","FAC003","54","4"),("LOC0010","P4","FAC006","12","1"),("LOC0010","P5","FAC006","12","1"),("LOC0010","P5","FAC002","45","3"),("LOC0010","P5","FAC005","23","2"),("LOC0010","P5","FAC003","54","4"),("LOC0010","P6","FAC006","12","1"),("LOC0010","P6","FAC005","23","2"),("LOC0010","P6","FAC002","45","3"),("LOC0010","P6","FAC003","54","4"),("LOC0043","P7","FAC004","42","1"),("LOC0054","P8","FAC002","24","2"),("LOC0054","P8","FAC006","12","1"),("LOC0054","P8","FAC005","76","3"),("LOC0054","P8","FAC007","100","4"),("LOC0065","P9","FAC006","32","1"),("LOC0065","P9","FAC005","54","2"),("LOC0065","P10","FAC006","32","1"),("LOC0065","P10","FAC005","54","2")).toDF("location_id","patient_id","facility_id","distance","rank_distance").withColumn("facility_new",first($"facility_id").over(Window.partitionBy($"patient_id").orderBy($"rank_distance".asc))).orderBy(substring($"patient_id",2,3).cast("int").asc)
val patients = df.select("patient_id").distinct.orderBy(substring($"patient_id",2,3).cast("int").asc).collect.map(_.getAs[String](0)) // Taking all patients into collection.
val facilities = Map("FAC004" -> 2, "FAC003" -> 3, "FAC007" -> 2, "FAC002" -> 2, "FAC006" -> 2, "FAC005" -> 2)// For checking conditions.
case class Config(newDF: DataFrame,oldDF: DataFrame,facilities: Map[String,Int]) // Taking all facilitiy ids & max allowed values for checking codition.
def fetchFacilityId(config:Config) = {
config.newDF.select("facility_id").distinct.orderBy(substring($"facility_id",5,6).cast("int").asc)
.except(config.oldDF.select("facility_new").distinct.orderBy(substring($"facility_new",5,6).cast("int").asc)).head.getAs[String](0)
} // Getting required facility id.
def findFacilityId(config:Config, patientId: String,index:Int):(Boolean,String,Int) = {
val max_distance = config.oldDF.filter($"patient_id"=== patientId).select("rank_distance").orderBy($"rank_distance".desc).head.getAs[String](0).toInt
(index < max_distance) match {
case true => {
val fac = config.oldDF.filter($"patient_id" === patientId && $"rank_distance"=== index).select("facility_id").distinct.map(_.getAs[String](0)).collect.head
val bool = config.newDF.filter($"facility_new" === lit(fac) && $"rank_distance"=== index).select("patient_id").distinct.count < config.facilities(fac)
(bool,fac, index)
}
case false => (true,fetchFacilityId(config), 0)
}
} // finding required facility id.
def process(config:Config,patientId:String,index:Int):DataFrame = findFacilityId(config,patientId,index) match {
case (true,fac,_) => {
config.newDF.union(config.oldDF.filter($"patient_id" === patientId).withColumn("facility_new",lit(fac)))
}
case (false,_,rank) => {
process(config,patientId,index+1)
}
} // Checking duplicate facility ids
val config = Config(newDF= df.limit(0),df, facilities)
val updatedDF = patients.foldLeft(config){ case ((cfg),patientId) =>
cfg.newDF.count match {
case 0L => cfg.copy(newDF = cfg.newDF.union(cfg.oldDF.filter($"patient_id" === patientId)))
case _ => cfg.copy(newDF = process(cfg, patientId,1))
}
}.newDF.drop("facility_id").withColumnRenamed("facility_new","facility_id")
updatedDF.show(31, false)
// Exiting paste mode, now interpreting.
+-----------+----------+--------+-------------+-----------+
|location_id|patient_id|distance|rank_distance|facility_id|
+-----------+----------+--------+-------------+-----------+
|LOC0001 |P1 |45 |1 |FAC002 |
|LOC0001 |P1 |54 |2 |FAC002 |
|LOC0001 |P2 |54 |2 |FAC002 |
|LOC0001 |P2 |45 |1 |FAC002 |
|LOC0010 |P3 |12 |1 |FAC006 |
|LOC0010 |P3 |54 |4 |FAC006 |
|LOC0010 |P3 |23 |2 |FAC006 |
|LOC0010 |P3 |45 |3 |FAC006 |
|LOC0010 |P4 |45 |3 |FAC006 |
|LOC0010 |P4 |23 |2 |FAC006 |
|LOC0010 |P4 |54 |4 |FAC006 |
|LOC0010 |P4 |12 |1 |FAC006 |
|LOC0010 |P5 |12 |1 |FAC005 |
|LOC0010 |P5 |45 |3 |FAC005 |
|LOC0010 |P5 |23 |2 |FAC005 |
|LOC0010 |P5 |54 |4 |FAC005 |
|LOC0010 |P6 |12 |1 |FAC005 |
|LOC0010 |P6 |23 |2 |FAC005 |
|LOC0010 |P6 |45 |3 |FAC005 |
|LOC0010 |P6 |54 |4 |FAC005 |
|LOC0043 |P7 |42 |1 |FAC003 |
|LOC0054 |P8 |24 |2 |FAC003 |
|LOC0054 |P8 |12 |1 |FAC003 |
|LOC0054 |P8 |76 |3 |FAC003 |
|LOC0054 |P8 |100 |4 |FAC003 |
|LOC0065 |P9 |32 |1 |FAC003 |
|LOC0065 |P9 |54 |2 |FAC003 |
|LOC0065 |P10 |32 |1 |FAC003 |
|LOC0065 |P10 |54 |2 |FAC003 |
+-----------+----------+--------+-------------+-----------+
Time taken: 31009 ms
scala>

How to replace null values with above/below not null value on same column in Data-frame using spark?

I'm trying to replace Null or invalid values present in a column with the above or below nonnull value of the same column. For Example:-
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
|d1 |4
b |a2 |5
c |a2 |6
| |7
| |8
d |c1 |9
In this case, I try to replace all the NULL values in the column "Name" 1st NULL will replace with 'a' & 2nd NULL will replace with 'c' and in column "Place" NULL replace with 'a2'.
When we try to replace the 8th cell NULL of 'Place' column then also replace with its sparse nonnull value 'a2'.
Required Result:
If we select the 8th cell NULL of 'Place' column replacing then result will be
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
|d1 |4
b |a2 |5
c |a2 |6
| |7
|a2 |8
d |c1 |9
if we select the 4th cell NULL of 'Name' column for replace then result will be
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
a |d1 |4
b |a2 |5
c |a2 |6
| |7
| |8
d |c1 |9
Windows functions would come handy to solve the issue. For the sake of simplicity, I'm focusing on just name column. If previous row has null, I'm using next row value. You can change this order according to your need.Same approach needs to be done for other columns as well.
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(("a", "a1", "1"),
("a", "a2", "2"),
("a", "a2", "3"),
("d1", null, "4"),
("b", "a2", "5"),
("c", "a2", "6"),
(null, null, "7"),
(null, null, "8"),
("d", "c1", "9")).toDF("name", "place", "row_count")
val window = Window.orderBy("row_count")
val lagNameWindowExpression = lag('name, 1).over(window)
val leadNameWindowExpression = lead('name, 1).over(window)
val nameConditionExpression = when($"name".isNull.and('previous_name_col.isNull), 'next_name_col)
.when($"name".isNull.and('previous_name_col.isNotNull), 'previous_name_col).otherwise($"name")
df.select($"*", lagNameWindowExpression as 'previous_name_col, leadNameWindowExpression as 'next_name_col)
.withColumn("name", nameConditionExpression).drop("previous_name_col", "next_name_col")
.show(false)
Output
+----+-----+---------+
|name|place|row_count|
+----+-----+---------+
|a |a1 |1 |
|a |a2 |2 |
|a |a2 |3 |
|d1 |null |4 |
|b |a2 |5 |
|c |a2 |6 |
|c |null |7 |
|d |null |8 |
|d |c1 |9 |
+----+-----+---------+