Set union over window in PySpark - pyspark

I am trying to get the list of all unique values that appear for a certain group of IDs, and all of this over window. As for the data below, for each address I want to find all id that has common value and gather the set of all value that appear for a group of id.
For example, I have window.partitionBy('address'). For address: 1 I see that id: A, B, C have common value: x. I understand they are connected and want to create value_set with all value that correspond with id: A, B, C, which is x,y,z.
id: D does not have common value with any other id, so its value_set include only values of id: D.
My data
+-------+---+-----+
|address| id|value|
+-------+---+-----+
| 1| A| x|
| 1| A| у|
| 1| B| x|
| 1| C| x|
| 1| C| z|
| 1| D| v|
| 2| E| m|
| 2| E| n|
| 2| F| m|
| 2| F| p|
+-------+---+-----+
What I want
+-------+---+-----+---------+
|address| id|value|value_set|
+-------+---+-----+---------+
| 1| A| x| x,y,z|
| 1| A| у| x,y,z|
| 1| B| x| x,y,z|
| 1| C| x| x,y,z|
| 1| C| z| x,y,z|
| 1| D| x| v |
| 2| E| m| m,n,p|
| 2| E| n| m,n,p|
| 2| F| m| m,n,p|
| 2| F| p| m,n,p|
+-------+---+-----+---------+

something like this?
dfn = df.withColumn('set_collect',collect_set(df.value).over(Window.partitionBy('address')))
Output is:
+-------+---+-----+------------+
|address| id|value| set_collect|
+-------+---+-----+------------+
| 1| A| x|[у, v, z, x]|
| 1| A| у|[у, v, z, x]|
| 1| B| x|[у, v, z, x]|
| 1| C| x|[у, v, z, x]|
| 1| C| z|[у, v, z, x]|
| 1| D| v|[у, v, z, x]|
| 2| E| m| [n, m, p]|
| 2| E| n| [n, m, p]|
| 2| F| m| [n, m, p]|
| 2| F| p| [n, m, p]|
+-------+---+-----+------------+

Related

create a new column to increment value when value resets to 1 in another column in pyspark

Logic and columnIn Pyspark DataFrame consider a column like [1,2,3,4,1,2,1,1,2,3,1,2,1,1,2]. Pyspark Column
create a new column to increment value when value resets to 1.
Expected output is[1,1,1,1,2,2,3,4,4,4,5,5,6,7,7]
i am bit new to pyspark, if anyone can help me it would be great for me.
written the logic as like below
def sequence(row_num):
results = [1, ]
flag = 1
for col in range(0, len(row_num)-1):
if row_num[col][0]>=row_num[col+1][0]:
flag+=1
results.append(flag)
return results
but not able to pass a column through udf. please help me in this
Your Dataframe:
df = spark.createDataFrame(
[
('1','a'),
('2','b'),
('3','c'),
('4','d'),
('1','e'),
('2','f'),
('1','g'),
('1','h'),
('2','i'),
('3','j'),
('1','k'),
('2','l'),
('1','m'),
('1','n'),
('2','o')
], ['group','label']
)
+-----+-----+
|group|label|
+-----+-----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 1| e|
| 2| f|
| 1| g|
| 1| h|
| 2| i|
| 3| j|
| 1| k|
| 2| l|
| 1| m|
| 1| n|
| 2| o|
+-----+-----+
You can create a flag and use a Window Function to calculate the cumulative sum. No need to use an UDF:
from pyspark.sql import Window as W
from pyspark.sql import functions as F
w = W.partitionBy().orderBy('label').rowsBetween(Window.unboundedPreceding, 0)
df\
.withColumn('Flag', F.when(F.col('group') == 1, 1).otherwise(0))\
.withColumn('Output', F.sum('Flag').over(w))\
.show()
+-----+-----+----+------+
|group|label|Flag|Output|
+-----+-----+----+------+
| 1| a| 1| 1|
| 2| b| 0| 1|
| 3| c| 0| 1|
| 4| d| 0| 1|
| 1| e| 1| 2|
| 2| f| 0| 2|
| 1| g| 1| 3|
| 1| h| 1| 4|
| 2| i| 0| 4|
| 3| j| 0| 4|
| 1| k| 1| 5|
| 2| l| 0| 5|
| 1| m| 1| 6|
| 1| n| 1| 7|
| 2| o| 0| 7|
+-----+-----+----+------+

How to take row_number() based on a condition in spark with scala

I have the below data frame -
+----+-----+---+
| val|count| id|
+----+-----+---+
| a| 10| m1|
| b| 20| m1|
|null| 30| m1|
| b| 30| m2|
| c| 40| m2|
|null| 50| m2|
+----+-----+---+
created by -
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
)toDF("val","count","id")
I am trying to make a rank with the help of row_number() and window fuction as below.
df1.withColumn("rannk_num", row_number() over Window.partitionBy("id").orderBy("count")).show
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| 3|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| 3|
+----+-----+---+---------+
But I have to filter those records with null values for column - val.
Expected output --
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| NULL|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| NULL|
+----+-----+---+---------+
wondering if this is possible with minimal change. Also there can be 'n' number of values for the columns val and count.
Filter those rows with null val, assign them a null row number, and union back to the original dataframe.
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
).toDF("val","count","id")
df1.filter("val is not null").withColumn(
"rannk_num", row_number() over Window.partitionBy("id").orderBy("count")
).union(
df1.filter("val is null").withColumn("rannk_num", lit(null))
).show
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 30| m1| null|
|null| 50| m2| null|
+----+-----+---+---------+

Pyspark Joining two dataframes with a collect_list

suppose I have the following DataFrames.
How can I perform a join between the two of them so that I have a final output in which the resulting column (value_2) takes into account the number of records to be appended based on the value of the ranking column.
import pyspark.sql.functions as f
from pyspark.sql.window import Window
l =[( 9 , 1, 'A' ),
( 9 , 2, 'B' ),
( 9 , 3, 'C' ),
( 9 , 4, 'D' ),
( 10 , 1, 'A' ),
( 10 , 2, 'B' )]
df = spark.createDataFrame(l, ['prod','rank', 'value'])
+----+----+-----+
|prod|rank|value|
+----+----+-----+
| 9| 1| A|
| 9| 2| B|
| 9| 3| C|
| 9| 4| D|
| 10| 1| A|
| 10| 2| B|
+----+----+-----+
sh =[( 9 , ['A','B','C','D'] ),
( 10 , ['A','B'])]
sh = spark.createDataFrame(sh, ['prod', 'conc'])
+----+------------+
|prod| value|
+----+------------+
| 9|[A, B, C, D]|
| 10| [A, B]|
+----+------------+
Final desidered output:
+----+----+-----+---------+
|prod|rank|value| value_2 |
+----+----+-----+---------+
| 9| 1| A| A |
| 9| 2| B| A,B |
| 9| 3| C| A,B,C |
| 9| 4| D| A,B,C,D|
| 10| 1| A| A |
| 10| 2| B| A,B |
+----+----+-----+---------+
You can use Window function and do this before aggregation; In spark 2.4+
df.select('*',
f.array_join(
f.collect_list(df.value).over(Window.partitionBy('prod').orderBy('rank')),
','
).alias('value_2')
).show()
+----+----+-----+-------+
|prod|rank|value|value_2|
+----+----+-----+-------+
| 9| 1| A| A|
| 9| 2| B| A,B|
| 9| 3| C| A,B,C|
| 9| 4| D|A,B,C,D|
| 10| 1| A| A|
| 10| 2| B| A,B|
+----+----+-----+-------+
Or if you don't need to join array as strings:
df.select('*',
f.collect_list(df.value).over(Window.partitionBy('prod').orderBy('rank')).alias('value_2')
).show()
+----+----+-----+------------+
|prod|rank|value| value_2|
+----+----+-----+------------+
| 9| 1| A| [A]|
| 9| 2| B| [A, B]|
| 9| 3| C| [A, B, C]|
| 9| 4| D|[A, B, C, D]|
| 10| 1| A| [A]|
| 10| 2| B| [A, B]|
+----+----+-----+------------+

Dynamic column selection in Spark (based on another column's value)

With the given Spark DataFrame:
> df.show()
+---+-----+---+---+---+---+
| id|delay| p1| p2| p3| p4|
+---+-----+---+---+---+---+
| 1| 3| a| b| c| d|
| 2| 1| m| n| o| p|
| 3| 2| q| r| s| t|
+---+-----+---+---+---+---+
How to select a column dynamically so that the new, col column is the result of the p{delay} existing column?
> df.withColumn("col", /* ??? */).show()
+---+-----+---+---+---+---+----+
| id|delay| p1| p2| p3| p4| col|
+---+-----+---+---+---+---+----+
| 1| 3| a| b| c| d| c| // col = p3
| 2| 1| m| n| o| p| m| // col = p1
| 3| 2| q| r| s| t| r| // col = p2
+---+-----+---+---+---+---+----+
The simplest solution I can think of is to use array with delay as an index:
import org.apache.spark.sql.functions.array
df.withColumn("col", array($"p1", $"p2", $"p3", $"p4")($"delay" - 1))
One option is create a map from number to column names, and then use foldLeft to update the col column with corresponding values:
val cols = (1 to 4).map(i => i -> s"p$i")
(cols.foldLeft(df.withColumn("col", lit(null))){
case (df, (k, v)) => df.withColumn("col", when(df("delay") === k, df(v)).otherwise(df("col")))
}).show
+---+-----+---+---+---+---+---+
| id|delay| p1| p2| p3| p4|col|
+---+-----+---+---+---+---+---+
| 1| 3| a| b| c| d| c|
| 2| 1| m| n| o| p| m|
| 3| 2| q| r| s| t| r|
+---+-----+---+---+---+---+---+

Scala Spark - Map function referencing another dataframe

I have two dataframes:
df1:
+---+------+----+
| id|weight|time|
+---+------+----+
| A| 0.1| 1|
| A| 0.2| 2|
| A| 0.3| 4|
| A| 0.4| 5|
| B| 0.5| 1|
| B| 0.7| 3|
| B| 0.8| 6|
| B| 0.9| 7|
| B| 1.0| 8|
+---+------+----+
df2:
+---+---+-------+-----+
| id| t|t_start|t_end|
+---+---+-------+-----+
| A| t1| 0| 3|
| A| t2| 4| 6|
| A| t3| 7| 9|
| B| t1| 0| 2|
| B| t2| 3| 6|
| B| t3| 7| 9|
+---+---+-------+-----+
My desired output is to identify the 't' for each time stamp in df1, where the ranges of 't' are in df2.
df_output:
+---+------+----+---+
| id|weight|time| t |
+---+------+----+---+
| A| 0.1| 1| t1|
| A| 0.2| 2| t1|
| A| 0.3| 4| t2|
| A| 0.4| 5| t2|
| B| 0.5| 1| t1|
| B| 0.7| 3| t2|
| B| 0.8| 6| t2|
| B| 0.9| 7| t3|
| B| 1.0| 8| t3|
+---+------+----+---+
My understanding so far is that I must create an udf that takes the column 'id and 'time as inputs, map for each row, by refering to df2.filter(df2.id == df1.id, df1.time >= df2.t_start, df1.time <= df2.t_end), and get the correspondingdf2.t`
I'm very new to Scala and Spark, so I am wondering if this solution is even possible?
You cannot use UDF for that but all you have to do is to reuse filter condition you already defined to join both frames:
df1.join(
df2,
df2("id") === df1("id") && df1("time").between(df2("t_start"), df2("t_end"))
)