How would you generate a new array column over a window? - pyspark

I'm trying to generate a new column that is an array over a window however it appears that the array function does not work over a window and I'm struggling to find an alternative method.
Code snippet:
df = df.withColumn('array_output', F.array(df.things_to_agg_in_array).over(Window.partitionBy("aggregate_over_this")))
Ideally what I'd like is an output that looks like the following table:
+---------------------+------------------------+--------------+
| Aggregate Over This | Things to Agg in Array | Array Output |
+---------------------+------------------------+--------------+
| 1 | C | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 1 | F | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 1 | K | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 1 | L | [C,F,K,L] |
+---------------------+------------------------+--------------+
| 2 | A | [A,B,C] |
+---------------------+------------------------+--------------+
| 2 | B | [A,B,C] |
+---------------------+------------------------+--------------+
| 2 | C | [A,B,C] |
+---------------------+------------------------+--------------+
For further context this is part of an explode which will then be rejoined onto another table based on the 'aggregate over this' and as a result only returning one instance of array_ouput.
Thanks

This solution used collect_list(), not sure if it fulfills your requirement.
myValues = [(1,'C'),(1,'F'),(1,'K'),(1,'L'),(2,'A'),(2,'B'),(2,'C')]
df = sqlContext.createDataFrame(myValues,['Aggregate_Over_This','Things_to_Agg_in_Array'])
df.show()
+-------------------+----------------------+
|Aggregate_Over_This|Things_to_Agg_in_Array|
+-------------------+----------------------+
| 1| C|
| 1| F|
| 1| K|
| 1| L|
| 2| A|
| 2| B|
| 2| C|
+-------------------+----------------------+
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select Aggregate_Over_This, Things_to_Agg_in_Array, collect_list(Things_to_Agg_in_Array) over (partition by Aggregate_Over_This) as aray_output from table_view'
)
df1.show()
+-------------------+----------------------+------------+
|Aggregate_Over_This|Things_to_Agg_in_Array| aray_output|
+-------------------+----------------------+------------+
| 1| C|[C, F, K, L]|
| 1| F|[C, F, K, L]|
| 1| K|[C, F, K, L]|
| 1| L|[C, F, K, L]|
| 2| A| [A, B, C]|
| 2| B| [A, B, C]|
| 2| C| [A, B, C]|
+-------------------+----------------------+------------+

Related

PySpark Column Creation by queuing filtered past rows

In PySpark, I want to make a new column in an existing table that stores the last K texts for a particular user that had label 1.
Example-
Index | user_name | text | label |
0 | u1 | t0 | 0 |
1 | u1 | t1 | 1 |
2 | u2 | t2 | 0 |
3 | u1 | t3 | 1 |
4 | u2 | t4 | 0 |
5 | u2 | t5 | 1 |
6 | u2 | t6 | 1 |
7 | u1 | t7 | 0 |
8 | u1 | t8 | 1 |
9 | u1 | t9 | 0 |
The table after the new column (text_list) should be as follows, storing last K = 2 messages for each user.
Index | user_name | text | label | text_list |
0 | u1 | t0 | 0 | [] |
1 | u1 | t1 | 1 | [] |
2 | u2 | t2 | 0 | [] |
3 | u1 | t3 | 1 | [t1] |
4 | u2 | t4 | 0 | [] |
5 | u2 | t5 | 1 | [] |
6 | u2 | t6 | 1 | [t5] |
7 | u1 | t7 | 0 | [t3, t1] |
8 | u1 | t8 | 1 | [t3, t1] |
9 | u1 | t9 | 0 | [t8, t3] |
A naïve way to do this would be to loop through each row and maintain a queue for each user. But the table could have millions of rows. Can we do this without looping in a more scalable, efficient way?
If you are using spark version >= 2.4, there is a way you can try. Let's say df is your dataframe.
df.show()
# +-----+---------+----+-----+
# |Index|user_name|text|label|
# +-----+---------+----+-----+
# | 0| u1| t0| 0|
# | 1| u1| t1| 1|
# | 2| u2| t2| 0|
# | 3| u1| t3| 1|
# | 4| u2| t4| 0|
# | 5| u2| t5| 1|
# | 6| u2| t6| 1|
# | 7| u1| t7| 0|
# | 8| u1| t8| 1|
# | 9| u1| t9| 0|
# +-----+---------+----+-----+
Two steps :
get list of struct of column text and label over a window using collect_list
filter array where label = 1 and get the text value, descending-sort the array using sort_array and get the first two elements using slice
It would be something like this
from pyspark.sql.functions import col, collect_list, struct, expr, sort_array, slice
from pyspark.sql.window import Window
# window : first row to row before current row
w = Window.partitionBy('user_name').orderBy('index').rowsBetween(Window.unboundedPreceding, -1)
df = (df
.withColumn('text_list', collect_list(struct(col('text'), col('label'))).over(w))
.withColumn('text_list', slice(sort_array(expr("FILTER(text_list, value -> value.label = 1).text"), asc=False), 1, 2))
)
df.sort('Index').show()
# +-----+---------+----+-----+---------+
# |Index|user_name|text|label|text_list|
# +-----+---------+----+-----+---------+
# | 0| u1| t0| 0| []|
# | 1| u1| t1| 1| []|
# | 2| u2| t2| 0| []|
# | 3| u1| t3| 1| [t1]|
# | 4| u2| t4| 0| []|
# | 5| u2| t5| 1| []|
# | 6| u2| t6| 1| [t5]|
# | 7| u1| t7| 0| [t3, t1]|
# | 8| u1| t8| 1| [t3, t1]|
# | 9| u1| t9| 0| [t8, t3]|
# +-----+---------+----+-----+---------+
Thanks to the solution posted here. I modified it slightly (since it assumed text field can be sorted) and was finally able to come to a working solution. Here it is:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, when, collect_list, slice, reverse
K = 2
windowPast = Window.partitionBy("user_name").orderBy("Index").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
df.withColumn("text_list", collect_list\
(when(col("label")==1,col("text"))\
.otherwise(F.lit(None)))\
.over(windowPast))\
.withColumn("text_list", slice(reverse(col("text_list")), 1, K))\
.sort(F.col("Index"))\
.show()

Add column elements to a Dataframe Scala Spark

I have two dataframes, and I want to add one to all row of the other one.
My dataframes are like:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
2 | a | 2
2 | d | 4
name
a
b
c
d
e
And I want a result like this:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
1 | d | null
1 | e | null
2 | a | 2
2 | b | null
2 | c | null
2 | d | 4
2 | e | null
How can I do this?
It seems it's more than a simple join.
val df = df1.select("id").distinct().crossJoin(df2).join(
df1,
Seq("name", "id"),
"left"
).orderBy("id", "name")
df.show
+----+---+----+
|name| id|rate|
+----+---+----+
| a| 1| 3|
| b| 1| 4|
| c| 1| 1|
| d| 1|null|
| e| 1|null|
| a| 2| 2|
| b| 2|null|
| c| 2|null|
| d| 2| 4|
| e| 2|null|
+----+---+----+

How to find the max length unique rows from a dataframe with spark?

I am trying to find the unique rows (based on id) that have the maximum length value in a Spark dataframe. Each Column has a value of string type.
The dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | | d |
|3 |b | c | a | d |
+-----+---+----+---+---+
The expectation is:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I can't figure how to do this using Spark easily...
Thanks in advance
Note: This approach takes care of any addition/deletion of columns to the DataFrame, without the need of code change.
It can be done by first finding length of all columns after concatenating (except the first column), then filter all other rows except the row with the maximum length.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val output = input.withColumn("rowLength", length(concat(input.columns.toList.drop(1).map(col): _*)))
.withColumn("maxLength", max($"rowLength").over(Window.partitionBy($"id")))
.filter($"rowLength" === $"maxLength")
.drop("rowLength", "maxLength")
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi| |
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| | d|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> df.groupBy("id").agg(concat_ws("",collect_set(col("A"))).alias("A"),concat_ws("",collect_set(col("B"))).alias("B"),concat_ws("",collect_set(col("C"))).alias("C"),concat_ws("",collect_set(col("D"))).alias("D")).show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| a| d|
+---+----+----+----+----+

Create a Vertical Table in Spark 2 [duplicate]

This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Closed 4 years ago.
How to create a vertical table in Spark 2 SQL.
I am building a ETL using Spark 2 / SQL / Scala. I have data in normal table structure like.
Input Table:
| ID | A | B | C | D |
| 1 | A1 | B1 | C1 | D1 |
| 2 | A2 | B2 | C2 | D2 |
Output Table:
| ID | Key | Val |
| 1 | A | A1 |
| 1 | B | B1 |
| 1 | C | C1 |
| 1 | D | D1 |
| 2 | A | A2 |
| 2 | B | B2 |
| 2 | C | C2 |
| 2 | D | D2 |
This could do the trick as well:
Input Data:
+---+---+---+---+---+
|ID |A |B |C |D |
+---+---+---+---+---+
|1 |A1 |B1 |C1 |D1 |
|2 |A2 |B2 |C2 |D2 |
|3 |A3 |B3 |C3 |D3 |
+---+---+---+---+---+
Zip the column header and no of columns to be included:
val cols = Seq("A","B","C","D") zip Range(0,4,1)
df.flatMap(r => cols.map(i => (r.getString(0),i._1,r.getString(i._2 + 1)))).toDF("ID","KEY","VALUE").show()
Result should look like this:
+---+---+-----+
| ID|KEY|VALUE|
+---+---+-----+
| 1| A| A1|
| 1| B| B1|
| 1| C| C1|
| 1| D| D1|
| 2| A| A2|
| 2| B| B2|
| 2| C| C2|
| 2| D| D2|
| 3| A| A3|
| 3| B| B3|
| 3| C| C3|
| 3| D| D3|
+---+---+-----+
Good Luck!!

Joining data in Scala using array_contains() method

I have below data in Scala in Spark environment -
val abc = Seq(
(Array("A"),0.1),
(Array("B"),0.11),
(Array("C"),0.12),
(Array("A","B"),0.24),
(Array("A","C"),0.27),
(Array("B","C"),0.30),
(Array("A","B","C"),0.4)
).toDF("channel_set", "rate")
abc.show(false)
abc.createOrReplaceTempView("abc")
val df = abc.withColumn("totalChannels",size(col("channel_set"))).toDF()
df.show()
scala> df.show
+-----------+----+-------------+
|channel_set|rate|totalChannels|
+-----------+----+-------------+
| [A]| 0.1| 1|
| [B]|0.11| 1|
| [C]|0.12| 1|
| [A, B]|0.24| 2|
| [A, C]|0.27| 2|
| [B, C]| 0.3| 2|
| [A, B, C]| 0.4| 3|
+-----------+----+-------------+
val oneChannelDF = df.filter($"totalChannels" === 1)
oneChannelDF.show()
oneChannelDF.createOrReplaceTempView("oneChannelDF")
+-----------+----+-------------+
|channel_set|rate|totalChannels|
+-----------+----+-------------+
| [A]| 0.1| 1|
| [B]|0.11| 1|
| [C]|0.12| 1|
+-----------+----+-------------+
val twoChannelDF = df.filter($"totalChannels" === 2)
twoChannelDF.show()
twoChannelDF.createOrReplaceTempView("twoChannelDF")
+-----------+----+-------------+
|channel_set|rate|totalChannels|
+-----------+----+-------------+
| [A, B]|0.24| 2|
| [A, C]|0.27| 2|
| [B, C]| 0.3| 2|
+-----------+----+-------------+
I want to join oneChannel and twoChannel dataframes so that I can see my resultant data as below -
+-----------+----+-------------+------------+-------+
|channel_set|rate|totalChannels|channel_set | rate |
+-----------+----+-------------+------------+-------+
| [A]| 0.1| 1| [A,B] | 0.24 |
| [A]| 0.1| 1| [A,C] | 0.27 |
| [B]|0.11| 1| [A,B] | 0.24 |
| [B]|0.11| 1| [B,C] | 0.30 |
| [C]|0.12| 1| [A,C] | 0.27 |
| [C]|0.12| 1| [B,C] | 0.30 |
+-----------+----+-------------+------------+-------+
Basically I need all the rows where a record from oneChannel dataframe in present in twoChannel dataframe.
I have tried -
spark.sql("""select * from oneChannelDF one inner join twoChannelDF two on array_contains(one.channel_set,two.channel_set)""").show()
However, I am facing this error -
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(one.`channel_set`, two.`channel_set`)' due to data type mismatch: Arguments must be an array followed by a value of same type as the array members; line 1 pos 62;
I guess I figured out the error. I need to pass a member as an argument to the array_contains() method. Since the size of every element in channel_set column for oneChannelDF is 1, hence below code gets me the correct data frame.
scala> spark.sql("""select * from oneChannelDF one inner join twoChannelDF two where array_contains(two.channel_set,one.channel_set[0])""").show()
+-----------+----+-------------+-----------+----+-------------+
|channel_set|rate|totalChannels|channel_set|rate|totalChannels|
+-----------+----+-------------+-----------+----+-------------+
| [A]| 0.1| 1| [A, B]|0.24| 2|
| [A]| 0.1| 1| [A, C]|0.27| 2|
| [B]|0.11| 1| [A, B]|0.24| 2|
| [B]|0.11| 1| [B, C]| 0.3| 2|
| [C]|0.12| 1| [A, C]|0.27| 2|
| [C]|0.12| 1| [B, C]| 0.3| 2|
+-----------+----+-------------+-----------+----+-------------+