How to iterate over an array column in PySpark while joining - pyspark

In PySpark, I have dataframe_a with:
+-----------+----------------------+
| str1 | array_of_str |
+-----------+----------------------+
| John | [mango, apple] |
| Tom | [mango, orange] |
| Matteo | [apple, banana] |
and dataframe_b with
+-----------+----------------------+
| key | value |
+-----------+----------------------+
| mango | 1 |
| apple | 2 |
| orange | 3 |
and I want to create a new column of type Array joined_result that maps each element in array_of_str (dataframe_a) to its value in dataframe_b, such as:
+-----------+----------------------+----------------------------------+
| str1 | array_of_str | joined_result |
+-----------+----------------------+----------------------------------+
| John | [mango, apple] | [1, 2] |
| Tom | [mango, orange] | [1, 3] |
| Matteo | [apple, banana] | [2] |
I'm not sure how to do it, I know I can use an udf with a lambda function but I don't manage to make it work :( Help!
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, ArrayType
# START EXTRACT OF CODE
ret = (df
.select(['str1', 'array_of_str'])
.withColumn('joined_result', F.udf(
map(lambda x: ??????, ArrayType(StringType))
)
)
return ret
# END EXTRACT OF CODE

My answer in your question:
lookup_list = map(lambda row: row.asDict(), dataframe_b.collect())
lookup_dict = {lookup['key']:lookup['value'] for lookup in lookup_list}
def mapper(keys):
return [lookup_dict[key][0] for key in keys]
dataframe_a = dataframe_a.withColumn('joined_result', F.udf(mapper)("arr_of_str"))
It works as you want :-)

Related

How to explode a string column based on specific delimiter in Spark Scala

I want to explode string column based on a specific delimiter (| in my case )
I have a dataset like this:
+-----+--------------+
|Col_1|Col_2 |
+-----+--------------+
| 1 | aa|bb |
| 2 | cc |
| 3 | dd|ee |
| 4 | ff |
+-----+-----+---------
I want an output like this:
+-----+--------------+
|Col_1|Col_2 |
+-----+--------------+
| 1 | aa |
| 1 | bb |
| 2 | cc |
| 3 | dd |
| 3 | ee |
| 4 | ff |
+-----+-----+---------
Use explode and split functions, and use \\ escape |.
val df1 = df.select(col("Col_1"), explode(split(col("Col_2"),"\\|")).as("Col_2"))

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show

Mean with differents columns ignoring Null values, Spark Scala

I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+

How to combine pyspark dataframes with different shapes and different columns

I have two dataframes in Pyspark. One has more than 1000 rows and the other only 4 rows. The columns also are not matching.
df1 with more than 1000 rows:
+----+--------+--------------+-------------+
| ID | col1 | col2 | col 3 |
+----+--------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 |
| 2 | time 2 | value2_col2 | value2_col3 |
+----+--------+--------------+-------------+
...
df2 with only 4 rows:
+-----+--------------+--------------+
| key | col_c | col_d |
+-----+--------------+--------------+
| a | valuea_colc | valuea_cold |
| b | valueb_colc | valueb_cold |
+-----+--------------+--------------+
I want to create a dataframe looking like this:
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| ID | col1 | col2 | col 3 | a_col_c | a_col_d | b_col_c | b_col_d |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
| 2 | time 2 | value2_col2 | value2_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
Can you please help with this? I prefer not to use Pandas.
Thank you!
I actually figured this out using crossJoin.
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html explains how to use crossJoin with Pyspark DataFrames.

Convert result of aggregation into 3 separate fields of col name, aggregate function and value

I have a dataframe in the form
+-----+--------+-------+
| id | label | count |
+-----+--------+-------+
| id1 | label1 | 5 |
| id1 | label1 | 2 |
| id2 | label2 | 3 |
+-----+--------+-------+
and I would like the resulting output to look like
+-----+--------+----------+----------+-------+
| id | label | col_name | agg_func | value |
+-----+--------+----------+----------+-------+
| id1 | label1 | count | avg | 3.5 |
| id1 | label1 | count | sum | 7 |
| id2 | label2 | count | avg | 3 |
| id2 | label2 | count | sum | 3 |
+-----+--------+----------+----------+-------+
First, I created a list of aggregate functions using the code below. I then apply these functions into the original dataframe to get the aggregation results in separate columns.
val f = org.apache.spark.sql.functions
val aggCols = Seq("col_name")
val aggFuncs = Seq("avg", "sum")
val aggOp = for (func <- aggFuncs) yield {
aggCols.map(x => f.getClass.getMethod(func, x.getClass).invoke(f, x).asInstanceOf[Column])
}
val aggOpFlat = aggOp.flatten
df.groupBy("id", "label").agg(aggOpFlat.head, aggOpFlat.tail: _*).na.fill(0)
I get to the format
+-----+--------+---------------+----------------+
| id | label | avg(col_name) | sum(col_name) |
+-----+--------+---------------+----------------+
| id1 | label1 | 3.5 | 7 |
| id2 | label2 | 3 | 3 |
+-----+--------+---------------+----------------+
but cannot think of the logic to get to what I want.
A possible solution might be to wrap all the aggregate values inside a map and then to use explode function.
Something like that (shouldn't be an issue to make it dynamic).
val df = List ( ("id1", "label1", 5), ("id1", "label1", 2), ("id2", "label2", 3)).toDF("id", "label", "count")
df
.groupBy("id", "label")
.agg(avg("count").as("avg"), sum("count").as("sum"))
.withColumn("map", map( lit("avg"), col("avg"), lit("sum"), col("sum")))
.select(col("id"), col("label"), explode(col("map")))
.show