Remove null values and shift values from the next column in pyspark - pyspark

I need to transform a Python script to Pyspark and it's being a tough task for me.
I'm trying to remove null values from a dataframe (without removing the entire column or row) and shift the next value to the prior column. Example:
CLIENT| ANIMAL_1 | ANIMAL_2 | ANIMAL_3| ANIMAL_4
ROW_1 1 | cow | frog | null | dog
ROW_2 2 | pig | null | cat | null
My goal is to have:
CLIENT| ANIMAL_1 | ANIMAL_2 | ANIMAL_3| ANIMAL_4
ROW_1 1 | cow | frog | dog | null
ROW_2 2 | pig | cat | null | null
The code I'm using on python is (which I got here on Stackoverflow):
df_out = df.apply(lambda x: pd.Series(x.dropna().to_numpy()), axis=1)
Then I rename the columns. But I have no idea how to do this on Pyspark.

Here's a way to do this for Spark version 2.4+:
Create an array of the columns you want and sort by your conditions, which are the following:
Sort non-null values first
Sort values in the order they appear in the columns
We can do the sorting by using array_sort. To achieve the multiple conditions, use arrays_zip. To make it easy to extract the value you want (i.e. the animal in this example) zip column value as well.
from pyspark.sql.functions import array, array_sort, arrays_zip, col, lit
animal_cols = df.columns[1:]
N = len(animal_cols)
df_out = df.select(
df.columns[0],
array_sort(
arrays_zip(
array([col(c).isNull() for c in animal_cols]),
array([lit(i) for i in range(N)]),
array([col(c) for c in animal_cols])
)
).alias('sorted')
)
df_out.show(truncate=False)
#+------+----------------------------------------------------------------+
#|CLIENT|sorted |
#+------+----------------------------------------------------------------+
#|1 |[[false, 0, cow], [false, 1, frog], [false, 3, dog], [true, 2,]]|
#|2 |[[false, 0, pig], [false, 2, cat], [true, 1,], [true, 3,]] |
#+------+----------------------------------------------------------------+
Now that things are in the right order, you just need to extract the value. In this case, that's the item at element '2' in the i-th index of sorted column.
df_out = df_out.select(
df.columns[0],
*[col("sorted")[i]['2'].alias(animal_cols[i]) for i in range(N)]
)
df_out.show(truncate=False)
#+------+--------+--------+--------+--------+
#|CLIENT|ANIMAL_1|ANIMAL_2|ANIMAL_3|ANIMAL_4|
#+------+--------+--------+--------+--------+
#|1 |cow |frog |dog |null |
#|2 |pig |cat |null |null |
#+------+--------+--------+--------+--------+

Related

Spark/Scala:Finding count of delimited values in a column eliminating duplicates

I've a column like
+-----------------+----------------------------+
|Race_Track | EngineType |
+----------------------------------------------+
|800-RDUO | 881,652,EWQ,300x,652,PXZ |
+----------------------------------------------+
i should remove one specific value say EWQ and all duplicates like below
+-----------------+----------------------------+
|Race_Track | EngineType |
+----------------------------------------------+
|800-RDUO | 881,300x,652,PXZ |
+----------------------------------------------+
How to achieve this in Scala?
You can achieve your desired output by combining split, concat_ws and array_distinct as below (assuming data is your dataset):
data = data
.withColumn("EngineType", array_distinct(
filter(split(col("EngineType"), ","), x => x.notEqual("EWQ")))
)
.withColumn("EngineType", concat_ws(",", col("EngineType")))
Final output:
+----------+----------------+
|Race_Track|EngineType |
+----------+----------------+
|800-RDUO |881,652,300x,PXZ|
+----------+----------------+
Good luck!

pyspark column type casting in pivot

I have a dataframe where I want to create pivot table from 2 columns, i'm using the question header column which will have its value pivoted like below : age , age_numeric
and the answer header is the value , my problem is I want to put the value of the answer header in a list which I'm doing using collect_list function, but the problem is i want the new column like age_numeric to be list of int, while column age to be list of strings, based on question type column, but when i try the code it always gives me a list of strings, any idea how to solve this problem?
this is the code
y=output.groupby("sessionId").pivot("questionHeader").
agg(collect_list(when(col("questionType")=="numericAnswer",
col("answerHeader")
.cast("float")).when(col("questionType")!="numericAnswer",col("answerHeader"))))
this is what i get
| session id | Age | Age_numeric
| 1 | ["20-25 years"] | ["20"]
| 3 | ["20-25 years"] | ["20"]
This is what i want
| session id | Age | Age_numeric
| 1 | ["20-25 years"] | [20]
| 3 | ["20-25 years"] | [20]
If you want the output as in the last two rows, then you do not require a pivot, just groupby and collect_list on each of the two columns To get the list of integers for Age_numeric, apply .cast("array< int>"), or change the type of Age_numeric column before collect_list().
Replicate the data
import pyspark.sql.functions as F
data = [(1, "20-25 years", "20"), (3, "20-25 years", "20")]
df = spark.createDataFrame(data, schema=["session_id", "Age", "Age_numeric"])
Replicate the output
df_out = (df.groupBy("session_id")
.agg(F.collect_list("Age").alias("Age"),
F.collect_list("Age_numeric")
.cast("array<int>")
.alias("Age_numeric"))

complex logic on pyspark dataframe including previous row existing value as well as previous row value generated on the fly

I have to apply a logic on spark dataframe or rdd(preferably dataframe) which requires to generate two extra column. First generated column is dependent on other columns of same row and second generated column is dependent on first generated column of previous row.
Below is representation of problem statement in tabular format. A and B columns are available in dataframe. C and D columns are to be generated.
A | B | C | D
------------------------------------
1 | 100 | default val | C1-B1
2 | 200 | D1-C1 | C2-B2
3 | 300 | D2-C2 | C3-B3
4 | 400 | D3-C3 | C4-B4
5 | 500 | D4-C4 | C5-B5
Here is the sample data
A | B | C | D
------------------------
1 | 100 | 1000 | 900
2 | 200 | -100 | -300
3 | 300 | -200 | -500
4 | 400 | -300 | -700
5 | 500 | -400 | -900
Only solution I can think of is to coalesce the input dataframe to 1, convert it to rdd and then apply python function (having all the calcuation logic) to mapPartitions API .
However this approach may create load on one executor.
Mathematically seeing, D1-C1 where D1= C1-B1; so D1-C1 will become C1-B1-C1 => -B1.
In pyspark, window function has a parameter called default. this should simplify your problem. try this:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([(1,100),(2,200),(3,300),(4,400),(5,500)],['a','b'])
w=Window.orderBy('a')
df_lag =df.withColumn('c',F.lag((F.col('b')*-1),default=1000).over(w))
df_final = df_lag.withColumn('d',F.col('c')-F.col('b'))
Results:
df_final.show()
+---+---+----+----+
| a| b| c| d|
+---+---+----+----+
| 1|100|1000| 900|
| 2|200|-100|-300|
| 3|300|-200|-500|
| 4|400|-300|-700|
| 5|500|-400|-900|
+---+---+----+----+
If the operation is something complex other than subtraction, then the same logic applies - fill the column C with your default value- calculate D , then use lag to calculate C and recalculate D.
The lag() function may help you with that:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.orderBy("A")
df1 = df1.withColumn("C", F.lit(1000))
df2 = (
df1
.withColumn("D", F.col("C") - F.col("B"))
.withColumn("C",
F.when(F.lag("C").over(w).isNotNull(),
F.lag("D").over(w) - F.lag("C").over(w))
.otherwise(F.col("C")))
.withColumn("D", F.col("C") - F.col("B"))
)

PySpark join dataframes and merge contents of specific columns

My goal is to merge two dataframes on the column id, and perform a somewhat complex merge on another column that contains JSON we can call data.
Suppose I have the DataFrame df1 that looks like this:
id | data
---------------------------------
42 | {'a_list':['foo'],'count':1}
43 | {'a_list':['scrog'],'count':0}
And I'm interested in merging with a similar, but different DataFrame df2:
id | data
---------------------------------
42 | {'a_list':['bar'],'count':2}
44 | {'a_list':['baz'],'count':4}
And I would like the following DataFrame, joining and merging properties from the JSON data where id matches, but retaining rows where id does not match and keeping the data column as-is:
id | data
---------------------------------------
42 | {'a_list':['foo','bar'],'count':3} <-- where 'bar' is added to 'foo', and count is summed
43 | {'a_list':['scrog'],'count':1}
44 | {'a_list':['baz'],'count':4}
As can be seen where id is 42, there is a some logic I will have to apply to how the JSON is merged.
My knee jerk thought is that I'd like to provide a lambda / udf to merge the data column, but not sure how to think about that with during a join.
Alternatively, I could break the properties from the JSON into columns, something like this, that might be a better approach?
df1:
id | a_list | count
----------------------
42 | ['foo'] | 1
43 | ['scrog'] | 0
df2:
id | a_list | count
---------------------
42 | ['bar'] | 2
44 | ['baz'] | 4
Resulting:
id | a_list | count
---------------------------
42 | ['foo', 'bar'] | 3
43 | ['scrog'] | 0
44 | ['baz'] | 4
If I went this route, I would then have to merge the columns a_list and count into JSON again under a single column data, but this I can wrap my head around as a relatively simple map function.
Update: Expanding on Question
More realistically, I will have n number of DataFrames in a list, e.g. df_list = [df1, df2, df3], all shaped the same. What is an efficient way to perform these same actions on n number of DataFrames?
Update to Update
Not sure how efficient this is, or if there is a more spark-esque way to do this, but incorporating accepted answer, this appears to work for question update:
for i in range(0, (len(validations) - 1)):
# set dfs
df1 = validations[i]['df']
df2 = validations[(i+1)]['df']
# joins here...
# update new_df
new_df = df2
Here's one way to accomplish your second approach:
Explode the list column and then unionAll the two DataFrames. Next groupBy the "id" column and use pyspark.sql.functions.collect_list() and pyspark.sql.functions.sum():
import pyspark.sql.functions as f
new_df = df1.select("id", f.explode("a_list").alias("a_values"), "count")\
.unionAll(df2.select("id", f.explode("a_list").alias("a_values"), "count"))\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))
new_df.show(truncate=False)
#+---+----------+-----+
#|id |a_list |count|
#+---+----------+-----+
#|43 |[scrog] |0 |
#|44 |[baz] |4 |
#|42 |[foo, bar]|3 |
#+---+----------+-----+
Finally you can use pyspark.sql.functions.struct() and pyspark.sql.functions.to_json() to convert this intermediate DataFrame into your desired structure:
new_df = new_df.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))
new_df.show()
#+---+----------------------------------+
#|id |data |
#+---+----------------------------------+
#|43 |{"a_list":["scrog"],"count":0} |
#|44 |{"a_list":["baz"],"count":4} |
#|42 |{"a_list":["foo","bar"],"count":3}|
#+---+----------------------------------+
Update
If you had a list of dataframes in df_list, you could do the following:
from functools import reduce # for python3
df_list = [df1, df2]
new_df = reduce(lambda a, b: a.unionAll(b), df_list)\
.select("id", f.explode("a_list").alias("a_values"), "count")\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))\
.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))

Checking datetime format of a column in dataframe

I have an input dateframe, which has below data:
id date_column
1 2011-07-09 11:29:31+0000
2 2011-07-09T11:29:31+0000
3 2011-07-09T11:29:31
4 2011-07-09T11:29:31+0000
I want to check whether format of date_column matches the format "%Y-%m-%dT%H:%M:%S+0000", if format matches, i want to add a column, which has value 1 otherwise 0.
Currently, i have defined a UDF to do this operation:
def date_pattern_matching(value, pattern):
try:
datetime.strptime(str(value),pattern)
return "1"
except:
return "0"
It generates below output dataframe:
id date_column output
1 2011-07-09 11:29:31+0000 0
2 2011-07-09T11:29:31+0000 1
3 2011-07-09T11:29:31 0
4 2011-07-09T11:29:31+0000 1
Execution through UDF takes a lot of time, is there an alternate way to achieve it?
Try the regex pyspark.sql.Column.rlike operator with a when otherwise block
from pyspark.sql import functions as F
data = [[1, '2011-07-09 11:29:31+0000'],
[1,"2011-07-09 11:29:31+0000"],
[2,"2011-07-09T11:29:31+0000"],
[3,"2011-07-09T11:29:31"],
[4,"2011-07-09T11:29:31+0000"]]
df = spark.createDataFrame(data, ["id", "date_column"])
regex = "([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\+?\-?[0-9]{4})"
df_w_output = df.select("*", F.when(F.col("date_column").rlike(regex), 1).otherwise(0).alias("output"))
df_w_output.show()
Output
+---+------------------------+------+
|id |date_column |output|
+---+------------------------+------+
|1 |2011-07-09 11:29:31+0000|0 |
|1 |2011-07-09 11:29:31+0000|0 |
|2 |2011-07-09T11:29:31+0000|1 |
|3 |2011-07-09T11:29:31 |0 |
|4 |2011-07-09T11:29:31+0000|1 |
+---+------------------------+------+