Flatten Json Key, values in Pyspark - pyspark

In a table having 2 columns and 2 Records :
Record 1 : Column 1 - my_col value as: {"XXX": ["123","456"],"YYY": ["246","135"]} and Column 2 - ID as A123
Record 2 : Column 1 - my_col value as: {"XXX": ["123","456"],"YYY": ["246","135"], "ZZZ":["333","444"]} and Column 2 - ID as B222
Need to parse/flatten using pyspark
Expectation :
Key
Value
ID
XXX
123
A123
XXX
456
A123
YYY
246
A123
YYY
135
A123
ZZZ
333
B222
ZZZ
444
B222

If your column is a string, you may use the from_json and custom_schema to convert it to a MapType before using explode to extract it into the desired results. I assumed that your initial column was named my_col and that your data was in a dataframe named input_df.
An example is shown below
Approach 1: Using pyspark api
from pyspark.sql import functions as F
from pyspark.sql import types as T
custom_schema = T.MapType(T.StringType(),T.ArrayType(T.StringType()))
output_df = (
input_df.select(
F.from_json(F.col('my_col'),custom_schema).alias('my_col_json')
)
.select(F.explode('my_col_json'))
.select(
F.col('key'),
F.explode('value')
)
)
Approach 2: Using spark sql
# Step 1 : Create a temporary view that may be queried
input_df.createOrReplaceTempView("input_df")
# Step 2: Run the following sql on your spark session
output_df = sparkSession.sql("""
SELECT
key,
EXPLODE(value)
FROM (
SELECT
EXPLODE(from_json(my_col,"MAP<STRING,ARRAY<STRING>>"))
FROM
input_df
) t
""")
For json column
If already json
from pyspark.sql import functions as F
output_df = (
input_df.select(F.explode('my_col_json'))
.select(
F.col('key'),
F.explode('value')
)
)
or
# Step 1 : Create a temporary view that may be queried
input_df.createOrReplaceTempView("input_df")
# Step 2: Run the following sql on your spark session
output_df = sparkSession.sql("""
SELECT
key,
EXPLODE(value)
FROM (
SELECT
EXPLODE(my_col)
FROM
input_df
) t
""")
Let me know if this works for you.

Related

PySpark Code Modification to Remove Nulls

I received help with following PySpark to prevent errors when doing a Merge in Databricks, see here
Databricks Error: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table conflicting way
I was wondering if I could get help to modify the code to drop NULLs.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
df2 = partdf.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("Id")))
df3 = df2.filter("rn = 1").drop("rn")
Thanks
The code that you are using does not completely delete the rows where P_key is null. It is applying the row number for null values and where row number value is 1 where P_key is null, that row is not getting deleted.
You can instead use the df.na.drop instead to get the required result.
df.na.drop(subset=["P_key"]).show(truncate=False)
To make your approach work, you can use the following approach. Add a row with least possible unique id value. Store this id in a variable, use the same code and add additional condition in filter as shown below.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,when,col
df = spark.read.option("header",True).csv("dbfs:/FileStore/sample1.csv")
#adding row with least possible id value.
dup_id = '0'
new_row = spark.createDataFrame([[dup_id,'','x','x']], schema = ['id','P_key','c1','c2'])
#replacing empty string with null for P_Key
new_row = new_row.withColumn('P_key',when(col('P_key')=='',None).otherwise(col('P_key')))
df = df.union(new_row) #row added
#code to remove duplicates
df2 = df.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("id")))
df2.show(truncate=False)
#additional condition to remove added id row.
df3 = df2.filter((df2.rn == 1) & (df2.P_key!=dup_id)).drop("rn")
df3.show()

How to append 'explode'd columns to a dataframe keeping all existing columns?

I'm trying to add exploded columns to a dataframe:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Convenience function for turning JSON strings into DataFrames.
def jsonToDataFrame(json, schema=None):
# SparkSessions are available with Spark 2.0+
reader = spark.read
if schema:
reader.schema(schema)
return reader.json(sc.parallelize([json]))
schema = StructType().add("a", MapType(StringType(), IntegerType()))
events = jsonToDataFrame("""
{
"a": {
"b": 1,
"c": 2
}
}
""", schema)
display(
events.withColumn("a", explode("a").alias("x", "y"))
)
However, I'm hitting the following error:
AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got a
Any ideas?
In the end, I used the following:
display(
events.select(explode("a").alias("x", "y"), *[c for c in events.columns])
)
This approach uses select to specify the columns to return.
The first argument explodes the data:
explode("a").alias("x", "y")
The second argument specifies all existing columns should be included in the select:\
*[c for c in events.columns]
Note that I'm prefixing the list with * - this sends each column name as a separate parameter.
Simpler Method
The API docs specify:
Parameters
colsstr, Column, or list
column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.
We can simplify the first approach by passing in "*" to select all the columns:
display(
events.select("*", explode("a").alias("x", "y"))
)

Pyspark fill null value of a column based on value of another column

I have a dataframe with 2 columns: col1 and col2:
col1 col2
aaa 111
222
ccc 333
I want to fill the null values (here the 2nd row of col1).
Here for example the logic I want to use is: if col2 is 222 and col1 is null, use the arbitrary string "zzz". For each possibility in col2, I have an arbitrary string I want to fill col1 if it's null (if it's not, I just want to get the value that is already in col1).
My idea was to do something like this:
mapping = {"222":"zzz", "444":"fff"}
df = df.select(F.when(F.col('col1').isNull(), mapping[F.col('col2')] ).otherwise(F.col('col1'))
I know F.col() is actually a column object and I can't simply do this.
What is the simplest solution to achieve the result I want with pyspark please ?
This should work:
from pyspark.sql.functions import col, create_map, lit, when
from itertools import chain
mapping = {"222":"zzz", "444":"fff"}
mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])
df = df.select(when(col('col1').isNull(), mapping_expr[col('col2')] ).otherwise(col('col1'))

When merging two queries in Power BI, can I exact match on one key and fuzzy match on a second key?

I am merging two tables in Power BI where I want to have an exact match on one field, then a fuzzy match on a second field.
In the example below, I want for there to be an exact match for the "Key" columns in Table 1 and Table 2. In table 2, the "Key" column is not a unique identifier and can have multiple names associated with a key. So, I want to then fuzzy match on the name column. Is there a way to do this in Power BI?
Table 1
Key
Name1
info_a
1
Michael
a
2
Robert
b
Table 2
Key
Name2
info_b
1
Mike
aa
1
Andrea
cc
2
Robbie
bb
2
Michelle
dd
Result
Key
Name1
Name2
info_a
info_b
1
Michael
Mile
a
aa
2
Robert
Robbie
b
bb
I ended up using a Python script to solve this problem.
I merged Table 1 and Table 2 on the field ("Key") where an exact match was required.
Then I added this Python script:
from fuzzywuzzy import fuzz
def get_fuzz_score(
df: pd.DataFrame, col1: str, col2: str, scorer=fuzz.token_sort_ratio
) -> pd.Series:
"""
Parameters
----------
df: pd.DataFrame
col1: str, name of column from df
col2: str, name of column from df
scorer: fuzzywuzzy scorer (e.g. fuzz.ratio, fuzz.Wratio, fuzz.partial_ratio, fuzz.token_sort_ratio)
Returns
-------
scores: pd.Series
"""
scores = []
for _, row in df.iterrows():
if row[col1]in [np.nan, None] or row[col2] in [np.nan, None]:
scores.append(None)
else:
scores.append(scorer(row[col1], row[col2]))
return scores
dataset['fuzzy_score'] = get_fuzz_score(dataset, 'Name1', 'Name2', fuzz.WRatio)
dataset['MatchRank'] = dataset.groupby(['Key'])['fuzzy_score'].rank('first', ascending=False)
Then I could just consider the matches where MatchRank = 1

How to generate cumulative concatenation in Spark SQL

My Input for spark is below:
Col_1
Col_2
Amount
1
0
35/310320
1
1
35/5
1
1
180/-310350
17
1
0/1000
17
17
0/-1000
17
17
74/314322
17
17
74/5
17
17
185/-3142
I want to generate the below Output using spark SQL:
Output
35/310320
35/310320/35/5
35/310320/35/5/180/-310350
0/1000
0/1000/0/-1000
0/1000/0/-1000/74/314322
0/1000/0/-1000/74/314322/74/5
0/1000/0/-1000/74/314322/74/5/185/-3142
Conditions & Procedure: If col_1 and col_2 values are not the same then consider the current amount value for the new Output column but both are the same then concatenate the previous all amount value by /.
i.e. 17 from col_1 where col_1 & col_2 value are different so consider current amount 0/1000. Next step both column values is the same so the value is 0/1000/0/-1000 and so on. Need to create this logic for dynamic data in spark SQL or Spark Scala.
You can use concat_ws on a list of amount obtained from collect_list over an appropriate window:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn(
"output",
concat_ws(
"/",
collect_list("amount").over(
Window.partitionBy("col_1")
.orderBy("col_2")
.rowsBetween(Window.unboundedPreceding, 0)
)
)
)