Denormalize data in AWS Glue PySpark [duplicate]

Denormalize data in AWS Glue PySpark [duplicate] - pyspark

I have the below spark dataframe.
Name age subject parts
xxxx 21 Maths,Physics I
yyyy 22 English,French I,II
I am trying to explode the above dataframe in both subject and parts like below.
Expected output:
Name age subject parts
xxxx 21 Maths I
xxxx 21 Physics I
yyyy 22 English I
yyyy 22 English II
yyyy 22 French I
yyyy 22 French II
I tried using array.zip for subject and parts and then tried to explode using the temp column, but I am getting null values in the place where there is only one part.
Is there a way to achieve this in Pyspark.

You simply need to use both split and explode :
Data Sample
df.show()
+----+---+--------------+-----+
|Name|age| subject|parts|
+----+---+--------------+-----+
|xxxx| 21| Maths,Physics| I|
|yyyy| 22|English,French| I,II|
+----+---+--------------+-----+
df.printSchema()
root
|-- Name: string (nullable = true)
|-- age: long (nullable = true)
|-- subject: string (nullable = true)
|-- parts: string (nullable = true)
Data Transformation
from pyspark.sql import functions as F
df.withColumn(
"subject", F.explode(F.split("subject", ","))
).withColumn(
"parts", F.explode(F.split("parts", ","))
).show()
+----+---+-------+-----+
|Name|age|subject|parts|
+----+---+-------+-----+
|xxxx| 21| Maths| I|
|xxxx| 21|Physics| I|
|yyyy| 22|English| I|
|yyyy| 22|English| II|
|yyyy| 22| French| I|
|yyyy| 22| French| II|
+----+---+-------+-----+

You can split them separately then join them back together
Split subjects
df1 = (df
.select('name', 'age', 'subject')
.withColumn('subject', F.explode(F.split('subject', ',')))
)
# +----+---+-------+
# |name|age|subject|
# +----+---+-------+
# |xxxx| 21| Maths|
# |xxxx| 21|Physics|
# |yyyy| 22|English|
# |yyyy| 22| French|
# +----+---+-------+
Split parts
df2 = (df
.select('name', 'age', 'parts')
.withColumn('parts', F.explode(F.split('parts', ',')))
)
# +----+---+-----+
# |name|age|parts|
# +----+---+-----+
# |xxxx| 21| I|
# |yyyy| 22| I|
# |yyyy| 22| II|
# +----+---+-----+
Join back
df1.join(df2, on=['name', 'age'])
# +----+---+-------+-----+
# |name|age|subject|parts|
# +----+---+-------+-----+
# |xxxx| 21| Maths| I|
# |xxxx| 21|Physics| I|
# |yyyy| 22|English| I|
# |yyyy| 22|English| II|
# |yyyy| 22| French| I|
# |yyyy| 22| French| II|
# +----+---+-------+-----+

I did this by passing columns as list to a for loop and exploded the dataframe for every element in list

Related

filter record in dataframe base on list of value

I have below scenario.
li = ['g1','g2','g3']
df1 = id name goal
1 raj g1
2 harsh g3/g1
3 ramu g1
Above as you can see dataframe df1 and list li
i wanted to filter record in df1 base on list values of li but you can see in goal column first we need to split value base of / del but getting error
df1 = df1.filter(~df1.goal.isin(li))
but this is not returning any record...
is there any way to get record

Using this exemple:
from pyspark.sql import functions as F
from pyspark.sql.types import *
li = ['g1','g2','g3']
df1 = spark.createDataFrame(
[
('1','raj','g1'),
('2','harsh','g3/g1'),
('3','ramu','g1'),
('4','luiz','g2/g4')
],
["id", "name", "goal"]
)
df1.show()
# +---+-----+-----+
# | id| name| goal|
# +---+-----+-----+
# | 1| raj| g1|
# | 2|harsh|g3/g1|
# | 3| ramu| g1|
# | 4| luiz|g2/g4|
# +---+-----+-----+
You can use split to split the goal column and then array_except to find which records are not in your list:
result = df1\
.withColumn('goal_split', F.split(F.col('goal'), "/"))\
.withColumn('li', F.array([F.lit(x) for x in li]))\
.withColumn("test",F.array_except('goal_split','li'))\
.filter(F.col('test') == F.array([]))\
result.show()
# +---+-----+-----+----------+------------+----+
# | id| name| goal|goal_split| li|test|
# +---+-----+-----+----------+------------+----+
# | 1| raj| g1| [g1]|[g1, g2, g3]| []|
# | 2|harsh|g3/g1| [g3, g1]|[g1, g2, g3]| []|
# | 3| ramu| g1| [g1]|[g1, g2, g3]| []|
# +---+-----+-----+----------+------------+----+
Than, select the columns you want for the result:
result.select('id', 'name', 'goal').show().
# +---+-----+-----+
# | id| name| goal|
# +---+-----+-----+
# | 1| raj| g1|
# | 2|harsh|g3/g1|
# | 3| ramu| g1|
# +---+-----+-----+

pypsark convert for loop to map

I have a dataset that has null values
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 0|null|
| 1|null| 0|
|null| 1| 0|
| 1| 0| 0|
| 1| 0| 0|
|null| 0| 1|
| 1| 1| 0|
| 1| 1|null|
|null| 1| 0|
+----+----+----+
I wrote a function to count the percentage of null values of each column in the dataset and removing those columns from the dataset. Below is the function
import pyspark.sql.functions as F
def calc_null_percent(df, strength=None):
if strength is None:
strength = 80
total_count = df.count()
null_cols = []
df2 = df.select([F.count(F.when(F.col(c).contains('None') | \
F.col(c).contains('NULL') | \
(F.col(c) == '' ) | \
F.col(c).isNull() | \
F.isnan(c), c
)).alias(c)
for c in df.columns])
for i in df2.columns:
get_null_val = df2.first()[i]
if (get_null_val/total_count)*100 > strength:
null_cols.append(i)
df = df.drop(*null_cols)
return df
I am using a for loop to get the columns based on the condition. Can we use map or Is there any other way to optimise the for loop in pyspark?

Here's a way to do it with list comprehension.
data_ls = [
(1, 0, 'blah'),
(0, None, 'None'),
(None, 1, 'NULL'),
(1, None, None)
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['id1', 'id2', 'id3'])
# +----+----+----+
# | id1| id2| id3|
# +----+----+----+
# | 1| 0|blah|
# | 0|null|None|
# |null| 1|NULL|
# | 1|null|null|
# +----+----+----+
Now, calculate the percentage of nulls in a dataframe and collect() it for further use.
# total row count
tot_count = data_sdf.count()
# percentage of null records per column
data_null_perc_sdf = data_sdf. \
select(*[(func.sum((func.col(k).isNull() | (func.upper(k).isin(['NONE', 'NULL']))).cast('int')) / tot_count).alias(k+'_nulls_perc') for k in data_sdf.columns])
# +--------------+--------------+--------------+
# |id1_nulls_perc|id2_nulls_perc|id3_nulls_perc|
# +--------------+--------------+--------------+
# | 0.25| 0.5| 0.75|
# +--------------+--------------+--------------+
# collection of the dataframe for list comprehension
data_null_perc = data_null_perc_sdf.collect()
# [Row(id1_nulls_perc=0.25, id2_nulls_perc=0.5, id3_nulls_perc=0.75)]
threshold = 0.5
# retain columns of `data_sdf` that have more null records than aforementioned threshold
cols2drop = [k for k in data_sdf.columns if data_null_perc[0][k+'_nulls_perc'] >= threshold]
# ['id2', 'id3']
Use cols2drop variable to drop the columns from data_sdf in the next step
new_data_sdf = data_sdf.drop(*cols2drop)
# +----+
# | id1|
# +----+
# | 1|
# | 0|
# |null|
# | 1|
# +----+

Explode multiple columns to rows in pyspark

I have the below spark dataframe.
Name age subject parts
xxxx 21 Maths,Physics I
yyyy 22 English,French I,II
I am trying to explode the above dataframe in both subject and parts like below.
Expected output:
Name age subject parts
xxxx 21 Maths I
xxxx 21 Physics I
yyyy 22 English I
yyyy 22 English II
yyyy 22 French I
yyyy 22 French II
I tried using array.zip for subject and parts and then tried to explode using the temp column, but I am getting null values in the place where there is only one part.
Is there a way to achieve this in Pyspark.

You simply need to use both split and explode :
Data Sample
df.show()
+----+---+--------------+-----+
|Name|age| subject|parts|
+----+---+--------------+-----+
|xxxx| 21| Maths,Physics| I|
|yyyy| 22|English,French| I,II|
+----+---+--------------+-----+
df.printSchema()
root
|-- Name: string (nullable = true)
|-- age: long (nullable = true)
|-- subject: string (nullable = true)
|-- parts: string (nullable = true)
Data Transformation
from pyspark.sql import functions as F
df.withColumn(
"subject", F.explode(F.split("subject", ","))
).withColumn(
"parts", F.explode(F.split("parts", ","))
).show()
+----+---+-------+-----+
|Name|age|subject|parts|
+----+---+-------+-----+
|xxxx| 21| Maths| I|
|xxxx| 21|Physics| I|
|yyyy| 22|English| I|
|yyyy| 22|English| II|
|yyyy| 22| French| I|
|yyyy| 22| French| II|
+----+---+-------+-----+

You can split them separately then join them back together
Split subjects
df1 = (df
.select('name', 'age', 'subject')
.withColumn('subject', F.explode(F.split('subject', ',')))
)
# +----+---+-------+
# |name|age|subject|
# +----+---+-------+
# |xxxx| 21| Maths|
# |xxxx| 21|Physics|
# |yyyy| 22|English|
# |yyyy| 22| French|
# +----+---+-------+
Split parts
df2 = (df
.select('name', 'age', 'parts')
.withColumn('parts', F.explode(F.split('parts', ',')))
)
# +----+---+-----+
# |name|age|parts|
# +----+---+-----+
# |xxxx| 21| I|
# |yyyy| 22| I|
# |yyyy| 22| II|
# +----+---+-----+
Join back
df1.join(df2, on=['name', 'age'])
# +----+---+-------+-----+
# |name|age|subject|parts|
# +----+---+-------+-----+
# |xxxx| 21| Maths| I|
# |xxxx| 21|Physics| I|
# |yyyy| 22|English| I|
# |yyyy| 22|English| II|
# |yyyy| 22| French| I|
# |yyyy| 22| French| II|
# +----+---+-------+-----+

I did this by passing columns as list to a for loop and exploded the dataframe for every element in list

Calculate running total in Pyspark dataframes and break the loop when a condition occurs

I have a spark dataframe, where I need to calculate a running total based on the current and previous row sum of amount valued based on the col_x. when ever there is occurance of negative amount in col_y, I should break the running total of previous records, and start doing the running total from current row.
Sample dataset:
The expected output should be like:
How to acheive this with dataframes using pyspark?

Another way
Create Index
df = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
Regenerate Columns
df = df.select('index', 'value.*')#.show()
Create groups bounded by negative values
w=Window.partitionBy().orderBy('index').rowsBetween(-sys.maxsize,0)
df=df.withColumn('cat', f.min('Col_y').over(w))
Cumsum within groups
y=Window.partitionBy('cat').orderBy(f.asc('index')).rowsBetween(Window.unboundedPreceding,0)
df.withColumn('cumsum', f.round(f.sum('Col_y').over(y),2)).sort('index').drop('cat','index').show()
Outcome
+-----+-------------------+------+
|Col_x| Col_y|cumsum|
+-----+-------------------+------+
| ID1|-17.899999618530273| -17.9|
| ID1| 21.899999618530273| 4.0|
| ID1| 236.89999389648438| 240.9|
| ID1| 4.989999771118164|245.89|
| ID1| 610.2000122070312|856.09|
| ID1| -35.79999923706055| -35.8|
| ID1| 21.899999618530273| -13.9|
| ID1| 17.899999618530273| 4.0|
+-----+-------------------+------+

I am hoping in real scenario you will be having a timestamp column to do ordering of the data, I am ordering the data using line number with zipindex for the explanation basis here.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("ID1", -17.9),
("ID1", 21.9),
("ID1", 236.9),
("ID1", 4.99),
("ID1", 610.2),
("ID1", -35.8),
("ID1",21.9),
("ID1",17.9)
]
schema = StructType([
StructField('Col_x', StringType(),True), \
StructField('Col_y', FloatType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("valuewithorder")
w = Window.partitionBy('Col_x').orderBy('index')
w1 = Window.partitionBy('Col_x','group').orderBy('index')
df_final=spark.sql("select value.Col_x,round(value.Col_y,1) as Col_y, index from valuewithorder")
"""Group The data into different groups based on the negative value existance"""
df_final = df_final.withColumn("valueChange",(f.col('Col_y')<0).cast("int")) \
.fillna(0,subset=["valueChange"])\
.withColumn("indicator",(~((f.col("valueChange") == 0))).cast("int"))\
.withColumn("group",f.sum(f.col("indicator")).over(w.rangeBetween(Window.unboundedPreceding, 0)))
"""Cumlative sum with idfferent parititon of group and col_x"""
df_cum_sum = df_final.withColumn("Col_z", sum('Col_y').over(w1))
df_cum_sum.createOrReplaceTempView("FinalCumSum")
df_cum_sum = spark.sql("select Col_x , Col_y ,round(Col_z,1) as Col_z from FinalCumSum")
df_cum_sum.show()
Results of intermedite data set and results
>>> df_cum_sum.show()
+-----+-----+-----+
|Col_x|Col_y|Col_z|
+-----+-----+-----+
| ID1|-17.9|-17.9|
| ID1| 21.9| 4.0|
| ID1|236.9|240.9|
| ID1| 5.0|245.9|
| ID1|610.2|856.1|
| ID1|-35.8|-35.8|
| ID1| 21.9|-13.9|
| ID1| 17.9| 4.0|
+-----+-----+-----+
>>> df_final.show()
+-----+-----+-----+-----------+---------+-----+
|Col_x|Col_y|index|valueChange|indicator|group|
+-----+-----+-----+-----------+---------+-----+
| ID1|-17.9| 0| 1| 1| 1|
| ID1| 21.9| 1| 0| 0| 1|
| ID1|236.9| 2| 0| 0| 1|
| ID1| 5.0| 3| 0| 0| 1|
| ID1|610.2| 4| 0| 0| 1|
| ID1|-35.8| 5| 1| 1| 2|
| ID1| 21.9| 6| 0| 0| 2|
| ID1| 17.9| 7| 0| 0| 2|
+-----+-----+-----+-----------+---------+-----+

Drop duplicates if reverse is present between two columns

I have dataframe contain (around 20000000 rows) and I'd like to drop duplicates from a dataframe for two columns if those columns have the same values, or even if those values are in the reverse order.
For example the original dataframe:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 1| B|
| 2| 1| C|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
| 4| 3| G|
+----+----+----+
where the schema of the column as follows:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
The desired dataframe should look like:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
+----+----+----+
The dropDuplicates() method remove duplicates if the values in the same order
I followed the accepted answer to this question Pandas: remove reverse duplicates from dataframe but it took more time.

You can use this:
Hope this helps.
Note : In 'col3' 'D' will be removed istead of 'C', because 'C' is positioned before 'D'.
from pyspark.sql import functions as F
df = spark.read.csv('/FileStore/tables/stack2.csv', header = 'True')
df2 = df.select(F.least(df.col1,df.col2).alias('col1'),F.greatest(df.col1,df.col2).alias('col2'),df.col3)
df2.dropDuplicates(['col1','col2']).show()

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Denormalize data in AWS Glue PySpark [duplicate] - pyspark

I did this by passing columns as list to a for loop and exploded the dataframe for every element in list

Related

filter record in dataframe base on list of value

pypsark convert for loop to map

Explode multiple columns to rows in pyspark

Calculate running total in Pyspark dataframes and break the loop when a condition occurs

Drop duplicates if reverse is present between two columns

Categories

Resources