I have the below spark dataframe.
Name age subject parts
xxxx 21 Maths,Physics I
yyyy 22 English,French I,II
I am trying to explode the above dataframe in both subject and parts like below.
Expected output:
Name age subject parts
xxxx 21 Maths I
xxxx 21 Physics I
yyyy 22 English I
yyyy 22 English II
yyyy 22 French I
yyyy 22 French II
I tried using array.zip for subject and parts and then tried to explode using the temp column, but I am getting null values in the place where there is only one part.
Is there a way to achieve this in Pyspark.
You simply need to use both split and explode :
Data Sample
df.show()
+----+---+--------------+-----+
|Name|age| subject|parts|
+----+---+--------------+-----+
|xxxx| 21| Maths,Physics| I|
|yyyy| 22|English,French| I,II|
+----+---+--------------+-----+
df.printSchema()
root
|-- Name: string (nullable = true)
|-- age: long (nullable = true)
|-- subject: string (nullable = true)
|-- parts: string (nullable = true)
Data Transformation
from pyspark.sql import functions as F
df.withColumn(
"subject", F.explode(F.split("subject", ","))
).withColumn(
"parts", F.explode(F.split("parts", ","))
).show()
+----+---+-------+-----+
|Name|age|subject|parts|
+----+---+-------+-----+
|xxxx| 21| Maths| I|
|xxxx| 21|Physics| I|
|yyyy| 22|English| I|
|yyyy| 22|English| II|
|yyyy| 22| French| I|
|yyyy| 22| French| II|
+----+---+-------+-----+
You can split them separately then join them back together
Split subjects
df1 = (df
.select('name', 'age', 'subject')
.withColumn('subject', F.explode(F.split('subject', ',')))
)
# +----+---+-------+
# |name|age|subject|
# +----+---+-------+
# |xxxx| 21| Maths|
# |xxxx| 21|Physics|
# |yyyy| 22|English|
# |yyyy| 22| French|
# +----+---+-------+
Split parts
df2 = (df
.select('name', 'age', 'parts')
.withColumn('parts', F.explode(F.split('parts', ',')))
)
# +----+---+-----+
# |name|age|parts|
# +----+---+-----+
# |xxxx| 21| I|
# |yyyy| 22| I|
# |yyyy| 22| II|
# +----+---+-----+
Join back
df1.join(df2, on=['name', 'age'])
# +----+---+-------+-----+
# |name|age|subject|parts|
# +----+---+-------+-----+
# |xxxx| 21| Maths| I|
# |xxxx| 21|Physics| I|
# |yyyy| 22|English| I|
# |yyyy| 22|English| II|
# |yyyy| 22| French| I|
# |yyyy| 22| French| II|
# +----+---+-------+-----+
I did this by passing columns as list to a for loop and exploded the dataframe for every element in list
Related
I have below scenario.
li = ['g1','g2','g3']
df1 = id name goal
1 raj g1
2 harsh g3/g1
3 ramu g1
Above as you can see dataframe df1 and list li
i wanted to filter record in df1 base on list values of li but you can see in goal column first we need to split value base of / del but getting error
df1 = df1.filter(~df1.goal.isin(li))
but this is not returning any record...
is there any way to get record
Using this exemple:
from pyspark.sql import functions as F
from pyspark.sql.types import *
li = ['g1','g2','g3']
df1 = spark.createDataFrame(
[
('1','raj','g1'),
('2','harsh','g3/g1'),
('3','ramu','g1'),
('4','luiz','g2/g4')
],
["id", "name", "goal"]
)
df1.show()
# +---+-----+-----+
# | id| name| goal|
# +---+-----+-----+
# | 1| raj| g1|
# | 2|harsh|g3/g1|
# | 3| ramu| g1|
# | 4| luiz|g2/g4|
# +---+-----+-----+
You can use split to split the goal column and then array_except to find which records are not in your list:
result = df1\
.withColumn('goal_split', F.split(F.col('goal'), "/"))\
.withColumn('li', F.array([F.lit(x) for x in li]))\
.withColumn("test",F.array_except('goal_split','li'))\
.filter(F.col('test') == F.array([]))\
result.show()
# +---+-----+-----+----------+------------+----+
# | id| name| goal|goal_split| li|test|
# +---+-----+-----+----------+------------+----+
# | 1| raj| g1| [g1]|[g1, g2, g3]| []|
# | 2|harsh|g3/g1| [g3, g1]|[g1, g2, g3]| []|
# | 3| ramu| g1| [g1]|[g1, g2, g3]| []|
# +---+-----+-----+----------+------------+----+
Than, select the columns you want for the result:
result.select('id', 'name', 'goal').show().
# +---+-----+-----+
# | id| name| goal|
# +---+-----+-----+
# | 1| raj| g1|
# | 2|harsh|g3/g1|
# | 3| ramu| g1|
# +---+-----+-----+
I have a dataset that has null values
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 0|null|
| 1|null| 0|
|null| 1| 0|
| 1| 0| 0|
| 1| 0| 0|
|null| 0| 1|
| 1| 1| 0|
| 1| 1|null|
|null| 1| 0|
+----+----+----+
I wrote a function to count the percentage of null values of each column in the dataset and removing those columns from the dataset. Below is the function
import pyspark.sql.functions as F
def calc_null_percent(df, strength=None):
if strength is None:
strength = 80
total_count = df.count()
null_cols = []
df2 = df.select([F.count(F.when(F.col(c).contains('None') | \
F.col(c).contains('NULL') | \
(F.col(c) == '' ) | \
F.col(c).isNull() | \
F.isnan(c), c
)).alias(c)
for c in df.columns])
for i in df2.columns:
get_null_val = df2.first()[i]
if (get_null_val/total_count)*100 > strength:
null_cols.append(i)
df = df.drop(*null_cols)
return df
I am using a for loop to get the columns based on the condition. Can we use map or Is there any other way to optimise the for loop in pyspark?
Here's a way to do it with list comprehension.
data_ls = [
(1, 0, 'blah'),
(0, None, 'None'),
(None, 1, 'NULL'),
(1, None, None)
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['id1', 'id2', 'id3'])
# +----+----+----+
# | id1| id2| id3|
# +----+----+----+
# | 1| 0|blah|
# | 0|null|None|
# |null| 1|NULL|
# | 1|null|null|
# +----+----+----+
Now, calculate the percentage of nulls in a dataframe and collect() it for further use.
# total row count
tot_count = data_sdf.count()
# percentage of null records per column
data_null_perc_sdf = data_sdf. \
select(*[(func.sum((func.col(k).isNull() | (func.upper(k).isin(['NONE', 'NULL']))).cast('int')) / tot_count).alias(k+'_nulls_perc') for k in data_sdf.columns])
# +--------------+--------------+--------------+
# |id1_nulls_perc|id2_nulls_perc|id3_nulls_perc|
# +--------------+--------------+--------------+
# | 0.25| 0.5| 0.75|
# +--------------+--------------+--------------+
# collection of the dataframe for list comprehension
data_null_perc = data_null_perc_sdf.collect()
# [Row(id1_nulls_perc=0.25, id2_nulls_perc=0.5, id3_nulls_perc=0.75)]
threshold = 0.5
# retain columns of `data_sdf` that have more null records than aforementioned threshold
cols2drop = [k for k in data_sdf.columns if data_null_perc[0][k+'_nulls_perc'] >= threshold]
# ['id2', 'id3']
Use cols2drop variable to drop the columns from data_sdf in the next step
new_data_sdf = data_sdf.drop(*cols2drop)
# +----+
# | id1|
# +----+
# | 1|
# | 0|
# |null|
# | 1|
# +----+
I have the below spark dataframe.
Name age subject parts
xxxx 21 Maths,Physics I
yyyy 22 English,French I,II
I am trying to explode the above dataframe in both subject and parts like below.
Expected output:
Name age subject parts
xxxx 21 Maths I
xxxx 21 Physics I
yyyy 22 English I
yyyy 22 English II
yyyy 22 French I
yyyy 22 French II
I tried using array.zip for subject and parts and then tried to explode using the temp column, but I am getting null values in the place where there is only one part.
Is there a way to achieve this in Pyspark.
You simply need to use both split and explode :
Data Sample
df.show()
+----+---+--------------+-----+
|Name|age| subject|parts|
+----+---+--------------+-----+
|xxxx| 21| Maths,Physics| I|
|yyyy| 22|English,French| I,II|
+----+---+--------------+-----+
df.printSchema()
root
|-- Name: string (nullable = true)
|-- age: long (nullable = true)
|-- subject: string (nullable = true)
|-- parts: string (nullable = true)
Data Transformation
from pyspark.sql import functions as F
df.withColumn(
"subject", F.explode(F.split("subject", ","))
).withColumn(
"parts", F.explode(F.split("parts", ","))
).show()
+----+---+-------+-----+
|Name|age|subject|parts|
+----+---+-------+-----+
|xxxx| 21| Maths| I|
|xxxx| 21|Physics| I|
|yyyy| 22|English| I|
|yyyy| 22|English| II|
|yyyy| 22| French| I|
|yyyy| 22| French| II|
+----+---+-------+-----+
You can split them separately then join them back together
Split subjects
df1 = (df
.select('name', 'age', 'subject')
.withColumn('subject', F.explode(F.split('subject', ',')))
)
# +----+---+-------+
# |name|age|subject|
# +----+---+-------+
# |xxxx| 21| Maths|
# |xxxx| 21|Physics|
# |yyyy| 22|English|
# |yyyy| 22| French|
# +----+---+-------+
Split parts
df2 = (df
.select('name', 'age', 'parts')
.withColumn('parts', F.explode(F.split('parts', ',')))
)
# +----+---+-----+
# |name|age|parts|
# +----+---+-----+
# |xxxx| 21| I|
# |yyyy| 22| I|
# |yyyy| 22| II|
# +----+---+-----+
Join back
df1.join(df2, on=['name', 'age'])
# +----+---+-------+-----+
# |name|age|subject|parts|
# +----+---+-------+-----+
# |xxxx| 21| Maths| I|
# |xxxx| 21|Physics| I|
# |yyyy| 22|English| I|
# |yyyy| 22|English| II|
# |yyyy| 22| French| I|
# |yyyy| 22| French| II|
# +----+---+-------+-----+
I did this by passing columns as list to a for loop and exploded the dataframe for every element in list
I have a spark dataframe, where I need to calculate a running total based on the current and previous row sum of amount valued based on the col_x. when ever there is occurance of negative amount in col_y, I should break the running total of previous records, and start doing the running total from current row.
Sample dataset:
The expected output should be like:
How to acheive this with dataframes using pyspark?
Another way
Create Index
df = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
Regenerate Columns
df = df.select('index', 'value.*')#.show()
Create groups bounded by negative values
w=Window.partitionBy().orderBy('index').rowsBetween(-sys.maxsize,0)
df=df.withColumn('cat', f.min('Col_y').over(w))
Cumsum within groups
y=Window.partitionBy('cat').orderBy(f.asc('index')).rowsBetween(Window.unboundedPreceding,0)
df.withColumn('cumsum', f.round(f.sum('Col_y').over(y),2)).sort('index').drop('cat','index').show()
Outcome
+-----+-------------------+------+
|Col_x| Col_y|cumsum|
+-----+-------------------+------+
| ID1|-17.899999618530273| -17.9|
| ID1| 21.899999618530273| 4.0|
| ID1| 236.89999389648438| 240.9|
| ID1| 4.989999771118164|245.89|
| ID1| 610.2000122070312|856.09|
| ID1| -35.79999923706055| -35.8|
| ID1| 21.899999618530273| -13.9|
| ID1| 17.899999618530273| 4.0|
+-----+-------------------+------+
I am hoping in real scenario you will be having a timestamp column to do ordering of the data, I am ordering the data using line number with zipindex for the explanation basis here.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("ID1", -17.9),
("ID1", 21.9),
("ID1", 236.9),
("ID1", 4.99),
("ID1", 610.2),
("ID1", -35.8),
("ID1",21.9),
("ID1",17.9)
]
schema = StructType([
StructField('Col_x', StringType(),True), \
StructField('Col_y', FloatType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("valuewithorder")
w = Window.partitionBy('Col_x').orderBy('index')
w1 = Window.partitionBy('Col_x','group').orderBy('index')
df_final=spark.sql("select value.Col_x,round(value.Col_y,1) as Col_y, index from valuewithorder")
"""Group The data into different groups based on the negative value existance"""
df_final = df_final.withColumn("valueChange",(f.col('Col_y')<0).cast("int")) \
.fillna(0,subset=["valueChange"])\
.withColumn("indicator",(~((f.col("valueChange") == 0))).cast("int"))\
.withColumn("group",f.sum(f.col("indicator")).over(w.rangeBetween(Window.unboundedPreceding, 0)))
"""Cumlative sum with idfferent parititon of group and col_x"""
df_cum_sum = df_final.withColumn("Col_z", sum('Col_y').over(w1))
df_cum_sum.createOrReplaceTempView("FinalCumSum")
df_cum_sum = spark.sql("select Col_x , Col_y ,round(Col_z,1) as Col_z from FinalCumSum")
df_cum_sum.show()
Results of intermedite data set and results
>>> df_cum_sum.show()
+-----+-----+-----+
|Col_x|Col_y|Col_z|
+-----+-----+-----+
| ID1|-17.9|-17.9|
| ID1| 21.9| 4.0|
| ID1|236.9|240.9|
| ID1| 5.0|245.9|
| ID1|610.2|856.1|
| ID1|-35.8|-35.8|
| ID1| 21.9|-13.9|
| ID1| 17.9| 4.0|
+-----+-----+-----+
>>> df_final.show()
+-----+-----+-----+-----------+---------+-----+
|Col_x|Col_y|index|valueChange|indicator|group|
+-----+-----+-----+-----------+---------+-----+
| ID1|-17.9| 0| 1| 1| 1|
| ID1| 21.9| 1| 0| 0| 1|
| ID1|236.9| 2| 0| 0| 1|
| ID1| 5.0| 3| 0| 0| 1|
| ID1|610.2| 4| 0| 0| 1|
| ID1|-35.8| 5| 1| 1| 2|
| ID1| 21.9| 6| 0| 0| 2|
| ID1| 17.9| 7| 0| 0| 2|
+-----+-----+-----+-----------+---------+-----+
I have dataframe contain (around 20000000 rows) and I'd like to drop duplicates from a dataframe for two columns if those columns have the same values, or even if those values are in the reverse order.
For example the original dataframe:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 1| B|
| 2| 1| C|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
| 4| 3| G|
+----+----+----+
where the schema of the column as follows:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
The desired dataframe should look like:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
+----+----+----+
The dropDuplicates() method remove duplicates if the values in the same order
I followed the accepted answer to this question Pandas: remove reverse duplicates from dataframe but it took more time.
You can use this:
Hope this helps.
Note : In 'col3' 'D' will be removed istead of 'C', because 'C' is positioned before 'D'.
from pyspark.sql import functions as F
df = spark.read.csv('/FileStore/tables/stack2.csv', header = 'True')
df2 = df.select(F.least(df.col1,df.col2).alias('col1'),F.greatest(df.col1,df.col2).alias('col2'),df.col3)
df2.dropDuplicates(['col1','col2']).show()