Pyspark set values based on column's condition - pyspark

I have this dataframe
data = [['tom','kik',1], ['nick','ken', 1], ['juli','ryan', 2]]
df = pd.DataFrame(data, columns=['Name','Name2', 'stat'])
df= spark.createDataFrame(df)
I need to make this transformation for the two columns (if stat ==1 then Name and Name2 ==Toto)
data = [['toto','toto', 1], ['toto','toto',1], ['juli','juli', 2]]
df = pd.DataFrame(data, columns=['Name','Name2', 'stat'])
df= spark.createDataFrame(df)

from pyspark.sql.functions import col, when
condition = (col("stat")==1)
new_df = df.withColumn("Name", when(condition, "toto")).withColumn("Name2", when(condition, "toto"))

Related

Convert each row of pyspark DataFrame column to a Json string

How to create a column with json structure based on other columns of a pyspark dataframe.
For example, I want to achieve the below in pyspark dataframe. I am able to do this on pandas dataframe as below, but how do I do the same on pyspark dataframe
df = {'Address': ['abc', 'dvf', 'bgh'], 'zip': [34567, 12345, 78905], 'state':['VA', 'TN', 'MA']}
df = pd.DataFrame(df, columns = ['Address', 'zip', 'state'])
lst = ['Address', 'zip']
df['new_col'] = df[lst].apply(lambda x: x.to_json(), axis = 1)
Expected output
Assuming your pyspark dataframe is named df, use the struct function to construct a struct, and then use the to_json function to convert it to a json string.
import pyspark.sql.functions as F
....
lst = ['Address', 'zip']
df = df.withColumn('new_col', F.to_json(F.struct(*[F.col(c) for c in lst])))
df.show(truncate=False)

Split the PySpark dataframes against number of records

I'm working on a pyspark dataframe having around 100000 records and I want to create new dataframes of around 20000 records each.How can I achieve it?
It can be dynamic but here is a lazy way to do it
#Creates a random DF with 100000 rows
from pyspark.sql import functions as F
df = spark.range(0, 100001).withColumn('rand_col', F.rand()).drop('id')
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("index", row_number().over(w)) #creates a index column to split the DF
df1 = df.filter(F.col('index') < 20001)
df2 = df.filter((F.col('index') >= 20001) & (F.col('index') < 40001))
df3 = df.filter((F.col('index') >= 40001) & (F.col('index') < 60001))
df4 = df.filter((F.col('index') >= 60001) & (F.col('index') < 80001))
df5 = df.filter((F.col('index') >= 80001) & (F.col('index') < 100001))

Alternative for df.withColumn in PySpark?

I am converting some long datatype to datetime using withColumn and below code, but what I notice that when using withColumn inside for loops is leading to poor query planning performance.
is there any other way I can achieve the below
from pyspark.sql import functions as F
for col in list_of_columns:
df = df.withColumn(col, F.from_utc_timestamp(F.from_unixtime(df[col] / 1000), "UTC"))
list_of_columns = [list of 20 columns]
you can put that in a single select statement.
no_change_columns = [col for col in df.columns if col not in list_of_columns]
df = df.select(
*no_change_columns,
*(
F.from_utc_timestamp(F.from_unixtime(F.col(col) / 1000), "UTC").alias(col)
for col in list_of_columns
)
)

join parameterization in pyspark

I need to parameterized join condition and joining columns should get passes from CLI (I'm using: prompt.in pyspark)
my code is:
x1 = col(argv[1])
x2 = col(argv[2])
df = df1.join(df2, (df1.x1 == df2.x2))
This is my script:
join.py empid emdid
I get this error
df has no such columns.
Any ideas on how to solve this?
Follow this approach, It will work even if your dataframes are joining on column having same name.
argv = ['join.py', 'empid', 'empid']
x1 = argv[1]
x2 = argv[2]
df1 = spark.createDataFrame([(1, "A"),(2, "B")], ("empid", "c2"))
df2 = spark.createDataFrame([(1, "A"),(2, "B")], ("empid", "c2"))
df = df1.join(df2, df1[x1] == df2[x2])
df.show()

Adding column to pyspark dataframe based on conditions from other pyspark dataframes

I am currently learning pyspark and currently working on adding columns to pyspark dataframes using multiple conditions.
I have tried working with UDFs but getting some errors like:
TypeError: 'object' object has no attribute '__getitem__'
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType, StringType, FloatType
from pyspark.sql.functions import pandas_udf, PandasUDFType
#first dataframe (superset)
df1 = simple_example1
#second dataframe
df = diff_cols.dropna()
def func(x,y):
z = df1[(df1['a'] == x) & (df1['b'] <= (y+10000000000)) & (df1['b'] >= (y-10000000000))]
z = z[(z["c"] ==1) | (z["d"] ==1)]
z = z[(z["e"]!=0) | (z["f"]!=0) | (z["g"]!=0) | (z["h"]!=0)]
return 1 if z.count() > 3 else 0
udf_func = udf(func, IntegerType())
df = df.withColumn('status', udf_func(df['a'],df['b']))
what i am trying is as follow:
1. for each row of df filter data from df1 where parameter a is equal to the parameter in df and parameter b should be in between b-10 to b+10
2. then filter that data further with either c or d = 1
3. then filter that data further if any of parameters from e f g h are non 0
4. then count number of rows in the subset and assign 0/1
5. return this 0/1 in status column of df