Split the PySpark dataframes against number of records - pyspark

I'm working on a pyspark dataframe having around 100000 records and I want to create new dataframes of around 20000 records each.How can I achieve it?

It can be dynamic but here is a lazy way to do it
#Creates a random DF with 100000 rows
from pyspark.sql import functions as F
df = spark.range(0, 100001).withColumn('rand_col', F.rand()).drop('id')
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("index", row_number().over(w)) #creates a index column to split the DF
df1 = df.filter(F.col('index') < 20001)
df2 = df.filter((F.col('index') >= 20001) & (F.col('index') < 40001))
df3 = df.filter((F.col('index') >= 40001) & (F.col('index') < 60001))
df4 = df.filter((F.col('index') >= 60001) & (F.col('index') < 80001))
df5 = df.filter((F.col('index') >= 80001) & (F.col('index') < 100001))

Related

Pyspark set values based on column's condition

I have this dataframe
data = [['tom','kik',1], ['nick','ken', 1], ['juli','ryan', 2]]
df = pd.DataFrame(data, columns=['Name','Name2', 'stat'])
df= spark.createDataFrame(df)
I need to make this transformation for the two columns (if stat ==1 then Name and Name2 ==Toto)
data = [['toto','toto', 1], ['toto','toto',1], ['juli','juli', 2]]
df = pd.DataFrame(data, columns=['Name','Name2', 'stat'])
df= spark.createDataFrame(df)
from pyspark.sql.functions import col, when
condition = (col("stat")==1)
new_df = df.withColumn("Name", when(condition, "toto")).withColumn("Name2", when(condition, "toto"))

Alternative for df.withColumn in PySpark?

I am converting some long datatype to datetime using withColumn and below code, but what I notice that when using withColumn inside for loops is leading to poor query planning performance.
is there any other way I can achieve the below
from pyspark.sql import functions as F
for col in list_of_columns:
df = df.withColumn(col, F.from_utc_timestamp(F.from_unixtime(df[col] / 1000), "UTC"))
list_of_columns = [list of 20 columns]
you can put that in a single select statement.
no_change_columns = [col for col in df.columns if col not in list_of_columns]
df = df.select(
*no_change_columns,
*(
F.from_utc_timestamp(F.from_unixtime(F.col(col) / 1000), "UTC").alias(col)
for col in list_of_columns
)
)

join parameterization in pyspark

I need to parameterized join condition and joining columns should get passes from CLI (I'm using: prompt.in pyspark)
my code is:
x1 = col(argv[1])
x2 = col(argv[2])
df = df1.join(df2, (df1.x1 == df2.x2))
This is my script:
join.py empid emdid
I get this error
df has no such columns.
Any ideas on how to solve this?
Follow this approach, It will work even if your dataframes are joining on column having same name.
argv = ['join.py', 'empid', 'empid']
x1 = argv[1]
x2 = argv[2]
df1 = spark.createDataFrame([(1, "A"),(2, "B")], ("empid", "c2"))
df2 = spark.createDataFrame([(1, "A"),(2, "B")], ("empid", "c2"))
df = df1.join(df2, df1[x1] == df2[x2])
df.show()

Adding column to pyspark dataframe based on conditions from other pyspark dataframes

I am currently learning pyspark and currently working on adding columns to pyspark dataframes using multiple conditions.
I have tried working with UDFs but getting some errors like:
TypeError: 'object' object has no attribute '__getitem__'
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType, StringType, FloatType
from pyspark.sql.functions import pandas_udf, PandasUDFType
#first dataframe (superset)
df1 = simple_example1
#second dataframe
df = diff_cols.dropna()
def func(x,y):
z = df1[(df1['a'] == x) & (df1['b'] <= (y+10000000000)) & (df1['b'] >= (y-10000000000))]
z = z[(z["c"] ==1) | (z["d"] ==1)]
z = z[(z["e"]!=0) | (z["f"]!=0) | (z["g"]!=0) | (z["h"]!=0)]
return 1 if z.count() > 3 else 0
udf_func = udf(func, IntegerType())
df = df.withColumn('status', udf_func(df['a'],df['b']))
what i am trying is as follow:
1. for each row of df filter data from df1 where parameter a is equal to the parameter in df and parameter b should be in between b-10 to b+10
2. then filter that data further with either c or d = 1
3. then filter that data further if any of parameters from e f g h are non 0
4. then count number of rows in the subset and assign 0/1
5. return this 0/1 in status column of df

dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)