Union list of pyspark dataframes - pyspark

Let's say I have a list of pyspark dataframes: [df1, df2, ...], what I want is to union them (so actually do df1.union(df2).union(df3).... What's the best practice to achieve that?

you could use the reduce and pass the union function along with the list of dataframes.
import pyspark
from functools import reduce
list_of_sdf = [df1, df2, ...]
final_sdf = reduce(pyspark.sql.dataframe.DataFrame.unionByName, list_of_sdf)
the final_sdf will have the appended data.

Related

Where are the PySpark docs' DataFrames df, df2, df3 etc. defined?

In the PySpark docs, I see many examples working on sample DataFrames like df4 here.
Where are they defined? I'd like to see them in full to better understand the docs.
They are defined in _test() method in Class GroupedData(...)
from pyspark.sql import Row
df4 = sc.parallelize([Row(course="dotNET", year=2012, earnings=10000),
Row(course="Java", year=2012, earnings=20000),
Row(course="dotNET", year=2012, earnings=5000),
Row(course="dotNET", year=2013, earnings=48000),
Row(course="Java", year=2013, earnings=30000)]).toDF()

How to get the common value by comparing two pyspark dataframes

I am migrating the pandas dataframe to pyspark. I have two dataframes in pyspark with different counts. The below code I am able to achieve in pandas but not in pyspark. How to compare the 2 dataframes values in pyspark and put the value as new column in df2.
def impute_value (row,df_custom):
for index,row_custom in df_custom.iterrows():
if row_custom["Identifier"] == row["IDENTIFIER"]:
row["NEW_VALUE"] = row_custom['CUSTOM_VALUE']
return row["NEW_VALUE"]
df2['VALUE'] = df2.apply(lambda row: impute_value(row, df_custom),axis =1)
How can I convert this particular function to pyspark dataframe? In pyspark, I cannot pass the row wise value to the function(impute_value).
I tried the following.
df3= df2.join(df_custom, df2["IDENTIFIER"]=df_custom["Identifier"],"left")
df3.WithColumnRenamed("CUSTOM_VALUE","NEW_VALUE")
This is not giving me the result.
the left join itself should do the needful
import pyspark.sql.functions as f
df3= df2.join(df_custom.withColumnRenamed('Identifier','Id'), df2["IDENTIFIER"]=df_custom["Id"],"left")
df3=df3.withColumn('NEW_VALUE',f.col('CUSTOM_VALUE')).drop('CUSTOM_VALUE','Id')

Spark Data frame select nothing

How to retrieve nothing out of a spark dataframe.
I need something like this,
df.where("1" === "2")
I needed this so that I can do a left join with another dataframe.
Basically I am trying to avoid the data skewing while joining two dataframes by splitting the null and not null key columns and joining them separately and then do a union them.
df1 has 300M records out of which 200M records has Null keys.
df2 has another 300M records.
So to join them, I am splitting the df1 containing null and not null keys separately and then join them with df2. so to join the null key dataframe with df2, I don't need any records from df2.
I can just add the columns from df2 to null key df1,
but curious to see if we have something like this in spark
df.where("1" === "2")
As we do in RDBMS SQLs.
There many different ways, like limit:
df.limit(0)
where with Column:
import org.apache.spark.sql.functions._
df.where(lit(false))
where with String expression:
df.where("false")
1 = 2 expressed as
df.where("1 = 2")
or
df.where(lit(1) === lit(2))
would work as well, but are more verbose than required.
where function calls filter function at the internal level so you can use filter as
import org.apache.spark.sql.functions._
df.filter(lit(1) === lit(2))
or
import org.apache.spark.sql.functions._
df.filter(expr("1 = 2"))
or
df.filter("1 = 2")
or
df.filter("false")
or
import org.apache.spark.sql.functions._
df.filter(lit(false))
Any expression that would return false in the filter function would work.

How to join two dataframes in Scala and select on few columns from the dataframes by their index?

I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one
However, I want to select only the second column from df2. In my task, I am going to use the join function for two dataframes within a reduce function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2.
I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!
EDIT:
ANSWER
I figured out the solution. Here is one way to do this:
def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
val df2cols = df2.columns
val desiredDf2Col = df2cols(1) // the second column
val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
.select($"df1.*",$"df2.$desiredDf2Col")
df3
}
And then I can apply this function in a reduce operation on a list of dataframes.
var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})
To select the second column in your dataframe you can simply do:
val df3 = df2.select(df2.columns(1))
This will first find the second column name and then select it.
If the join and select methods that you want to define in reduce function is similar to Joining two DataFrames in Spark SQL and selecting columns of only one Then you should do the following :
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1) map d2.columns map col: _*)
You will have to remember that the name of the second column i.e. Seq(1) should not be same as any of the dataframes column names.
You can select multiple columns as well but remember the bold note above
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1, 2) map d2.columns map col: _*)

Add a column to an existing dataframe with random fixed values in Pyspark

I'm new to Pyspark and I'm trying to add a new column to my existing dataframe. The new column should contain only 4 fixed values (e.g. 1,2,3,4) and I'd like to randomly pick one of the values for each row.
How can I do that?
Pyspark dataframes are immutable, so you have to return a new one (e.g. you can't just assign to it the way you can with Pandas dataframes). To do what you want use a udf:
from pyspark.sql.functions import udf
import numpy as np
df = <original df>
udf_randint = udf(np.random.randint(1, 4))
df_new = df.withColumn("random_num": udf_randint)