Where are the PySpark docs' DataFrames df, df2, df3 etc. defined? - pyspark

In the PySpark docs, I see many examples working on sample DataFrames like df4 here.
Where are they defined? I'd like to see them in full to better understand the docs.

They are defined in _test() method in Class GroupedData(...)
from pyspark.sql import Row
df4 = sc.parallelize([Row(course="dotNET", year=2012, earnings=10000),
Row(course="Java", year=2012, earnings=20000),
Row(course="dotNET", year=2012, earnings=5000),
Row(course="dotNET", year=2013, earnings=48000),
Row(course="Java", year=2013, earnings=30000)]).toDF()

Related

Union list of pyspark dataframes

Let's say I have a list of pyspark dataframes: [df1, df2, ...], what I want is to union them (so actually do df1.union(df2).union(df3).... What's the best practice to achieve that?
you could use the reduce and pass the union function along with the list of dataframes.
import pyspark
from functools import reduce
list_of_sdf = [df1, df2, ...]
final_sdf = reduce(pyspark.sql.dataframe.DataFrame.unionByName, list_of_sdf)
the final_sdf will have the appended data.

Is it possible to reference a PySpark DataFrame using it's rdd id?

If I "overwrite" a df using the same naming convention in PySpark such as in the example below, am I able to reference it later on using the rdd id?
df = spark.createDataFrame([('Abraham','Lincoln')], ['first_name', 'last_name'])
df.checkpoint()
print(df.show())
print(df.rdd.id())
from pyspark.sql.functions import *
df = df.select(names.first_name,names.last_name,concat_ws(' ', names.first_name, names.last_name).alias('full_name'))
df.checkpoint()
print(df.show())
print(df.rdd.id())

How to get the common value by comparing two pyspark dataframes

I am migrating the pandas dataframe to pyspark. I have two dataframes in pyspark with different counts. The below code I am able to achieve in pandas but not in pyspark. How to compare the 2 dataframes values in pyspark and put the value as new column in df2.
def impute_value (row,df_custom):
for index,row_custom in df_custom.iterrows():
if row_custom["Identifier"] == row["IDENTIFIER"]:
row["NEW_VALUE"] = row_custom['CUSTOM_VALUE']
return row["NEW_VALUE"]
df2['VALUE'] = df2.apply(lambda row: impute_value(row, df_custom),axis =1)
How can I convert this particular function to pyspark dataframe? In pyspark, I cannot pass the row wise value to the function(impute_value).
I tried the following.
df3= df2.join(df_custom, df2["IDENTIFIER"]=df_custom["Identifier"],"left")
df3.WithColumnRenamed("CUSTOM_VALUE","NEW_VALUE")
This is not giving me the result.
the left join itself should do the needful
import pyspark.sql.functions as f
df3= df2.join(df_custom.withColumnRenamed('Identifier','Id'), df2["IDENTIFIER"]=df_custom["Id"],"left")
df3=df3.withColumn('NEW_VALUE',f.col('CUSTOM_VALUE')).drop('CUSTOM_VALUE','Id')

scala: select column where not contains elements into dataframe

I have this ligne of code that should create a dataframe from list of columns that not contain a string. I tried this but it doesn't work:
val exemple = hiveObj.sql("show tables in database").select("tableName")!==="ABC".collect()
Try using the filter method:
import org.apache.spark.sql.functions._
import spark.implicits._
val exemple = hiveObj.sql("your query here").filter($"columnToFilter" =!= "ABC").show
NOTE: the inequality operator =!=is only available for Spark 2.0.0+. If you're using an older version, you must use !==. You can see the documentation here.
If you need to filter several columns you can do so:
.filter($"columnToFilter" =!= "ABC" and $"columnToFilter2" =!= "ABC")
another alternative answer to my question:
val exemple1 = hiveObj.sql("show tables in database").filter(!$"tableName".contains("ABC")).show()

how to use createDataFrame to create a pyspark dataframe?

I know this is probably to be a stupid question. I have the following code:
from pyspark.sql import SparkSession
rows = [1,2,3]
df = SparkSession.createDataFrame(rows)
df.printSchema()
df.show()
But I got an error:
createDataFrame() missing 1 required positional argument: 'data'
I don't understand why this happens because I already supplied 'data', which is the variable rows.
Thanks
You have to create SparkSession instance using the build pattern and use it for creating dataframe, check
https://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
spark= SparkSession.builder.getOrCreate()
Below are the steps to create pyspark dataframe using createDataFrame
Create sparksession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
Create data and columns
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
Creating DataFrame from RDD
rdd = spark.sparkContext.parallelize(data)
df= spark.createDataFrame(rdd).toDF(*columns)
the second approach, Directly creating dataframe
df2 = spark.createDataFrame(data).toDF(*columns)
Try
row = [(1,), (2,), (3,)]
?
If I am not wrong createDataFrame() takes 2 lists as input: first list is the data and second list is the column names. The data must be a lists of list of tuples, where each tuple is a row of the dataframe.