PySpark UDF function with data frame query? - pyspark

I have another solution, but I prefer to use PySpark 2.3 to do it.
I have a two dimensional PySpark data frame like this:
Date | ID
---------- | ----
08/31/2018 | 10
09/31/2018 | 10
09/01/2018 | null
09/01/2018 | null
09/01/2018 | 12
I wanted to replace ID null values by looking for the closest in the past, or if that value is null, by looking forward (and if it is again null, set a default value)
I have imagined adding a new column with .withColumn and use a UDF function which will query the data frame itself.
Something like that in pseudo code (not perfect but it is the main idea):
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def return_value(value,date):
if value is not null:
return val
value1 = df.filter(df['date']<= date).select(df['value']).collect()
if (value1)[0][0] is not null:
return (value1)[0][0]
value2 = df.filter(tdf['date']>= date).select(df['value']).collect()
return (value2)[0][0]
value_udf = udf(return_value,StringType())
new_df = tr.withColumn("new_value", value_udf(df.value,df.date))
But it does not work. Am I completely on the wrong way to do it? Is it only possible to query a Spark data frame in a UDF function? Did I miss an easier solution?

Create new dataframe that have one column - unique list of all dates:
datesDF = yourDF.select('Date').distinct()
Create another one that will consist of dates and ID's but only ones where there is no nulls. And also lets keep only first (whatever will be first) occurrence of ID for each date (judging from your example you can have multiple rows per date)
noNullsDF = yourDF.dropna().dropDuplicates(subset='Date')
Lets now join those two so that we have list of all dates with whatever value we have for it (or null)
joinedDF = datesDF.join(noNullsDF, 'Date', 'left')
Now for every date get the value of ID from previous date and next date using window functions and also lets rename our ID column so later there will be less problems with join:
from pyspark.sql.window import Window
from pyspark.sql import functions as f
w = Window.orderBy('Date')
joinedDF = joinedDF.withColumn('previousID',f.lag('ID').over(w))
.withColumn('nextID',f.lead('ID').over(w))
.withColumnRenamed('ID','newID')
Now lets join it back to our original Dataframe by date
yourDF = yourDF.join(joinedDF, 'Date', 'left')
Now our Dataframe have 4 ID columns:
original ID
newID - ID of any non-null value of given date if any or null
previousID - ID from previous date (non null if any or null)
nextID - ID from next date (non null if any or null)
Now we need to combine them into finalID in order:
original value if not null
value for current date if any non nulls present (it's in contrast with your question but you pandas code suggest you go <= on date checking) if result is not null
value for previous date if its not null
value for next date if its not null
some default value
We do it's simply by coalescing:
default = 0
finalDF = yourDF.select('Date',
'ID',
f.coalesce('ID',
'newID',
'previousID',
'nextID',
f.lit(default)).alias('finalID')
)

Related

PySpark Code Modification to Remove Nulls

I received help with following PySpark to prevent errors when doing a Merge in Databricks, see here
Databricks Error: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table conflicting way
I was wondering if I could get help to modify the code to drop NULLs.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
df2 = partdf.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("Id")))
df3 = df2.filter("rn = 1").drop("rn")
Thanks
The code that you are using does not completely delete the rows where P_key is null. It is applying the row number for null values and where row number value is 1 where P_key is null, that row is not getting deleted.
You can instead use the df.na.drop instead to get the required result.
df.na.drop(subset=["P_key"]).show(truncate=False)
To make your approach work, you can use the following approach. Add a row with least possible unique id value. Store this id in a variable, use the same code and add additional condition in filter as shown below.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,when,col
df = spark.read.option("header",True).csv("dbfs:/FileStore/sample1.csv")
#adding row with least possible id value.
dup_id = '0'
new_row = spark.createDataFrame([[dup_id,'','x','x']], schema = ['id','P_key','c1','c2'])
#replacing empty string with null for P_Key
new_row = new_row.withColumn('P_key',when(col('P_key')=='',None).otherwise(col('P_key')))
df = df.union(new_row) #row added
#code to remove duplicates
df2 = df.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("id")))
df2.show(truncate=False)
#additional condition to remove added id row.
df3 = df2.filter((df2.rn == 1) & (df2.P_key!=dup_id)).drop("rn")
df3.show()

Dynamic boolean join in pyspark

I have two pyspark dataframe with same schema as below -
df_source:
id, name, age
df_target:
id,name,age
"id" is primary column in both the tables and rest are attribute columns
i am accepting primary and attribute column list from user as below-
primary_columns = ["id"]
attribute_columns = ["name","age"]
I need to join above two dataframes dynamically as below -
df_update = df_source.join(df_target, (df_source["id"] == df_target["id"]) & ((df_source["name"] != df_target["name"]) | (df_source["age"] != df_target["age"])) ,how="inner").select([df_source[col] for col in df_source.columns])
how can i achieve this join condition dynamically in pyspark since number of attribute and primary key columns can change as per the user input? Please help.
IIUC, you just can achieve the desired output using an inner join on the primary_columns and a where clause that loops over the attribute_columns.
Since the two DataFrames have the same column names, use alias to differentiate column names after the join.
from functools import reduce
from pyspark.sql.functions import col
df_update = df_source.alias("s")\
.join(df_target.alias("t"), on=primary_columns, how="inner")\
.where(
reduce(
lambda a, b: a|b,
[(col("s."+c) != col("t."+c) for c in attribute_columns]
)\
)
.select("s.*")
Use reduce to apply the bitwise OR operation on the columns in attribute_columns.

Pyspark remove columns with 10 null values

I am new to PySpark.
I have read a parquet file. I only want to keep columns that have atleast 10 values
I have used describe to get the count of not-null records for each column
How do I now extract the column names that have less than 10 values and then drop those columns before writing to a new file
df = spark.read.parquet(file)
col_count = df.describe().filter($"summary" == "count")
You can convert it into a dictionary and then filter out the keys(column names) based on their values (count < 10, the count is a StringType() which needs to be converted to int in the Python code):
# here is what you have so far which is a dataframe
col_count = df.describe().filter('summary == "count"')
# exclude the 1st column(`summary`) from the dataframe and save it to a dictionary
colCountDict = col_count.select(col_count.columns[1:]).first().asDict()
# find column names (k) with int(v) < 10
bad_cols = [ k for k,v in colCountDict.items() if int(v) < 10 ]
# drop bad columns
df_new = df.drop(*bad_cols)
Some notes:
use #pault's approach if the information can not be retrieved directly from df.describe() or df.summary() etc.
you need to drop() instead of select() columns since describe()/summary() only include numeric and string columns, selecting columns from a list processed by df.describe() will lose columns of TimestampType(), ArrayType() etc

variable not binding value in Spark

I am passing variable, but it is not passing value.
I populates variable value here.
val Temp = sqlContext.read.parquet("Tabl1.parquet")
Temp.registerTempTable("temp")
val year = sqlContext.sql("""select value from Temp where name="YEAR"""")
year.show()
here year.show() proper value.
I am passing the parameter here in below code.
val data = sqlContext.sql("""select count(*) from Table where Year='$year' limit 10""")
data.show()
The value year is a Dataframe, not a specific value (Int or Long). So when you use it inside a string interpolation, you get the result of Dataframe.toString, which isn't something you can use to compare values to (the toString returns a string representation of the Dataframe's schema).
If you can assume the year Dataframe has a single Row with a single column of type Int, and you want to get the value of that column - you get use first().getAs[Int](0) to get that value and then use it to construct your next query:
val year: DataFrame = sqlContext.sql("""select value from Temp where name="YEAR"""")
// get the first column of the first row:
val actualYear: Int = year.first().getAs[Int](0)
val data = sqlContext.sql(s"select count(*) from Table where Year='$actualYear' limit 10")
If value column in Temp table has a different type (String, Long) - just replace the Int with that type.

Select values from a dataframe column

I would like to calculate the difference between two values from within the same column. Right now I just want the difference between the last value and the first value, however using last(column) returns a null result. Is there a reason last() would not be returning a value? Is there a way to pass the position of the values I want as variables; ex: the 10th and the 1st, or the 7th and the 6th?
Current code
Using Spark 1.4.0 and Scala 2.11.6
myDF = some dataframe with n rows by m columns
def difference(col: Column): Column = {
last(col)-first(col)
}
def diffCalcs(dataFrame: DataFrame): DataFrame = {
import hiveContext.implicits._
dataFrame.agg(
difference($"Column1"),
difference($"Column2"),
difference($"Column3"),
difference($"Column4")
)
}
When I run diffCalcs(myDF) it returns a null result. If I modify difference to only have first(col), it does return the first value for the four columns. However, if I change it to last(col), it returns null. If I call myDF.show(), I can see that all of columns have Double values on every row, there are no null values in any of the columns.
After updating to Spark 1.5.0, I was able to use the code snippet provided in the question and it worked. That was what ultimately fixed it. Just for completeness, I have included the code that I used after updating the Spark version.
def difference(col:Column): Column = {
last(col)-first(col)
}
def diffCalcs(dataFrame: DataFrame): DataFrame = {
import hiveContext.implicits._
dataFrame.agg(
difference($"Column1").alias("newColumn1"),
difference($"Column2").alias("newColumn2"),
difference($"Column3").alias("newColumn3"),
difference($"Column4").alias("newColumn4")
)
}