How UDF function works in pyspark with dates as arguments? - date

I started in the pyspark world some time ago and I'm racking my brain with an algorithm, initially I want to create a function that calculates the difference of months between two dates, I know there is a function for that (months_between), but it works a little bit different from what I want, I want to extract the months from two dates and subtract without taking into account the days, only the month and the year, the point is, I can do this by manipulating base, creating new columns with the months and subtracting , but I want to do this as a UDF function, like below:
from datetime import datetime
import pyspark.sql.functions as f
base_study = spark.createDataFrame([("1", "2009-01-31", "2007-01-31"),("2","2009-01-31","2011-01-31")], ['ID', 'A', 'B'])
base_study = base_study.withColumn("A",f.to_date(base_study["A"], 'yyyy-MM-dd'))
base_study = base_study.withColumn("B",f.to_date(base_study["B"], 'yyyy-MM-dd'))
def intckSasFunc(RecentDate, PreviousDate):
RecentDate = f.month("RecentDate")
PreviousDate = f.month("PreviousDate")
months_diff = (RecentDate.year - PreviousDate.year) * 12 + (RecentDate.month - PreviousDate.month)
return months_diff
intckSasFuncUDF = f.udf(intckSasFunc, IntegerType())
base_study.withColumn('Result', intckSasFuncUDF(f.col('B'), f.col('A') ))
What I'm doing wrong ?
Another question: When I pass parameters in a UDF function, they sent one by one or it pass entire column? And this column is a series?
Thank you!

I found a solution and upgraded it to handle missings too.
from datetime import datetime
import pyspark.sql.functions as f
base_study = spark.createDataFrame([("1", None, "2015-01-01"),("2","2015-01-31","2015-01-31")], ['ID', 'A', 'B'])
base_study = base_study.withColumn("A",f.to_date(base_study["A"], 'yyyy-MM-dd'))
base_study = base_study.withColumn("B",f.to_date(base_study["B"], 'yyyy-MM-dd'))
def intckSasFunc(RecentDate, PreviousDate):
if (PreviousDate and RecentDate) is not None:
months_diff = (RecentDate.year - PreviousDate.year) * 12 + (RecentDate.month - PreviousDate.month)
return months_diff
else:
return None
intckSasFuncUDF = f.udf(lambda x,y:intckSasFunc(x,y) , IntegerType())
display(base_study.withColumn('Result', intckSasFuncUDF(f.col('B'), f.col('A'))))
for those who have doubts, as I had, the function treats one record at a time, as if it were a normal python function, I couldn't use pyspark.sql functions inside this UDF, it gives an error, it seems, these functions are used only in pypsark columns, and inside the UDF the transformation is row by row.

Related

Getting model parameters from regression in Pyspark in efficient way for large data

I have created a function for applying OLS regression and just getting the model parameters. I used groupby and applyInPandas but it's taking too much of time. Is there are more efficient way to work around this?
Note: I din't had to use groupby as all features have many levels but as I cannot use applyInPandas without it so I created a dummy feature as 'group' having the same value as 1.
Code
import pandas as pd
import statsmodels.api as sm
from pyspark.sql.functions import lit
pdf = pd.DataFrame({
'x':[3,6,2,0,1,5,2,3,4,5],
'y':[0,1,2,0,1,5,2,3,4,5],
'z':[2,1,0,0,0.5,2.5,3,4,5,6]})
df = sqlContext.createDataFrame(pdf)
result_schema =StructType([
StructField('index',StringType()),
StructField('coef',DoubleType())
])
def ols(pdf):
y_column = ['z']
x_column = ['x', 'y']
y = pdf[y_column]
X = pdf[x_column]
model = sm.OLS(y, X).fit()
param_table =pd.DataFrame(model.params, columns = ['coef']).reset_index()
return param_table
#adding a new column to apply groupby
df = df.withColumn('group', lit(1))
#applying function
data = df.groupby('group').applyInPandas(ols, schema = result_schema)
Final output sample
index coef
x 0.183246073
y 0.770680628

Pyspark dataframe filter based on in between values

I have a Pyspark dataframe with below values -
[Row(id='ABCD123', score='28.095238095238095'), Row(id='EDFG456', score='36.2962962962963'), Row(id='HIJK789', score='37.56218905472637'), Row(id='LMNO1011', score='36.82352941176471')]
I want only the values from the DF which have score between the input score value and input score value + 1, say, the input score value is 36 then I want the output DF with only two ids - EDFG456 & LMNO1011 as their score falls between 36 & 37. I achieved this by doing as follows -
input_score_value = 36
input_df = my_df.withColumn("score_num", substring(my_df.score, 1,2))
output_matched = input_df.filter(input_df.score_num == input_score_value)
print(output_matched.take(5))
The above code gives the below output, but it takes too long to process 2 mil rows. I was thinking if there is some better way to do this to reduce the response time.
[Row(id='EDFG456', score='36.2962962962963'), Row(id='LMNO1011',score='36.82352941176471')]
You can use the function floor.
from pyspark.sql.functions import floor
output_matched = input_df.filter(foor(input_df.score_num) == input_score_value)
print(output_matched.take(5))
It should be much faster compared to substring. Let me know.

How to get the time difference between 2 timestamps

I have tried below code and it gives me -55 as difference between above 2 time_stamps. It should give me 5 minutes. Is there any direct function to get correct time difference in pyspark?
import pyspark.sql.functions as F
# ts1 := 2019-11-07T22:00:00.000+0000
# ts2 := 2019-11-07T21:55:00.000+0000
df.withColumn("time_diff", F.minute("time_stamp") - F.minute("time_stamp2"))
Doesn't give me correct answer. Please help.
You can use the following function to get the time difference in seconds:
from pyspark.sql.functions import *
diff_secs_col = col("time_stamp").cast("long") - col("time_stamp2").cast("long")
Then do the math to get in minutes.
Can you try this
import pyspark.sql.functions as F
import pyspark.sql.types as Types
df = df.withColumn('t1_unix', F.unix_timestamp(df.t1, "yyyy-MM-dd'T'HH:mm:ss.SSS"))
df = df.withColumn('t2_unix', F.unix_timestamp(df.t2, "yyyy-MM-dd'T'HH:mm:ss.SSS"))
df = df.withColumn('diff', ((df.t1_unix-df.t2_unix)/60).cast(Types.IntegerType()))

How to get keys from pyspark SparseVector

I conducted a tf-idf transform and now I want to get the keys and values from the result.
I am using the following udf code to get values:
def extract_values_from_vector(vector):
return vector.values.tolist()
extract_values_from_vector_udf = udf(lambda vector:extract_values_from_vector(vector), ArrayType(DoubleType()))
extract = rescaledData.withColumn("extracted_keys", extract_keys_from_vector_udf("features"))
So if the sparsevector looks like:
features=SparseVector(123241, {20672: 4.4233, 37393: 0.0, 109847: 3.7096, 118474: 5.4042}))
extracted_keys in my extract will look like:
[4.4233, 0.0, 3.7096, 5.4042]
My question is, how can I get the keys in the SparseVector dictionary? Such as keys = [20672, 37393, 109847, 118474] ?
I am trying the following code but it won't work
def extract_keys_from_vector(vector):
return vector.indices.tolist()
extract_keys_from_vector_udf = spf.udf(lambda vector:extract_keys_from_vector(vector), ArrayType(DoubleType()))
The result it gave me is: [null,null,null,null]
Can someone help?
Many thanks in advance!
Since the answer is in the comments above, I thought that I would take this time (while waiting to write a parquet of course) to write down the answer.
from pyspark.sql.types import *
from pyspark.sql import functions as F
def extract_keys_from_vector(vector):
return vector.indices.tolist()
feature_extract = F.UserDefinedFunction(lambda vector: extract_keys_from_vector(vector), ArrayType(IntegerType()))
df = df.withColumn("features", feature_extract(F.col("features")))

Using PySpark Imputer on grouped data

I have a Class column which can be 1, 2 or 3, and another column Age with some missing data. I want to Impute the average Age of each Class group.
I want to do something along:
grouped_data = df.groupBy('Class')
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imputer.fit(grouped_data)
Is there any workaround to that?
Thanks for your time
Using Imputer, you can filter down the dataset to each Class value, impute the mean, and then join them back, since you know ahead of time what the values can be:
subsets = []
for i in range(1, 4):
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
subset_df = df.filter(col('Class') == i)
imputed_subset = imputer.fit(subset_df).transform(subset_df)
subsets.append(imputed_subset)
# Union them together
# If you only have 3 just do it without a loop
imputed_df = subsets[0].unionByName(subsets[1]).unionByName(subsets[2])
If you don't know ahead of time what the values are, or if they're not easily iterable, you can groupBy, get the average values for each group as a DataFrame, and then coalesce join that back onto your original dataframe.
import pyspark.sql.functions as F
averages = df.groupBy("Class").agg(F.avg("Age").alias("avgAge"))
df_with_avgs = df.join(averages, on="Class")
imputed_df = df_with_avgs.withColumn("imputedAge", F.coalesce("Age", "avgAge"))
You need to transform your dataframe with fitted model. Then take average of filled data:
from pyspark.sql import functions as F
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imp_model = imputer.fit(df)
transformed_df = imp_model.transform(df)
transformed_df \
.groupBy('Class') \
.agg(F.avg('Age'))