Using PySpark Imputer on grouped data - pyspark

I have a Class column which can be 1, 2 or 3, and another column Age with some missing data. I want to Impute the average Age of each Class group.
I want to do something along:
grouped_data = df.groupBy('Class')
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imputer.fit(grouped_data)
Is there any workaround to that?
Thanks for your time

Using Imputer, you can filter down the dataset to each Class value, impute the mean, and then join them back, since you know ahead of time what the values can be:
subsets = []
for i in range(1, 4):
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
subset_df = df.filter(col('Class') == i)
imputed_subset = imputer.fit(subset_df).transform(subset_df)
subsets.append(imputed_subset)
# Union them together
# If you only have 3 just do it without a loop
imputed_df = subsets[0].unionByName(subsets[1]).unionByName(subsets[2])
If you don't know ahead of time what the values are, or if they're not easily iterable, you can groupBy, get the average values for each group as a DataFrame, and then coalesce join that back onto your original dataframe.
import pyspark.sql.functions as F
averages = df.groupBy("Class").agg(F.avg("Age").alias("avgAge"))
df_with_avgs = df.join(averages, on="Class")
imputed_df = df_with_avgs.withColumn("imputedAge", F.coalesce("Age", "avgAge"))

You need to transform your dataframe with fitted model. Then take average of filled data:
from pyspark.sql import functions as F
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imp_model = imputer.fit(df)
transformed_df = imp_model.transform(df)
transformed_df \
.groupBy('Class') \
.agg(F.avg('Age'))

Related

Getting model parameters from regression in Pyspark in efficient way for large data

I have created a function for applying OLS regression and just getting the model parameters. I used groupby and applyInPandas but it's taking too much of time. Is there are more efficient way to work around this?
Note: I din't had to use groupby as all features have many levels but as I cannot use applyInPandas without it so I created a dummy feature as 'group' having the same value as 1.
Code
import pandas as pd
import statsmodels.api as sm
from pyspark.sql.functions import lit
pdf = pd.DataFrame({
'x':[3,6,2,0,1,5,2,3,4,5],
'y':[0,1,2,0,1,5,2,3,4,5],
'z':[2,1,0,0,0.5,2.5,3,4,5,6]})
df = sqlContext.createDataFrame(pdf)
result_schema =StructType([
StructField('index',StringType()),
StructField('coef',DoubleType())
])
def ols(pdf):
y_column = ['z']
x_column = ['x', 'y']
y = pdf[y_column]
X = pdf[x_column]
model = sm.OLS(y, X).fit()
param_table =pd.DataFrame(model.params, columns = ['coef']).reset_index()
return param_table
#adding a new column to apply groupby
df = df.withColumn('group', lit(1))
#applying function
data = df.groupby('group').applyInPandas(ols, schema = result_schema)
Final output sample
index coef
x 0.183246073
y 0.770680628

PySpark Cosine Similarity between two vectors of TF-IDF values the Cosine Similarity using SparseMatrix + koalas or Pandas API on Spark

I do try to implement this Name Matching Cosine Similarity approach/functions get_matches_df in pyspark and pandas_on_spark(koalas) and struggling with optimizing this function (I do try to avoid conversion toPandas() for dataframes because will overload driver so I want to optimize this function and to scale it, so basically a batch approach will work perfect as in this example, or use pandas_udfs or simple UDFs that takes 1 vector and 2 dataframes:
>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pdf):
... return pdf[pdf.a > 1] # allow arbitrary length
...
>>> psdf.pandas_on_spark.apply_batch(pandas_plus)
this is the function I do work on optimizing (everything else I converted and created custom tfidfvectorizer, scaling cosine, pyspark sparsematrix generator and all I have left to optimize is this part (because uses loc and not sure how does work, I don't mind to have it behave as pandas aka all dataframe to driver but ideally will be
def get_matches_df(sparse_matrix, name_vector, top=100):
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
left_side = np.empty([nr_matches], dtype=object)
right_side = np.empty([nr_matches], dtype=object)
similairity = np.zeros(nr_matches)
for index in range(0, nr_matches):
left_side[index] = name_vector[sparserows[index]]
right_side[index] = name_vector[sparsecols[index]]
similairity[index] = sparse_matrix.data[index]
return pd.DataFrame({'left_side': left_side,
'right_side': right_side,
'similairity': similairity})

How UDF function works in pyspark with dates as arguments?

I started in the pyspark world some time ago and I'm racking my brain with an algorithm, initially I want to create a function that calculates the difference of months between two dates, I know there is a function for that (months_between), but it works a little bit different from what I want, I want to extract the months from two dates and subtract without taking into account the days, only the month and the year, the point is, I can do this by manipulating base, creating new columns with the months and subtracting , but I want to do this as a UDF function, like below:
from datetime import datetime
import pyspark.sql.functions as f
base_study = spark.createDataFrame([("1", "2009-01-31", "2007-01-31"),("2","2009-01-31","2011-01-31")], ['ID', 'A', 'B'])
base_study = base_study.withColumn("A",f.to_date(base_study["A"], 'yyyy-MM-dd'))
base_study = base_study.withColumn("B",f.to_date(base_study["B"], 'yyyy-MM-dd'))
def intckSasFunc(RecentDate, PreviousDate):
RecentDate = f.month("RecentDate")
PreviousDate = f.month("PreviousDate")
months_diff = (RecentDate.year - PreviousDate.year) * 12 + (RecentDate.month - PreviousDate.month)
return months_diff
intckSasFuncUDF = f.udf(intckSasFunc, IntegerType())
base_study.withColumn('Result', intckSasFuncUDF(f.col('B'), f.col('A') ))
What I'm doing wrong ?
Another question: When I pass parameters in a UDF function, they sent one by one or it pass entire column? And this column is a series?
Thank you!
I found a solution and upgraded it to handle missings too.
from datetime import datetime
import pyspark.sql.functions as f
base_study = spark.createDataFrame([("1", None, "2015-01-01"),("2","2015-01-31","2015-01-31")], ['ID', 'A', 'B'])
base_study = base_study.withColumn("A",f.to_date(base_study["A"], 'yyyy-MM-dd'))
base_study = base_study.withColumn("B",f.to_date(base_study["B"], 'yyyy-MM-dd'))
def intckSasFunc(RecentDate, PreviousDate):
if (PreviousDate and RecentDate) is not None:
months_diff = (RecentDate.year - PreviousDate.year) * 12 + (RecentDate.month - PreviousDate.month)
return months_diff
else:
return None
intckSasFuncUDF = f.udf(lambda x,y:intckSasFunc(x,y) , IntegerType())
display(base_study.withColumn('Result', intckSasFuncUDF(f.col('B'), f.col('A'))))
for those who have doubts, as I had, the function treats one record at a time, as if it were a normal python function, I couldn't use pyspark.sql functions inside this UDF, it gives an error, it seems, these functions are used only in pypsark columns, and inside the UDF the transformation is row by row.

Percentile rank in pyspark using QuantileDiscretizer

I am wondering if it's possible to obtain the result of percentile_rank using the QuantileDiscretizer transformer in pyspark.
The purpose is that I am trying to avoid computing the percent_rank over the entire column, as it generates the following error:
WARN WindowExec: No Partition Defined for Window operation!
Moving all data to a single partition, this can cause serious performance degradation.
The method I am following is to first use QuantileDiscretizer then normalize to [0,1]:
from pyspark.sql.window import Window
from pyspark.ml.feature import QuantileDiscretizer
from scipy.stats import gamma
X1 = gamma.rvs(0.2, size=1000)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("perc_rank", F.percent_rank().over(Window.orderBy("x")))
df = QuantileDiscretizer(numBuckets=df.count()+1,\
inputCol="x",\
outputCol="q_discretizer").fit(df).transform(df)
agg_values = df.agg(F.max(df["q_discretizer"]).alias("maxval"),\
F.min(df["q_discretizer"]).alias("minval")).collect()[0]
xmax, xmin = agg_values.__getitem__("maxval"), agg_values.__getitem__("minval")
normalize = F.udf(lambda x: (x-xmin)/(xmax-xmin))
df = df.withColumn("perc_discretizer", normalize("q_discretizer"))
df = df.withColumn("error", F.round(F.abs(F.col("perc_discretizer")- F.col("perc_rank")),6) )
print(df.select(F.max("error")).show())
df.show(5)
However, it seems that increasing the number of datapoints the error grows, so I am not sure this is the right way to do it.
Is it possible to use QuantileDiscretizer to obtain the percentile_rank?
Alternatively is there a way to compute percentile_rank over an entire column in an efficient way?
Well you can use the below to avoid the warning message:
X1 = gamma.rvs(0.2, size=10)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("dummyCol", F.lit("some_val"))
win = Window.partitionBy("dummyCol").orderBy("x")
df = df.withColumn("perc_rank", F.percent_rank().over(win)).drop("dummyCol")
but nonetheless, the data would still be moved to a single worker, I don't think so there is any better alternative to avoid the shuffle here since the complete column needs to be rank-ordered.
In case you have multiple windows over the same column, you can try to pre-partition the data and then apply the ranking functions.

Imputing the dataset with mean of class label causing crash in filter operation

I have a csv file that contains numeric values.
val row = withoutHeader.map{
line => {
val arr = line.split(',')
for (h <- 0 until arr.length){
if(arr(h).trim == ""){
val abc = avgrdd.filter {case ((x,y),z) => x == h && y == arr(dependent_col_index).toDouble} //crashing here
arr(h) = //imputing with the value above
}
}
arr.mkString(",")
}
}
This is a snippet of the code where I am trying to impute the missing values with the mean of class labels.
avgrdd contains the average for the key value pairs where key is column index and the class label value. This avgrdd is calculated using the combiners which I see is calculating the results correctly.
dependent_col_index is the column containing the class labels.
The line with filter is crashing with the null pointer exception.
On removing this line the original array is the output (comma separated).
I am confused why the filter operation is causing a crash.
Please suggest on how to fix this issue.
Example
col1,dependent_col_index
4,1
8,0
,1
21,1
21,0
,1
25,1
,0
34,1
mean for class 1 is 84/4 = 21 and for class 0 is 29/2 = 14.5
Required Output
4,1
8,0
21,1
21,1
21,0
21,1
25,1
14.5,0
34,1
Thanks !!
You are trying to execute a RDD transformation inside of another RDD transformation. Remember that you cannot use RDD inside of another RDD transformation, this would cause an error.
The way to proceed is the following:
Transform the source RDD withoutHeader to the RDD of pairs <Class, Value> of the corrent type (Long in your case). Cache it
Calculate avgrdd on top of withoutHeader. This should be an RDD of pairs <Class, AvgValue>
Join withoutHeader RDD and avgrdd together - this way for each row you would have a structure <Class, <Value, AvgValue>>
Execute map on top of the result to replace missing Value with AvgValue
Another option might be to split the RDD in two parts on step 3 (one part - RDD with missing values, second one - RDD with non-missing values), join the avgrdd only with the RDD containing only missing values and after that make a union between this two parts. It would be faster if you have a small fraction of missing values