Getting model parameters from regression in Pyspark in efficient way for large data - pyspark

I have created a function for applying OLS regression and just getting the model parameters. I used groupby and applyInPandas but it's taking too much of time. Is there are more efficient way to work around this?
Note: I din't had to use groupby as all features have many levels but as I cannot use applyInPandas without it so I created a dummy feature as 'group' having the same value as 1.
Code
import pandas as pd
import statsmodels.api as sm
from pyspark.sql.functions import lit
pdf = pd.DataFrame({
'x':[3,6,2,0,1,5,2,3,4,5],
'y':[0,1,2,0,1,5,2,3,4,5],
'z':[2,1,0,0,0.5,2.5,3,4,5,6]})
df = sqlContext.createDataFrame(pdf)
result_schema =StructType([
StructField('index',StringType()),
StructField('coef',DoubleType())
])
def ols(pdf):
y_column = ['z']
x_column = ['x', 'y']
y = pdf[y_column]
X = pdf[x_column]
model = sm.OLS(y, X).fit()
param_table =pd.DataFrame(model.params, columns = ['coef']).reset_index()
return param_table
#adding a new column to apply groupby
df = df.withColumn('group', lit(1))
#applying function
data = df.groupby('group').applyInPandas(ols, schema = result_schema)
Final output sample
index coef
x 0.183246073
y 0.770680628

Related

Inconsistent behavior of pyspark code depending on order of line execution

I am running the code through jupyter on EMR, pyspark version 3.3.0.
I have two dataframes that I have preprocessed with the pyspark.ml.feature functions (OneHotEncoder, StringIndexer, VectorAssembler). The first dataframe, lets call it df_good, has 5 features, the second dataframe, lets call it df_bad, omits 2 of the features from df_good. The underlying dataset used to generate the two datasets is the same, the code to generate the datasets is identical (other than two features not be included in the VectorAssembler inputCols for df_bad).
Below is the code I am using to train the model:
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType
from pyspark.ml.classification import LogisticRegression
def split_array(col):
def to_list(v):
return v.toArray().tolist()
return F.udf(to_list, ArrayType(DoubleType()))(col)
def train_model(df):
train_df = df.selectExpr("label as label", "features as features")
logit = LogisticRegression()
logit = logit.setFamily("multinomial")
logit_mod = logit.fit(train_df)
df = logit_mod.transform(df)
df = df.withColumn("pred", split_array(F.col("probability"))[0])
return df
Here is where things get weird.
If I run the code below it works and runs in 10-20 seconds each:
df_good = spark.read.parquet("<s3_location_good>")
df_good = train_model(df_good)
df_good.select(F.sum("pred")).show()
df_bad = spark.read.parquet("<s3_location_bad>")
df_bad = train_model(df_bad)
df_bad.select(F.sum("pred")).show()
If I change the order, the code completely hangs on df_bad:
df_bad = spark.read.parquet("<s3_location_bad>")
df_bad = train_model(df_bad)
df_bad.select(F.sum("pred")).show()
df_good = spark.read.parquet("<s3_location_good>")
df_good = train_model(df_good)
df_good.select(F.sum("pred")).show()
The data is unchanged, the code is the same, the behavior is always the same.
Any thoughts are appreciated.

PySpark Cosine Similarity between two vectors of TF-IDF values the Cosine Similarity using SparseMatrix + koalas or Pandas API on Spark

I do try to implement this Name Matching Cosine Similarity approach/functions get_matches_df in pyspark and pandas_on_spark(koalas) and struggling with optimizing this function (I do try to avoid conversion toPandas() for dataframes because will overload driver so I want to optimize this function and to scale it, so basically a batch approach will work perfect as in this example, or use pandas_udfs or simple UDFs that takes 1 vector and 2 dataframes:
>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pdf):
... return pdf[pdf.a > 1] # allow arbitrary length
...
>>> psdf.pandas_on_spark.apply_batch(pandas_plus)
this is the function I do work on optimizing (everything else I converted and created custom tfidfvectorizer, scaling cosine, pyspark sparsematrix generator and all I have left to optimize is this part (because uses loc and not sure how does work, I don't mind to have it behave as pandas aka all dataframe to driver but ideally will be
def get_matches_df(sparse_matrix, name_vector, top=100):
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
left_side = np.empty([nr_matches], dtype=object)
right_side = np.empty([nr_matches], dtype=object)
similairity = np.zeros(nr_matches)
for index in range(0, nr_matches):
left_side[index] = name_vector[sparserows[index]]
right_side[index] = name_vector[sparsecols[index]]
similairity[index] = sparse_matrix.data[index]
return pd.DataFrame({'left_side': left_side,
'right_side': right_side,
'similairity': similairity})

Percentile rank in pyspark using QuantileDiscretizer

I am wondering if it's possible to obtain the result of percentile_rank using the QuantileDiscretizer transformer in pyspark.
The purpose is that I am trying to avoid computing the percent_rank over the entire column, as it generates the following error:
WARN WindowExec: No Partition Defined for Window operation!
Moving all data to a single partition, this can cause serious performance degradation.
The method I am following is to first use QuantileDiscretizer then normalize to [0,1]:
from pyspark.sql.window import Window
from pyspark.ml.feature import QuantileDiscretizer
from scipy.stats import gamma
X1 = gamma.rvs(0.2, size=1000)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("perc_rank", F.percent_rank().over(Window.orderBy("x")))
df = QuantileDiscretizer(numBuckets=df.count()+1,\
inputCol="x",\
outputCol="q_discretizer").fit(df).transform(df)
agg_values = df.agg(F.max(df["q_discretizer"]).alias("maxval"),\
F.min(df["q_discretizer"]).alias("minval")).collect()[0]
xmax, xmin = agg_values.__getitem__("maxval"), agg_values.__getitem__("minval")
normalize = F.udf(lambda x: (x-xmin)/(xmax-xmin))
df = df.withColumn("perc_discretizer", normalize("q_discretizer"))
df = df.withColumn("error", F.round(F.abs(F.col("perc_discretizer")- F.col("perc_rank")),6) )
print(df.select(F.max("error")).show())
df.show(5)
However, it seems that increasing the number of datapoints the error grows, so I am not sure this is the right way to do it.
Is it possible to use QuantileDiscretizer to obtain the percentile_rank?
Alternatively is there a way to compute percentile_rank over an entire column in an efficient way?
Well you can use the below to avoid the warning message:
X1 = gamma.rvs(0.2, size=10)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("dummyCol", F.lit("some_val"))
win = Window.partitionBy("dummyCol").orderBy("x")
df = df.withColumn("perc_rank", F.percent_rank().over(win)).drop("dummyCol")
but nonetheless, the data would still be moved to a single worker, I don't think so there is any better alternative to avoid the shuffle here since the complete column needs to be rank-ordered.
In case you have multiple windows over the same column, you can try to pre-partition the data and then apply the ranking functions.

Using PySpark Imputer on grouped data

I have a Class column which can be 1, 2 or 3, and another column Age with some missing data. I want to Impute the average Age of each Class group.
I want to do something along:
grouped_data = df.groupBy('Class')
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imputer.fit(grouped_data)
Is there any workaround to that?
Thanks for your time
Using Imputer, you can filter down the dataset to each Class value, impute the mean, and then join them back, since you know ahead of time what the values can be:
subsets = []
for i in range(1, 4):
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
subset_df = df.filter(col('Class') == i)
imputed_subset = imputer.fit(subset_df).transform(subset_df)
subsets.append(imputed_subset)
# Union them together
# If you only have 3 just do it without a loop
imputed_df = subsets[0].unionByName(subsets[1]).unionByName(subsets[2])
If you don't know ahead of time what the values are, or if they're not easily iterable, you can groupBy, get the average values for each group as a DataFrame, and then coalesce join that back onto your original dataframe.
import pyspark.sql.functions as F
averages = df.groupBy("Class").agg(F.avg("Age").alias("avgAge"))
df_with_avgs = df.join(averages, on="Class")
imputed_df = df_with_avgs.withColumn("imputedAge", F.coalesce("Age", "avgAge"))
You need to transform your dataframe with fitted model. Then take average of filled data:
from pyspark.sql import functions as F
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imp_model = imputer.fit(df)
transformed_df = imp_model.transform(df)
transformed_df \
.groupBy('Class') \
.agg(F.avg('Age'))

How to give predicted and label columns in BinaryClassificationMetrics evaluation for Naive Bayes model

I have a confusion regarding BinaryClassificationMetrics (Mllib) inputs. As per Apache Spark 1.6.0, we need to pass predictedandlabel of Type (RDD[(Double,Double)]) from transformed DataFrame that having predicted, probability(vector) & rawPrediction(vector).
I have created RDD[(Double,Double)] from Predicted and label columns. After performing BinaryClassificationMetrics evaluation on NavieBayesModel, I'm able to retrieve ROC, PR etc. But the values are limited, I can't able plot the curve using the value generated from this. Roc contains 4 values and PR contains 3 value.
Is it the right way of preparing PredictedandLabel or do I need to use rawPrediction column or Probability column instead of Predicted column?
Prepare like this:
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
val df = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val predictions = new NaiveBayes().fit(df).transform(df)
val preds = predictions.select("probability", "label").rdd.map(row =>
(row.getAs[Vector](0)(0), row.getAs[Double](1)))
And evaluate:
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
new BinaryClassificationMetrics(preds, 10).roc
If predictions are only 0 or 1 number of buckets can be lower like in your case. Try more complex data like this:
val anotherPreds = df1.select(rand(), $"label").rdd.map(row => (row.getDouble(0), row.getDouble(1)))
new BinaryClassificationMetrics(anotherPreds, 10).roc