Issue/Bug when loading and applying MultilayerPerceptronClassifier in Spark Version 3.0.0 - pyspark

IllegalArgumentException: MultilayerPerceptronClassifier_... parameter solver given invalid value auto
I believe I have discovered a bug when loading MultilayerPerceptronClassificationModel in spark 3.0.0, scala 2.1.2 which I have tested and can see is not there in at least Spark 2.4.3, Scala 2.11. .
I am using pyspark on a databricks cluster and importing the library “from pyspark.ml.classification import MultilayerPerceptronClassificationModel”
When running model=MultilayerPerceptronClassificationModel.(“load”) and then model. transform (df) I get the following error: IllegalArgumentException: MultilayerPerceptronClassifier_8055d1368e78 parameter solver given invalid value auto.
This issue can be easily replicated by running the example given on the spark documents: http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier
Then adding a save model, load model and transform statement as such:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Load training data
data = spark.read.format("libsvm")\
.load("data/mllib/sample_multiclass_classification_data.txt")
# Split the data into train and test
splits = data.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]
# specify layers for the neural network:
# input layer of size 4 (features), two intermediate of size 5 and 4
# and output of size 3 (classes)
layers = [4, 5, 4, 3]
# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
# train the model
model = trainer.fit(train)
# compute accuracy on the test set
result = model.transform(test)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
from pyspark.ml.classification import MultilayerPerceptronClassifier, MultilayerPerceptronClassificationModel
model.save(Save_location)
model2. MultilayerPerceptronClassificationModel.load(Save_location)
result_from_loaded = model2.transform(test)

Bug has been confirmed Jira opened: : https://issues.apache.org/jira/browse/SPARK-32232

Related

How to predict label for new input values using artificial neural network in python

I am new in machine learning. I am making a Streamlit app for multiclass classification using artificial neural network. My question is about the ANN model, not about the Streamlit. I know I can use MLPClassifier, but I want to build and train my own model. So, I used the following code to analyze the following data.-
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dropout
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import plot_roc_curve, roc_auc_score, roc_curve
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import GridSearchCV
df=pd.read_csv("./Churn_Modelling.csv")
#Drop Unwanted features
df.drop(columns=['Surname','RowNumber','CustomerId'],inplace=True)
df.head()
#Label Encoding of Categ features
df['Geography']=df['Geography'].map({'France':0,'Spain':1,'Germany':2})
df['Gender']=df['Gender'].map({'Male':0,'Female':1})
#Input & Output selection
X=df.drop('Exited',axis=1)
Y = df['Exited']
Y = df['Exited'].map({'yes':1, 'no':2, 'maybe':3})
#train test split
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=12,stratify=Y)
#scaling
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
Y_train = ss.fit_transform(Y_train)
X_test=ss.transform(X_test)
# build a model
#build ANN
model=Sequential()
model.add(Dense(units=30,activation='relu',input_shape=(X.shape[1],)))
model.add(Dropout(rate = 0.2))
model.add(Dense(units=18,activation='relu'))
model.add(Dropout(rate = 0.1))
model.add(Dense(units=1,activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
#create callback : -
cb=EarlyStopping(
monitor="val_loss", #val_loss means testing error
min_delta=0.00001, #value of lambda
patience=15,
verbose=1,
mode="auto", #minimize loss #maximize accuracy
baseline=None,
restore_best_weights=False
)
trained_model=model.fit(X_train,Y_train,epochs=10,
validation_data=(X_test,Y_test),
callbacks=cb,
batch_size=10
)
model.evaluate(X_train,Y_train)
print("Training accuracy :",model.evaluate(X_train,Y_train)[1])
print("Training loss :",model.evaluate(X_train,Y_train)[0])
model.evaluate(X_test,Y_test)
print("Testing accuracy :",model.evaluate(X_test,Y_test)[1])
print("Testing loss :",model.evaluate(X_test,Y_test)[0])
y_pred_prob=model.predict(X_test)
y_pred=np.argmax(y_pred_cv, axis=-1)
print(classification_report(Y_test,y_pred))
print(confusion_matrix(Y_test,y_pred))
plt.figure(figsize=(7,5))
sns.heatmap(confusion_matrix(Y_test,y_pred),annot=True,cmap="OrRd_r",
fmt="d",cbar=True,
annot_kws={"fontsize":15})
plt.xlabel("Actual Result")
plt.ylabel("Predicted Result")
plt.show()
Then, I will save the model either by using pickle as follows-
# pickle_out = open("./my_model.pkl", mode = "wb")
# pickle.dump(my_model, pickle_out)
# pickle_out.close()
or as follows-
model.save('./my_model.h5')
Now, I want to predict the label (i.e. 'yes', 'no', 'maybe' etc.) of output variable 'Existed' based on new input values (as shown in the following table) that will be provided by an user -
.
My question is that how should I save and load the model followed by predicting the labels for 'Existed' variable, so that it will automatically fill up the empty cell of Exited column with respective labels (i.e. 'yes', 'no', 'maybe' etc.).
I will appreciate your insightful comments on this post.
Once you have your model trained, you can simply run model.predict with the data you wish to predict on. Tricky parts of this process involve making sure this data is the right shape and that the indices match up.
I typically use this recipe:
Note that the features need to be in the exact same shape and order that the model was trained with.
to_predict = df[features]
predictions = model.predict(
to_predict.to_numpy().reshape(-1, len(features))
)
predictions should be the same length as to_predict and it will be an np.array. You can get this back into a DataFrame with the same indices as to_predict by using
predictions = pd.DataFrame(
predictions,
columns="predicted_value", # Anything you want
index=to_predict.index,
)
In your case, this should give values of 0, 1, 2. You will need to map these values back to 'yes', 'no', 'maybe'. To avoid overcomplicating things, you can just use a map on this new DataFrame:
predictions["predicted_value"] = predictions["predicted_value"].map({0: 'yes', 1: 'no', 2: 'maybe'})
Now we need to merge these predictions back with the original df:
df = df.merge(
predictions, left_index=True, right_index=True, how="outer"
)

RandomForestClassifier has no attribute transform, so how to get predictions?

How do you get predictions out of a RandomForestClassifier? Loosely following the latest docs here, my code looks like...
# Split the data into training and test sets (30% held out for testing)
SPLIT_SEED = 64 # some const seed just for reproducibility
TRAIN_RATIO = 0.75
(trainingData, testData) = df.randomSplit([TRAIN_RATIO, 1-TRAIN_RATIO], seed=SPLIT_SEED)
print(f"Training set ({trainingData.count()}):")
trainingData.show(n=3)
print(f"Test set ({testData.count()}):")
testData.show(n=3)
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="labels", featuresCol="features", numTrees=36)
rf.fit(trainingData)
#print(rf.featureImportances)
preds = rf.transform(testData)
When running this, I get the error
AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
Examining the python api docs, I see nothing that look like it relates to generating predictions from the trained model (nor feature importance for that matter). Not much experience with mllib, so not sure what to make of this. Anyone with more experience know what to do here?
by looking closely to the documentation
>>> model = rf.fit(td)
>>> model.featureImportances
SparseVector(1, {0: 1.0})
>>> allclose(model.treeWeights, [1.0, 1.0, 1.0])
True
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> result = model.transform(test0).head()
>>> result.prediction
you will notice the rf.fit return fitted models which is different than the original RandomForestClassifier class.
And the model will have the method to transform and also feature importance
so in your code
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="labels", featuresCol="features", numTrees=36)
model = rf.fit(trainingData)
#print(rf.featureImportances)
preds = model.transform(testData)

Deep decision tree in PySpark

I am using PySpark for machine learning and I want to train decision tree classifier, random forest and gradient boosted trees. I want to try out different maximum depth values and select the best one via grid search and cross-validation. However, Spark is telling me that DecisionTree currently only supports maxDepth <= 30. What is the reason to limit it to 30? Is there a way to increase it? I am using it with text data and my feature vectors are TF-IDFs, so I want to try higher values for the maximum depth. Sample code from the Spark website with some modifications:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label",
outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel",
featuresCol="indexedFeatures", numTrees=500)
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction",
outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])
paramGrid_rf = ParamGridBuilder() \
.addGrid(rf.maxDepth, [50,100,150,250,300]) \
.build()
crossval_rf = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid_rf,
evaluator=BinaryClassificationEvaluator(),
numFolds= 5)
cvModel_rf = crossval_rf.fit(trainingData)
The code above gives me the error message below.
Py4JJavaError: An error occurred while calling o12383.fit.
: java.lang.IllegalArgumentException: requirement failed: DecisionTree currently only supports maxDepth <= 30, but was given maxDepth = 50.
From https://forums.databricks.com/questions/12300/for-decision-trees-is-the-current-maxdepth-limited.html
...the current implmentation imposes a restriction of maxDepth <= 30:
https://github.com/apache/spark/blob/ca6955858cec868c878a2fd8528dbed0ef9edd3f/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L137
You could ask to increase that limit in github forum!

PySpark PCA: avoiding NotConvergedException

I'm attempting to reduce a wide dataset (51 features, ~1300 individuals) using PCA through the ml.linalg method as follows:
1) Named my columns as one list:
features = indi_prep_df.select([c for c in indi_prep_df.columns if c not in{'indi_nbr','label'}]).columns
2) Imported the necessary libraries
from pyspark.ml.feature import PCA as PCAML
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import DenseVector
3) Collapsed the features to a DenseVector
indi_feat = indi_prep_df.rdd.map(lambda x: (x[0], x[-1], DenseVector(x[1:-2]))).toDF(['indi_nbr','label','features'])
4) Dropped everything but the features to retain index:
dftest = indi_feat.drop('indi_nbr','label')
5) Instantiated the PCA object
dfPCA = PCAML(k=3, inputCol="features", outputCol="pcafeats")
6) And attempted to fit the model
PCAout = dfPCA.fit(dftest)
But my model fails to converge (error below).
Things I've tried:
- Mean-filling or zero-filling NA and Null values (as appropriate)
- Reducing the number of features (to 25, then I switched to SKlearn's PCA)
Py4JJavaError: An error occurred while calling o2242.fit.
: breeze.linalg.NotConvergedException:
at breeze.linalg.svd$.breeze$linalg$svd$$doSVD_Double(svd.scala:110)
at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:40)
at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:39)
at breeze.generic.UFunc$class.apply(UFunc.scala:48)
at breeze.linalg.svd$.apply(svd.scala:23)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponentsAndExplainedVariance(RowMatrix.scala:389)
at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:48)
at org.apache.spark.ml.feature.PCA.fit(PCA.scala:99)
at org.apache.spark.ml.feature.PCA.fit(PCA.scala:70)
My configuration is for 50 executors with 6GB/executor, so I don't think it's a matter of not having enough resources (and I don't see anything about resources here).
My input factors are a mixture of percentages, integers and 2-decimal floats, all positive and all ordinal. Could that be causing difficulty with convergence?
I had no trouble with the SKLearn method converging, and quickly, once I converted the PySpark DF to a Pandas DF.

How to give predicted and label columns in BinaryClassificationMetrics evaluation for Naive Bayes model

I have a confusion regarding BinaryClassificationMetrics (Mllib) inputs. As per Apache Spark 1.6.0, we need to pass predictedandlabel of Type (RDD[(Double,Double)]) from transformed DataFrame that having predicted, probability(vector) & rawPrediction(vector).
I have created RDD[(Double,Double)] from Predicted and label columns. After performing BinaryClassificationMetrics evaluation on NavieBayesModel, I'm able to retrieve ROC, PR etc. But the values are limited, I can't able plot the curve using the value generated from this. Roc contains 4 values and PR contains 3 value.
Is it the right way of preparing PredictedandLabel or do I need to use rawPrediction column or Probability column instead of Predicted column?
Prepare like this:
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
val df = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val predictions = new NaiveBayes().fit(df).transform(df)
val preds = predictions.select("probability", "label").rdd.map(row =>
(row.getAs[Vector](0)(0), row.getAs[Double](1)))
And evaluate:
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
new BinaryClassificationMetrics(preds, 10).roc
If predictions are only 0 or 1 number of buckets can be lower like in your case. Try more complex data like this:
val anotherPreds = df1.select(rand(), $"label").rdd.map(row => (row.getDouble(0), row.getDouble(1)))
new BinaryClassificationMetrics(anotherPreds, 10).roc