How to get support numbers in pyspark.ml like we get in sklearn classification report? - pyspark

I am using Pyspark and able to get metrices like accuracy, f1, precison and recall from MulticlassClassificationEvaluator but I am not sure how to get the support numbers like we get in classification report for sklearn. rfc_pred in my case has the group of each class that I run in loop. So, will rfc_pred.count() do the trick?
Below is my current code:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")
accuracy = evaluator.evaluate(rfc_pred, {evaluator.metricName: "accuracy"})
f1 = evaluator.evaluate(rfc_pred, {evaluator.metricName: "f1"})
weightedPrecision = evaluator.evaluate(rfc_pred, {evaluator.metricName: "weightedPrecision"})
weightedRecall = evaluator.evaluate(rfc_pred, {evaluator.metricName: "weightedRecall"})

Related

How to predict label for new input values using artificial neural network in python

I am new in machine learning. I am making a Streamlit app for multiclass classification using artificial neural network. My question is about the ANN model, not about the Streamlit. I know I can use MLPClassifier, but I want to build and train my own model. So, I used the following code to analyze the following data.-
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dropout
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import plot_roc_curve, roc_auc_score, roc_curve
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import GridSearchCV
df=pd.read_csv("./Churn_Modelling.csv")
#Drop Unwanted features
df.drop(columns=['Surname','RowNumber','CustomerId'],inplace=True)
df.head()
#Label Encoding of Categ features
df['Geography']=df['Geography'].map({'France':0,'Spain':1,'Germany':2})
df['Gender']=df['Gender'].map({'Male':0,'Female':1})
#Input & Output selection
X=df.drop('Exited',axis=1)
Y = df['Exited']
Y = df['Exited'].map({'yes':1, 'no':2, 'maybe':3})
#train test split
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=12,stratify=Y)
#scaling
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
Y_train = ss.fit_transform(Y_train)
X_test=ss.transform(X_test)
# build a model
#build ANN
model=Sequential()
model.add(Dense(units=30,activation='relu',input_shape=(X.shape[1],)))
model.add(Dropout(rate = 0.2))
model.add(Dense(units=18,activation='relu'))
model.add(Dropout(rate = 0.1))
model.add(Dense(units=1,activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
#create callback : -
cb=EarlyStopping(
monitor="val_loss", #val_loss means testing error
min_delta=0.00001, #value of lambda
patience=15,
verbose=1,
mode="auto", #minimize loss #maximize accuracy
baseline=None,
restore_best_weights=False
)
trained_model=model.fit(X_train,Y_train,epochs=10,
validation_data=(X_test,Y_test),
callbacks=cb,
batch_size=10
)
model.evaluate(X_train,Y_train)
print("Training accuracy :",model.evaluate(X_train,Y_train)[1])
print("Training loss :",model.evaluate(X_train,Y_train)[0])
model.evaluate(X_test,Y_test)
print("Testing accuracy :",model.evaluate(X_test,Y_test)[1])
print("Testing loss :",model.evaluate(X_test,Y_test)[0])
y_pred_prob=model.predict(X_test)
y_pred=np.argmax(y_pred_cv, axis=-1)
print(classification_report(Y_test,y_pred))
print(confusion_matrix(Y_test,y_pred))
plt.figure(figsize=(7,5))
sns.heatmap(confusion_matrix(Y_test,y_pred),annot=True,cmap="OrRd_r",
fmt="d",cbar=True,
annot_kws={"fontsize":15})
plt.xlabel("Actual Result")
plt.ylabel("Predicted Result")
plt.show()
Then, I will save the model either by using pickle as follows-
# pickle_out = open("./my_model.pkl", mode = "wb")
# pickle.dump(my_model, pickle_out)
# pickle_out.close()
or as follows-
model.save('./my_model.h5')
Now, I want to predict the label (i.e. 'yes', 'no', 'maybe' etc.) of output variable 'Existed' based on new input values (as shown in the following table) that will be provided by an user -
.
My question is that how should I save and load the model followed by predicting the labels for 'Existed' variable, so that it will automatically fill up the empty cell of Exited column with respective labels (i.e. 'yes', 'no', 'maybe' etc.).
I will appreciate your insightful comments on this post.
Once you have your model trained, you can simply run model.predict with the data you wish to predict on. Tricky parts of this process involve making sure this data is the right shape and that the indices match up.
I typically use this recipe:
Note that the features need to be in the exact same shape and order that the model was trained with.
to_predict = df[features]
predictions = model.predict(
to_predict.to_numpy().reshape(-1, len(features))
)
predictions should be the same length as to_predict and it will be an np.array. You can get this back into a DataFrame with the same indices as to_predict by using
predictions = pd.DataFrame(
predictions,
columns="predicted_value", # Anything you want
index=to_predict.index,
)
In your case, this should give values of 0, 1, 2. You will need to map these values back to 'yes', 'no', 'maybe'. To avoid overcomplicating things, you can just use a map on this new DataFrame:
predictions["predicted_value"] = predictions["predicted_value"].map({0: 'yes', 1: 'no', 2: 'maybe'})
Now we need to merge these predictions back with the original df:
df = df.merge(
predictions, left_index=True, right_index=True, how="outer"
)

Can I draw a bipartite graph from every dataset?

I am trying to draw a bipartite graph for my data set, which is like below:
source target weight
reduce energy 25
reduce consumption 25
energy pennsylvania 4
energy natural 4
consumption balancing 4
the code That I am trying to plot the graph is as below:
C_2021 = nx.Graph()
C_2021.add_nodes_from(df_final_2014['source'], bipartite=0)
C_2021.add_nodes_from(df_final_2014['target'], bipartite=1)
edges = df_final_2014[['source', 'target','weight']].apply(tuple, axis=1)
C_2021.add_weighted_edges_from(edges)
But when I check with the below code whether it is bipartite or not, I get the "False" feedback.
nx.is_bipartite(C_2021)
Could you please advise what the issue is?
The previous issue is resolved, but when I want to plot the bipartite graph with the below steps, I do not get a proper result. If someone could help me, I will be appreciated it:
top_nodes_2021 = set(n for n,d in C_2021.nodes(data=True) if d['bipartite']==0)
top_nodes_2021
the output of the above is:
{'reduce'}
bottom_nodes_2021 = set(C_2021) - top_nodes_2021
bottom_nodes_2021
the output of the above is:
{'balancing', 'consumption', 'energy', 'natural', 'pennsylvania '}
then plot it by:
pos = nx.bipartite_layout(C_2021,top_nodes_2021)
plt.figure(figsize=[8,6])
# Pass that layout to nx.draw
nx.draw(C_2021,pos,node_color='#A0CBE2',edge_color='black',width=0.2,
edge_cmap=plt.cm.Blues,with_labels=True)
and the result is:
It works for me using your code. nx.is_bipartite(C_2021) returns true. Check the example below:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
data = StringIO('''source;target;weight
reduce;energy;25
reduce;consumption;25
energy;pennsylvania ;4
energy;natural;4
consumption;balancing;4
''')
df_final_2014 = pd.read_csv(data, sep=";")
C_2021 = nx.Graph()
C_2021.add_nodes_from(df_final_2014['source'], bipartite=0)
C_2021.add_nodes_from(df_final_2014['target'], bipartite=1)
edges = df_final_2014[['source', 'target','weight']].apply(tuple, axis=1)
C_2021.add_weighted_edges_from(edges)
nx.is_bipartite(C_2021)
Finally to draw them get the bipartite sets. The data you passed during the creation is false (i.g. bipartite=0 and bipartite=1).
Use the following commands:
from networkx.algorithms import bipartite
top_nodes_2021, bottom_nodes_2021 = bipartite.sets(C_2021)
pos = nx.bipartite_layout(C_2021, top_nodes_2021)
plt.figure(figsize=[8,6])
# Pass that layout to nx.draw
nx.draw(C_2021,pos,node_color='#A0CBE2',edge_color='black',width=0.2,
edge_cmap=plt.cm.Blues,with_labels=True)
With the following result:

Inconsistent behavior of pyspark code depending on order of line execution

I am running the code through jupyter on EMR, pyspark version 3.3.0.
I have two dataframes that I have preprocessed with the pyspark.ml.feature functions (OneHotEncoder, StringIndexer, VectorAssembler). The first dataframe, lets call it df_good, has 5 features, the second dataframe, lets call it df_bad, omits 2 of the features from df_good. The underlying dataset used to generate the two datasets is the same, the code to generate the datasets is identical (other than two features not be included in the VectorAssembler inputCols for df_bad).
Below is the code I am using to train the model:
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType
from pyspark.ml.classification import LogisticRegression
def split_array(col):
def to_list(v):
return v.toArray().tolist()
return F.udf(to_list, ArrayType(DoubleType()))(col)
def train_model(df):
train_df = df.selectExpr("label as label", "features as features")
logit = LogisticRegression()
logit = logit.setFamily("multinomial")
logit_mod = logit.fit(train_df)
df = logit_mod.transform(df)
df = df.withColumn("pred", split_array(F.col("probability"))[0])
return df
Here is where things get weird.
If I run the code below it works and runs in 10-20 seconds each:
df_good = spark.read.parquet("<s3_location_good>")
df_good = train_model(df_good)
df_good.select(F.sum("pred")).show()
df_bad = spark.read.parquet("<s3_location_bad>")
df_bad = train_model(df_bad)
df_bad.select(F.sum("pred")).show()
If I change the order, the code completely hangs on df_bad:
df_bad = spark.read.parquet("<s3_location_bad>")
df_bad = train_model(df_bad)
df_bad.select(F.sum("pred")).show()
df_good = spark.read.parquet("<s3_location_good>")
df_good = train_model(df_good)
df_good.select(F.sum("pred")).show()
The data is unchanged, the code is the same, the behavior is always the same.
Any thoughts are appreciated.

Getting model parameters from regression in Pyspark in efficient way for large data

I have created a function for applying OLS regression and just getting the model parameters. I used groupby and applyInPandas but it's taking too much of time. Is there are more efficient way to work around this?
Note: I din't had to use groupby as all features have many levels but as I cannot use applyInPandas without it so I created a dummy feature as 'group' having the same value as 1.
Code
import pandas as pd
import statsmodels.api as sm
from pyspark.sql.functions import lit
pdf = pd.DataFrame({
'x':[3,6,2,0,1,5,2,3,4,5],
'y':[0,1,2,0,1,5,2,3,4,5],
'z':[2,1,0,0,0.5,2.5,3,4,5,6]})
df = sqlContext.createDataFrame(pdf)
result_schema =StructType([
StructField('index',StringType()),
StructField('coef',DoubleType())
])
def ols(pdf):
y_column = ['z']
x_column = ['x', 'y']
y = pdf[y_column]
X = pdf[x_column]
model = sm.OLS(y, X).fit()
param_table =pd.DataFrame(model.params, columns = ['coef']).reset_index()
return param_table
#adding a new column to apply groupby
df = df.withColumn('group', lit(1))
#applying function
data = df.groupby('group').applyInPandas(ols, schema = result_schema)
Final output sample
index coef
x 0.183246073
y 0.770680628

Deep decision tree in PySpark

I am using PySpark for machine learning and I want to train decision tree classifier, random forest and gradient boosted trees. I want to try out different maximum depth values and select the best one via grid search and cross-validation. However, Spark is telling me that DecisionTree currently only supports maxDepth <= 30. What is the reason to limit it to 30? Is there a way to increase it? I am using it with text data and my feature vectors are TF-IDFs, so I want to try higher values for the maximum depth. Sample code from the Spark website with some modifications:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label",
outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel",
featuresCol="indexedFeatures", numTrees=500)
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction",
outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])
paramGrid_rf = ParamGridBuilder() \
.addGrid(rf.maxDepth, [50,100,150,250,300]) \
.build()
crossval_rf = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid_rf,
evaluator=BinaryClassificationEvaluator(),
numFolds= 5)
cvModel_rf = crossval_rf.fit(trainingData)
The code above gives me the error message below.
Py4JJavaError: An error occurred while calling o12383.fit.
: java.lang.IllegalArgumentException: requirement failed: DecisionTree currently only supports maxDepth <= 30, but was given maxDepth = 50.
From https://forums.databricks.com/questions/12300/for-decision-trees-is-the-current-maxdepth-limited.html
...the current implmentation imposes a restriction of maxDepth <= 30:
https://github.com/apache/spark/blob/ca6955858cec868c878a2fd8528dbed0ef9edd3f/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L137
You could ask to increase that limit in github forum!