pyspark extract ROC curve? - pyspark

Is there a way to get the points on an ROC curve from Spark ML in pyspark? In the documentation I see an example for Scala but not python: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html
Is that right? I can certainly think of ways to implement it but I have to imagine it’s faster if there’s a pre-built function. I’m working with 3 million scores and a few dozen models so speed matters.

For a more general solution that works for models besides Logistic Regression (like Decision Trees or Random Forest which lack a model summary) you can get the ROC curve using BinaryClassificationMetrics from Spark MLlib.
Note that the PySpark version doesn't implement all of the methods that the Scala version does, so you'll need to use the .call(name) function from JavaModelWrapper. It also seems that py4j doesn't support parsing scala.Tuple2 classes, so they have to be manually processed.
Example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
# Create a Pipeline estimator and fit on train DF, predict on test DF
model = estimator.fit(train)
predictions = model.transform(test)
# Returns as a list (false positive rate, true positive rate)
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
BinaryClassificationMetrics in Scala implements several other useful methods as well:
metrics = CurveMetrics(preds)
metrics.get_curve('fMeasureByThreshold')
metrics.get_curve('precisionByThreshold')
metrics.get_curve('recallByThreshold')

As long as the ROC curve is a plot of FPR against TPR, you can extract the needed values as following:
your_model.summary.roc.select('FPR').collect()
your_model.summary.roc.select('TPR').collect())
Where your_model could be for example a model you got from something like this:
from pyspark.ml.classification import LogisticRegression
log_reg = LogisticRegression()
your_model = log_reg.fit(df)
Now you should just plot FPR against TPR, using for example matplotlib.
P.S.
Here is a complete example for plotting ROC curve using a model named your_model (and anything else!). I've also plot a reference "random guess" line inside the ROC plot.
import matplotlib.pyplot as plt
plt.figure(figsize=(5,5))
plt.plot([0, 1], [0, 1], 'r--')
plt.plot(your_model.summary.roc.select('FPR').collect(),
your_model.summary.roc.select('TPR').collect())
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()

To get ROC metrics for train data (trained model), we can use your_model.summary.roc which is a DataFrame with columns FPR and TPR. See Andrea's answer.
For ROC evaluated on arbitrary test data, we can use label and probability columns to pass to sklearn's roc_curve to get FPR and TPR. Here we assume a binary classification problem where the y score is the probability of predicting 1. See also How to split Vector into columns - using PySpark, How to convert a pyspark dataframe column to numpy array
Example
from sklearn.metrics import roc_curve
model = lr.fit(train_df)
test_df_predict = model.transform(test_df)
y_score = test_df_predict.select(vector_to_array("probability")[1]).rdd.keys().collect()
y_true = test_df_predict.select("label").rdd.keys().collect()
fpr, tpr, thresholds = roc_curve(y_true, y_score)

Related

How do you predict an outcome using a single value in a multiple logistic regression using statsmodels?

#import the needed pandas module
import pandas as pd
import statsmodels.formula.api as smf
#Upload the contents of an excel file to a DataFrame
df= pd.read_excel("C:/Users/ME/OneDrive/Desktop/weather.xlsx")
#Create a multiple logistic regression model
logRegModel = smf.logit('sunny ~ temp + barom', data = df)
#Fit the data in df into the model
results = logRegModel.fit()
#Print the results summary
print(results.summary())
#plot the scatterplot with the actual data
z = df.sunny
x = df.temp
y = df.barom
#make a prediction for a given temp x and barometer y reading
prediction = results.predict(pd.DataFrame({'temp': [21],'barom':[12]})
prediction.summary_frame(alpha=0.05)
# Creating figure
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (10, 7))
ax = plt.axes(projection ="3d")
# Creating plot
ax.scatter3D(x, y, z, color = "blue")
plt.title("3D scatter plot")
# show plot
plt.show()
I ran the code above. Everything works fine until it hits the code for making a prediction using a single x and a single y value. When I run the code to include:
#make a prediction for a given temp x and barometer y reading
prediction = results.predict(pd.DataFrame({'temp': [21],'barom':[12]})
prediction.summary_frame(alpha=0.05)
I recieve the following error:
File "<ipython-input-78-b26a4bf65d01>", line 36
from mpl_toolkits import mplot3d
^
SyntaxError: invalid syntax
This is so incredibly odd??? WHy does it run perfectly without the two prediction lines above and then when I include them it tells me a simple import function is a syntax error? It is my understanding reading the statsmodels docs, that in order to make a prediction for a multiple logistic regression model I need to pass a dataFrame into the predict function. Wasn't this done correctly above? My logistic regression is trying to predict if there is a sunny day from temperature and barameter reading. WHen I comment out the import statement above and run it I receive another error on another import statement. This is so strange. WHy soes it not accept my import statements? I ran the code on multiple IDEs and receive the same results. Thank you everyone in advance.

How to apply clustering on sentences embeddings?

I would like to create a summary with the major points of the original document. To do this, I made sentences embeddings with a Universal Sentence Encoder(https://tfhub.dev/google/universal-sentence-encoder/2). After, I would like apply clustering on my vectors.
I've tried with the library sklearn:
import numpy as np
from sklearn.cluster import KMeans
n_clusters = np.ceil(len(encoded)**0.5)
kmeans = KMeans(n_clusters=n_clusters)
kmeans = kmeans.fit(encoded)
But I get an error message:
'numpy.float64' object cannot be interpreted as an integer'
The problem is caused in this line:
n_clusters = np.ceil(len(encoded)**0.5)
kmeans expects to receive an integer as the number of clusters so simply add:
n_clusters = int(np.ceil(len(encoded)**0.5))

Binary classification always outputs 1

I'm making my first steps on keras and I'm trying to do binary classification on the cancer dataset available in scikit-learn
# load dataset
from sklearn import datasets
cancer = datasets.load_breast_cancer()
cancer.data
# dataset into pd.dataframe
import pandas as pd
donnee = pd.concat([pd.DataFrame(data = cancer.data, columns = cancer.feature_names),
pd.DataFrame(data = cancer.target, columns = ["target"])
], axis = 1)
# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(donnee.loc[:, donnee.columns != "target"], donnee.target, test_size = 0.25, random_state = 1)
I'm trying to follow keras' tutorial here : https://keras.io/#getting-started-30-seconds-to-keras
The thing is, I always get the same loss value (6.1316862406430541), and the same accuracy (0.61538461830232527), because the predictions are always 1.
I'm not sure if it's because of a code error :
I don't know, maybe the shape of X_train is wrong ?
Or maybe I'm doing something wrong with epochs and/or batch_size.
Or if it's because of the network itself :
if I'm not mistaken, all 1 predictions is possible if there's no biases to the layers, and I don't know yet how they're initialized
But maybe it's something else, maybe 1 layer only is too few ? (if so, I wonder why keras' tutorial is 1 layer only...)
Here is my code, if you have any idea :
import keras
from keras.models import Sequential
model = Sequential()
from keras.layers import Dense
model.add(Dense(units=64, activation='relu', input_dim=30))
model.add(Dense(units=1, activation='sigmoid'))
model.summary()
model.compile(loss = keras.losses.binary_crossentropy,
optimizer = 'rmsprop',
metrics=['accuracy']
)
model.fit(X_train.as_matrix(), y_train.as_matrix().reshape(426, -1), epochs=5, batch_size=32)
loss_and_metrics = model.evaluate(X_test.as_matrix(), y_test.as_matrix(), batch_size=128)
loss_and_metrics
classes = model.predict(X_test.as_matrix(), batch_size=128)
classes
This is a very usual case. If you check the histogram of your data you will see that there are data points in your dataset which coordinates spans from 0 to 100. When you feed such data to neural network input to sigmoid might be so big that it will suffer from underflow. In order to scale data, you could use either MinMaxScaler or StandardScaler thanks to what you'll make your data to have a span suitable for neural network computations.

PySpark MLLIB Random Forest: prediction always 0

Using ml, Spark 2.0 (Python) and a 1.2 million row dataset, I am trying to create a model that predicts purchase tendency with a Random Forest Classifier. However when applying the transformation to the splitted test dataset the prediction is always 0.
The dataset looks like:
[Row(tier_buyer=u'0', N1=u'1', N2=u'0.72', N3=u'35.0', N4=u'65.81', N5=u'30.67', N6=u'0.0'....
tier_buyer is the field used as a label indexer. The rest of the fields contain numeric data.
Steps
1.- Load the parquet file, and fill possible null values:
parquet = spark.read.parquet('path_to_parquet')
parquet.createOrReplaceTempView("parquet")
dfraw = spark.sql("SELECT * FROM parquet").dropDuplicates()
df = dfraw.na.fill(0)
2.- Create features vector:
features = VectorAssembler(
inputCols = ['N1','N2'...],
outputCol = 'features')
3.- Create string indexer:
label_indexer = StringIndexer(inputCol = 'tier_buyer', outputCol = 'label')
4.- Split the train and test datasets:
(train, test) = df.randomSplit([0.7, 0.3])
Resulting train dataset
Resulting Test dataset
5.- Define the classifier:
classifier = RandomForestClassifier(labelCol = 'label', featuresCol = 'features')
6.- Pipeline the stages and fit the train model:
pipeline = Pipeline(stages=[features, label_indexer, classifier])
model = pipeline.fit(train)
7.- Transform the test dataset:
predictions = model.transform(test)
8.- Output the test result, grouped by prediction:
predictions.select("prediction", "label", "features").groupBy("prediction").count().show()
As you can see, the outcome is always 0. I have tried with multiple feature variations in hopes of reducing the noise, also trying from different sources and infering the schema, with no luck and the same results.
Questions
Is the current setup, as described above, correct?
Could the null value filling on the original Dataframe be source of failure to effectively perform the prediction?
In the screenshot shown above it looks like some features are in the form of a tuple and other of a list, why? I'm guessing this could be a possible source of error. (They are representation of Dense and Sparse Vectors)
It seems your features [N1, N2, ...] are strings. You man want to cast all your features as FloatType() or something along those lines. It may be prudent to fillna() after type casting.

How to define a function and pass training and test datasets in Scala?

I want to define a function in Scala in which I can pass my training and test datasets and then it perform a simple machine learning algorithm and returns some statistics. How should do that? What will be the parameters data type?
Imagine, you need to define a function which by taking training and test datasets performs a simple classification algorithm and then return the accuracy.
What I expect to have is like as follow:
val data = MLUtils.loadLibSVMFile(sc, datadir + "/example.txt");
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L);
val training = splits(0).cache();
val test = splits(1);
val results1 = SVMFunction(training, test)
val results2 = RegressionFunction(training, test)
val results3 = ClassificationFunction(training, test)
I need just the declaration of the functions and not the code that produce the results1, results2, and results3.
def SVMFunction ("I need help here"){
//I know how to work with the training and test datasets to generate the results.
//So no need to discuss what should be here
}
Thanks.
In case you're using supervised learning you should opt for LabeledPoint. Excerpt from mllib doc:
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....
And example is:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// Create a labeled point with a positive label and a dense feature vector.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))