How do you predict an outcome using a single value in a multiple logistic regression using statsmodels? - linear-regression

#import the needed pandas module
import pandas as pd
import statsmodels.formula.api as smf
#Upload the contents of an excel file to a DataFrame
df= pd.read_excel("C:/Users/ME/OneDrive/Desktop/weather.xlsx")
#Create a multiple logistic regression model
logRegModel = smf.logit('sunny ~ temp + barom', data = df)
#Fit the data in df into the model
results = logRegModel.fit()
#Print the results summary
print(results.summary())
#plot the scatterplot with the actual data
z = df.sunny
x = df.temp
y = df.barom
#make a prediction for a given temp x and barometer y reading
prediction = results.predict(pd.DataFrame({'temp': [21],'barom':[12]})
prediction.summary_frame(alpha=0.05)
# Creating figure
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (10, 7))
ax = plt.axes(projection ="3d")
# Creating plot
ax.scatter3D(x, y, z, color = "blue")
plt.title("3D scatter plot")
# show plot
plt.show()
I ran the code above. Everything works fine until it hits the code for making a prediction using a single x and a single y value. When I run the code to include:
#make a prediction for a given temp x and barometer y reading
prediction = results.predict(pd.DataFrame({'temp': [21],'barom':[12]})
prediction.summary_frame(alpha=0.05)
I recieve the following error:
File "<ipython-input-78-b26a4bf65d01>", line 36
from mpl_toolkits import mplot3d
^
SyntaxError: invalid syntax
This is so incredibly odd??? WHy does it run perfectly without the two prediction lines above and then when I include them it tells me a simple import function is a syntax error? It is my understanding reading the statsmodels docs, that in order to make a prediction for a multiple logistic regression model I need to pass a dataFrame into the predict function. Wasn't this done correctly above? My logistic regression is trying to predict if there is a sunny day from temperature and barameter reading. WHen I comment out the import statement above and run it I receive another error on another import statement. This is so strange. WHy soes it not accept my import statements? I ran the code on multiple IDEs and receive the same results. Thank you everyone in advance.

Related

Modyfing python code and running PCA in Tableau

I am a beginner, my first time to use Tableau. I want to perfrorm PCA from Python code in Tableau Dekstop. I got main ideas behind that process, TabPy is installed.
My dataset is really big, having around 1000 + columns.
I took a look on modyfing python code (my python code at the end) to be able to run in tableau.
My question is, in my case how can specify _arg1,_arg2,_arg3,... because I used dataset.drop('Class', 1) to define x, and dataset['Class'] to define y?
Thank you in advance.
# importing or loading the dataset
dataset = pd.read_excel('NL_undivided.xlsx')
# distributing the dataset into two components X and Y
X = dataset.drop('Class', 1)
Y = dataset['Class']
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
scaled_data = scaler.transform(X)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
plt.figure(figsize=(20,10))
fig, ax = plt.subplots(figsize=(20, 10))
scatter = ax.scatter(x_pca[:,0],x_pca[:,1],c=Y,cmap='rainbow',)
# produce a legend with the unique colors from the scatter
legend1 = ax.legend(*scatter.legend_elements(),
loc="best", title="Cohorts")
ax.add_artist(legend1)
plt.figure(figsize=(15,8))

Curve fitting of sine function in python using scipy is not yielding desired output

I'm trying to fit sine function on my data. No errors are shown but it doesn't seem to work.
python
def sin_fun(x,a,b):
return (a*np.sin(b*x))
p_opt,p_cov=cf(sin_fun,xdata,ydata)
print(p_opt)
plt.plot(xdata,sin_fun(xdata,*p_opt))
plt.scatter(xdata,ydata)
plt.show()
This is the output I am getting:
I have simulated your data. There are 2 problems with your code as to why it isn't doing what you want. First is that your sin_fun needs a y-offset parameter, otherwise the function will always be symmetrical about y = 0. Secondly, the fit works better if you can provide curve_fit with a reasonable guess. This is done using the p0 argument. Have a look here:
from scipy.optimize import curve_fit as cf
import numpy as np
from matplotlib import pyplot as plt
# simulate your data
xdata = np.linspace(0, 25000, 256)
ydata = 15000 * np.sin(xdata/2000) + 22000
# add some noise
ydata += np.random.rand(xdata.size) * 2000
# sin function needs a y-offset -> c
def sin_fun(x,a,b,c):
return a*np.sin(b*x)+c
# need a reasonable guess -> note that the guess is not quite right but curve_fit still works
p_opt,p_cov=cf(sin_fun,xdata,ydata, p0=(10000, 1/2500, 15000))
print(p_opt)
plt.plot(xdata,sin_fun(xdata,*p_opt))
plt.plot(xdata,ydata, 'r.', ms=1)
plt.show()
With these fixes you can get a good fit. You could also add a phase parameter to your function to help fit other sinusoids.

Python sklearn- gaussian.mixture how to get the samples/points in each clusters

I am using the GMM to cluster my dataset to K Groups, my model is running well, but there is no way to get raw data from each cluster, Can you guys suggest me some idea to solve this problem. Thank you so much.
You can do it like this (look at d0, d1, & d2).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn import datasets
from sklearn.mixture import GaussianMixture
# load the iris dataset
iris = datasets.load_iris()
# select first two columns
X = iris.data[:, 0:2]
# turn it into a dataframe
d = pd.DataFrame(X)
# plot the data
plt.scatter(d[0], d[1])
gmm = GaussianMixture(n_components = 3)
# Fit the GMM model for the dataset
# which expresses the dataset as a
# mixture of 3 Gaussian Distribution
gmm.fit(d)
# Assign a label to each sample
labels = gmm.predict(d)
d['labels']= labels
d0 = d[d['labels']== 0]
d1 = d[d['labels']== 1]
d2 = d[d['labels']== 2]
# here is a possible solution for you:
d0
d1
d2
# plot three clusters in same plot
plt.scatter(d0[0], d0[1], c ='r')
plt.scatter(d1[0], d1[1], c ='yellow')
plt.scatter(d2[0], d2[1], c ='g')
# print the converged log-likelihood value
print(gmm.lower_bound_)
# print the number of iterations needed
# for the log-likelihood value to converge
print(gmm.n_iter_)
# it needed 8 iterations for the log-likelihood to converge.

How to apply clustering on sentences embeddings?

I would like to create a summary with the major points of the original document. To do this, I made sentences embeddings with a Universal Sentence Encoder(https://tfhub.dev/google/universal-sentence-encoder/2). After, I would like apply clustering on my vectors.
I've tried with the library sklearn:
import numpy as np
from sklearn.cluster import KMeans
n_clusters = np.ceil(len(encoded)**0.5)
kmeans = KMeans(n_clusters=n_clusters)
kmeans = kmeans.fit(encoded)
But I get an error message:
'numpy.float64' object cannot be interpreted as an integer'
The problem is caused in this line:
n_clusters = np.ceil(len(encoded)**0.5)
kmeans expects to receive an integer as the number of clusters so simply add:
n_clusters = int(np.ceil(len(encoded)**0.5))

pyspark extract ROC curve?

Is there a way to get the points on an ROC curve from Spark ML in pyspark? In the documentation I see an example for Scala but not python: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html
Is that right? I can certainly think of ways to implement it but I have to imagine it’s faster if there’s a pre-built function. I’m working with 3 million scores and a few dozen models so speed matters.
For a more general solution that works for models besides Logistic Regression (like Decision Trees or Random Forest which lack a model summary) you can get the ROC curve using BinaryClassificationMetrics from Spark MLlib.
Note that the PySpark version doesn't implement all of the methods that the Scala version does, so you'll need to use the .call(name) function from JavaModelWrapper. It also seems that py4j doesn't support parsing scala.Tuple2 classes, so they have to be manually processed.
Example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
# Create a Pipeline estimator and fit on train DF, predict on test DF
model = estimator.fit(train)
predictions = model.transform(test)
# Returns as a list (false positive rate, true positive rate)
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
BinaryClassificationMetrics in Scala implements several other useful methods as well:
metrics = CurveMetrics(preds)
metrics.get_curve('fMeasureByThreshold')
metrics.get_curve('precisionByThreshold')
metrics.get_curve('recallByThreshold')
As long as the ROC curve is a plot of FPR against TPR, you can extract the needed values as following:
your_model.summary.roc.select('FPR').collect()
your_model.summary.roc.select('TPR').collect())
Where your_model could be for example a model you got from something like this:
from pyspark.ml.classification import LogisticRegression
log_reg = LogisticRegression()
your_model = log_reg.fit(df)
Now you should just plot FPR against TPR, using for example matplotlib.
P.S.
Here is a complete example for plotting ROC curve using a model named your_model (and anything else!). I've also plot a reference "random guess" line inside the ROC plot.
import matplotlib.pyplot as plt
plt.figure(figsize=(5,5))
plt.plot([0, 1], [0, 1], 'r--')
plt.plot(your_model.summary.roc.select('FPR').collect(),
your_model.summary.roc.select('TPR').collect())
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()
To get ROC metrics for train data (trained model), we can use your_model.summary.roc which is a DataFrame with columns FPR and TPR. See Andrea's answer.
For ROC evaluated on arbitrary test data, we can use label and probability columns to pass to sklearn's roc_curve to get FPR and TPR. Here we assume a binary classification problem where the y score is the probability of predicting 1. See also How to split Vector into columns - using PySpark, How to convert a pyspark dataframe column to numpy array
Example
from sklearn.metrics import roc_curve
model = lr.fit(train_df)
test_df_predict = model.transform(test_df)
y_score = test_df_predict.select(vector_to_array("probability")[1]).rdd.keys().collect()
y_true = test_df_predict.select("label").rdd.keys().collect()
fpr, tpr, thresholds = roc_curve(y_true, y_score)