Modyfing python code and running PCA in Tableau - tableau-api

I am a beginner, my first time to use Tableau. I want to perfrorm PCA from Python code in Tableau Dekstop. I got main ideas behind that process, TabPy is installed.
My dataset is really big, having around 1000 + columns.
I took a look on modyfing python code (my python code at the end) to be able to run in tableau.
My question is, in my case how can specify _arg1,_arg2,_arg3,... because I used dataset.drop('Class', 1) to define x, and dataset['Class'] to define y?
Thank you in advance.
# importing or loading the dataset
dataset = pd.read_excel('NL_undivided.xlsx')
# distributing the dataset into two components X and Y
X = dataset.drop('Class', 1)
Y = dataset['Class']
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
scaled_data = scaler.transform(X)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
plt.figure(figsize=(20,10))
fig, ax = plt.subplots(figsize=(20, 10))
scatter = ax.scatter(x_pca[:,0],x_pca[:,1],c=Y,cmap='rainbow',)
# produce a legend with the unique colors from the scatter
legend1 = ax.legend(*scatter.legend_elements(),
loc="best", title="Cohorts")
ax.add_artist(legend1)
plt.figure(figsize=(15,8))

Related

How do you predict an outcome using a single value in a multiple logistic regression using statsmodels?

#import the needed pandas module
import pandas as pd
import statsmodels.formula.api as smf
#Upload the contents of an excel file to a DataFrame
df= pd.read_excel("C:/Users/ME/OneDrive/Desktop/weather.xlsx")
#Create a multiple logistic regression model
logRegModel = smf.logit('sunny ~ temp + barom', data = df)
#Fit the data in df into the model
results = logRegModel.fit()
#Print the results summary
print(results.summary())
#plot the scatterplot with the actual data
z = df.sunny
x = df.temp
y = df.barom
#make a prediction for a given temp x and barometer y reading
prediction = results.predict(pd.DataFrame({'temp': [21],'barom':[12]})
prediction.summary_frame(alpha=0.05)
# Creating figure
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (10, 7))
ax = plt.axes(projection ="3d")
# Creating plot
ax.scatter3D(x, y, z, color = "blue")
plt.title("3D scatter plot")
# show plot
plt.show()
I ran the code above. Everything works fine until it hits the code for making a prediction using a single x and a single y value. When I run the code to include:
#make a prediction for a given temp x and barometer y reading
prediction = results.predict(pd.DataFrame({'temp': [21],'barom':[12]})
prediction.summary_frame(alpha=0.05)
I recieve the following error:
File "<ipython-input-78-b26a4bf65d01>", line 36
from mpl_toolkits import mplot3d
^
SyntaxError: invalid syntax
This is so incredibly odd??? WHy does it run perfectly without the two prediction lines above and then when I include them it tells me a simple import function is a syntax error? It is my understanding reading the statsmodels docs, that in order to make a prediction for a multiple logistic regression model I need to pass a dataFrame into the predict function. Wasn't this done correctly above? My logistic regression is trying to predict if there is a sunny day from temperature and barameter reading. WHen I comment out the import statement above and run it I receive another error on another import statement. This is so strange. WHy soes it not accept my import statements? I ran the code on multiple IDEs and receive the same results. Thank you everyone in advance.

Applying scipy.stats.gaussian_kde to 3D point cloud

I have a set of about 33K (x,y,z) points in a csv file and would like to convert this to a grid of density values using scipy.stats.gaussian_kde. I have not been able to find a way to convert this point cloud array into an appropriate input format for the gaussian_kde function (and then take the output of this and convert it into a density value grid). Can anyone provide sample code?
Here's an example with some comments which may be of use. gaussian_kde wants the data and points to be row stacked, ie. (# ndim, # num values), as per the docs. In your case you would row_stack([x, y, z]) such that the shape is (3, 33000).
from scipy.stats import gaussian_kde
import numpy as np
import matplotlib.pyplot as plt
# simulate some data
n = 33000
x = np.random.randn(n)
y = np.random.randn(n) * 2
# data must be stacked as (# ndim, # n values) as per docs.
data = np.row_stack((x, y))
# perform KDE
kernel = gaussian_kde(data)
# create grid over which to evaluate KDE
s = np.linspace(-8, 8, 128)
grid = np.meshgrid(s, s)
# again KDE needs points to be row_stacked
grid_points = np.row_stack([g.ravel() for g in grid])
# evaluate KDE and reshape result correctly
Z = kernel(grid_points)
Z = Z.reshape(grid[0].shape)
# plot KDE as image and overlay some data points
fig, ax = plt.subplots()
ax.matshow(Z, extent=(s.min(), s.max(), s.min(), s.max()))
ax.plot(x[::10], y[::10], 'w.', ms=1, alpha=0.3)
ax.set_xlim(s.min(), s.max())
ax.set_ylim(s.min(), s.max())

why k-means is better in clustering than topic modelling algorithms like LDA?

I want to know about the advantages of K-means in clustering essays to discover their topics. There are a lot of algorithms to do it such as K-medoid, x-means, LDA, LSA, etc. Please give me a full description of the motives to select k-means algorithms
I don't think you can draw parallels between all these things. I would highly recommend doing some well-defined Googling on your side, and come back here with a more refined question, or questions. In the meantime, I'll share with you what little I know about these topics. First, let's look at PCA & LDA...
import numpy as np
import pandas as pd
# Importing the Dataset
#url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
#names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
#dataset = pd.read_csv(url, names=names)
dataset = pd.read_csv('C:\\your_path_here\\iris.csv')
# PRINCIPAL COMPONENT ANALYSIS
X = dataset.drop('species', 1)
y = dataset['species']
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# As mentioned earlier, PCA performs best with a normalized feature set. We will perform standard scalar normalization to normalize our feature set. To do this, execute the following code:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.decomposition import PCA
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Performance Evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[11 0 0]
[ 0 12 1]
[ 0 1 5]]
print('Accuracy ' + str(accuracy_score(y_test, y_pred)))
Accuracy 0.9333333333333333
# Results with 2 & 3 pirncipal Components
#from sklearn.decomposition import PCA
#pca = PCA(n_components=5)
#X_train = pca.fit_transform(X_train)
#X_test = pca.transform(X_test)
# https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/
# LINEAR DISCRIMINANT ANALYSIS
# Data Preprocessing
# Once dataset is loaded into a pandas data frame object, the first step is to divide dataset into features and corresponding labels and then divide the resultant dataset into training and test sets. The following code divides data into labels and feature set:
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Feature Scaling
# As was the case with PCA, we need to perform feature scaling for LDA too. Execute the following script to do so:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=1)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[11 0 0]
[ 0 13 0]
[ 0 0 6]]
print('Accuracy ' + str(accuracy_score(y_test, y_pred)))
Result:
Accuracy 1.0
# https://stackabuse.com/implementing-lda-in-python-with-scikit-learn/
Does that make sense? Hopefully it does. Now, let's move on to KMeans and PCA...
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
dataset = pd.read_csv('C:\\your_path_here\\iris.csv')
# PRINCIPAL COMPONENT ANALYSIS
X = dataset.drop('species', 1)
y = dataset['species']
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
ax = sns.scatterplot(x="sepal_length", y="sepal_width", hue="sepal_length", data=dataset)
ax = sns.scatterplot(x="petal_length", y="petal_width", hue="petal_length", data=dataset)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
# ordinarily, when you don't have the actual labels, you might use
# silhouette analysis to determine a good number of clusters k to use.
# i.e. you would just run that same code for different values of k and print the value for
# the silhouette score.
# let's see what that value is for the case we just did, k=3.
from sklearn import metrics
score = metrics.silhouette_score(X_scaled, y_cluster_kmeans)
score
# Result:
# 0.45994823920518646
# note that this is the mean over all the samples - there might be some clusters
# that are well separated and others that are closer together.
# so let's look at the distribution of silhouette scores...
scores = metrics.silhouette_samples(X_scaled, y_cluster_kmeans)
sns.distplot(scores);
# so you can see that the blue species have higher silhouette scores
# (the legend doesn't show the colors though... so the pandas plot is more useful).
# note that if we used the best mean silhouette score to try to find the best
# number of clusters k, we'd end up with 2 clusters, because the mean silhouette
# score in that case would be largest, since the clusters would be better separated.
# but, that's using k-means - gmm might give better results...
# so that was clustering on the orginal 4d data.
# if you have a lot of features it can be helpful to do some feature reduction
# to avoid the curse of dimensionality (i.e. needing exponentially more data
# to do accurate predictions as the number of features grows).
# you can do this with Principal Component Analysis (PCA), which remaps the data
# to a new (smaller) coordinate system which tries to account for the
# most information possible.
# you can *also* use PCA to visualize the data by reducing the
# features to 2 dimensions and making a scatterplot.
# it kind of mashes the data down into 2d, so can lose
# information - but in this case it's just going from 4d to 2d,
# so not losing too much info.
# so let's just use it to visualize the data...
# mash the data down into 2 dimensions
from sklearn.decomposition import PCA
ndimensions = 2
pca = PCA(n_components=ndimensions, random_state=seed)
pca.fit(X_scaled)
X_pca_array = pca.transform(X_scaled)
X_pca = pd.DataFrame(X_pca_array, columns=['PC1','PC2']) # PC=principal component
X_pca.sample(5)
# Result:
PC1 PC2
90 0.279078 -1.120029
26 -2.051151 0.242164
83 1.061095 -0.633843
135 2.798770 0.856803
54 1.075475 -0.208421
# so that gives us new 2d coordinates for each data point.
# at this point, if you don't have labelled data,
# you can add the k-means cluster ids to this table and make a
# colored scatterplot.
# we do actually have labels for the data points, but let's imagine
# we don't, and use the predicted labels to see what the predictions look like.
df_plot = X_pca.copy()
df_plot['ClusterKmeans'] = y_cluster_kmeans
df_plot['SpeciesId'] = y_id_array # also add actual labels so we can use it in later plots
df_plot.sample(5)
# Result:
PC1 PC2 ClusterKmeans SpeciesId
132 1.862703 -0.178549 0 2
85 0.429139 0.845582 0 1
139 1.852045 0.676128 0 2
33 -2.446177 2.150728 1 0
147 1.521170 0.269069 0 2
# so now we can make a 2d scatterplot of the clusters
# first define a plot fn
def plotData(df, groupby):
"make a scatterplot of the first two principal components of the data, colored by the groupby field"
# make a figure with just one subplot.
# you can specify multiple subplots in a figure,
# in which case ax would be an array of axes,
# but in this case it'll just be a single axis object.
fig, ax = plt.subplots(figsize = (7,7))
# color map
cmap = mpl.cm.get_cmap('prism')
# we can use pandas to plot each cluster on the same graph.
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
for i, cluster in df.groupby(groupby):
cluster.plot(ax = ax, # need to pass this so all scatterplots are on same graph
kind = 'scatter',
x = 'PC1', y = 'PC2',
color = cmap(i/(nclusters-1)), # cmap maps a number to a color
label = "%s %i" % (groupby, i),
s=30) # dot size
ax.grid()
ax.axhline(0, color='black')
ax.axvline(0, color='black')
ax.set_title("Principal Components Analysis (PCA) of Iris Dataset");
# plot the clusters each datapoint was assigned to
plotData(df_plot, 'ClusterKmeans')
# so those are the *predicted* labels - what about the *actual* labels?
plotData(df_plot, 'SpeciesId')
# so the k-means clustering *did not* find the correct clusterings!
# q. so what do these dimensions mean?
# they're the principal components, which pick out the directions
# of maximal variation in the original data.
# PC1 finds the most variation, PC2 the second-most.
# the rest of the data is basically thrown away when the data is reduced down to 2d.

pyspark extract ROC curve?

Is there a way to get the points on an ROC curve from Spark ML in pyspark? In the documentation I see an example for Scala but not python: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html
Is that right? I can certainly think of ways to implement it but I have to imagine it’s faster if there’s a pre-built function. I’m working with 3 million scores and a few dozen models so speed matters.
For a more general solution that works for models besides Logistic Regression (like Decision Trees or Random Forest which lack a model summary) you can get the ROC curve using BinaryClassificationMetrics from Spark MLlib.
Note that the PySpark version doesn't implement all of the methods that the Scala version does, so you'll need to use the .call(name) function from JavaModelWrapper. It also seems that py4j doesn't support parsing scala.Tuple2 classes, so they have to be manually processed.
Example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
# Create a Pipeline estimator and fit on train DF, predict on test DF
model = estimator.fit(train)
predictions = model.transform(test)
# Returns as a list (false positive rate, true positive rate)
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
BinaryClassificationMetrics in Scala implements several other useful methods as well:
metrics = CurveMetrics(preds)
metrics.get_curve('fMeasureByThreshold')
metrics.get_curve('precisionByThreshold')
metrics.get_curve('recallByThreshold')
As long as the ROC curve is a plot of FPR against TPR, you can extract the needed values as following:
your_model.summary.roc.select('FPR').collect()
your_model.summary.roc.select('TPR').collect())
Where your_model could be for example a model you got from something like this:
from pyspark.ml.classification import LogisticRegression
log_reg = LogisticRegression()
your_model = log_reg.fit(df)
Now you should just plot FPR against TPR, using for example matplotlib.
P.S.
Here is a complete example for plotting ROC curve using a model named your_model (and anything else!). I've also plot a reference "random guess" line inside the ROC plot.
import matplotlib.pyplot as plt
plt.figure(figsize=(5,5))
plt.plot([0, 1], [0, 1], 'r--')
plt.plot(your_model.summary.roc.select('FPR').collect(),
your_model.summary.roc.select('TPR').collect())
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()
To get ROC metrics for train data (trained model), we can use your_model.summary.roc which is a DataFrame with columns FPR and TPR. See Andrea's answer.
For ROC evaluated on arbitrary test data, we can use label and probability columns to pass to sklearn's roc_curve to get FPR and TPR. Here we assume a binary classification problem where the y score is the probability of predicting 1. See also How to split Vector into columns - using PySpark, How to convert a pyspark dataframe column to numpy array
Example
from sklearn.metrics import roc_curve
model = lr.fit(train_df)
test_df_predict = model.transform(test_df)
y_score = test_df_predict.select(vector_to_array("probability")[1]).rdd.keys().collect()
y_true = test_df_predict.select("label").rdd.keys().collect()
fpr, tpr, thresholds = roc_curve(y_true, y_score)

Chaco - Getting multiple data series to use the same axes and maps

I am trying to plot several collections of data on a single plot.
Each dataset can be represented as an x-series (index) and several y-series (values). The ranges of x and y data series may be different in each data set. I want to have several of these data sets display on one plot. However, when I simply add the second plot object to the first (see below) it makes a second axis for it that is nested inside the plot.
I want both plots to share the same axis and for the axis bounds to be updated to fit all the data. What is the best way to achieve this? I am struggling to find topics on this in the documentation.
Thanks for your help. The code below highlights my problem.
# Major library imports
from numpy import linspace
from scipy.special import jn
from chaco.example_support import COLOR_PALETTE
# Enthought library imports
from enable.api import Component, ComponentEditor
from traits.api import HasTraits, Instance
from traitsui.api import Item, Group, View
# Chaco imports
from chaco.api import ArrayPlotData, Plot
from chaco.tools.api import BroadcasterTool, PanTool, ZoomTool
from chaco.api import create_line_plot, add_default_axes
def _create_plot_component():
# Create some x-y data series to plot
x = linspace(-2.0, 10.0, 100)
x2 =linspace(-5.0, 10.0, 100)
pd = ArrayPlotData(index = x)
for i in range(5):
pd.set_data("y" + str(i), jn(i,x))
#slightly different plot data
pd2 = ArrayPlotData(index = x2)
for i in range(5):
pd2.set_data("y" + str(i), 2*jn(i,x2))
# Create some line plots of some of the data
plot1 = Plot(pd)
plot1.plot(("index", "y0", "y1", "y2"), name="j_n, n<3", color="red")
# Tweak some of the plot properties
plot1.title = "My First Line Plot"
plot1.padding = 50
plot1.padding_top = 75
plot1.legend.visible = True
plot2 = Plot(pd2)
plot2.plot(("index", "y0", "y1"), name="j_n, n<3", color="green")
plot1.add(plot2)
# Attach some tools to the plot
broadcaster = BroadcasterTool()
broadcaster.tools.append(PanTool(plot1))
broadcaster.tools.append(PanTool(plot2))
for c in (plot1, plot2):
zoom = ZoomTool(component=c, tool_mode="box", always_on=False)
broadcaster.tools.append(zoom)
plot1.tools.append(broadcaster)
return plot1
# Attributes to use for the plot view.
size=(900,500)
title="Multi-Y plot"
# # Demo class that is used by the demo.py application.
#===============================================================================
class Demo(HasTraits):
plot = Instance(Component)
traits_view = View(
Group(
Item('plot', editor=ComponentEditor(size=size),
show_label=False),
orientation = "vertical"),
resizable=True, title=title,
width=size[0], height=size[1]
)
def _plot_default(self):
return _create_plot_component()
demo = Demo()
if __name__ == "__main__":
demo.configure_traits()
One of the warts in Chaco (and indeed many plotting libraries) is the overloading of terms---especially the word "plot".
You're creating two different (capital-"P") Plots, but (I believe) you really only want one. Plot is the container that holds all of your individual line ... umm ... plots. The Plot.plot method returns a list of LinePlot instances (this "plot" is also called a "renderer" sometimes). That renderer is what you want to add to your (capital-"P") Plot container. The plot method actually creates the LinePlot instance and adds it to the Plot container for you. (Yup, that's three different uses of "plot": The container, the renderer, and the method on the container that adds/returns the renderer.)
Here's a simpler version of _create_plot_component that does roughly what you want. Note that only a single (capital-"P") Plot container is created.
def _create_plot_component():
# Create some x-y data series to plot
x = linspace(-2.0, 10.0, 100)
x2 =linspace(-5.0, 10.0, 100)
pd = ArrayPlotData(x=x, x2=x2)
for i in range(3):
pd.set_data("y" + str(i), jn(i,x))
# slightly different plot data
for i in range(3, 5):
pd.set_data("y" + str(i), 2*jn(i,x2))
# Create some line plots of some of the data
canvas = Plot(pd)
canvas.plot(("x", "y0", "y1", "y2"), name="plot 1", color="red")
canvas.plot(("x2", "y3", "y4"), name="plot 2", color="green")
return canvas
Edit: An earlier response fixed the issue with a two-line modification, but it wasn't the ideal way to solve the problem.