Scatterplot in python - cluster-analysis

I am trying to make a scatter plot with 2-3 variables represented in different colors. I have plotted it but I need a circle to make a boundary around each of the scatter clusters just for the representation. something like below.... how to do that?

You could use K-Means clustering which is one of the easiest ways to create these kinds of clusters. You could use scikit-learn to make this easy. I have also done PCA for you and if you don't want to do PCA you could remove that, but I urge you to apply PCA.
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
def KMeansModel(n):
pca = PCA(n_components=2)
reduced_train_data = pca.fit_transform(train_data)
KM = KMeans(n_clusters=n)
KM.fit(reduced_train_data)
plt.plot(reduced_train_data[:, 0], reduced_train_data[:, 1], 'k.', markersize=2)
centroids = KM.cluster_centers_
for ind,i in enumerate(centroids):
class_inds=np.where(KM.labels_==ind)[0]
max_dist=np.max(metrics.pairwise_distances(i, reduced_train_data[class_inds]))
print(max_dist)
plt.gca().add_artist(plt.Circle(i, max_dist, fill=False))
plt.show()
out = ...
The n here is number of clusters you want to make.

Well, there will be two steps: clustering and constructing a shape around the clusters.
You appear to have got the clustering sorted. I recommend for the shape you bound the clusters by ellipsoids. There are a few ways of constructing such ellipsoids.
A simple one is to construct Legendre ellipsoids, that is ellipsoids centered at the mean,
$$
c_i = \frac1n \sum \vec x_i
$$
and defined by
$$
(x_i - c_i) a^{-1} (x_j - c_j) \le 1
$$
where
$$
a_{ij} = \frac1n (x _i - c_i) (x_j - c_j) ( d + 2)
$$
You may need to expand the ellipsoids as Legendre ellipsoids are not gaurenteed to bound all the points.

Related

Using numerical methods to plot solution to first-order nonlinear differential equation in Matlab

I have a question about plotting x(t), the solution to the following differential equation knowing that dx/dt equals the expression below. The value of x is 0 at t = 0.
syms x
dxdt = -(1.0*(6.84e+45*x^2 + 5.24e+32*x - 2.49e+42))/(2.47e+39*x + 7.12e+37)
I want to plot the solution of this first-order nonlinear differential equation. The analytical solution involves complex numbers so that's not relevant because this equation models a real-life process, but Matlab can solve the equation using numerical methods and plot it. Can someone please suggest how to do this?
in matlab try this
tspan = [0 10];
x0 = 0;
[t,x] = ode45(#(t,x) -(1.0*(6.84e+45*x^2 + 5.24e+32*x - 2.49e+42))/(2.47e+39*x + 7.12e+37), tspan, x0);
plot(t,x,'b')
i try it and i got this
hope that help you.
I have written an example for how to use Python with SymPy and matplotlib. SymPy can be used to calculate both definite and indefinite integrals. By calculating the indefinite integral and adding a constant to set it to evaluate to 0 at t = 0. Now you have the integral, so just a matter of plotting. I would define an array from a starting point to an endpoint with 1000 points between (could likely be less). You can then calculate the value of the integral with the constant at each time point, which can then be plotted with matplotlib. There are plenty of other questions on how to customize plots with matplotlib.
This displays a basic plot of the indefinite integral of the function dxdt with assumption of x(t) = 0. Variation of the tuple when running Plotting() will set what range of x values to plot. This is set to plot 1000 data points between the minimum and maximum values set when calling the function.
For more information on customizing the plot, I recommend matplotlib documentation. Documentation on the integral can be found in SymPy documentation.
import pandas as pd
from sympy import *
from sympy.abc import x
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
def Plotting(xValues, dxdt):
# Calculate integral
xt = integrate(dxdt,x)
# Convert to function
f = lambdify(x, xt)
C = -f(0)
# Define x values, last number in linspace corresponding to number of points to plot
xValues = np.linspace(xValues[0],xValues[1],500)
yValues = [f(x)+C for x in xValues]
# Initialize figure
fig = plt.figure(figsize = (4,3))
ax = fig.add_axes([0, 0, 1, 1])
# Plot Data
ax.plot(xValues, yValues)
plt.show()
plt.close("all")
# Define Function
dxdt = -(1.0*(6.84e45*x**2 + 5.24e32*x - 2.49e42))/(2.47e39*x + 7.12e37)
# Run Plotting function, with left and right most points defined as tuple, and function as second argument
Plotting((-0.025, 0.05),dxdt)

Can there be overlap in k-means clusters?

I am unclear about why k-means clustering can have overlap in clusters. From Chen (2018) I saw the following definition:
"..let the observations be a sample set to be partitioned into K disjoint clusters"
However I see an overlap in my plots, and am not sure why this is the case.
For reference, I am trying to cluster a multi-dimensional dataset with three variables (Recency, Frequency, Revenue). To visualize clustering, I can project 3D data into 2D using PCA and run k-means on that. Below is the code and plot I get:
df1=tx_user[["Recency","Frequency","Revenue"]]
#standardize
names = df1.columns
# Create the Scaler object
scaler = preprocessing.StandardScaler()
# Fit your data on the scaler object
scaled_df1 = scaler.fit_transform(df1)
df1 = pd.DataFrame(scaled_df1, columns=names)
df1.head()
del scaled_df1
sklearn_pca = PCA(n_components = 2)
X1 = sklearn_pca.fit_transform(df1)
X1 = X1[:, ::-1] # flip axes for better plotting
kmeans = KMeans(3, random_state=0)
labels = kmeans.fit(X1).predict(X1)
plt.scatter(X1[:, 0], X1[:, 1], c=labels, s=40, cmap='viridis');
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
def plot_kmeans(kmeans, X, n_clusters=4, rseed=0, ax=None):
labels = kmeans.fit_predict(X)
# plot the input data
ax = ax or plt.gca()
ax.axis('equal')
#ax.set_ylim(-5000,7000)
ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
# plot the representation of the KMeans model
centers = kmeans.cluster_centers_
radii = [cdist(X[labels == i], [center]).max()
for i, center in enumerate(centers)]
for c, r in zip(centers, radii):
ax.add_patch(plt.Circle(c, r, fc='#CCCCCC', lw=3, alpha=0.5, zorder=1))
kmeans = KMeans(n_clusters=4, random_state=0)
plot_kmeans(kmeans, X1)
My question is:
1. Why is there an overlap? Is my clustering wrong if there is?
2. How does k-means decide cluster assignment incase there is an overlap?
Thank you
Reference:
Chen, L., Xu, Z., Wang, H., & Liu, S. (2018). An ordered clustering algorithm based on K-means and the PROMETHEE method. International Journal of Machine Learning and Cybernetics, 9(6), 917-926.
K-means computes k clusters by average approximation. Each cluster is defined by their computed center and thus is unique by definition.
Sample assignment is made to cluster with closest distance from cluster center, also unique by definition. Thus in this sense there is NO OVERLAP.
However for given distance d>0 a sample may be within d-distance to more than one cluster center (it is possible). This is what you see when you say overlap. However still the sample is assigned to closest cluster not to all of them. So no overlap.
NOTE: In the case where a sample has exactly same closest distance to more than one cluster center any random assignment can be made between the closest clusters and this changes nothing important in the algorithm or results since clusters are re-computed after assignment.
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K-pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
Perhaps you did something wrong... I don't have your data, so I can't test it. You can add boundaries, and check those. See the sample code below.
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import Voronoi
def voronoi_finite_polygons_2d(vor, radius=None):
"""
Reconstruct infinite voronoi regions in a 2D diagram to finite
regions.
Parameters
----------
vor : Voronoi
Input diagram
radius : float, optional
Distance to 'points at infinity'.
Returns
-------
regions : list of tuples
Indices of vertices in each revised Voronoi regions.
vertices : list of tuples
Coordinates for revised Voronoi vertices. Same as coordinates
of input vertices, with 'points at infinity' appended to the
end.
"""
if vor.points.shape[1] != 2:
raise ValueError("Requires 2D input")
new_regions = []
new_vertices = vor.vertices.tolist()
center = vor.points.mean(axis=0)
if radius is None:
radius = vor.points.ptp().max()*2
# Construct a map containing all ridges for a given point
all_ridges = {}
for (p1, p2), (v1, v2) in zip(vor.ridge_points, vor.ridge_vertices):
all_ridges.setdefault(p1, []).append((p2, v1, v2))
all_ridges.setdefault(p2, []).append((p1, v1, v2))
# Reconstruct infinite regions
for p1, region in enumerate(vor.point_region):
vertices = vor.regions[region]
if all([v >= 0 for v in vertices]):
# finite region
new_regions.append(vertices)
continue
# reconstruct a non-finite region
ridges = all_ridges[p1]
new_region = [v for v in vertices if v >= 0]
for p2, v1, v2 in ridges:
if v2 < 0:
v1, v2 = v2, v1
if v1 >= 0:
# finite ridge: already in the region
continue
# Compute the missing endpoint of an infinite ridge
t = vor.points[p2] - vor.points[p1] # tangent
t /= np.linalg.norm(t)
n = np.array([-t[1], t[0]]) # normal
midpoint = vor.points[[p1, p2]].mean(axis=0)
direction = np.sign(np.dot(midpoint - center, n)) * n
far_point = vor.vertices[v2] + direction * radius
new_region.append(len(new_vertices))
new_vertices.append(far_point.tolist())
# sort region counterclockwise
vs = np.asarray([new_vertices[v] for v in new_region])
c = vs.mean(axis=0)
angles = np.arctan2(vs[:,1] - c[1], vs[:,0] - c[0])
new_region = np.array(new_region)[np.argsort(angles)]
# finish
new_regions.append(new_region.tolist())
return new_regions, np.asarray(new_vertices)
# make up data points
np.random.seed(1234)
points = np.random.rand(15, 2)
# compute Voronoi tesselation
vor = Voronoi(points)
# plot
regions, vertices = voronoi_finite_polygons_2d(vor)
print("--")
print(regions)
print("--")
print(vertices)
# colorize
for region in regions:
polygon = vertices[region]
plt.fill(*zip(*polygon), alpha=0.4)
plt.plot(points[:,0], points[:,1], 'ko')
plt.axis('equal')
plt.xlim(vor.min_bound[0] - 0.1, vor.max_bound[0] + 0.1)
plt.ylim(vor.min_bound[1] - 0.1, vor.max_bound[1] + 0.1)
Great resource here.
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html

how to draw classifier in SVM in Matlab

suppose I have w and b, then how to draw the classifier in Matlab? suppose the nodes are in 2-D space, that is to say, x=(x1,x2).
I've tried several methods, but can't draw the classifier y=w^T x+b? Any help?
In that case, w is a 2d vector as well.
Say that you have the following data:
datawX = [1,1,3,2,0,0,0]
datawY = [2,0,1,4,2,1,0]
databX = [1,0,2,5,4,4,2]
databY = [1,0,2,2,4,4,4]
Then you calculate the classifier using the Support-vector machine method.
Say the result is w=[3,1] and b=1.5, in that case the direction of the vector is perpendicular to w so d=[-w(2),w(1)]. So you can define two points p1 and p2.
after performing the svm:
p1=-b*w
p2=d-b*w
w=[3,1]
b=1.5
d=[-w(2),w(1)]
p1=-b*w
p2=d-b*w
scatter(dataw, 'r')
hold on
scatter(datab, 'g')
hold on
plot([p1;p2],'b');

How to use rp, rs, and Wn parameters in scipy.signal.filter_design.ellip?

I'd like to try out the elliptic filter design function from SciPy in scipy.signal.filter_design.ellip. I'm familiar with the filter design functions in Octave, but I'm not sure how to use this:
From the documentation at http://www.scipy.org/doc/api_docs/SciPy.signal.filter_design.html
ellip(N, rp, rs, Wn, btype = 'low', analog = 0, output = 'ba')
Elliptic (Cauer) digital and analog filter design.
Description:
Design an Nth order lowpass digital or analog elliptic filter and return the filter coefficients in (B,A) or (Z,P,K) form.
See also ellipord.
I understand N (order), btype (low or high), analog (true/false), and output (ba vs. zpk).
What are rp, rs, and Wn and how are they supposed to work?
From my experience with Octave, I'm guessing that rp and rs have to do with the maximum allowed ripple in the pass and stop bands, and that Wn is a weight or controls the cutoff frequency, but how these work isn't documented and I can't find any examples.
I believe HYRY is correct. From my experience using the Python Matlab clone scripts they work well, with the exception of poor documentation. Yes, Rp and Rs are the maximum allowable ripple in the passband and stopband respectively. The Wn is the digital cutoff, or edge frequency.
So...here's some code on how to use it to replicate the filter that the mathworks uses as an example:
import pylab
import scipy
import scipy.signal
[b,a] = scipy.signal.ellip(6,3,50,300.0/500.0);
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
plt.title('Digital filter frequency response')
ax1 = fig.add_subplot(111)
h,w = scipy.signal.freqz(b, a)
plt.semilogy(h, np.abs(w), 'b')
plt.semilogy(h, abs(w), 'b')
plt.ylabel('Amplitude (dB)', color='b')
plt.xlabel('Frequency (rad/sample)')
plt.grid()
plt.legend()
ax2 = ax1.twinx()
angles = np.unwrap(np.angle(w))
plt.plot(h, angles, 'g')
plt.ylabel('Angle (radians)', color='g')
plt.show()
sorry the format is so lame, but it works! You'll notice that the frequency scale is different than matlab shows, it's just cosmetic. This is what you get:
I think this function is the same as Octave or MATLAB, so you can read the MATLAB document about it.
http://www.mathworks.com/help/toolbox/signal/ref/ellip.html

Using scipy.stats.gaussian_kde with 2 dimensional data

I'm trying to use the scipy.stats.gaussian_kde class to smooth out some discrete data collected with latitude and longitude information, so it shows up as somewhat similar to a contour map in the end, where the high densities are the peak and low densities are the valley.
I'm having a hard time putting a two-dimensional dataset into the gaussian_kde class. I've played around to figure out how it works with 1 dimensional data, so I thought 2 dimensional would be something along the lines of:
from scipy import stats
from numpy import array
data = array([[1.1, 1.1],
[1.2, 1.2],
[1.3, 1.3]])
kde = stats.gaussian_kde(data)
kde.evaluate([1,2,3],[1,2,3])
which is saying that I have 3 points at [1.1, 1.1], [1.2, 1.2], [1.3, 1.3]. and I want to have the kernel density estimation using from 1 to 3 using width of 1 on x and y axis.
When creating the gaussian_kde, it keeps giving me this error:
raise LinAlgError("singular matrix")
numpy.linalg.linalg.LinAlgError: singular matrix
Looking into the source code of gaussian_kde, I realize that the way I'm thinking about what dataset means is completely different from how the dimensionality is calculate, but I could not find any sample code showing how multi-dimension data works with the module. Could someone help me with some sample ways to use gaussian_kde with multi-dimensional data?
This example seems to be what you're looking for:
import numpy as np
import scipy.stats as stats
from matplotlib.pyplot import imshow
# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(2000,1)),
stats.norm.rvs(loc=0,scale=3,size=(2000,1)),
axis=1)
kde = stats.kde.gaussian_kde(rvs.T)
# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)
z = kde(grid_coords.T)
z = z.reshape(128,128)
imshow(z,aspect=x_flat.ptp()/y_flat.ptp())
Axes need fixing, obviously.
You can also do a scatter plot of the data with
scatter(rvs[:,0],rvs[:,1])
I think you are mixing up kernel density estimation with interpolation or maybe kernel regression. KDE estimates the distribution of points if you have a larger sample of points.
I'm not sure which interpolation you want, but either the splines or rbf in scipy.interpolate will be more appropriate.
If you want one-dimensional kernel regression, then you can find a version in scikits.statsmodels with several different kernels.
update: here is an example (if this is what you want)
>>> data = 2 + 2*np.random.randn(2, 100)
>>> kde = stats.gaussian_kde(data)
>>> kde.evaluate(np.array([[1,2,3],[1,2,3]]))
array([ 0.02573917, 0.02470436, 0.03084282])
gaussian_kde has variables in rows and observations in columns, so reversed orientation from the usual in stats. In your example, all three points are on a line, so it has perfect correlation. That is, I guess, the reason for the singular matrix.
Adjusting the array orientation and adding a small noise, the example works, but still looks very concentrated, for example you don't have any sample point near (3,3):
>>> data = np.array([[1.1, 1.1],
[1.2, 1.2],
[1.3, 1.3]]).T
>>> data = data + 0.01*np.random.randn(2,3)
>>> kde = stats.gaussian_kde(data)
>>> kde.evaluate(np.array([[1,2,3],[1,2,3]]))
array([ 7.70204299e+000, 1.96813149e-044, 1.45796523e-251])
I found it difficult to understand the SciPy manual's description of how gaussian_kde works with 2D data. Here is an explanation which is intended to complement #endolith 's example. I divided the code into several steps with comments to explain the less intuitive bits.
First, the imports:
import numpy as np
import scipy.stats as st
from matplotlib.pyplot import imshow, show
Create some dummy data: these are 1-D arrays of the "X" and "Y" point coordinates.
np.random.seed(142) # for reproducibility
x = st.norm.rvs(loc=2, scale=1, size=2000)
y = st.norm.rvs(loc=0, scale=3, size=2000)
For 2-D density estimation the gaussian_kde object has to be initialised with an array with two rows containing the "X" and "Y" datasets. In NumPy terminology, we "stack them vertically":
xy = np.vstack((x, y))
so the "X" data is in the first row xy[0,:] and the "Y" data are in the second row xy[1,:] and xy.shape is (2, 2000). Now create the gaussian_kde object:
dens = st.gaussian_kde(xy)
We will evaluate the estimated 2-D density PDF on a 2-D grid. There is more than one way of creating such a grid in NumPy. I show here an approach which is different from (but functionally equivalent to) #endolith 's method:
gx, gy = np.mgrid[x.min():x.max():128j, y.min():y.max():128j]
gxy = np.dstack((gx, gy)) # shape is (128, 128, 2)
gxy is a 3-D array, the [i,j]-th element of gxy contains a 2-element list of the corresponding "X" and "Y" values: gxy[i, j] 's value is [ gx[i], gy[j] ].
We have to invoke dens() (or dens.pdf() which is the same thing) on each of the 2-D grid points. NumPy has a very elegant function for this purpose:
z = np.apply_along_axis(dens, 2, gxy)
In words, the callable dens (could have been dens.pdf as well) is invoked along axis=2 (the third axis) in the 3-D array gxy and the values should be returned as a 2-D array. The only glitch is that the shape of z will be (128,128,1) and not (128,128) what I expected. Note that the documentation says that:
The shape of out [the return value, L.D.] is identical to the shape of arr, except along the
axis dimension. This axis is removed, and replaced with new dimensions
equal to the shape of the return value of func1d. So if func1d returns
a scalar out will have one fewer dimensions than arr.
Most likely dens() returned a 1-long tuple and not a scalar which I was hoping for. I didn't investigate the issue any further, because this is easy to fix:
z = z.reshape(128, 128)
after which we can generate the image:
imshow(z, aspect=gx.ptp() / gy.ptp())
show() # needed if you try this in PyCharm
Here is the image. (Note that I have implemented #endolith 's version as well and got an image indistinguishable from this one.)
The example posted in the top answer didn't work for me. I had to tweak it little bit and it works now:
import numpy as np
import scipy.stats as stats
from matplotlib import pyplot as plt
# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(2000,1)),
stats.norm.rvs(loc=0,scale=3,size=(2000,1)),
axis=1)
kde = stats.kde.gaussian_kde(rvs.T)
# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)
z = kde(grid_coords.T)
z = z.reshape(128,128)
plt.imshow(z,aspect=x_flat.ptp()/y_flat.ptp())
plt.show()