How to visualize rownames of specific data points within a cluster plot in R? - cluster-analysis

I have calculated clusters with a big dataset (1) and found four clusters which I plotted. Now I have 30 new data points (2) that I want to plot in/ on top of the existing clusters in order to see which of the new data points is closest to the original cluster centroids (of the 1. big dataset).
What I did so far:
#I have combined both data sets (1. my old big data set) and (2. my 30 new data points) and added an indicator variable in order to distinguish between the old and new data sets:
# I only chose variables that are needed for the cluster calculations as well as the indicator
combined.ind <- combined [, c(1752:1757, 1759:1762, 1942)]
#I created a factor variable that indicates "new' and old variables
combined.ind$indicator <- factor(combined.ind$indicator,
levels = c(0,1),
labels = c("new", "old"))
#Then I calculated a hierarchical cluster analysis with the ward-centroids which I have then used for calculating a k-means clustering:
#calculate ward-centroids:
combined.ward.cent <- aggregate(cbind(Z1, Z2, Z3, Z4, Z5, Z6, Z7, Z8, Z9, Z10)~CLU4_1,combined,mean)
combined.ward.cent2 <- combined.ward.cent[, c(2:11)]
#apply kmeans with ward centroids as initial starting points:
kmeans <- kmeans(combined.ind[1:(length(combined.ind)-1)], centers = combined.ward.cent2)
#Then I have plotted the results and tried to highlight the new data points:
#Plot the results
fviz_cluster(kmeans, data = combined.ind[, 1:length(combined.ind)-1])
#I changed the colors with scale color manual in order to see the new data points.
fviz_cluster(kmeans, data = combined.ind[, 1:length(combined.ind)-1], geom=c("point", "text"), ellipse = T) + geom_point(aes(color=combined.ind$indicator)) + ggtitle("My Beautiful Graph") +
scale_color_manual("Old vs New", values = c("new" = "black", "old" = "red"))
Since the first dataset is huge, I cannot see/read the rownames of the new data points because all of them overlap. When I add repel=True to the argument (see below) only the rownames of the data points on the edge are visualized, which does not help me because I am trying to only visualize the rownames of the new data points.
fviz_cluster(kmeans, data = combined.ind[, 1:length(combined.ind)-1], geom=c("point", "text"), repel = TRUE, ellipse = T) +
geom_point(aes(color=combined.ind$indicator)) + ggtitle("My Beautiful Graph") +
scale_color_manual("Old vs New", values = c("new" = "black", "old" = "red"))
How can I solve this problem?

Related

How to deal with KML3D results in R?

I have done a KML3d in R using longitudinal medical data with 2 outcome-measurements (Eq5d score + oxford score) measured on 3 time-points.
After pre-processing the data, I used the following code to build the model:
trajectory_knee <-clusterLongData3d(data, timeInData = list(oxford = c(17, 19, 21), eq5d = c(18, 20, 22)), varNames = c("Oxford score", "Eq5d score"))
kml3d(trajectory_knee, nbClusters = 2:5, nbRedrawing = 4, toPlot = "both")
The goal of my KML3d is to predict different clusters containing observations. However, running the KML3D gave me the calinski harabatz plot, while my goal is to:
obtain the cluster labels generated by the model
plot the clusters
use the BIC + plot trajectories to find the optimal numbers of clusters
However, I do not know how to reach the above goals. Is there someone who can help me/ put me in the right direction?
Thanks!
I tried using
NbClust
Plotallcriterion
choice(trajectory)
but none of these give me information about the previously formulated goals...

N-dimensional GP Regression

I'm trying to use GPflow for a multidimensional regression. But I'm confused by the shapes of the mean and variance.
For example: A 2-dimensional input space X of shape (20,20) is supposed to be predicted. My training samples are of shape (8,2) which means 8 training samples overall for the two dimensions. The y-values are of shape (8,1) which of course means one value of the ground truth per combination of the 2 input dimensions.
If I now use model.predict_y(X) I would expect to receive a mean of shape (20,20) but obtain a shape of (20,1). Same goes for the variance. I think that this problem comes from the shape of the y-values but I have have no idea how to fix it.
bound = 3
num = 20
X = np.random.uniform(-bound, bound, (num,num))
print(X_sample.shape) # (8,2)
print(Y_sample.shape) # (8,1)
k = gpflow.kernels.RBF(input_dim=2)
m = gpflow.models.GPR(X_sample, Y_sample, kern=k)
m.likelihood.variance = sigma_n
m.compile()
gpflow.train.ScipyOptimizer().minimize(m)
mean, var = m.predict_y(X)
print(mean.shape) # (20, 1)
print(var.shape) # (20, 1)
It sounds like you may be confused between the shape of a grid of input positions and the shape of the numpy arrays: if you want to predict on a 20 x 20 grid in two dimensions, you have 400 points in total, each with 2 values. So X (the one that you pass to m.predict_y()) should have shape (400, 2). (Note that the second dimension needs to have the same shape as X_sample!)
To construct this array of shape (400,2) you can use np.meshgrid (e.g., see What is the purpose of meshgrid in Python / NumPy?).
m.predict_y(X) only predicts the marginal variance at each test point, so the returned mean and var both have shape (400,1) (same length as X). You can of course reshape them to the 20 x 20 values on your grid.
(It is also possible to compute the full covariance, for the latent f this is implemented as m.predict_f_full_cov, which for X of shape (400,2) would return a 400x400 matrix. This is relevant if you want consistent samples from the GP, but I suspect that goes well beyond this question.)
I was indeed making the mistake to not flatten the arrays which in return produced the mistake. Thank you for the fast response STJ!
Here is an example of the working code:
# Generate data
bound = 3.
x1 = np.linspace(-bound, bound, num)
x2 = np.linspace(-bound, bound, num)
x1_mesh,x2_mesh = np.meshgrid(x1, x2)
X = np.dstack([x1_mesh, x2_mesh]).reshape(-1, 2)
z = f(x1_mesh, x2_mesh) # evaluation of the function on the grid
# Draw samples from feature vectors and function by a given index
size = 2
np.random.seed(1991)
index = np.random.choice(range(len(x1)), size=(size,X.ndim), replace=False)
samples = utils.sampleFeature([x1,x2], index)
X1_sample = samples[0]
X2_sample = samples[1]
X_sample = np.column_stack((X1_sample, X2_sample))
Y_sample = utils.samplefromFunc(f=z, ind=index)
# Change noise parameter
sigma_n = 0.0
# Construct models with initial guess
k = gpflow.kernels.RBF(2,active_dims=[0,1], lengthscales=1.0,ARD=True)
m = gpflow.models.GPR(X_sample, Y_sample, kern=k)
m.likelihood.variance = sigma_n
m.compile()
#print(X.shape)
mean, var = m.predict_y(X)
mean_square = mean.reshape(x1_mesh.shape) # Shape: (num,num)
var_square = var.reshape(x1_mesh.shape) # Shape: (num,num)
# Plot mean
fig = plt.figure(figsize=(16, 12))
ax = plt.axes(projection='3d')
ax.plot_surface(x1_mesh, x2_mesh, mean_square, cmap=cm.viridis, linewidth=0.5, antialiased=True, alpha=0.8)
cbar = ax.contourf(x1_mesh, x2_mesh, mean_square, zdir='z', offset=offset, cmap=cm.viridis, antialiased=True)
ax.scatter3D(X1_sample, X2_sample, offset, marker='o',edgecolors='k', color='r', s=150)
fig.colorbar(cbar)
for t in ax.zaxis.get_major_ticks(): t.label.set_fontsize(fontsize_ticks)
ax.set_title("$\mu(x_1,x_2)$", fontsize=fontsize_title)
ax.set_xlabel("\n$x_1$", fontsize=fontsize_label)
ax.set_ylabel("\n$x_2$", fontsize=fontsize_label)
ax.set_zlabel('\n\n$\mu(x_1,x_2)$', fontsize=fontsize_label)
plt.xticks(fontsize=fontsize_ticks)
plt.yticks(fontsize=fontsize_ticks)
plt.xlim(left=-bound, right=bound)
plt.ylim(bottom=-bound, top=bound)
ax.set_zlim3d(offset,np.max(z))
which leads to (red dots are the sample points drawn from the function). Note: Code not refactored what so ever :)

create training and testing set with ground truth for Hyper spectral satellite imagery

I am trying to create Training and Testing set out of my ground truth(observation) data which are presented in a tif (raster) format.
Actually, I have a hyperspectral image (Satellite image) which has 200 dimensions(channels/bands) along with the corresponding label(17 class) which are stored in another image. Now, my goal is to implement a classification algorithm and then check the accuracy with the testing dataset.
My problem is, that I do not know how can I describe to my algorithm that which pixel belongs to which class and then split them to taring and testing set.
I have provided a face idea of my goal which is as follows:
But I do not want to do this since I have 145 * 145 pixels dim, so it's not easy to define the location of these pixels and manually assign to their corresponding class.
note that the following example is for 3D image and I have 200D image and I have the labels (ground truth) so I do not need to specify them like the following code but I do want to assign them to their pixels member.
% Assigning pixel(by their location)to different groups.
tpix=[1309,640 ,1;... % Group 1
1218,755 ,1;...
1351,1409,2;... % Group 2
673 ,394 ,2;...
285 ,1762,3;... % Group 3
177 ,1542,3;...
538 ,1754,4;... % Group 4
432 ,1811,4;...
1417,2010,5;... % Group 5
163 ,1733,5;...
652 ,677 ,6;... % Group 6
864 ,1032,6];
row=tpix(:,1); % y-value
col=tpix(:,2); % x-value
group=tpix(:,3); % group number
ngroup=max(group);
% create trainingset
train=[];
for i=1:length(group)
train=[train; r(row(i),col(i)), g(row(i),col(i)), b(row(i),col(i))];
end %for
Do I understand this right? At the seconlast line the train variable gets the values it has until now + the pixels in red, green and blue? Like, you want them to be displayed only in red,green and blue? Only certain ones or all of them? I could imagine that we define an image matrix and then place the values in the images red, green and blue layers. Would that help? I'd make you the code if this is you issue :)
Edit: Solution
%download the .mats from the website and put them in folder of script
load 'Indian_pines_corrected.mat';
load 'Indian_pines_gt.mat';
ipc = indian_pines_corrected;
gt = indian_pines_gt;
%initiating cell
train = cell(16,1);
%loop to search class number of the x and y pixel coordinates
for c = 1:16
for i = 1:145
for j = 1:145
% if the classnumber is equal to the number in the gt pixel,
% then place the pixel from ipc(x,y,:) it in the train{classnumber}(x,y,:)
if gt(i,j) == c
train{c}(i,j,:) = ipc(i,j,:);
end %if
end %for j
end %for i
end %for c
Now you get the train cell that has a matrix in each cell. Each cell is one class and has only the pixels inside that you want. You can check for yourself if the classes correspond to the shape.
Eventually, I could solve my problem. The following code reshapes the matrix(Raster) to vector and then I index the Ground Truth data to find the corresponding pixel's location in Hyperspectral image.
Note that I am looking for an efficient way to construct Training and Testing set.
GT = indian_pines_gt;
data = indian_pines_corrected;
data_vec=reshape(data, 145*145,200);
GT_vec = reshape(GT,145*145,1);
[GT_vec_sort,idx] = sort(GT_vec);
%INDEXING.
index = find(and(GT_vec_sort>0,GT_vec_sort<=16));
classes_num = GT_vec_sort(index);
%length(index)
for k = 1: length(index)
classes(k,:) = data_vec(idx(index(k)),:);
end
figure(1)
plot(GT_vec_sort)
New.
I have done the following for creating Training and Testing set for #Hyperspectral images(Pine dataset). No need to use for loop
clear all
load('Indian_pines_corrected.mat');
load Indian_pines_gt.mat;
GT = indian_pines_gt;
data = indian_pines_corrected;
%Convert image from raster to vector.
data_vec = reshape(data, 145*145, 200);
%Provide location of the desired classes.
GT_loc = find(and(GT>0,GT<=16));
GT_class = GT(GT_loc)
data_value = data_vec(GT_loc,:)
% explanatories plus Respond variable.
%[200(variable/channel)+1(labels)= 201])
dat = [data_value, GT_class];
% create random Test and Training set.
[m,n] = size(dat);
P = 0.70 ;
idx = randperm(m);
Train = dat(idx(1:round(P*m)),:);
Test = dat(idx(round(P*m)+1:end),:);
X_train = Train(:,1:200); y_train = Train(:, 201);
X_test = Test(:,1:200); y_test = Test(:, 201);

Distinguish classes with different colors Networkx

I have a huge dataset of 80,000 rows , I want to draw a meaningful graph in networkx using 2 dataframes (nodes and edges)
In "nodes", I have : actor1 , category_id(int :numerical value from 0 - 7 describe the type , and fatalities (float representing the number of injured or killed people))
In "edges" : "actor1", "actor2", "interaction: float 64"
my aim is to draw a graph with different colors according to category_id and different sizes based on number of fatalities
I started thid code which run perfectly until I tried to retrieve interaction and fatalities to calculate wights of nodes as follows
nodes = ACLED_to_graph[['actor1','category_id','fatalities']]
edges = ACLED_to_graph[['actor1','actor2','interaction']]
# Initiate the graph
G4 = nx.Graph()
for index, row in nodes.iterrows():
G4.add_node(row['actor1'], category_id=row['category_id'], nodesize=row['fatalities'])
for index, row in edges.iterrows():
G4.add_weighted_edges_from([(row['actor1'], row['actor2'], row['interaction'])])
#greater_than_ = [x for x in G.nodes(data=True) if x[2]['nodesize']>15]
# Sort nodes by degree
sorted(G4.degree, key=lambda x: x[1], reverse=True)
# remove anonymous nodes whose degree are <2 and <200
cond1 = [node for node,degree in G4.degree() if degree>=200]
cond2 = [node for node,degree in G4.degree() if degree<4]
remove = cond1+cond2
G4.remove_nodes_from(remove)
G4.remove_edges_from(remove)
# Customize the layout
pos=nx.spring_layout(G4, k=0.25, iterations=50)
# Define color map for classes
color_map = {0:'#f1f0c0',1:'#f09494', 2:'#eebcbc', 3:'#72bbd0', 4:'#91f0a1', 5:'#629fff', 6:'#bcc2f2',
7:'#eebcbc' }
plt.figure(figsize=(25,25))
options = {
'edge_color': '#FFDEA2',
'width': 1,
'with_labels': True,
'font_weight': 'regular',
}
colors = [color_map[G4.node[node]['category_id']] for node in G4.node]
#sizes = [G.node[node]['interaction'] for node in G]
"""
Using the spring layout :
- k controls the distance between the nodes and varies between 0 and 1
- iterations is the number of times simulated annealing is run
default k=0.1 and iterations=50
"""
#node_color=colors,
#node_size=sizes,
nx.draw(G4,node_color=colors, **options,cmap=plt.get_cmap('jet'))
ax = plt.gca()
ax.collections[0].set_edgecolor("#555555")
I am also removing some nodes with degrees greater than 200 and less than 3 to simplify the graph and make it more appealing.
I am getting the following error :
colors = [color_map[G4.node[node]['category_id']] for node in G4.node]
KeyError: 'category_id'
without the input data it is a bit hard to tell for sure, but it looks as if you are not constructing the graph nodes with a property 'category_id'. In the for index, row in nodes.iterrows(): you assign the data in the nodes dictionary, key 'category_id' to the property "group".
you can confirm this to be the case by checking what keys are set for an example node in your graph, e.g. print(G4.node['actor1 '].keys()).
To fix this, either
a) change the assignment
for index, row in nodes.iterrows():
G4.add_node(row['actor1'], category_id=row['category_id'], nodesize=row['interaction'])
or b) change the lookup
colors = [color_map[G4.node[node]['group']] for node in G4.node]
Solving mathematical operation using nodes attributes can be summarized as follows :
1 -After the subsetting the dataframe, we initialize the graph
nodes = ACLED_to_graph[['actor1','category_id','interaction']]
edges = ACLED_to_graph[['actor1','actor2','fatalities']]
# Initiate the graph
G8 = nx.Graph()
2- Add edges attributes first (I emphasize the use of from_pandas_edgelist)
for index, row in edges.iterrows():
G8 = nx.from_pandas_edgelist(edges, 'actor1', 'actor2', ['fatalities'])
3- Next we add nodes attributes using add_note, other techniques such as set_nodes_attributes didn't work in pandas
for index, row in nodes.iterrows():
G8.add_node(row['actor1'], category_id=row['category_id'], interaction=row['interaction'])
4- Sort nodes by degree to select the most connected nodes (I am choosing nodes with degrees more than 3 and less than 200)
sorted(G8.degree, key=lambda x: x[1], reverse=True)
# remove anonymous nodes whose degree are <2 and <200
cond1 = [node for node,degree in G8.degree() if degree>=200]
cond2 = [node for node,degree in G8.degree() if degree<3]
remove = cond1+cond2
G8.remove_nodes_from(remove)
G8.remove_edges_from(remove)
5- set the color based on degree (calling node.degree)
node_color = [G8.degree(v) for v in G8]
6- set the edge size based on fatalities
edge_width = [0.15*G8[u][v]['fatalities'] for u,v in G8.edges()]
7- set the node size based on interaction
node_size = [list(nx.get_node_attributes(G8, 'interaction').values()) for v in G8]
I used get.node_attribute instead pandas to access the features which allowed me to list the dictionary and convert it to a matrix of values , ready to compute.
8- Select the most important edges based on atalities
large_edges = [x for x in G8.edges(data=True) if x[2]['fatalities']>=3.0]
9- Finally, draw the network and edges seperately
nx.draw_networkx(G8, pos, node_size=node_size,node_color=node_color, alpha=0.7, with_labels=False, width=edge_width, edge_color='.4', cmap=plt.cm.Blues)
nx.draw_networkx_edges(G8, pos, edgelist=large_edges, edge_color='r', alpha=0.4, width=6)

Plot portfolio composition map in Julia (or Matlab)

I am optimizing portfolio of N stocks over M levels of expected return. So after doing this I get the time series of weights (i.e. a N x M matrix where where each row is a combination of stock weights for a particular level of expected return). Weights add up to 1.
Now I want to plot something called portfolio composition map (right plot on the picture), which is a plot of these stock weights over all levels of expected return, each with a distinct color and length (at every level of return) is proportional to it's weight.
My questions is how to do this in Julia (or MATLAB)?
I came across this and the accepted solution seemed so complex. Here's how I would do it:
using Plots
#userplot PortfolioComposition
#recipe function f(pc::PortfolioComposition)
weights, returns = pc.args
weights = cumsum(weights,dims=2)
seriestype := :shape
for c=1:size(weights,2)
sx = vcat(weights[:,c], c==1 ? zeros(length(returns)) : reverse(weights[:,c-1]))
sy = vcat(returns, reverse(returns))
#series Shape(sx, sy)
end
end
# fake data
tickers = ["IBM", "Google", "Apple", "Intel"]
N = 10
D = length(tickers)
weights = rand(N,D)
weights ./= sum(weights, dims=2)
returns = sort!((1:N) + D*randn(N))
# plot it
portfoliocomposition(weights, returns, labels = tickers)
matplotlib has a pretty powerful polygon plotting capability, e.g. this link on plotting filled polygons:
ploting filled polygons in python
You can use this from Julia via the excellent PyPlot.jl package.
Note that the syntax for certain things changes; see the PyPlot.jl README and e.g. this set of examples.
You "just" need to calculate the coordinates from your matrix and build up a set of polygons to plot the portfolio composition graph. It would be nice to see the code if you get this working!
So I was able to draw it, and here's my code:
using PyPlot
using PyCall
#pyimport matplotlib.patches as patch
N = 10
D = 4
weights = Array(Float64, N,D)
for i in 1:N
w = rand(D)
w = w/sum(w)
weights[i,:] = w
end
weights = [zeros(Float64, N) weights]
weights = cumsum(weights,2)
returns = sort!([linspace(1,N, N);] + D*randn(N))
##########
# Plot #
##########
polygons = Array(PyObject, 4)
colors = ["red","blue","green","cyan"]
labels = ["IBM", "Google", "Apple", "Intel"]
fig, ax = subplots()
fig[:set_size_inches](5, 7)
title("Problem 2.5 part 2")
xlabel("Weights")
ylabel("Return (%)")
ax[:set_autoscale_on](false)
ax[:axis]([0,1,minimum(returns),maximum(returns)])
for i in 1:(size(weights,2)-1)
xy=[weights[:,i] returns;
reverse(weights[:,(i+1)]) reverse(returns)]
polygons[i] = matplotlib[:patches][:Polygon](xy, true, color=colors[i], label = labels[i])
ax[:add_artist](polygons[i])
end
legend(polygons, labels, bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0)
show()
# savefig("CompositionMap.png",bbox_inches="tight")
Can't say that this is the best way, to do this, but at least it is working.