Why confusion matrix shows different results from random undersampling class distribution? - classification

I have an imbalanced dataset that consists of 17 numerical features and 3 classes for output. I applied random undersampling and obtained the following confusion matrix with undersampling.
My question; when random undersampling show 33 numbers for each class why the confusion matrix shows more than 33?
#Raw Data Distribution
layers_counts=y.value_counts()
layers_counts
#Output
2 498
1 116
0 39
from imblearn.under_sampling import RandomUnderSampler
rus=RandomUnderSampler(sampling_strategy="not minority")
X_rus, y_rus = rus.fit_resample(Xtrain, ytrain)
y_rus.value_counts()
#Output
0 33
1 33
2 33
from sklearn.linear_model import LogisticRegression
classifier=LogisticRegression()
classifier.fit(X_rus, y_rus)
ypred=classifier.predict(Xtest)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix (y_test,ypred)
cm_df2 = pd.DataFrame(cm,
index = ['VCS','VSG','VG'],
columns = ['VCS','VSG','VG'])
plt.figure(figsize=(8,6))
sns.heatmap(cm_df2, annot=True)
plt.title('Confusion Matrix')
plt.ylabel('Actal Values')
plt.xlabel('Predicted Values')
plt.show()
When the rus provided 33 numbers for each class the confusion matrix is shown in the following, but I think it should be matched with 33? I am confused about that point, could you help me to understand?

Related

Scipy sparse matrix dimension issue

I was working on a simple MultiOutputRegressor model with KNeighborsRegressor. My X_train, X_test, y_train, y_test are in Scipy sparse matrix data type. I have 1185 features and 46 targets to predict.
from sklearn.multioutput import MultiOutputRegressor
from sklearn.neighbors import KNeighborsRegressor
kreg = MultiOutputRegressor(KNeighborsRegressor())
# fit model
kreg.fit(X_train, y_train)
>>> MultiOutputRegressor(estimator=KNeighborsRegressor())
kreg.predict(X_test)
after kreg.predict(X_test) I got an error messages with the last one says
~/opt/anaconda3/envs/data3/lib/python3.8/site-packages/scipy/sparse/_index.py in >getitem(self, key)
62 return self._get_arrayXint(row, col)
63 elif isinstance(col, slice):
---> 64 raise IndexError('index results in >2 dimensions')
65 elif row.shape[1] == 1 and col.ndim == 1:
66 # special case for outer indexing
IndexError: index results in >2 dimensions
Where did I do wrong?
Thanks.
Turns out I shouldn't have my labels, i.e. y_train and y_test in sparse matrix data type. Once I kept them as Numpy array, the code worked.

K-means sort labels

Assume I have matrix A and I perform K-means clustering on them in MATLAB. I get the following
A=
1 20 5
1 30 10
2 60 20
5 100 45
kmeans(A,4) results in the following labels:
2
4
3
1
Now I permute rows of A and I get matrix B:
B =
2 60 20
1 30 10
5 100 45
1 20 5
and after applying the kmeans the labels are B1 = [3 1 2 4], which seems to be random assignment. For example second row of matrix A is in cluster 4 but second row of matrix B which is the same thing as second row of A is in cluster 1.
How can I get the labels in the kmeans such that rows that have highest value always get the same label, for example 3, and row that have lowest value always get 1?
For example the last row of A gets label 3, thus the third row of B also get label 3.
Each label is tied to the mean of a cluster. To sort the labels, you sort the means in e.g. order of appearance along a given axis (x-axis in this example). Here's an implementation in Python:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
np.random.seed(1)
def rearrange_labels(X, cluster_labels, sort_on_column=0):
labels, ctrs = [], []
for i in range(len(set(cluster_labels))):
Xi = X[cluster_labels == i]
ctr = np.mean(Xi, axis=0)
labels.append(i)
ctrs.append(ctr)
ctrs = np.row_stack(ctrs)
labels = np.array(labels).reshape(-1, 1)
# sort on x column
new_order = ctrs[:, sort_on_column].argsort()
labels_new = labels[new_order]
ctrs_new = ctrs[new_order]
np.put(cluster_labels, labels, labels_new)
return cluster_labels, ctrs_new
X, _ = make_blobs(n_samples=500, centers=10, n_features=2)
clf = KMeans(n_clusters=10)
cluster_labels = clf.fit_predict(X)
cluster_labels, ctrs = rearrange_labels(X=X, cluster_labels=cluster_labels)
fig, ax = plt.subplots()
for i, m in enumerate(ctrs):
ax.annotate(
xy=m[[0, 1]],
s=i,
bbox=dict(boxstyle="square", fc="w", ec="grey", alpha=0.9),
)
ax.scatter(X[:, 0], X[:, 1], c=cluster_labels)
plt.show()
The cluster numbers assigned by k-means do not have an order - don't treat them as such. They are numbers just for convenience, they might as well be A B C D.
If you want to impose an order on them, you can relabel them as you want. You can sort the centers by X coordinate, and relabel them. It's not the job of k-means to do so, you need to do this yourself.

how to derive meaning/values from cluster analysis results

I am currently doing my MasterThesis and for my MasterThesis I would like to create a simulation model which simulates the walking behavior of the older adults. However to make my simulation model easier, I want to form groups based on the cluster analysis so I can easily assign certain walking behavior to an older person if it belongs to a certain group(so if you belong to group 1, your walking time in min will be approximately 20 minutes for example).
However, I am not that familiar with cluster analysis. I have a big dataset containing many data of the characteristics of the older adults (** variables of discrete and continuous nature**), however the following characteristics are used currently based on literature:
age,gender, scorehealth, education categoy, income category, occupation, socialnetwork, yes/no living in a pleasant neighbourhood, yes/no feeling safe in the neighbourhood, the distance to green, having a dog, the walking time, walking in minutes.
After using the daisy function and using the silhouette method to define the ideal amount of clusters/thus groups, I got my clusters. However, now I was wondering how I should derive meaning from my clusters. I found it difficult to use statistical functions such as means, since I am dealing with categories. So what can I do to derive useful meaning/statistical conclusions from each clustergroup, such as if you belong to cluster group1, your incomelevel should be on average around incomegroup 10, age should be around 70 and the walkingtime in minutes is around 20 min for example. Ideally I also would like to have standard deviations of each varaibles in each cluster group.
So I can easily use these values in my simulation model to assign certain walking behavior to older adults.
#Joy, you should first determine the relevant variables. This will also help in dimensionality reduction. Since you've not given a sample dataset to work with, I'm creating my own. Also you must note, before cluster analysis, its important to obtain clusters that are pure. With purity, I mean the cluster must contain only those variables that account for maximum variance in the data. The variables that show little to negligible variance can best be removed for they are non-contributors to a cluster model. Once you've these (statistically) significant variables, cluster analysis will be meaningful.
Theoretical concepts
Clustering is a preprocessing algorithm. Its imperative to derive statistically significant variables to extract pure clusters. The derivation of these significant variables in a classification task is called feature selection whereas in a clustering task is called Principal Components (PCs). Historically the PCs are known to work only for continuous variables. To derive the PCs from categorical variable there is a method called Correspondence Analysis (CA) and for nominal categorical variables the method Multiple Correspondence Analysis (MCA) can be used.
Practical implementation
Let's create a data frame containing mixed variables (i.e. both categorical and continuous) like,
R> digits = 0:9
# set seed for reproducibility
R> set.seed(17)
# function to create random string
R> createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
R> df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
R> str(df)
'data.frame': 10 obs. of 6 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10
$ name : Factor w/ 10 levels "a","b","c","d",..: 2 9 4 6 3 7 1 8 10 5
$ studLoc : Factor w/ 10 levels "APBQD6181U","GOSWE3283C",..: 5 3 7 9 2 1 8 10 4 6
$ finalmark: int 53 73 95 39 97 58 67 64 15 81
$ subj1mark: int 63 18 98 83 68 80 46 32 99 19
$ subj2mark: int 90 40 8 14 35 82 79 69 91 2
I will inject random missing values in the data so that its more similar to real-world datasets.
# add random NA values
R> df<-as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
R> colSums(is.na(df))
ID name studLoc finalmark subj1mark subj2mark
0 0 0 2 2 0
As you can see the missing values are in continuous variables finalmark and subj1mark. I choose to do median imputation over mean because median is more robust than mean.
# Create a function to impute the missing values
R> ImputeMissing<- function(data=df){
# check if data frame
if(!(is.data.frame(df))){
df<- as.data.frame(df)
}
# Loop through the columns of the dataframe
for(i in seq_along(df))
{
if(class(df[,i]) %in% c("numeric","integer")){
# missing continuous data to be replaced by median
df[is.na(df[,i]),i] <- median(df[,i],na.rm = TRUE)
} # end inner-if
} # end for
return(df)
} # end function
# Remove the missing values
R> df.complete<- ImputeMissing(df)
# check missing values
R> colSums(is.na(df.complete))
ID name studLoc finalmark subj1mark subj2mark
0 0 0 0 0 0
Now we can apply the method FAMD() from the FactoMineR package to the cleaned dataset. You can type, ??FactoMineR::FAMD in R console to look at the vignette of this method. From the vignette, FAMD is a principal component method dedicated to explore data with both continuous and categorical variables. It can be seen roughly as a mixed between PCA and MCA. More precisely, the continuous variables are scaled to unit variance and the categorical variables are transformed into a disjunctive data table (crisp coding) and then scaled using the specific scaling of MCA. This ensures to balance the influence of both continous and categorical variables in the analysis. It means that both variables are on a equal foot to determine the dimensions of variability.
R> df.princomp <- FactoMineR::FAMD(df.complete, graph = FALSE)
Thereafter we can visualize the PCs using a screeplot shown in fig1 like,
R> factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
A Scree Plot (as shown in fig1) is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 44.5% of total variance. The question now naturally arises, "What are these variables?".
To extract the contribution of the variables, I've used fviz_contrib() shown in fig2 like,
R> factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
The fig2 above visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, studLoc, name, subj2markand finalMark are the most important variables that can be used for further analysis.
Now, you can proceed with cluster analysis.
# extract the important variables and store in a new dataframe
R> df.princomp.impvars<- df.complete[,c(2:3,6,4)]
# make the distance matrix
R> gower_dist <- cluster::daisy(df.princomp.impvars,
metric = "gower",
type = list(logratio = 3))
R> gower_mat <- as.matrix(gower_dist)
#make a hierarchical cluster model
R> model<-hclust(gower_dist)
#plotting the hierarchy
R> plot(model)
#cutting the tree at your decided level
R> clustmember<-cutree(model,3)
#adding the cluster member as a column to your data
R> df.clusters<-data.frame(df.princomp.impvars,cluster=clustmember)
R> df.clusters
name studLoc subj2mark finalmark cluster
1 b POTYQ0002N 90 53 1
2 i LWMTW1195I 40 73 1
3 d VTUGO1685F 8 95 2
4 f YCGGS5755N 14 70 1
5 c GOSWE3283C 35 97 2
6 g APBQD6181U 82 58 1
7 a VUJOG1460V 79 67 1
8 h YXOGP1897F 69 64 1
9 j NFUOB6042V 91 70 1
10 e QYTHG0783G 2 81 3

Extract the same part of slices of a 3D matrix by using linear index

Indeed, my problem is a succession of my previous problem:
1) Extract submatrices, 2) vectorize and then 3) put back
Thanks to Dan and his ideas works perfectly for the purpose.
My new problem is this:
If I have a 3D matrix, 8 by 8 by 12, e.g. A = randn(8,8,12).
Let's see the linear index of the first slice:
From Dan's solution, I understand that A[4:6, 4:6, :] can extract the corresponding parts of all slices.
However, going back to my real situations, extracting parts by actually counting rows and columns seem not suit my purpose because my matrix size is huge and I do have many sub-matrices to be extracted.
So, I prefer to work on linear index and want to ask if there are any ways to work with this possibility.
Here is my trial:
By defining sub_group = [28 29 30 36 37 38 44 45 46], then A(sub_group) can extract sub-matrix from the first slice of the 3D matrix, A.
I understand that A(sub_group + 8*8*(n-1)) can extract the sub-matrix from the nth slice.
I aim to only work with my sub_group and then extract the same part of every slice.
Most importantly, I have to put back the sub-matrices after updating their values.
So, is there are any quick syntax for matlab to work for my purpose?
I appreciate for your help.
Approach #1
For cases like this when you need to calculate linear indices, you can use bsxfun as shown here -
%// Store number of rows in A as a variable
M = size(A,1)
%// Get start and offset linear indices for the first slice and thus sub_group
start_idx = (colstart-1)*M + rowstart
offset_idx = bsxfun(#plus,[0:rowstop - rowstart]', [0:colstop-colstart]*M) %//'
sub_group = reshape(start_idx + offset_idx,1,[])
%// Calculate sub_groups for all 3D slices
all_sub_groups = bsxfun(#plus,sub_group',[0:size(A,3)-1]*numel(A(:,:,1)))
Sample run -
A(:,:,1) =
0.096594 0.52368 0.76285 0.83984 0.27019
0.84588 0.65035 0.57569 0.42683 0.4008
0.9094 0.38515 0.63192 0.63162 0.55425
0.011341 0.6493 0.2782 0.83347 0.44387
A(:,:,2) =
0.090384 0.037262 0.38325 0.89456 0.89451
0.74438 0.9758 0.88445 0.39852 0.21417
0.032615 0.52234 0.25502 0.62502 0.0038592
0.42974 0.90963 0.90905 0.5676 0.88058
rowstart =
2
rowstop =
4
colstart =
3
colstop =
5
sub_group =
10 11 12 14 15 16 18 19 20
all_sub_groups =
10 30
11 31
12 32
14 34
15 35
16 36
18 38
19 39
20 40
Approach #2
For a quick syntax based solution, sub2ind could be suggested here. The implementation would look something like this -
[X,Y] = ndgrid(rowstart:rowstop,colstart:colstop);
sub_group = sub2ind(size(A(:,:,1)),X,Y);
[X,Y,Z] = ndgrid(rowstart:rowstop,colstart:colstop,1:size(A,3));
all_sub_groups = sub2ind(size(A),X,Y,Z);

How can I get a SVM that has been trained on a bigger matrix to classify a different size matrix

I am training a one vs all svm classifier. I used a 200 by 459 matrix to train the classifier using VLFeat svm classifier. (http://www.vlfeat.org/matlab/vl_svmtrain.html)
[W B] = vl_svmtrain(train_image_feats', tmp', .00001);
where train_image_feats' is a 200 by 459 matrix, and tmp' is the label matrix which is 1 by 459 vector.
The above command trains the svm with no problem, but then to classify the scores obtained on the test matrix I get an error. The test matrix is obviously not of the same size as that of the training matrix.
scores(i, :) = W'*test_image_feats' + B;
Where test_image_feats' is a 200 by 90 matrix. scores is a 9 by 459 matrix. 9 Because there are 9 categories(labels) to classify and 459 are the number of training images.
The above command gives the error:
Subscripted assignment dimension mismatch.
Error in svm_classify (line 56) scores(i, :) = W'*test_image_feats'
+ B;
Edit: Full code added..
categories = unique(train_labels);
num_categories = length(categories);
scores = zeros([num_categories size(train_labels, 1)]); %train_labels is 459 by 1 size
for i=1:num_categories %there are 9 categories
tmp = strcmp(train_labels, categories{i});
tmp = tmp - (1-tmp);
[W B] = vl_svmtrain(train_image_feats', tmp', .00001);
scores(i, :) = W'*test_image_feats' + B;
end
predicted_categories = cell(size(train_labels));
parfor i=1:size(test_image_feats,1)
image_scores = scores(:, i);
label_index = find(image_scores==max(image_scores));
predicted_categories{i}=categories(label_index);
end
Conceptually you are training a model with 459 training samples to predict the scores of 90 test samples.
scores = zeros([num_categories size(train_labels, 1)]);
isn't right as it will be the size of the training set. In fact you don't have to care at all about the size of the training set, you could train the model with 20 or 20000 images the prediction step shouldn't be any different.
scores have to be defined with the test case in mind
scores = zeros([num_categories size(test_labels, 1)]);
When you used 459 for both it only worked because size(test_labels, 1) was equal to size(train_labels, 1)
The problem is not with your right hand side of the assignment, but with score(i,:): you are trying to assign a 9-by-90 size matrix into a single row of score - this simply won't fit.