The result of FCM clustering for my Testset in matlab [duplicate] - matlab

After finishing cluster analysis,when I input some new data,how Do I know which cluster do the data belongs to?
data(freeny)
library(RSNNS)
options(digits=2)
year<-as.integer(rownames(freeny))
freeny<-cbind(freeny,year)
freeny = freeny[sample(1:nrow(freeny),length(1:nrow(freeny))),1:ncol(freeny)]
freenyValues= freeny[,1:5]
freenyTargets=decodeClassLabels(freeny[,6])
freeny = splitForTrainingAndTest(freenyValues,freenyTargets,ratio=0.15)
km<-kmeans(freeny$inputsTrain,10,iter.max = 100)
kclust=km$cluster

kmeans returns an object containing the coordinates of the cluster centers in $centers. You want to find the cluster to which the new object is closest (in terms of the sum of squares of distances):
v <- freeny$inputsTrain[1,] # just an example
which.min( sapply( 1:10, function( x ) sum( ( v - km$centers[x,])^2 ) ) )
The above returns 8 - same as the cluster to which the first row of freeny$inputsTrain was assigned.
In an alternative approach, you can first create a clustering, and then use a supervised machine learning to train a model which you will then use as a prediction. However, the quality of the model will depend on how good the clustering really represents the data structure and how much data you have. I have inspected your data with PCA (my favorite tool):
pca <- prcomp( freeny$inputsTrain, scale.= TRUE )
library( pca3d )
pca3d( pca )
My impression is that you have at most 6-7 clear classes to work with:
However, one should run more kmeans diagnostic (elbow plots etc) to determine the optimal number of clusters:
wss <- sapply( 1:10, function( x ) { km <- kmeans(freeny$inputsTrain,x,iter.max = 100 ) ; km$tot.withinss } )
plot( 1:10, wss )
This plot suggests 3-4 classes as the optimum. For a more complex and informative approach, consult the clusterograms: http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

Related

Hierarchical Agglomerative clustering for Spark

I am working on a project using Spark and Scala and I am looking for a hierarchical clustering algorithm, which is similar to scipy.cluster.hierarchy.fcluster or sklearn.cluster.AgglomerativeClustering, which will be useable for large amounts of data.
MLlib for Spark implements Bisecting k-means, which needs as input the number of clusters. Unfortunately in my case, I don't know the number of clusters and I would prefer to use some distance threshold as an input parameter, as it is possible to use in those two python implementations above.
If anyone would know the answer, I would be very grateful.
So I had the same problem and after looking high and low found no answers so I will post what I did here in the hopes that it helps anyone else and that maybe someone will build on it.
The basic idea of what I did was to use bisecting K-means recursively to continue to split clusters in half until all points in the cluster were a specified distance away from the centroid. I was using gps data so I have a little bit of extra machinery to deal with that.
The first step is to create a model that will cut the data in half. I used bisecting K means but I think this would work with any of the pyspark clustering methods so long as you can get the distance to the centroid.
import pyspark.sql.functions as f
from pyspark import SparkContext, SQLContext
from pyspark.ml.clustering import BisectingKMeans
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
bkm = BisectingKMeans().setK(2).setSeed(1)
assembler = VectorAssembler(inputCols=['lat','long'], outputCol="features")
adf = assembler.transform(locAggDf)#locAggDf contains my location info
model = bkm.fit(adf)
# predictions will have the original data plus the "features" col which assigns a cluster number
predictions = model.transform(adf)
predictions.persist()
The next step is our recursive function. The idea here is that we specify some distance from the centroid and if any point in a cluster is farther than that distance we cut the cluster in half. When a cluster is tight enough that it meets the condition I add it to a result array that I use to build the final clustering
def bisectToDist(model, predictions, bkm, precision, result = []):
centers = model.clusterCenters()
# row[0] is predictedClusterNum, row[1] is unit, row[2] point lat, row[3] point long
# centers[row[0]] is the lat long of center, centers[row[0]][0] = lat, centers[row[0]][1] = long
distUdf = f.udf(
lambda row: getDistWrapper((centers[row[0]][0], centers[row[0]][1], row[1]), (row[2], row[3], row[1])),
FloatType())##getDistWrapper(is how I calculate the distance of lat and long but you can define any distance metric)
predictions = predictions.withColumn('dist', distUdf(
f.struct(predictions.prediction, predictions.encodedPrecisionUnit, predictions.lat, predictions.long)))
#create a df of all rows that were in clusters that had a point outside of the threshold
toBig = predictions.join(
predictions.groupby('prediction').agg({"dist": "max"}).filter(f.col('max(dist)') > self.precision).select(
'prediction'), ['prediction'], 'leftsemi')
#this could probably be improved
#get all cluster numbers that were to big
listids = toBig.select("prediction").distinct().rdd.flatMap(lambda x: x).collect()
#if all data points are within the speficed distance of the centroid we can return the clustering
if len(listids) == 0:
return predictions
# assuming binary class now k must be = 2
# if one of the two clusters was small enough we will not have another recusion call for that cluster
# we must save it and return it at this depth the clustiering that was 2 big will be cut in half in the loop below
if len(listids) == 1:
ok = predictions.join(
predictions.groupby('prediction').agg({"dist": "max"}).filter(
f.col('max(dist)') <= precision).select(
'prediction'), ['prediction'], 'leftsemi')
for clusterId in listids:
# get all of the pieces that were to big
part = toBig.filter(toBig.prediction == clusterId)
# we now deed to refit the subset of the data
assembler = VectorAssembler(inputCols=['lat', 'long'], outputCol="features")
adf = assembler.transform(part.drop('prediction').drop('features').drop('dist'))
model = bkm.fit(adf)
#predictions now holds the new subclustering and we are ready for recursion
predictions = model.transform(adf)
result.append(bisectToDist(model, predictions, bkm, result=result))
#return anything that was given and already good
if len(listids) == 1:
return ok
Finally we can call the function and build the resulting dataframe
result = []
self.bisectToDist(model, predictions, bkm, result=result)
#drop any nones can happen in recursive not top level call
result =[r for r in result if r]
r = result[0]
r = r.withColumn('subIdx',f.lit(0))
result = result[1:]
idx = 1
for r1 in result:
r1 = r1.withColumn('subIdx',f.lit(idx))
r = r.unionByName(r1)
idx = idx + 1
# each of the subclusters will have a 0 or 1 classification in order to make it 0 - n I added the following
r = r.withColumn('delta', r.subIdx * 100 + r.prediction)
r = r.withColumn('delta', r.delta - f.lag(r.delta, 1).over(Window.orderBy("delta"))).fillna(0)
r = r.withColumn('ddelta', f.when(r.delta != 0,1).otherwise(0))
r = r.withColumn('spacialLocNum',f.sum('ddelta').over(Window.orderBy(['subIdx','prediction'])))
#spacialLocNum should be the final clustering
Admittadly this is quite convoluted and slow but it does get the job done, hope this helps!

How to perform fuzzy clustering method on Qualitative Bankruptcy dataset

We are required to build a fuzzy system with MATLAB on Qualitative_Bankruptcy Data Set and we were advised to implement Fuzzy Clustering Method on it.
There are 7 attributes (6+1) on the dataset (250 instances) and each independent attribute has 3 possible values, which are Positive, Average, and Negative. Please refer to the dataset for more.
From our understanding, clustering is about grouping instances that exhibit similar properties by calculating the distances between the parameters. So the data could be like this. Picture below is just a dummy data, not relevant to my project.
The question is, how is it possible to implement a cluster analysis on a dataset like this.
P,P,A,A,A,P,NB
N,N,A,A,A,N,NB
A,A,A,A,A,A,NB
P,P,P,P,P,P,NB
N,N,N,A,N,A,B
N,N,N,P,N,N,B
N,N,N,N,N,P,B
N,N,N,N,N,A,B
Since you asked about fuzzy clustering, you are contradicting yourself.
In fuzzy clustering, every object belongs to every cluster, just to a varying degree (the cluster assignment is "fuzzy").
It's mostly used with numerical data, where you can assume the measurements are not precise either, but come with a fuzzy error, too. So I don't think it makes as much sense on categoricial data.
Now categoricial data tends to cluster really bad beyond counting duplicates. It just has a too coarse resolution. People do all kind of crazy hacks like running k-means on dummy variables, and never seem to question what they actually compute/optimize by doing this. Nor test their result...
Well, let's start from reading your data:
clear();
clc();
close all;
opts = detectImportOptions('Qualitative_Bankruptcy.data.txt');
opts.DataLine = 1;
opts.MissingRule = 'omitrow';
opts.VariableNamesLine = 0;
opts.VariableNames = {'IR' 'MR' 'FF' 'CR' 'CO' 'OR' 'Class'};
opts.VariableTypes = repmat({'categorical'},1,7);
opts = setvaropts(opts,'Categories',{'P' 'A' 'N'});
opts = setvaropts(opts,'Class','Categories',{'B' 'NB'});
data = readtable('Qualitative_Bankruptcy.data.txt',opts);
data = rmmissing(data);
data_len = height(data);
Now, since the kmeans function (reference here) accepts only numeric values, we need to convert a table of categorical values into a matrix:
x = double(table2array(data));
And finally, we apply the function:
[idx,c] = kmeans(x,number_of_clusters);
Now comes the problem. The k-means clustering can be performed using a wide variety of distance measures together with a wide variety of options. You have to play with those parameters in order to obtain the clustering that better approximates your available output.
Since k-means clustering organizes your data into n clusters, this means that your output defines more than 3 clusters because 46 + 71 + 61 = 178... and since your data contains 250 observations, 72 of them are assigned to one or more clusters that are unknown to me (and maybe to you too).
If you want to replicate that output, or to find the clustering that better approximate your output... you have to find, if available, an algorithm that minimize the error... or alternatively you can try to brute-force it, for example:
% ...
x = double(table2array(data));
cl1_targ = 46;
cl2_targ = 71;
cl3_targ = 61;
dist = {'sqeuclidean' 'cityblock' 'cosine' 'correlation'};
res = cell(16,3);
res_off = 1;
for i = 1:numel(dist)
dist_curr = dist{i};
for j = 3:6
idx = kmeans(x,j,'Distance',dist_curr); % start parameter needed
cl1 = sum(idx == 1);
cl2 = sum(idx == 2);
cl3 = sum(idx == 3);
err = abs(cl1 - cl1_targ) + abs(cl2 - cl2_targ) + abs(cl3 - cl3_targ);
res(res_off,:) = {dist_curr j err};
res_off = res_off + 1;
end
end
[min_val,min_idx] = min([res{:,3}]);
best = res(min_idx,1:2);
Don't forget to remember that the kmeans function uses a randomly-chosen starting configuration... so it will end up delivering different solutions for different starting points. Define fixed starting points (means) using the Start parameter, otherwise a different result will be produced every time your run the kmeans function.

Function fitting neural network in for loop in MATLAB

I'm using MATLAB R2014a version.
I have ten clusters of X, and y data.
I want to fit these 10 corresponding data model by using a neural network tool in MATLAB. And I want to save 10 different models in somewhere.
For each cluster, I need to design an implementation to determine the correct number of hidden layers. And I will save each model into an array or something like that. And then continue for the 2nd cluster.
For this aim, I have developed this algorithm:
for q = 1:z % number of clusters
mdl = fitnet( 10 );
mdl = train( mdl, X( classes == q ), y( classes == q ) );
view( mdl );
yy = net( X( classes == q ) );
perf = perform( net, yy, y( classes == q ) );
model( q ).mdl = mdl;
clear mdl;
end
When I run this code, I get this error:
Error using view (line 67)
Invalid input arguments
Error in Main (line 97)
view(mdl);
How can I fix the problem?
Thanks,
Unlike mentioned in the comments view() is the right function to choose here because it has been overloaded to also show a sketch of a neural network (see here: http://www.mathworks.com/help/nnet/ref/view.html).
So, the problem obviously is not view() itself, but your mdl-network which means you should:
go there with a debugger and check if it really is a neural network and if it contains values
check those values because X and y might not be the vectors you want (which you should also check)
...and/or post more information about what's going on in your code.

How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

I'm trying to cluster some text documents using scikit-learn. I'm trying out both DBSCAN and MeanShift and want to determine which hyperparameters (e.g. bandwidth for MeanShift and eps for DBSCAN) best work for the kind of data I'm using (news articles).
I have some testing data which consists of pre-labeled clusters. I have been trying to use scikit-learn's GridSearchCV but don't understand how (or if it can) be applied in this case, since it needs the test data to be split, but I want to run the evaluation on the entire dataset and compare the results to the pre-labeled data.
I have been trying to specify a scoring function which compares the estimator's labels to the true labels, but of course it doesn't work because only a sample of the data has been clustered, not all of it.
What's an appropriate approach here?
The following function for DBSCAN might help. I've written it to iterate over the hyperparameters eps and min_samples and included optional arguments for min and max clusters. As DBSCAN is unsupervised, I have not included an evaluation parameter.
def dbscan_grid_search(X_data, lst, clst_count, eps_space = 0.5,
min_samples_space = 5, min_clust = 0, max_clust = 10):
"""
Performs a hyperparameter grid search for DBSCAN.
Parameters:
* X_data = data used to fit the DBSCAN instance
* lst = a list to store the results of the grid search
* clst_count = a list to store the number of non-whitespace clusters
* eps_space = the range values for the eps parameter
* min_samples_space = the range values for the min_samples parameter
* min_clust = the minimum number of clusters required after each search iteration in order for a result to be appended to the lst
* max_clust = the maximum number of clusters required after each search iteration in order for a result to be appended to the lst
Example:
# Loading Libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Loading iris dataset
iris = datasets.load_iris()
X = iris.data[:, :]
y = iris.target
# Scaling X data
dbscan_scaler = StandardScaler()
dbscan_scaler.fit(X)
dbscan_X_scaled = dbscan_scaler.transform(X)
# Setting empty lists in global environment
dbscan_clusters = []
cluster_count = []
# Inputting function parameters
dbscan_grid_search(X_data = dbscan_X_scaled,
lst = dbscan_clusters,
clst_count = cluster_count
eps_space = pd.np.arange(0.1, 5, 0.1),
min_samples_space = pd.np.arange(1, 50, 1),
min_clust = 3,
max_clust = 6)
"""
# Importing counter to count the amount of data in each cluster
from collections import Counter
# Starting a tally of total iterations
n_iterations = 0
# Looping over each combination of hyperparameters
for eps_val in eps_space:
for samples_val in min_samples_space:
dbscan_grid = DBSCAN(eps = eps_val,
min_samples = samples_val)
# fit_transform
clusters = dbscan_grid.fit_predict(X = X_data)
# Counting the amount of data in each cluster
cluster_count = Counter(clusters)
# Saving the number of clusters
n_clusters = sum(abs(pd.np.unique(clusters))) - 1
# Increasing the iteration tally with each run of the loop
n_iterations += 1
# Appending the lst each time n_clusters criteria is reached
if n_clusters >= min_clust and n_clusters <= max_clust:
dbscan_clusters.append([eps_val,
samples_val,
n_clusters])
clst_count.append(cluster_count)
# Printing grid search summary information
print(f"""Search Complete. \nYour list is now of length {len(lst)}. """)
print(f"""Hyperparameter combinations checked: {n_iterations}. \n""")
Have you considered implementing the search yourself?
It's not particularly hard to implement a for loop. Even if you want to optimize two parameters it's still fairly easy.
For both DBSCAN and MeanShift I do however advise to first understand your similarity measure. It makes more sense to choose the parameters based on an understanding of your measure instead of parameter optimization to match some labels (which has a high risk of overfitting).
In other words, at which distance are two articles supposed to be clustered?
If this distance varies too much from one data point to another, these algorithms will fail badly; and you may need to find a normalized distance function such that the actual similarity values are meaningful again. TF-IDF is standard on text, but mostly in a retrieval context. They may work much worse in a clustering context.
Also beware that MeanShift (similar to k-means) needs to recompute coordinates - on text data, this may yield undesired results; where the updated coordinates actually got worse, instead of better.

MATLAB: Naive Bayes with Univariate Gaussian

I am trying to implement Naive Bayes Classifier using a dataset published by UCI machine learning team. I am new to machine learning and trying to understand techniques to use for my work related problems, so I thought it's better to get the theory understood first.
I am using pima dataset (Link to Data - UCI-ML), and my goal is to build Naive Bayes Univariate Gaussian Classifier for K class problem (Data is only there for K=2). I have done splitting data, and calculate the mean for each class, standard deviation, priors for each class, but after this I am kind of stuck because I am not sure what and how I should be doing after this. I have a feeling that I should be calculating posterior probability,
Here is my code, I am using percent as a vector, because I want to see the behavior as I increase the training data size from 80:20 split. Basically if you pass [10 20 30 40] it will take that percentage from 80:20 split, and use 10% of 80% as training.
function[classMean] = naivebayes(file, iter, percent)
dm = load(file);
for i=1:iter
idx = randperm(size(dm.data,1))
%Using same idx for data and labels
shuffledMatrix_data = dm.data(idx,:);
shuffledMatrix_label = dm.labels(idx,:);
percent_data_80 = round((0.8) * length(shuffledMatrix_data));
%Doing 80-20 split
train = shuffledMatrix_data(1:percent_data_80,:);
test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
train_labels = shuffledMatrix_label(1:percent_data_80,:)
test_labels = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the array of percents
for pRows = 1:length(percent)
percentOfRows = round((percent(pRows)/100) * length(train));
new_train = train(1:percentOfRows,:)
new_trin_label = shuffledMatrix_label(1:percentOfRows)
%get unique labels in training
numClasses = size(unique(new_trin_label),1)
classMean = zeros(numClasses,size(new_train,2));
for kclass=1:numClasses
classMean(kclass,:) = mean(new_train(new_trin_label == kclass,:))
std(new_train(new_trin_label == kclass,:))
priorClassforK = length(new_train(new_trin_label == kclass))/length(new_train)
priorClassforK_1 = 1 - priorClassforK
end
end
end
end
First, compute the probability of evey class label based on frequency counts. For a given sample of data and a given class in your data set, you compute the probability of evey feature. After that, multiply the conditional probability for all features in the sample by each other and by the probability of the considered class label. Finally, compare values of all class labels and you choose the label of the class with the maximum probability (Bayes classification rule).
For computing conditonal probability, you can simply use the Normal distribution function.