K-mean Clustering in IDL - cluster-analysis

I am an IDL beginner and I was wondering if I could get some help on clustering in IDL. I found a good example on Harris Geospatial that explains the method, however, I am confused on how to run the clustering on my own data (ASCII) to perform the K-mean analysis. How can I use my data instead of the 'random' function that generates random numbers
Below is the code I found on Harris:
n = 50
c1 = RANDOMN(seed, 3, n)
c1[0:1,*] -= 3
c2 = RANDOMN(seed, 3, n)
c2[0,*] += 3
c2[1,*] -= 3
c3 = RANDOMN(seed, 3, n)
c3[1:2,*] += 3
array = [[c1], [c2], [c3]]
; Compute cluster weights, using three clusters:
weights = CLUST_WTS(array, N_CLUSTERS = 3)
; Compute the classification of each sample:
result = CLUSTER(array, weights, N_CLUSTERS = 3)
Thank you.

you'll need to get your data into IDL. If it's a comma-separated (or other "delimiter") file, then you can just use READ_CSV. Or you could try using READ_ASCII but then you need to know the specific format. Either way, you just need to use one of the read routines.
https://www.harrisgeospatial.com/docs/READ_CSV.html

Related

How to perform fuzzy clustering method on Qualitative Bankruptcy dataset

We are required to build a fuzzy system with MATLAB on Qualitative_Bankruptcy Data Set and we were advised to implement Fuzzy Clustering Method on it.
There are 7 attributes (6+1) on the dataset (250 instances) and each independent attribute has 3 possible values, which are Positive, Average, and Negative. Please refer to the dataset for more.
From our understanding, clustering is about grouping instances that exhibit similar properties by calculating the distances between the parameters. So the data could be like this. Picture below is just a dummy data, not relevant to my project.
The question is, how is it possible to implement a cluster analysis on a dataset like this.
P,P,A,A,A,P,NB
N,N,A,A,A,N,NB
A,A,A,A,A,A,NB
P,P,P,P,P,P,NB
N,N,N,A,N,A,B
N,N,N,P,N,N,B
N,N,N,N,N,P,B
N,N,N,N,N,A,B
Since you asked about fuzzy clustering, you are contradicting yourself.
In fuzzy clustering, every object belongs to every cluster, just to a varying degree (the cluster assignment is "fuzzy").
It's mostly used with numerical data, where you can assume the measurements are not precise either, but come with a fuzzy error, too. So I don't think it makes as much sense on categoricial data.
Now categoricial data tends to cluster really bad beyond counting duplicates. It just has a too coarse resolution. People do all kind of crazy hacks like running k-means on dummy variables, and never seem to question what they actually compute/optimize by doing this. Nor test their result...
Well, let's start from reading your data:
clear();
clc();
close all;
opts = detectImportOptions('Qualitative_Bankruptcy.data.txt');
opts.DataLine = 1;
opts.MissingRule = 'omitrow';
opts.VariableNamesLine = 0;
opts.VariableNames = {'IR' 'MR' 'FF' 'CR' 'CO' 'OR' 'Class'};
opts.VariableTypes = repmat({'categorical'},1,7);
opts = setvaropts(opts,'Categories',{'P' 'A' 'N'});
opts = setvaropts(opts,'Class','Categories',{'B' 'NB'});
data = readtable('Qualitative_Bankruptcy.data.txt',opts);
data = rmmissing(data);
data_len = height(data);
Now, since the kmeans function (reference here) accepts only numeric values, we need to convert a table of categorical values into a matrix:
x = double(table2array(data));
And finally, we apply the function:
[idx,c] = kmeans(x,number_of_clusters);
Now comes the problem. The k-means clustering can be performed using a wide variety of distance measures together with a wide variety of options. You have to play with those parameters in order to obtain the clustering that better approximates your available output.
Since k-means clustering organizes your data into n clusters, this means that your output defines more than 3 clusters because 46 + 71 + 61 = 178... and since your data contains 250 observations, 72 of them are assigned to one or more clusters that are unknown to me (and maybe to you too).
If you want to replicate that output, or to find the clustering that better approximate your output... you have to find, if available, an algorithm that minimize the error... or alternatively you can try to brute-force it, for example:
% ...
x = double(table2array(data));
cl1_targ = 46;
cl2_targ = 71;
cl3_targ = 61;
dist = {'sqeuclidean' 'cityblock' 'cosine' 'correlation'};
res = cell(16,3);
res_off = 1;
for i = 1:numel(dist)
dist_curr = dist{i};
for j = 3:6
idx = kmeans(x,j,'Distance',dist_curr); % start parameter needed
cl1 = sum(idx == 1);
cl2 = sum(idx == 2);
cl3 = sum(idx == 3);
err = abs(cl1 - cl1_targ) + abs(cl2 - cl2_targ) + abs(cl3 - cl3_targ);
res(res_off,:) = {dist_curr j err};
res_off = res_off + 1;
end
end
[min_val,min_idx] = min([res{:,3}]);
best = res(min_idx,1:2);
Don't forget to remember that the kmeans function uses a randomly-chosen starting configuration... so it will end up delivering different solutions for different starting points. Define fixed starting points (means) using the Start parameter, otherwise a different result will be produced every time your run the kmeans function.

Applying SVM on a dataset

I have written the following code to examine svmtrain.
a = 5*[randn(200, 1) + 5, randn(200, 1) + 5];
b = 5*[randn(200, 1) + 5, randn(200, 1) - 5];
all_data = [a;b];
plot(a(:,1) , a(:,2),'b.'); hold on
plot(b(:,1) , b(:,2),'r.');
group = ['r';'b'];
svmStruct = svmtrain(all_data, group,'ShowPlot',true);
I have created two normal distributed datasets (a and b) and then combined them into a single 2D array. Now I want to separate these two areas using svmtrain but I don't know what should I do with the Group parameter. As matlab help stated I can use a 2x1 matrix of characters to imply the label of these two areas. I did so, But I don't know why my code is not working.
You should provide the class label for each instance. So use:
group = [repmat('r',200,1); repmat('g',200,1)];
svmStruct = svmtrain(all_data, group, 'ShowPlot',true);

How to perform multi-label learning with LSTM using theano?

I have some text data with multiple labels for each document. I want to train a LSTM network using Theano for this dataset. I came across http://deeplearning.net/tutorial/lstm.html but it only facilitates a binary classification task. If anyone has any suggestions on which method to proceed with, that will be great. I just need an initial feasible direction, I can work on.
thanks,
Amit
1) Change the last layer of the model. I.e.
pred = tensor.nnet.softmax(tensor.dot(proj, tparams['U']) + tparams['b'])
should be replaced by some other layer, e.g. sigmoid:
pred = tensor.nnet.sigmoid(tensor.dot(proj, tparams['U']) + tparams['b'])
2) The cost should also be changed.
I.e.
cost = -tensor.log(pred[tensor.arange(n_samples), y] + off).mean()
should be replaced by some other cost, e.g. cross-entropy:
one = np.float32(1.0)
pred = T.clip(pred, 0.0001, 0.9999) # don't piss off the log
cost = -T.sum(y * T.log(pred) + (one - y) * T.log(one - pred), axis=1) # Sum over all labels
cost = T.mean(cost, axis=0) # Compute mean over samples
3) In the function build_model(tparams, options), you should replace:
y = tensor.vector('y', dtype='int64')
by
y = tensor.matrix('y', dtype='int64') # Each row of y is one sample's label e.g. [1 0 0 1 0]. sklearn.preprocessing.MultiLabelBinarizer() may be handy.
4) Change pred_error() so that it supports multilabel (e.g. using some metrics like accuracy or F1 score from scikit-learn).
You can change the last layer of the model. It would have a vector of target where each element is 0 or 1, depending if you have the target or not.

basic help using hmm to clasify a sequence

I am very new to matlab, hidden markov model and machine learning, and am trying to classify a given sequence of signals. Please let me know if the approach I have followed is correct:
create a N by N transition matrix and fill with random values which sum to 1for each row. (N will be the number of states)
create a N by M emission/observation matrix and fill with random values which sum to 1 for each row
convert different instances of the sequence (i.e each instance will be saying the word 'hello' ) into one long stream and feed each stream to the hmm train function such that:
new_transition_matrix old_transition_matrix = hmmtrain(sequence,old_transition_matrix,old_emission_matrix)
give the final transition and emission matrix to hmm decode with an unknown sequence to give the probability
i.e [posterior_states logrithmic_probability] = hmmdecode( sequence, final_transition_matrix,final_emission_matris)
1. and 2. are correct. You have to be careful that your initial transition and emission matrices are not completely uniform, they should be slightly randomized for the training to work.
3. I would just feed in the 'Hello' sequences separately rather than concatenating them to form a single long sequence.
Let's say this is the sequence for Hello: [1,0,1,1,0,0]. If you form one long sequence from 3 'Hello' sequences, you would get:
data = [1,0,1,1,0,0,1,0,1,1,0,0,1,0,1,1,0,0]
This is not ideal, instead you should feed the sequences in separately like:
data = [1,0,1,1,0,0; 1,0,1,1,0,0; 1,0,1,1,0,0].
Since you are using MatLab, I would recommend using the HMM toolbox by Murphy. It has a demo on how you can train an HMM with multiple observation sequences:
M = 3;
N = 2;
% "true" parameters
prior0 = normalise(rand(N ,1));
transmat0 = mk_stochastic(rand(N ,N ));
obsmat0 = mk_stochastic(rand(N ,M));
% training data: a 5*6 matrix, e.g. 5 different 'Hello' sequences of length 6
number_of_seq = 5;
seq_len= 6;
data = dhmm_sample(prior0, transmat0, obsmat0, number_of_seq, seq_len);
% initial guess of parameters
prior1 = normalise(rand(N ,1));
transmat1 = mk_stochastic(rand(N ,N ));
obsmat1 = mk_stochastic(rand(N ,M));
% improve guess of parameters using EM
[LL, prior2, transmat2, obsmat2] = dhmm_em(data, prior1, transmat1, obsmat1, 'max_iter', 5);
LL
4. What you say is correct, below is how you calculate the log probaility in the HMM toolbox:
% use model to compute log[P(Obs|model)]
loglik = dhmm_logprob(data, prior2, transmat2, obsmat2)
Finally: Have a look at this paper by Rabiner on how the mathematics work if anything is unclear.
Hope this helps.

Extract features, sift detector

I m little confused about Andrea Vedaldi implementation of the algorithm. I m trying to extract features with the algorithm sift of the toolbox.
I m using this command [frames,descriptors] = sift(image, 'Verbosity', 1); so I ve got the frames which is 4xk matrix and the descriptors which is 128xK. I want to use a vector as a feature. Which of the two matrices should i use as a feature? Has anyone idea?
The descriptors are what you compare in order to determine matches.
I1 = double(rgb2gray(imread('image1.png'))/256) ;
I2 = double(rgb2gray(imread('image2.png'))/256) ;
[frames1,descriptors1] = sift(I1, 'Verbosity', 1) ;
[frames2,descriptors2] = sift(I2, 'Verbosity', 1) ;
matches = siftmatch(descriptors1, descriptors2) ;
You now have a matrix of matched features between the two images.
To visualize the results add the following line to the above
plotsiftmatches(I1,I2,frames1,frames2,matches);
Vedaldi's report can be found here.