How to cluster X to maximize explained variance in Y? - cluster-analysis

This is a problem I've encountered more than once, and I have pseudo-solution in mind but it's more or less a Monte Carlo method rather than anything clever.
What I'm trying to do is essentially convert a continuous variable into a categorical variable in such a way that each category has a significantly different mean in the response variable.
So let's say I am trying to model rates of depression against Age Groups. In my model, I want to have up to N Age Groups, and the bounds for each group can be arbitrarily sized (5-10, 11-27, 28-30, 31-64...etc). The question is, how to choose the bounds such that with N groups, the explained variance in depression rates can be maximized?
# Monte Carlo approach using iris dataset as an example
n_clust = 5
best_bounds = rep(0, n_clust)
best_groups = NULL
bestSSE = Inf
X_var = iris$Petal.Length
Y_var = iris$Sepal.Width
min_x = min(X_var)
max_x = max(X_var)
range_x = max_x - min_x
for (i in 1:10000){
b = sort(runif(n_clust-1))
b = cumsum(b / sum(b))
bounds = min_x + b * range_x
groups = cut(X_var, breaks = c(-Inf,bounds,Inf))
model = lm(Y_var~groups)
SSE = sum(model$residuals^2)
if (SSE < bestSSE){
print(SSE)
best_bounds = bounds
best_groups = groups
bestSSE = SSE
}
}
g = aggregate(Y_var, list(best_groups), mean)
names(g) = c("Cluster", "y_mean")
g$Cluster=c(best_bounds)
plot(X_var, Y_var, col='blue', pch=20)
abline(lm(Y_var~X_var), col='darkgray', lty=2)
for (i in 1:(nrow(g))){
x0 = ifelse(i == 1, min_x-max_x, g[i-1,"Cluster"])
x1 = ifelse(i < nrow(g), g[i,"Cluster"], 2*max_x)
segments(x0,g[i,"y_mean"],x1,g[i,"y_mean"], col='red')
}
R_cont = summary(lm(Y_var~X_var))$r.squared
R_cat = summary(lm(Y_var~best_groups))$r.squared
title(paste("R^2:", round(R_cont,4),"vs",round(R_cat,4)))
Note: I don't care about interpretation, only predictive ability.

I would not look at this from a clustering perspective. Instead, treat it as an optimization problem. Then do gradient descent to optimize, or any other search.
Another option would be piecewise linear regression, but you want a "piecewise constant regression" actually.

Related

Parameter estimation using Particle Filter in MATLAB

After reading the docs about "stateEstimatorPF" I get a little confused about how to create the StateTransitionFcn for my case. In my case I have 10 sensors measurments that decay exponentially and I want to find the best parameters for my function model.
The function model is x = exp(B*deltaT)*x_1, where x are the hypotheses, deltaT is the constant time delta in my measurments and x_1 is the true previous state. I would like to use the particle filter to estimate the parameter B. If I guess right, B should be the particles and the weighted mean of this particles should be what I'm looking for.
How can I write the StateTransitionFcn and use the "stateEstimatorPF" to solve this problem?
The code below is what I get so far (and it does not work):
pf = robotics.ParticleFilter
pf.StateTransitionFcn = #stateTransitionFcn
pf.StateEstimationMethod = 'mean';
pf.ResamplingMethod = 'systematic';
initialize(pf,5000,[0.9],1);
measu = [1.0, 0.9351, 0.8512, 0.9028, 0.7754, 0.7114, 0.6830, 0.6147, 0.5628, 0.7090]
states = []
for i=1:10
[statePredicted,stateCov] = predict(pf);
[stateCorrected,stateCov] = correct(pf,measu(i));
states(i) = getStateEstimate(pf)
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function predictParticles = stateTransitionFcn(pf, prevParticles,x_1)
predictParticles = exp(prevParticles)*x_1 %how to properly use x_1?%;
end

Inconsistency when estimating AR model coefficients in MATLAB

I'm trying to estimate the coefficients of an AR[2] model
x(t) = a_1*x(t-1) + a_2*x(t-2) + e(t), e(t) ~ N(0, sigma^2)
in MATLAB. For a_1 = 2*cos(2*pi/T)*exp(-1/tau), a_2 = -exp(-2/tau), the AR[2] model corresponds to a linear damped oscillator with period T and relaxation time tau. I simulated some data for this process with T = 30 and tau = 100 which corresponds to a_1 = 1.9368, a_2 = -0.9802:
T = 30; tau = 100;
a_1 = 2*cos(2*pi/T)*exp(-1/tau); a_2 = -exp(-2/tau);
simuMdl = arima(2,0,0);
simuMdl.Constant = 0;
simuMdl.Variance = 1e-1;
simuMdl.AR{1} = a_1;
simuMdl.AR{2} = a_2;
data = simulate(simuMdl, 600);
data = data(501:end);
plot(data)
I only take the last 100 timepoints to make sure the system is not influenced by the initial conditions any more. Now, when trying to estimate the parameters, everything works just fine when using the estimate command that uses maximum likelihood estimation:
ToEstMdl = arima(2,0,0); ToEstMdl.Constant = 0;
EstMdl = estimate(ToEstMdl, data);
EstMdl.AR
%'[1.9319] [-0.9745]'
However, when I use the Yule-Walker-Equations implemented in aryule, I get a completely different result that does not match the true parameter values at all:
aryule(data, 2)
%'1.0000 -1.4645 0.5255'
Does anyone have an idea why the Yule-Walker-equations have such shortcomings to the MLE approach?
Yule-Walker (YW) is a method of moment based method. As such its estimate would get better with increasing data points. You can check it in this example by using all 600 data points to see what is the 'best' YW estimate you can get had you used all the data points and the MLE would still be better than it. You can also increase the data points to say 5000 instead of 600 and you will see in this case the best YW (the one that uses all 5000 points) would start to approach the MLE estimate.

Using PCA before classification

I am using PCA to reduce number of features before training Random Forest. I first used around 70 principal components out of 125 which were around 99% of the energy (according to eigen values). I got much worse results after training Random Forests with new transformed features. After that I used all the principal components and I got the same results as when I used 70. This made no sense to me since that is the same feature space only in difirent base (the space has only be rotated so that should not affect the boundary).
Does anyone have the idea what may be the problem here?
Here is my code
clc;
clear all;
close all;
load patches_training_256.txt
load patches_testing_256.txt
Xtr = patches_training_256(:,2:end);
Xtr = Xtr';
Ytr = patches_training_256(:,1);
Ytr = Ytr';
Xtest = patches_testing_256(:,2:end);
Xtest = Xtest';
Ytest = patches_testing_256(:,1);
Ytest = Ytest';
data_size = size(Xtr, 2);
feature_size = size(Xtr, 1);
mu = mean(Xtr,2);
sigma = std(Xtr,0,2);
mu_mat = repmat(mu,1,data_size);
sigma_mat = repmat(sigma,1,data_size);
cov = ((Xtr - mu_mat)./sigma_mat) * ((Xtr - mu_mat)./sigma_mat)' / data_size;
[v d] = eig(cov);
%[U S V] = svd(((Xtr - mu_mat)./sigma_mat)');
k = 124;
%Ureduce = U(:,1:k);
%XtrReduce = ((Xtr - mu_mat)./sigma_mat) * Ureduce;
XtrReduce = v'*((Xtr - mu_mat)./sigma_mat);
B = TreeBagger(300, XtrReduce', Ytr', 'Prior', 'Empirical', 'NPrint', 1);
data_size_test = size(Xtest, 2);
mu_test = repmat(mu,1,data_size_test);
sigma_test = repmat(sigma,1,data_size_test);
XtestReduce = v' * ((Xtest - mu_test) ./ sigma_test);
Ypredict = predict(B,XtestReduce');
error = sum(Ytest' ~= (double(cell2mat(Ypredict)) - 48))
Random forest heavily depends on the choice of the base. It is not a linear model, which is (up to normalization) rotation invariant, RF completely changes behaviour once you "rotate the space". The reason behind it lies in the fact that it uses decision trees as base classifiers which analyze each feature completely independently, so as the result it fails to find any linear combination of features. Once you rotate your space you change "meaning" of features. There is nothing wrong with that, simply tree based classifiers are rather bad choice to apply after such transformations. Use features selection methods instead (methods which select which features are valuable without creating any linear combinations). In fact, RFs themselves can be used for such task due to their internal "feature importance" computation,
There is already a matlab function princomp which would do pca for you. I would suggest not to fall in numerical error loops. They have done it for us..:)

Writing a matlab program for combinations of large numbers

Because for combinations of large numbers at times matlab replies NaN, the assignment is to write a program to compute combinations of 200 objects taken 90 at a time. Once this works we are to make it into a function y = comb(n,k).
This is what I have so far based on an example we were given of the probability that 2 people in a class have the same birthday.
This is the example:
nMax = 70; %maximum number of people in classroom
nArray = 1:nMax;
prevPnot = 1; %initialize probability
for iN = 1:nMax
Pnot = prevPnot*(365-iN+1)/365; %probability that no birthdays are the same
P(iN) = 1-Pnot; %probability that at least two birthdays are the same
prevPnot = Pnot;
end
plot(nArray, P, '.-')
xlabel('nb. of people')
ylabel('prob. that at least two have same birthday')
grid on
At this point I'm having trouble because I'm more familiar with java. This is what I have so far, and it isn't coming out at all.
k = 90;
n = 200;
nArray = 1:k;
prevPnot = 1;
for counter = 1:k
Pnot = (n-counter+1)/(prevPnot*(n-k-counter+1);
P(iN) = Pnot;
prevPnot = Pnot;
end
The point of the loop I wrote is to separate out each term
i.e. n/k*(n-k), times (n-counter)/(k-counter)*(n-k-counter), and so forth.
I'm also not entirely sure how to save a loop as a function in matlab.
To compute the number of combinations of n objects taken k at a time, you can use gammaln to compute the logarithm of the factorials in order to avoid overflow:
result = exp(gammaln(n+1)-gammaln(k+1)-gammaln(n-k+1));
Another approach is to remove terms that will cancel and then compute the result:
result = prod((n-k+1:n)./(1:k));

Naive Bayse Classifier for Multiclass: Getting Same Error Rate

I have implemented the Naive Bayse Classifier for multiclass but problem is my error rate is same while I increase the training data set. I was debugging this over an over but wasn't able to figure why its happening. So I thought I ll post it here to find if I am doing anything wrong.
%Naive Bayse Classifier
%This function split data to 80:20 as data and test, then from 80
%We use incremental 5,10,15,20,30 as the test data to understand the error
%rate.
%Goal is to compare the plots in stanford paper
%http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
function[tPercent] = naivebayes(file, iter, percent)
dm = load(file);
for i=1:iter
%Getting the index common to test and train data
idx = randperm(size(dm.data,1))
%Using same idx for data and labels
shuffledMatrix_data = dm.data(idx,:);
shuffledMatrix_label = dm.labels(idx,:);
percent_data_80 = round((0.8) * length(shuffledMatrix_data));
%Doing 80-20 split
train = shuffledMatrix_data(1:percent_data_80,:);
test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the label data from the 80:20 split
train_labels = shuffledMatrix_label(1:percent_data_80,:);
test_labels = shuffledMatrix_label(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the array of percents [5 10 15..]
percent_tracker = zeros(length(percent), 2);
for pRows = 1:length(percent)
percentOfRows = round((percent(pRows)/100) * length(train));
new_train = train(1:percentOfRows,:);
new_train_label = train_labels(1:percentOfRows);
%get unique labels in training
numClasses = size(unique(new_train_label),1);
classMean = zeros(numClasses,size(new_train,2));
classStd = zeros(numClasses, size(new_train,2));
priorClass = zeros(numClasses, size(2,1));
% Doing the K class mean and std with prior
for kclass=1:numClasses
classMean(kclass,:) = mean(new_train(new_train_label == kclass,:));
classStd(kclass, :) = std(new_train(new_train_label == kclass,:));
priorClass(kclass, :) = length(new_train(new_train_label == kclass))/length(new_train);
end
error = 0;
p = zeros(numClasses,1);
% Calculating the posterior for each test row for each k class
for testRow=1:length(test)
c=0; k=0;
for class=1:numClasses
temp_p = normpdf(test(testRow,:),classMean(class,:), classStd(class,:));
p(class, 1) = sum(log(temp_p)) + (log(priorClass(class)));
end
%Take the max of posterior
[c,k] = max(p(1,:));
if test_labels(testRow) ~= k
error = error + 1;
end
end
avgError = error/length(test);
percent_tracker(pRows,:) = [avgError percent(pRows)];
tPercent = percent_tracker;
plot(percent_tracker)
end
end
end
Here is the dimentionality of my data
x =
data: [768x8 double]
labels: [768x1 double]
I am using Pima data set from UCI
What are the results of your implementation of the training data itself? Does it fit it at all?
It's hard to be sure but there are couple things that I noticed:
It is important for every class to have training data. You can't really train a classifier to recognize a class if there was no training data.
If possible number of training examples shouldn't be skewed towards some of classes. For example if in 2-class classification number of training and cross validation examples for class 1 constitutes only 5% of the data then function that always returns class 2 will have error of 5%. Did you try checking precision and recall separately?
You're trying to fit normal distribution to each feature in a class and then use it for posterior probabilities. I'm not sure how it plays out in terms of smoothing. Could you try to re-implement it with simple counting and see if it gives any different results?
It also could be that features are highly redundant and bayes method overcounts probabilities.