How to setup fitensemble for binary imbalanced datasets? - matlab

I've been trying to test matlab's ensemble methods with randomly generated imbalance dataset and no matter what I set the prior/cost/weight parameters the method never predicts close to the label ratio.
Below is an example of the tests I did.
prob = 0.9; %set label ratio to 90% 1 and 10% 0
y = (rand(100,1) < prob);
X = rand(100,3); %generate random training data with three features
X_test = rand(100,3); %generate random test data
%A few parameter sets I've tested
B = TreeBagger(100,X,y);
B2 = TreeBagger(100,X,y,'Prior','Empirical');
B3 = TreeBagger(100,X,y,'Cost',[0,9;1,0]);
B4 = TreeBagger(100,X,y,'Cost',[0,1;9,0]);
B5 = fitensemble(X,y,'RUSBoost', 20, 'Tree', 'Prior', 'Empirical');
Here I tried to predict the trained classifiers on random test data. My assumption is that since the classifier is trained on random data, it should on average predict close to the dataset ratio (1/9) if it takes the prior into account. But each of the classifiers predicted 98-100% in favor of '1' instead of ~90% that I am looking for.
l1 = predict(B,X_test);
l2 = predict(B2,X_test);
l3 = predict(B3,X_test);
l4 = predict(B4,X_test);
l5 = predict(B5,X_test);
How do I get the ensemble method to take the prior into account? Or is there a fundamental misunderstanding on my part?

I don't think it can work like you think.
Thats because as i understood your training and test data is random. So how should your classifier find any relation between features and your label?
lets take the accuracy as a mesurement and make an example.
class A: 900 datarows.
class B: 100 datarows.
Classify 100% as A:
0.9*/(0.1+0.9) = 0.9
gets you 90% Accuracy.
if your classifier does something different, means trying to classify some datarows to B he will by chance get 9 times more wrongly classified A datarows
Lets say 20 B datarows are correctly classified you will get around 180 wrong a classified A datarows
B: 20 correct, 80 incorrect
A: 720 correct, 180 wrong
740/(740+260) = 0.74
Accuracy goes down to 74 %. And thats not something your classifying algorithms want.
Long story short: Your classifier will allways tend to classify allmost 100% class A if you dont get any information into your Data

Related

Problem understanding Loss function behavior using Flux.jl. in Julia

So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.
loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.

Bicoin price prediction using spark and scala [duplicate]

I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289
Here's my code:
def parsePoint(line):
split = map(sanitize, line.split(','))
rev = split.pop(-2)
return LabeledPoint(rev, split)
def sanitize(value):
return float(value.strip('"'))
parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)
print model.predict(parsedData.first().features)
The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?
The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.
What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.
w(i+1) = w(i) - s * g
Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:
LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)

Bag of words not correctly labeling responses

I am trying to implement Bag of Words in opencv and has come with the implementation below. I am using Caltech 101 database. However, since its my first time and not being familiar, I have planned to used two image sets from the database, the chair image set and the soccer ball image set. I have coded for the svm using this.
Everything went allright, except when I call classifier.predict(descriptor) , I do not get the label vale as intended. I always get a0 instead of '1', irrespective of my test image. The number of images in the chair dataset is 10 and in the soccer ball dataset is 10. I labelled chair as 0 and soccer ball as 1 . The links represent the samples of each categories, the top 10 is of chairs, the bottom 10 is of soccer balls
function hello
clear all; close all; clc;
detector = cv.FeatureDetector('SURF');
extractor = cv.DescriptorExtractor('SURF');
links = {
'http://i.imgur.com/48nMezh.jpg'
'http://i.imgur.com/RrZ1i52.jpg'
'http://i.imgur.com/ZI0N3vr.jpg'
'http://i.imgur.com/b6lY0bJ.jpg'
'http://i.imgur.com/Vs4TYPm.jpg'
'http://i.imgur.com/GtcwRWY.jpg'
'http://i.imgur.com/BGW1rqS.jpg'
'http://i.imgur.com/jI9UFn8.jpg'
'http://i.imgur.com/W1afQ2O.jpg'
'http://i.imgur.com/PyX3adM.jpg'
'http://i.imgur.com/U2g4kW5.jpg'
'http://i.imgur.com/M8ZMBJ4.jpg'
'http://i.imgur.com/CinqIWI.jpg'
'http://i.imgur.com/QtgsblB.jpg'
'http://i.imgur.com/SZX13Im.jpg'
'http://i.imgur.com/7zVErXU.jpg'
'http://i.imgur.com/uUMGw9i.jpg'
'http://i.imgur.com/qYSkqEg.jpg'
'http://i.imgur.com/sAj3pib.jpg'
'http://i.imgur.com/DMPsKfo.jpg'
};
N = numel(links);
trainer = cv.BOWKMeansTrainer(100);
train = struct('val',repmat({' '},N,1),'img',cell(N,1), 'pts',cell(N,1), 'feat',cell(N,1));
for i=1:N
train(i).val = links{i};
train(i).img = imread(links{i});
if ndims(train(i).img > 2)
train(i).img = rgb2gray(train(i).img);
end;
train(i).pts = detector.detect(train(i).img);
train(i).feat = extractor.compute(train(i).img,train(i).pts);
end;
for i=1:N
trainer.add(train(i).feat);
end;
dictionary = trainer.cluster();
extractor = cv.BOWImgDescriptorExtractor('SURF','BruteForce');
extractor.setVocabulary(dictionary);
for i=1:N
desc(i,:) = extractor.compute(train(i).img,train(i).pts);
end;
a = zeros(1,10)';
b = ones(1,10)';
labels = [a;b];
classifier = cv.SVM;
classifier.train(desc,labels);
test_im =rgb2gray(imread('D:\ball1.jpg'));
test_pts = detector.detect(test_im);
test_feat = extractor.compute(test_im,test_pts);
val = classifier.predict(test_feat);
disp('Value is: ')
disp(val)
end
These are my test samples:
Soccer Ball
(source: timeslive.co.za)
Chair
Searching through this site I think that my algorithm is okay, even though I am not quite confident about it. If anybody can help me in finding the bug, it will be appreciable.
Following Amro's code , this was my result:
Distribution of classes:
Value Count Percent
1 62 49.21%
2 64 50.79%
Number of training instances = 61
Number of testing instances = 65
Number of keypoints detected = 38845
Codebook size = 100
SVM model parameters:
svm_type: 'C_SVC'
kernel_type: 'RBF'
degree: 0
gamma: 0.5063
coef0: 0
C: 62.5000
nu: 0
p: 0
class_weights: 0
term_crit: [1x1 struct]
Confusion matrix:
ans =
29 1
1 34
Accuracy = 96.92 %
Your logic looks fine to me.
Now I guess you'll have to tweak the various parameters if you want to improve the classification accuracy. This includes the clustering algorithm parameters (such as the vocabulary size, clusters initialization, termination criteria, etc..), the SVM parameters (kernel type, the C coefficient, ..), the local features algorithm used (SIFT, SURF, ..).
Ideally, whenever you want to perform parameter selection, you ought to use cross-validation. Some methods already have such mechanism embedded (CvSVM::train_auto for instance), but for the most part you'll have to do this manually...
Finally you should follow general machine learning guidelines; see the whole bias-variance tradeoff dilemma. The online Coursera ML class discusses this topic in detail in week 6, and explains how to perform error analysis and use learning curves to decide what to try next (do we need to add more instances, increase model complexity, and so on..).
With that said, I wrote my own version of the code. You might wanna compare it with your code:
% dataset of images
% I previously saved them as: chair1.jpg, ..., ball1.jpg, ball2.jpg, ...
d = [
dir(fullfile('images','chair*.jpg')) ;
dir(fullfile('images','ball*.jpg'))
];
% local-features algorithm used
detector = cv.FeatureDetector('SURF');
extractor = cv.DescriptorExtractor('SURF');
% extract local features from images
t = struct();
for i=1:numel(d)
% load image as grayscale
img = imread(fullfile('images', d(i).name));
if ~ismatrix(img), img = rgb2gray(img); end
% extract local features
pts = detector.detect(img);
feat = extractor.compute(img, pts);
% store along with class label
t(i).img = img;
t(i).class = find(strncmp(d(i).name,{'chair','ball'},4));
t(i).pts = pts;
t(i).feat = feat;
end
% split into training/testing sets
% (a better way would be to use cvpartition from Statistics toolbox)
disp('Distribution of classes:')
tabulate([t.class])
tTrain = t([1:7 11:17]);
tTest = t([8:10 18:20]);
fprintf('Number of training instances = %d\n', numel(tTrain));
fprintf('Number of testing instances = %d\n', numel(tTest));
% build visual vocabulary (by clustering training descriptors)
K = 100;
bowTrainer = cv.BOWKMeansTrainer(K, 'Attempts',5, 'Initialization','PP');
clust = bowTrainer.cluster(vertcat(tTrain.feat));
fprintf('Number of keypoints detected = %d\n', numel([tTrain.pts]));
fprintf('Codebook size = %d\n', K);
% compute histograms of visual words for each training image
bowExtractor = cv.BOWImgDescriptorExtractor('SURF', 'BruteForce');
bowExtractor.setVocabulary(clust);
M = zeros(numel(tTrain), K);
for i=1:numel(tTrain)
M(i,:) = bowExtractor.compute(tTrain(i).img, tTrain(i).pts);
end
labels = vertcat(tTrain.class);
% train an SVM model (perform paramter selection using cross-validation)
svm = cv.SVM();
svm.train_auto(M, labels, 'SvmType','C_SVC', 'KernelType','RBF');
disp('SVM model parameters:'); disp(svm.Params)
% evaluate classifier using testing images
actual = vertcat(tTest.class);
pred = zeros(size(actual));
for i=1:numel(tTest)
descs = bowExtractor.compute(tTest(i).img, tTest(i).pts);
pred(i) = svm.predict(descs);
end
% report performance
disp('Confusion matrix:')
confusionmat(actual, pred)
fprintf('Accuracy = %.2f %%\n', 100*nnz(pred==actual)./numel(pred));
Here are the output:
Distribution of classes:
Value Count Percent
1 10 50.00%
2 10 50.00%
Number of training instances = 14
Number of testing instances = 6
Number of keypoints detected = 6300
Codebook size = 100
SVM model parameters:
svm_type: 'C_SVC'
kernel_type: 'RBF'
degree: 0
gamma: 0.5063
coef0: 0
C: 312.5000
nu: 0
p: 0
class_weights: []
term_crit: [1x1 struct]
Confusion matrix:
ans =
3 0
1 2
Accuracy = 83.33 %
So the classifier correctly labels 5 out of 6 images from the test set, which is not bad for a start :) Obviously you'll get different results each time you run the code due to the inherent randomness of the clustering step.
What is the number of images you are using to build your dictionary i.e. what is N? From your code, it seems that you are only using a 10 images (those listed in links). I hope this list is truncated down for this post else that would be too few. Typically you need in the order of 1000 or much more images to build the dictionary and the images need not be restricted to only these 2 classes that you are classifying. Otherwise, with only 10 images and 100 clusters your dictionary is likely to be messed up.
Also, you might want to use SIFT as a first choice as it tends to perform better than the other descriptors.
Lastly, you can also debug by checking the detected keypoints. You can get OpenCV to draw the keypoints. Sometimes your keypoint detector parameters are not set properly, resulting in too few keypoints getting detected, which in turn gives poor feature vectors.
To understand more about the BOW algorithm, you can take a look at these posts here and here. The second post has a link to a free pdf for an O'Reilley book on computer vision using python. The BOW model (and other useful stuff) is described in more details inside that book.
Hope this helps.

MATLAB: Naive Bayes with Univariate Gaussian

I am trying to implement Naive Bayes Classifier using a dataset published by UCI machine learning team. I am new to machine learning and trying to understand techniques to use for my work related problems, so I thought it's better to get the theory understood first.
I am using pima dataset (Link to Data - UCI-ML), and my goal is to build Naive Bayes Univariate Gaussian Classifier for K class problem (Data is only there for K=2). I have done splitting data, and calculate the mean for each class, standard deviation, priors for each class, but after this I am kind of stuck because I am not sure what and how I should be doing after this. I have a feeling that I should be calculating posterior probability,
Here is my code, I am using percent as a vector, because I want to see the behavior as I increase the training data size from 80:20 split. Basically if you pass [10 20 30 40] it will take that percentage from 80:20 split, and use 10% of 80% as training.
function[classMean] = naivebayes(file, iter, percent)
dm = load(file);
for i=1:iter
idx = randperm(size(dm.data,1))
%Using same idx for data and labels
shuffledMatrix_data = dm.data(idx,:);
shuffledMatrix_label = dm.labels(idx,:);
percent_data_80 = round((0.8) * length(shuffledMatrix_data));
%Doing 80-20 split
train = shuffledMatrix_data(1:percent_data_80,:);
test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
train_labels = shuffledMatrix_label(1:percent_data_80,:)
test_labels = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the array of percents
for pRows = 1:length(percent)
percentOfRows = round((percent(pRows)/100) * length(train));
new_train = train(1:percentOfRows,:)
new_trin_label = shuffledMatrix_label(1:percentOfRows)
%get unique labels in training
numClasses = size(unique(new_trin_label),1)
classMean = zeros(numClasses,size(new_train,2));
for kclass=1:numClasses
classMean(kclass,:) = mean(new_train(new_trin_label == kclass,:))
std(new_train(new_trin_label == kclass,:))
priorClassforK = length(new_train(new_trin_label == kclass))/length(new_train)
priorClassforK_1 = 1 - priorClassforK
end
end
end
end
First, compute the probability of evey class label based on frequency counts. For a given sample of data and a given class in your data set, you compute the probability of evey feature. After that, multiply the conditional probability for all features in the sample by each other and by the probability of the considered class label. Finally, compare values of all class labels and you choose the label of the class with the maximum probability (Bayes classification rule).
For computing conditonal probability, you can simply use the Normal distribution function.

using precomputed kernels with libsvm

I'm currently working on classifying images with different image-descriptors. Since they have their own metrics, I am using precomputed kernels. So given these NxN kernel-matrices (for a total of N images) i want to train and test a SVM. I'm not very experienced using SVMs though.
What confuses me though is how to enter the input for training. Using a subset of the kernel MxM (M being the number of training images), trains the SVM with M features. However, if I understood it correctly this limits me to use test-data with similar amounts of features. Trying to use sub-kernel of size MxN, causes infinite loops during training, consequently, using more features when testing gives poor results.
This results in using equal sized training and test-sets giving reasonable results. But if i only would want to classify, say one image, or train with a given amount of images for each class and test with the rest, this doesn't work at all.
How can i remove the dependency between number of training images and features, so i can test with any number of images?
I'm using libsvm for MATLAB, the kernels are distance-matrices ranging between [0,1].
You seem to already have figured out the problem... According to the README file included in the MATLAB package:
To use precomputed kernel, you must include sample serial number as
the first column of the training and testing data.
Let me illustrate with an example:
%# read dataset
[dataClass, data] = libsvmread('./heart_scale');
%# split into train/test datasets
trainData = data(1:150,:);
testData = data(151:270,:);
trainClass = dataClass(1:150,:);
testClass = dataClass(151:270,:);
numTrain = size(trainData,1);
numTest = size(testData,1);
%# radial basis function: exp(-gamma*|u-v|^2)
sigma = 2e-3;
rbfKernel = #(X,Y) exp(-sigma .* pdist2(X,Y,'euclidean').^2);
%# compute kernel matrices between every pairs of (train,train) and
%# (test,train) instances and include sample serial number as first column
K = [ (1:numTrain)' , rbfKernel(trainData,trainData) ];
KK = [ (1:numTest)' , rbfKernel(testData,trainData) ];
%# train and test
model = svmtrain(trainClass, K, '-t 4');
[predClass, acc, decVals] = svmpredict(testClass, KK, model);
%# confusion matrix
C = confusionmat(testClass,predClass)
The output:
*
optimization finished, #iter = 70
nu = 0.933333
obj = -117.027620, rho = 0.183062
nSV = 140, nBSV = 140
Total nSV = 140
Accuracy = 85.8333% (103/120) (classification)
C =
65 5
12 38