Decision Level Fusion of SVR outputs - matlab

I have two sets of features predicting the same outputs. But instead of training everything at once, I would like to train them separately and fuse the decisions. In SVM classification, we can take the probability values for the classes which can be used to train another SVM. But in SVR, how can we do this?
Any ideas?
Thanks :)

There are a couple of choices here . The two most popular ones would be:
ONE)
Build the two models and simply average the results.
It tends to work well in practice.
TWO)
You could do it in a very similar fashion as when you have probabilities. The problem is, you need to control for over fitting .What I mean is that it is "dangerous" to produce a score with one set of features and apply to another where the labels are exactly the same as before (even if the new features are different). This is because the new applied score was trained on these labels and therefore over fits in it (hyper-performs).
Normally you use a Cross-validation
In your case you have
train_set_1 with X1 features and label Y
train_set_2 with X2 features and same label Y
Some psedo code:
randomly split 50-50 both train_set_1 and train_set_2 at exactly the same points along with the Y (output array)
so now you have:
a.train_set_1 (50% of training_set_1)
b.train_set_1 (the rest of 50% of training_set_1)
a.train_set_2 (50% of training_set_2)
b.train_set_2 (the rest of 50% of training_set_2)
a.Y (50% of the output array that corresponds to the same sets as a.train_set_1 and a.train_set_2)
b.Y (50% of the output array that corresponds to the same sets as b.train_set_1 and b.train_set_2)
here is the key part
Build a svr with a.train_set_1 (that contains X1 features) and output a.Y and
Apply that model's prediction as a feature to b.train_set_2 .
By this I mean, you score the b.train_set_2 base on your first model. Then you take this score and paste it next to your a.train_set_2 .So now this set will have X2 features + 1 more feature, the score produced by the first model.
Then build your final model on b.train_set_2 and b.Y
The new model , although uses the score produced from training_set_1, still it does so in an unbiased way , since the later was never trained on these labels!
You might also find this paper quite useful

Related

ANN multiple vs single outputs

I recently started studying ANN, and there is something that I've been trying to figure out that I can't seem to find an answer to (probably because it's too trivial or because I'm searching for the wrong keywords..).
When do you use multiple outputs instead of single outputs? I guess in simplest case of 1/0-classification its the easiest to use the "sign" as the output activiation function. But in which case do you use several outputs? Is it if you have for instance a multiple classification problem, so you want to classify something as, say for instance, A, B or C and you choose 1 output neuron for each class? How do you determine which class it belongs to?
In a classification context, there are a couple of situations where using multiple output units can be helpful: multiclass classification, and explicit confidence estimation.
Multiclass
For the multiclass case, as you wrote in your question, you typically have one output unit in your network for each class of data you're interested in. So if you're trying to classify data as one of A, B, or C, you can train your network on labeled data, but convert all of your "A" labels to [1 0 0], all your "B" labels to [0 1 0], and your "C" labels to [0 0 1]. (This is called a "one-hot" encoding.) You also probably want to use a logistic activation on your output units to restrict their activation values to the interval (0, 1).
Then, when you're training your network, it's often useful to optimize a "cross-entropy" loss (as opposed to a somewhat more intuitive Euclidean distance loss), since you're basically trying to teach your network to output the probability of each class for a given input. Often one uses a "softmax" (also sometimes called a Boltzmann) distribution to define this probability.
For more info, please check out http://www.willamette.edu/~gorr/classes/cs449/classify.html (slightly more theoretical) and http://deeplearning.net/tutorial/logreg.html (more aimed at the code side of things).
Confidence estimation
Another cool use of multiple outputs is to use one output as a standard classifier (e.g., just one output unit that generates a 0 or 1), and a second output to indicate the confidence that this network has in its classification of the input signal (e.g., another output unit that generates a value in the interval (0, 1)).
This could be useful if you trained up a separate network on each of your A, B, and C classes of data, but then also presented data to the system later that came from class D (or whatever) -- in this case, you'd want each of the networks to indicate that they were uncertain of the output because they've never seen something from class D before.
Have a look at softmax layer for instance. Maximum output of this layer is your class. And it has got nice theoretical justification.
To be concise : you take previous layer's output and interpret it as a vector in m dimensional space. After that you fit K gaussians to it, which are sharing covariance matrices. If you model it and write out equations it amounts to softmax layer. For more details see "Machine Learning. A Probabilistic Perspective" by Kevin Murphy.
It is just an example of using last layer for multiclass classification. You can as well use multiple outputs for something else. For instance you can train ANN to "compress" your data, that is calculate a function from N dimensional to M dimensional space that minimizes loss of information (this model is called autoencoder)

One Class SVM using LibSVM in Matlab - Conceptual

Perhaps this is an easy question, but I want to make sure I understand the conceptual basis of the LibSVM implementation of one-class SVMs and if what I am doing is permissible.
I am using one class SVMs in this case for outlier detection and removal. This is used in the context of a greater time series prediction model as a data preprocessing step. That said, I have a Y vector (which is the quantity we are trying to predict and is continuous, not class labels) and an X matrix (continuous features used to predict). Since I want to detect outliers in the data early in the preprocessing step, I have yet to normalize or lag the X matrix for use in prediction, or for that matter detrend/remove noise/or otherwise process the Y vector (which is already scaled to within [-1,1]). My main question is whether it is correct to model the one class SVM like so (using libSVM):
svmod = svmtrain(ones(size(Y,1),1),Y,'-s 2 -t 2 -g 0.00001 -n 0.01');
[od,~,~] = svmpredict(ones(size(Y,1),1),Y,svmod);
The resulting model does yield performance somewhat in line with what I would expect (99% or so prediction accuracy, meaning 1% of the observations are outliers). But why I ask is because in other questions regarding one class SVMs, people appear to be using their X matrices where I use Y. Thanks for your help.
What you are doing here is nothing more than a fancy range check. If you are not willing to use X to find outliers in Y (even though you really should), it would be a lot simpler and better to just check the distribution of Y to find outliers instead of this improvised SVM solution (for example remove the upper and lower 0.5-percentiles from Y).
In reality, this is probably not even close to what you really want to do. With this setup you are rejecting Y values as outliers without considering any context (e.g. X). Why are you using RBF and how did you come up with that specific value for gamma? A kernel is total overkill for one-dimensional data.
Secondly, you are training and testing on the same data (Y). A kitten dies every time this happens. One-class SVM attempts to build a model which recognizes the training data, it should not be used on the same data it was built with. Please, think of the kittens.
Additionally, note that the nu parameter of one-class SVM controls the amount of outliers the classifier will accept. This is explained in the LIBSVM implementation document (page 4): It is proved that nu is an upper bound on the fraction of training errors and
a lower bound of the fraction of support vectors. In other words: your training options specifically state that up to 1% of the data can be rejected. For one-class SVM, replace can by should.
So when you say that the resulting model does yield performance somewhat in line with what I would expect ... ofcourse it does, by definition. Since you have set nu=0.01, 1% of the data is rejected by the model and thus flagged as an outlier.

Rapidminer - neural net operator - output confidence

I have feed-forward neural network with six inputs, 1 hidden layer and two output nodes (1; 0). This NN is learned by 0;1 values.
When applying model, there are created variables confidence(0) and confidence(1), where sum of this two numbers for each row is 1.
My question is: what do these two numbers (confidence(0) and confidence(1)) exactly mean? Are these two numbers probabilities?
Thanks for answers
In general
The confidence values (or scores, as they are called in other programs) represent a measure how, well, confident the model is that the presented example belongs to a certain class. They are highly dependent on the general strategy and the properties of the algorithm.
Examples
The easiest example to illustrate is the majority classifier, who just assigns the same score for all observations based on the proportions in the original testset
Another is example the k-nearest-neighbor-classifier, where the score for a class i is calculated by averaging the distance to those examples which both belong to the k-nearest-neighbors and have class i. Then the score is sum-normalized across all classes.
In the specific example of NN, I do not know how they are calculated without checking the code. I guess it is just the value of output node, sum-normalized across both classes.
Do the confidences represent probabilities ?
In general no. To illustrate what probabilities in this context mean: If an example has probability 0.3 for class "1", then 30% of all examples with similar feature/variable values should belong to class "1" and 70% should not.
As far as I know, his task is called "calibration". For this purpose some general methods exist (e.g. binning the scores and mapping them to the class-fraction of the corresponding bin) and some classifier-dependent (like e.g. Platt Scaling which has been invented for SVMs). A good point to start is:
Bianca Zadrozny, Charles Elkan: Transforming Classifier Scores into Accurate Multiclass Probability Estimates
The confidence measures correspond to the proportion of outputs 0 and 1 that are activated in the initial training dataset.
E.g. if 30% of your training set has outputs (1;0) and the remaining 70% has outputs (0; 1), then confidence(0) = 30% and confidence(1) = 70%

Principal component analysis

I have to write a classificator (gaussian mixture model) that I use for human action recognition.
I have 4 dataset of video. I choose 3 of them as training set and 1 of them as testing set.
Before I apply the gm model on the training set I run the pca on it.
pca_coeff=princomp(trainig_data);
score = training_data * pca_coeff;
training_data = score(:,1:min(size(score,2),numDimension));
During the testing step what should I do? Should I execute a new princomp on testing data
new_pca_coeff=princomp(testing_data);
score = testing_data * new_pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));
or I should use the pca_coeff that I compute for the training data?
score = testing_data * pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));
The classifier is being trained on data in the space defined by the principle components of the training data. It doesn't make sense to evaluate it in a different space - therefore, you should apply the same transformation to testing data as you did to training data, so don't compute a different pca_coef.
Incidently, if your testing data is drawn independently from the same distribution as the training data, then for large enough training and test sets, the principle components should be approximately the same.
One method for choosing how many principle components to use involves examining the eigenvalues from the PCA decomposition. You can get these from the princomp function like this:
[pca_coeff score eigenvalues] = princomp(data);
The eigenvalues variable will then be an array where each element describes the amount of variance accounted for by the corresponding principle component. If you do:
plot(eigenvalues);
you should see that the first eigenvalue will be the largest, and they will rapidly decrease (this is called a "Scree Plot", and should look like this: http://www.ats.ucla.edu/stat/SPSS/output/spss_output_pca_5.gif, though your one may have up to 800 points instead of 12).
Principle components with small corresponding eigenvalues are unlikely to be useful, since the variance of the data in those dimensions is so small. Many people choose a threshold value, and then select all principle components where the eigenvalue is above that threshold. An informal way of picking the threshold is to look at the Scree plot and choose the threshold to be just after the line 'levels out' - in the image I linked earlier, a good value might be ~0.8, selecting 3 or 4 principle components.
IIRC, you could do something like:
proportion_of_variance = sum(eigenvalues(1:k)) ./ sum(eigenvalues);
to calculate "the proportion of variance described by the low dimensional data".
However, since you are using the principle components for a classification task, you can't really be sure that any particular number of PCs is optimal; the variance of a feature doesn't necessarily tell you anything about how useful it will be for classification. An alternative to choosing PCs with the Scree plot is just to try classification with various numbers of principle components and see what the best number is empirically.

Process for comparing two datasets

I have two datasets at the time (in the form of vectors) and I plot them on the same axis to see how they relate with each other, and I specifically note and look for places where both graphs have a similar shape (i.e places where both have seemingly positive/negative gradient at approximately the same intervals). Example:
So far I have been working through the data graphically but realize that since the amount of the data is so large plotting each time I want to check how two sets correlate graphically it will take far too much time.
Are there any ideas, scripts or functions that might be useful in order to automize this process somewhat?
The first thing you have to think about is the nature of the criteria you want to apply to establish the similarity. There is a wide variety of ways to measure similarity and the more precisely you can describe what you want for "similar" to mean in your problem the easiest it will be to implement it regardless of the programming language.
Having said that, here is some of the thing you could look at :
correlation of the two datasets
difference of the derivative of the datasets (but I don't think it would be robust enough)
spectral analysis as mentionned by #thron of three
etc. ...
Knowing the origin of the datasets and their variability can also help a lot in formulating robust enough algorithms.
Sure. Call your two vectors A and B.
1) (Optional) Smooth your data either with a simple averaging filter (Matlab 'smooth'), or the 'filter' command. This will get rid of local changes in velocity ("gradient") that appear to be essentially noise (as in the ascending component of the red trace.
2) Differentiate both A and B. Now you are directly representing the velocity of each vector (Matlab 'diff').
3) Add the two differentiated vectors together (element-wise). Call this C.
4) Look for all points in C whose absolute value is above a certain threshold (you'll have to eyeball the data to get a good idea of what this should be). Points above this threshold indicate highly similar velocity.
5) Now look for where a high positive value in C is followed by a high negative value, or vice versa. In between these two points you will have similar curves in A and B.
Note: a) You could do the smoothing after step 3 rather than after step 1. b) Re 5), you could have a situation in which a 'hill' in your data is at the edge of the vector and so is 'cut in half', and the vectors descend to baseline before ascending in the next hill. Then 5) would misidentify the hill as coming between the initial descent and subsequent ascent. To avoid this, you could also require that the points in A and B in between the two points of velocity similarity have high absolute values.