Multitask learning in GPFlow with missing inputs? - gpflow

Is it possible to do multitask learning with GPFlow where some inputs are missing? Specifically, I am trying to fit spatial data from several related individuals, but the data are not at identical input locations for all individuals. I know I ought to be doing hierarchical GPs here, but they tend not to scale well. I was hoping that multitask learning might be used instead, though the use case does not map exactly onto the typical application of this method.

Is it possible to do multitask learning with GPFlow where some inputs are missing?
Absolutely, yes.
The way GPflow does this is by stacking the output index onto the input. For example, suppose you have two outputs (0, 1) observed at locations [0.1, 0.2, 0.3] and [0.3, 0.4, 0.5], you'y construct the "input matrix"
[0.1 0]
[0.2 0]
[0.3 0]
[0.3 1]
[0.4 1]
[0.5 1]
Then, specify how the kernel acts on this matrix using "active_dims". The simplest kernel that could act on this model would be:
k = gpflow.kernels.Matern32(1, active_dims=[0]) * gpflow.kernels.Coregion(1, 2, 2, active_dims=[1])
Which is the intrinsic coregionalization model (see Alvarez et al, [1]). You can gfind a more detailed demo in the GPflow documentation, here.
Note that you probably want a more powerful model that intrinsic coregionalization: the Linear Model of Coregionalization is more powerful and still easy to implement.
[1] http://eprints.whiterose.ac.uk/114503/1/1106.6251v2.pdf

There is currently not a model in GPflow that does this. However, GPflow does provide tools to easily implement this. Two suggestions:
Use a multioutput kernel, and for missing datapoints you set your observation variance to infinity.
Define a multioutput kernel, and specify a custom Kuf for which the requested outputs are passed together with the corresponding input.

Related

Replacing three integrators with one

I need to calculate a linear growth model (in simulink) with continuous-time signal, described as:
x’ = ax
with at least three different real parameters “a”.
I've managed to do it using three integrators as you can see in the image below:
I've been told that there is a way to do it using only one integrator, but I can't figure it out.
You can give your gain blocks a vector, not just a scalar.
You can use a gain of [1 0.8 1.2] for a single gain block (with the multiplication mode set to Element-wise) instead of having three separate gain blocks set to 1, 0.8, and 1.2.

How do I obtain the step response of this PID controller in Matlab?

I'm relatively new to Control systems. I'm trying to obtain a graph for the step response of a PID controller of the form
Is this possible to plot in mat lab because I get the error that the function cannot plot the step response of a system with more poles than zeros.
Is there any way to plot this system without the whole infinity issue so that I can observe the characteristics of its step response?
I'm sorry if I'm asking a dumb question that may seem obvious but any help or explanation would be greatly appreciated.
This is my mat lab code for my PID controller:
%3.PID Control,Td=0.001, 0.01, 0.05, 0.1
a=tf([0 0 -10],[0 1 10]);
b=tf([0 -1 -5],[1 3.5 6]);
kc=5;
Ti=1;
Td=0.001;
k1=tf([0 Td 0],[0 0 1]); %derivative control
k2=tf([0 1],[Ti 0]); %integral control
G=kc*(k1+k2+1); % the controller
G1=series(a,b);
y=feedback(G,G1,-1);
subplot(2,2,1),stepplot(y),title('kc=5,Ti=1,Td=0.001');
As thewaywewalk mentioned, MATLAB can only deal with proper systems, and a pure derivative isn't proper, so you need to use an approximate derivative in your transfer function. It's never a good practice to use pure derivatives as they tend to amplify noise.
Look at the documentation on the PID Controller block in Simulink to see how to implement a PID controller with approximate derivative. In short, you need to replace Kd*s by Kd*s/(1+a*s) where a is small compared to the dominant time constant of the system.
EDIT:
The best way to create your PID is to use the actual pid function from the Control System Toolbox. It implements a first-order derivative filter on the derivative term.

Best way to compare two signals in Matlab

I have a signal I made in matlab that I want to compare to another signal (call them y and z). What I am looking for is a way to assign a value or percentage of how similar two signals are.
I was trying to use corrcoef, but I get very poor values (corrcoef(y,z) = -0.1141), yet when I look at the FFT of the two plots superimposed on each other, I would have visually said that they are very similar. Taking a look at the corrcoef of the FFT of the magnitude of the two signals looks a lot more promising: corrcoef(abs(fft(y)),abs(fft(z))) = 0.9955, but I am not sure if that is the best way to go about it since the two signals in their pure form appear to not be correlated.
Does anyone have a recommendation of how to compare two signals in Matlab as described?
Thanks!
The question is impossible to answer without a clearer definition of what you mean by "similar".
If by "similar" you mean "correlated frequency responses", then, well, you're one step ahead of the game!
In general, defining the proper metric is highly application specific; you need to answer why you want to know how similar these two signals are to know how to measure how similar they are. Will they be input to the same system? Do they need to be detected by the same algorithm?
In the meantime, your idea to use the freq-domain correlation is not bad. But you might also consider
http://en.wikipedia.org/wiki/Dynamic_time_warping
Or the likelihood of the time-series under various statistical models:
http://en.wikipedia.org/wiki/Hidden_Markov_model
http://en.wikipedia.org/wiki/Autoregressive_model
http://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model
Or any number of other models...
I should add: In general, the correlation coefficient between two time-series is a very poor metric of the time-series' similarity, except under very specific circumstances (e.g., no shifts in phase)
Pete is right that you need to define a notion of similarity before progressing further. You might find normalized maximum cross-correlation magnitude to be useful notion of similarity for your circumstances, however:
norm_max_xcorr_mag = #(x,y)(max(abs(xcorr(x,y)))/(norm(x,2)*norm(y,2)));
x = randn(1, 200); y = randn(1, 200); % two random signals
norm_max_xcorr_mag(x,y)
ans = 0.1636
y = [zeros(1, 30), 3.*x]; % y is delayed, multiplied version of x
norm_max_xcorr_mag(x,y)
ans = 1
This notion of similarity is similar to rote correlation of the two sequences but is invariant to time delay.

how to calculate roc curves?

I write a classifier (Gaussian Mixture Model) to classify five human actions. For every observation the classifier compute the posterior probability to belong to a cluster.
I want to valutate the performance of my system parameterized with a threshold, with values from 0 to 100. For every threshold values, for every observation, if the probability of belonging to one of cluster is greater than threshold I accept the result of the classifier otherwise I discard it.
For every threshold values I compute the number of true-positive, true-negative, false-positive, false-negative.
Than I compute the two function: sensitivity and specificity as
sensitivity = TP/(TP+FN);
specificity=TN/(TN+FP);
In matlab:
plot(1-specificity,sensitivity);
to have the ROC curve. But the result isn't what I expect.
This is the plot of the functions of discards, errors, corrects, sensitivity and specificity varying the threshold of one action.
This is the plot of ROC curve of one action
This is the stem of ROC curve for the same action
I am wrong, but i don't know where. Perhaps I do wrong the calculating of FP, FN, TP, TN especially when the result of the classifier is minor of the threshold, so I have a discard. What I have to incremente when there is a discard?
Background
I am answering this because I need to work through the content, and a question like this is a great excuse. Thank you for the good opportunity.
I use data from the built-in fisher iris data:
http://archive.ics.uci.edu/ml/datasets/Iris
I also use code snippets from the Mathworks tutorial on the classification, and for plotroc
http://www.mathworks.com/products/demos/statistics/classdemo.html
http://www.mathworks.com/help/nnet/ref/plotroc.html?searchHighlight=plotroc
Problem Description
There is clearer boundary within the domain to classify "setosa" but there is overlap for "versicoloir" vs. "virginica". This is a two dimensional plot, and some of the other information has been discarded to produce it. The ambiguity in the classification boundaries is a useful thing in this case.
%load data
load fisheriris
%show raw data
figure(1); clf
gscatter(meas(:,1), meas(:,2), species,'rgb','osd');
xlabel('Sepal length');
ylabel('Sepal width');
axis equal
axis tight
title('Raw Data')
Analysis
Lets say that we want to determine the bounds for a linear classifier that defines "virginica" versus "non-virginica". We could look at "self vs. not-self" for other classes, but they would have their own
So now we make some linear discriminants and plot the ROC for them:
%load data
load fisheriris
load iris_dataset
irisInputs=meas(:,1:2)';
irisTargets=irisTargets(3,:);
ldaClass1 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'linear')';
ldaClass2 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'diaglinear')';
ldaClass3 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'quadratic')';
ldaClass4 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'diagquadratic')';
ldaClass5 = classify(meas(:,1:2),meas(:,1:2),irisTargets,'mahalanobis')';
myinput=repmat(irisTargets,5,1);
myoutput=[ldaClass1;ldaClass2;ldaClass3;ldaClass4;ldaClass5];
whos
plotroc(myinput,myoutput)
The result is shown in the following, though it took deleting repeat copies of the diagonal:
You can note in the code that I stack "myinput" and "myoutput" and feed them as inputs into the "plotroc" function. You should take the results of your classifier as targets and actuals and you can get similar results. This compares the actual output of your classifier versus the ideal output of your target values. Those are the input to plotroc.
So this will give you "built-in" ROC, which is useful for quick work, but does not make you learn every step in detail.
Questions you can ask at this point include:
which classifier is best? How do I determine what best is in this case?
What is the convex hull of the classifiers? Is there some mixture of classifiers that is more informative than any pure method? Bagging perhaps?
You are trying to draw the curves of precision vs recall, depending on the classifier threshold parameter. The definition of precision and recall are:
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
You can check the definition of these parameters in:
http://en.wikipedia.org/wiki/Precision_and_recall
There are some curves here:
http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf
Are you dividing your dataset in training set, cross validation set and test set? (if you do not divide the data, it is normal that your precision-recall curve seems weird)
EDITED: I think that there are two possible sources for your problem:
When you train a classifier for 5 classes, usually you have to train 5 distinctive classifiers. One classifier for (class A = class 1, class B = class 2, 3, 4 or 5), then a second classfier for (class A = class 2, class B = class 1, 3, 4 or 5), ... and the fifth for class A = class 5, class B = class 1, 2, 3 or 4).
As you said to select the output for your "compound" classifier, you have to pass your new (test) datapoint through the five classifiers, and you choose the one with the biggest probability.
Then, you should have 5 thresholds to define weighting values that my prioritize selecting one classifier over the others. You should check how the matlab implementations uses the thresholds, but their effect is that you don't choose the class with more probability, but the class with better weighted probability.
As you say, maybe you are not calculating well TP, TN, FP, FN. Your test data should have datapoints belonging to all the classes. Then you have testdata(i,:) and classtestdata(i) are the feature vector and "ground truth" class of datapoint i. When you evaluate the classifier you obtain classifierOutput(i) = 1 or 2 or 3 or 4 or 5. Then you should calculate the "confusion matrix", which is the way to calculate TP, TN, FP, FN when you have multiple classes (> 2):
http://en.wikipedia.org/wiki/Confusion_matrix
http://www.mathworks.com/help/stats/confusionmat.html
(note the relation between TP, TN, FP, FN that you are calculating for the multiclass problem)
I think that you can obtain the TP, TN, FP, FN data of each subclassifier (remember that you are calculating 5 separate classifiers, even if you do not realize it) from the confusion matrix. I am not sure but you can draw the precision recall curve for each subclassifier.
Also check these slides: http://www.slideserve.com/MikeCarlo/multi-class-and-structured-classification
I don't know what the ROC curve is, I will check it because machine learning is a really interesting subject for me.
Hope this helps,

Neural networks - input values

I have a question that may be trivial but it's not described anywhere i've looked. I'm studying neural networks and everywhere i look there's some theory and some trivial example with some 0s and 1s as an input. I'm wondering: do i have to put only one value as an input value for one neuron, or can it be a vector of, let's say, 3 values (RGB colour for example)?
The above answers are technically correct, but don't explain the simple truth: there is never a situation where you'd need to give a vector of numbers to a single neuron.
From a practical standpoint this is because (as one of the earlier solutions has shown) you can just have a neuron for each number in a vector and then have all of those be the input to a single neuron. This should get you your desired behavior after training, as the second layer neuron can effectively make use of the entire vector.
From a mathematical standpoint, there is a fundamental theorem of coding theory that states that any vector of numbers can be represented as a single number. Thus, if you really don't want an extra layer of neurons, you could simply encode the RGB values into a single number and input that to the neuron. Though, this coding function would probably make most learning problems more difficult, so I doubt this solution would be worth it in most cases.
To summarize: artificial neural networks are used without giving a vector to an input unit, but lose no computational power because of this.
When dealing with multi-dimensional data, I believe a two layer neural network is said to give better result.
In your case:
R[0..1] => (N1)----\
\
G[0..1] => (N2)-----(N4) => Result[0..1]
/
B[0..1] => (N3)----/
As you can see, the N4 neurone can handle 3 entries.
The [0..1] interval is a convention but a good one imo. That way, you can easily code a set of generic neuron classes that can take an arbitrary number of entries (I had template C++ classes with the number of entries as template parameter personally). So you code the logic of your neurons once, then you toy with the structure of the network and/or combinations of functions within your neurons.
Generally, the input for a single neuron is a value between 0 and 1. That convention is not just for ease of implementation but because normalizing the input values to the same range ensures that each input carries similar weighting. (If you have some images with 8 bit color with pixel values between 0 and 7 and some images with 16 bit color with pixel values between 0 and 255 you probably wouldn't want to favor the 24 bit color images just because the numerical values are higher. Similarly, you will probably want your images to be the same dimensions.)
As far as using pixel values as inputs, it is very common to try to gather a higher level representation of the image than its pixels (more info). For example, given a 5 x 5 (normalized) gray scale image:
[1 1 1 1 1]
[0 0 1 0 0]
[0 0 1 0 0]
[0 0 1 0 0]
[0 0 1 0 0]
We could use a the following feature matrices to help discover horizontal, vertical, and diagonal features of the images. See python haar face detection for more information.
[1 1] [0 0] [1 0] [0 1] [1 0], [0 1]
[0 0], [1 1], [1 0], [0 1], [0 1], [1 0]
To build the input vector, v, for this image, take the first 2x2 feature matrix and "apply" it with element-wise multiplication to the first position in the image. Applying,
[1 1] (the first feature matrix) to [1 1] (the first position in the image)
[0 0] [0 0]
will result in 2 because 1*1 + 1*1 + 0*0 + 0*0 = 2. Append 2 to the back of your input vector for this image. Then move this feature matrix to the next position, one to the right, and apply it again, adding the result to the input vector. Do this repeatedly for each position of the feature matrix and for each of the feature matrices. This will build your input vector for a single image. Be sure that you build the vectors in the same order for each image.
In this case the image is black and white, but with RGB values you could extend the algorithm to do the same computation but add 3 values to the input vector for each pixel--one for each color. This should provide you with one input vector per image and a single input to each neuron. The vectors will then need to be normalized before running through the network.
Normally a single neuron takes as its input multiple real numbers and outputs a real number, which typically is calculated as applying the sigmoid function to the sum of the real numbers (scaled, and then plus or minus a constant offset).
If you want to put in, say, two RGB vectors (2 x 3 reals), you need to decide how you want to combine the values. If you add all the elements together and apply the sigmoid function, it is equivalent to getting in six reals "flat". On the other hand, if you process the R elements, then the G elements, and the B elements, all individually (e.g. sum or subtract the pairs), you have in practice three independent neurons.
So in short, no, a single neuron does not take in vector values.
Use light wavelength normalized to visible spectrum as the input.
There are some approximate equations on the net.
Search for RGB to wavelength conversion
or
use HSL color model and extract Hue component and possibly use Saturation and Lightness as well. Well...
It can be whatever you want, as long as you write your inner function accordingly.
The examples you mention use [0;1] as their domain, but you can use R, R², or whatever you want, as long as the function you use in your neurons is defined on this domain.
In your case, you can define your functions on R3 to allow for RGB values to be handled
A trivial example : use (x1, y1, z1),(x2,y2,z2)->(ax1+x2,by1+y2,cz1+z2) as your function to transform two colors into one, a b and c being your learning coefs, which you will determine during the learning phase.
Very detailed information (including the answer to your question) is available on Wikipedia.