How to take the difference between the resulting and the correct bucket of a one hot vector into account? - neural-network

Hi I am using tensorflow at my university to try to classify steering angles of a simulation program using only the images the simulation produces.
The Steering angles are values from -1 to 1 and I separated them into 50 "buckets". So the first value of my prediction vector would mean that the predicted steering angle is between -1 and -0.96.
The following shows the classification and optimization functions I am using.
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(prediction, y))
optimizer = tf.train.AdamOptimizer(0.001).minimize(cost)
y is a vector that with 49 zeros and a single 1 for the correct bucket. My question now is.
How do I take into account if e.g. the correct bucket is at index 25, that the a prediction of 26 is much better than a prediction of 48.
I didn't post the actual network since it is just a couple of conv2d and maxpool layers with a fully connected layer at the end.

Since you are applying Cross entropy or negative log likelihood. you are penalizing the system given the predicted output and the ground truth.
So saying that your system predicted different numbers on your 50 classes output and the highest one was the class number 25 but your ground truth is class 26. So your system will take the value predicted on 26 and adapt the parameters to produce the highest number on this output the next time it sees this input.

You could do two basic things:
Change your y and prediction to be scalars in the range -1..1; make the loss function be (y-prediction)**2 or something. A very different model, but perhaps more reasonable that the one-hot.
Keep the one-hot target and loss, but have y = target*w, where w is a constant matrix, mostly zeros, 1s on the diagonal, and smaller values on the next diagonal, elements (e.g. y(i) = target(i) * 1. + target(i-1) * .5 + target(i+1) * .5 + ...); kind of gross, but it should converge to something reasonable.

Related

How to model scalar values with a neural network if besides direction the magnitude matters too

Say you want to predict temperature changes based on some input data. Temperature changes are positive or negative scalars with a mean of zero. If only the direction matters one could just use tanh as an activation function in the output layer. But say for delta-temperatures predicting the magnitude of the change is also important, not just the sign.
How would you model this output. Tanh doesn't seem to be a good choice because it gives values between -1 and 1. And say temperature changes have a gaussian, or some other weird distribution, so hovering around the center quasi-linear domain of tanh(+-0) would be difficult to learn for a neural network. I'm worried that the sign would be good but the magnitude output would be useless.
How about having the network output one-hot vectors of length N, treat the argmax of this output vector as a temperature change on a pre-defined window. Say the window is -30 - +30 degrees, using N=60 long one-hot vector, if argmax( output )=45 that means the prediction is about 15 degrees.
I was actually not sure how to search for this topic.

How do I align two signals in MATLAB [duplicate]

I want to get the offset in samples between two datasets in Matlab (getting them synced in time), a quite common issue. Therefore I use the cross correlation function xcorr or the cross covariance function xcov (both provide similar results in most cases for this purpose). With artificial data it works fine, but I struggle with "real" data, even though it should be pretty much the same. Matlab always says the offset would be zero. I'm using this simple piece of code:
[crossCorr] = xcov(b, c);
[~, peakIndex] = max(crossCorr())
offset = peakIndex - length(b)
I've posted a fully runable example m-file with a downsampled data excerpt on pastebin:
Code with data on pastebin
EDIT: The downsampled excerpt seems to be not fully suitable for evaluating the effect. Here's a much larger sample with the original frequency, pease use this one instead. Unfortunately it was too big for pastebin.
As the plot shows it should be no problem at all to get the offset via cross covariance. I also tried to scale the data nicer in order to avoid numerical problems, but that didn't change anything at all.
Would be great if someone could tell me my mistake.
There's nothing wrong with your method in principle, I used exactly the same approach successfully for temporally aligning different audio recordings of the same signal.
However, it appears that for your time series, correlation (or covariance) is simply not the right measure to compare shifted versions – possibly because they contain components of a time scale comparable to the total length. An alternative is to use residual variance, i.e. the variance of the difference between shifted versions. Here is a (not particularly elegant) implementation of this idea:
lags = -1000 : 1000;
v = nan(size(lags));
for i = 1 : numel(lags)
lag = lags(i);
if lag >= 0
v(i) = var(b(1 + lag : end) - c(1 : end - lag));
else
v(i) = var(b(1 : end + lag) - c(1 - lag : end));
end
end
[~, ind] = min(v);
minlag = lags(ind);
For your (longer) data set, this results in minlag = 169. Plotting residual variance over lags gives:
Your data has a minor peak around 5 and a major peak around 101.
If I knew something about my data then I could might window around an acceptable range of offsets as shown below.
Code for initial exploration:
figure; clc;
subplot(2,1,1)
plot(1:numel(b), b);
hold on
plot(1:numel(c), c, 'r');
legend('b','c')
subplot(2,1,2)
plot(crossCorr,'.b-')
hold on
plot(peakIndex,crossCorr(peakIndex),'or')
legend('crossCorr','peak')
Initial Image:
If you zoom into the first peak you can see that it is not only high around 5, but it is polynomial "enough" to allow sub-element offsets. That is convenient.
Image showing :
Here is what the curve-fitting tool gives as the analytic for a cubic:
Linear model Poly3:
f(x) = p1*x^3 + p2*x^2 + p3*x + p4
Coefficients (with 95% confidence bounds):
p1 = 8.515e-013 (8.214e-013, 8.816e-013)
p2 = -3.319e-011 (-3.369e-011, -3.269e-011)
p3 = 2.253e-010 (2.229e-010, 2.277e-010)
p4 = -4.226e-012 (-7.47e-012, -9.82e-013)
Goodness of fit:
SSE: 2.799e-024
R-square: 1
Adjusted R-square: 1
RMSE: 6.831e-013
You can note that the SSE fits to roundoff.
If you compute the root (near n=4) you use the following matlab code:
% Coefficients
p1 = 8.515e-013
p2 = -3.319e-011
p3 = 2.253e-010
p4 = -4.226e-012
% Linear model Poly3:
syms('x')
f = p1*x^3 + p2*x^2 + p3*x + p4
xz1=fzero(#(y) subs(diff(f),'x',y), 4)
and you get the analytic root at 4.01420240431444.
EDIT:
Hmmm. How about fitting a gaussian mixture model to the convolution? You sweep through a good range of component count, you do between 10 and 30 repeats, and you find which component count has the best/lowest BIC. So you fit a gmdistribution to the lower subplot of the first figure, then test the covariance at the means of the components in decreasing order.
I would try the offset at the means, and just look at sum squared error. I would then pick the offset that has the lowest error.
Procedure:
compute cross correlation
fit cross correlation to Gaussian Mixture model
sweep a reasonable range of components (start with 1-10)
use a reasonable number of repeats (10 to 30 depending on run-to-run variation)
compute Bayes Information Criterion (BIC) for each level, pick the lowest because it indicates a reasonable balance of error and parameter count
each component is going to have a mean, evaluate that mean as a candidate offset and compute sum-squared error (sse) when you offset like that.
pick the offset of the component that gives best SSE
Let me know how well that works.
If the two signals misalign by non-integer number of samples, e.g. 3.7 samples, then the xcorr method may find the max value at 4 samples, it won't be able to find the accurate time shift. In this case, you should try a method called "unified change detection". The web-link for the paper is:
[http://www.phmsociety.org/node/1404/]
Good Luck.

How to make sense (handle) when computes logarithm of zero in prior information

I am working in image classification. I am using an information that called prior probability (in Bayesian rule). It has range in [0,1]. And it requires computing in logarithm. However, as you know, logarithm of zero number is Inf.
For example, given an pixel x in image I (size 3 by 3) with an cost function such as
Cost(x)=30+log(prior(x))
where prior is an matrix 3 by 3
prior=[ 0 0 0.5;
1 1 0.2;
0.4 0 0]
I =[ 1 2 3;
4 5 6;
7 8 9]
I want to compute cost of x=1 then
cost(x=1)=30+log(0)
Now, log(0) is Inf. Then result cost(x=1) also Inf. Based on my assumption that prior=0 that mean the given pixel belongs to background, and prior=1 that mean the given pixel belongs to foreground.
My question is that how to compute log(prior) satisfy my assumption.
I am using Matlab to do it. I think that log(0) becomes very small negative value. And I just set it is -9 as my code
%% Handle with log(0)
prior(prior==0.0) = NaN;
%% Compute log
log_prior=log(prior);
%% Assume that e^-9 very near 0.
log_prior(isnan(log_prior)) = -9;
UPDATE: To make clearly what I am doing. Let see the Bayesian rule. My task is that how to assign an given pixel x belongs to Background (BG) or Foreground (FG). It will depends on the probability
P(x∈BG|x)=P(x|x∈BG)P(x∈BG)/P(x)
In which P(x|x∈BG) is likelihood function and assume that it is approximated by Gaussian distribution, P(x∈BG) is prior term and P(x) can be ignore due to it is const
Using Maximum-a-Posteriori (MAP) Estimation we can map the above equation in to log space (to resolve exponential in Gaussian function)
Cost(x)=log(P(x∈BG|x))=log(P(x|x∈BG))+log(P(x∈BG))
To make simple, let assume log(P(x|x∈BG))=30, log(P(x∈BG)) is log(prior) then my cost function can rewritten as
Cost(x)=30+log(prior(x))
Now problem is that prior is within [0,1] then it logarithm is -Inf. As the chepner said, we can add eps value as
log(prior+eps)
However, log(eps) is very a lager negative number. It will be affected my cost function (also becomes very large negative number). Then the first term in my cost function (30) becomes not necessary. Based on my assumption that log(x)=1 then the pixel x will be BG and prior(x)=1 will be FG. How to make handle with my log(prior) when I compute my cost function?
The correct thing to do, before fiddling with Matlab, is to try to understand your problem. Ask yourself "what does it mean for the prior probability to vanish?". The answer is given by Bayes theorem, one form of which is:
posterior = likelihood * prior / normalization
So places where the prior is nil are, by definition, places where you are certain that your events (the things whose probabilities you are computing) cannot happen, regardless of their apparent likelihood (i.e. "cost"). So they are not interesting for you. You just recognize that and skip them.

Simple binary logistic regression using MATLAB

I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.
Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.
The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.

Normalization in neural network with (x, y) output

I built a backpropagation neural network to learn from a dataset that consists of 7 continuous inputs and 2 outputs (x, y coordinates). My implementation choice was to use one hidden layer with 7 neurons, but I did it in such a way that I can try different combinations of hidden layers with variable number of hidden nodes.
The error measurement is the usual mean squared error, calculated as follows:
MSE(x,y) = 1/N * sum((X - x)^2 + (Y - y)^2)
where X and Y are the target values, x and y the predictions. I also have to compute an accuracy measure which is the mean euclidean distance of each point from the target points, that's basically the same as the MSE, but the values inside sum get square-rooted.
The input ranges are all between the interval [-2, +2], plus some outliers.
The output coordinates have completely unrelated distributions (x is normally distributed while y is uniformly distributed). The x range is small (say -1, +1 from the mean) while the y range varies more (say -10, +10 from the mean).
The behavior I get is that the net seems to predict quite right the y output, while the x "flattens" to y. Ie, the x values get closer to the y values, the network doesn't adapt to predict the x correctly.
My initial choice was to scale both inputs/outputs as a whole to the usual (0,1) interval but that didn't lead to good results. So I then chose to standardize each feature separately with their z-score, and scale the outputs in the (0,1) interval (I am using the sigmoid activation function so (0,1) seemed about right). But then this strange behavior appeared.
So my questions are, how would you normalize such inputs/outputs? Is there a way to deal with such uncorrelated outputs? I had even thought about using two separate networks to predict one single output discarding the other, is that a good choice?
Could you also point me to some reading where output normalization is discussed? The literature talks a lot about normalizing the inputs, but no one seems to care about the outputs.