I want to know how grdient descent algorithm works on matlab network training and how MSE is calculated - I have my own app but it doesnt work as the matlab nn and I want to know why.
My algorithm looks like this:
foreach epoch
gradient_vector = 0 // this is a vector
rmse = 0
foreach sample in data set
output = CalculateForward(sample.input)
error = sample.target - output
rmse += DotProduct(error,error)
gradient_part = CalculateBackward(error)
gradient_vector += (gradient_part / number_of_samples)
end
network.AddToWeights( gradient_vector * learning_rate)
rmse = sqrt(rmse/number_of_samples)
end
I it something similar what matlab does?
It appears close to what MATLAB does, but keep in mind that the toolbox is designed for a broad base of applications. Your algorithm gives each data entry once to the network once per epoch. Matlab's toolbox can present the data multiple times per epoch, update multiple times per epoch, and can update in a number of ways. I assure you that your exact method can be duplicated with the existing matlab toolbox, but with a very specific setting, which can be found by digging around in the help files for the neural network you're using. Some of them may be closer to what you're doing than others, so be discerning. Good luck!
Related
Encouraged by some success in MNIST classification I wanted to solve a "real" problem with some neural networks.
The task seems quite easy:
We have:
some x-value (e.g. 1:1:100)
some y-values (e.g. x^2)
I want to train a network with 1 input (for 1 x-value) and one output (for 1 y-value). One hidden layer.
Here is my basic procedure:
Slicing my x-values into different batches (e.g. 10 elements per batch)
In each batch calculating the outputs of the net, then applying backpropagation, calculating weight and bias updates
After each batch averaging the calculated weight and bias updates and actually update the weights and biases
Repeating step 1. - 3. multiple times
This procedure worked fine for MNIST, but for the regression it totally fails.
I am wondering if I do something fundamentally wrong.
I tried different batchsizes, up to averaging over ALL x values.
Basically the network does not train well. After manually tweaking the weights and biases (with 2 hidden neurons) I could approximate my y=f(x) quite well, but when the network shall learn the parameters, it fails.
When I have just one element for x and one for y and I train the network, it trains well for this one specific pair.
Maybe somebody has a hint for me. Am I misunderstanding regression with neural networks?
So far I assume, the code itself is okay, as it worked for MNIST and it works for the "one x/y pair example". I rather think my overall approach (see above) may be not suitable for regression.
Thanks,
Jim
ps: I will post some code tomorrow...
Here comes the code (MATLAB). As I said, its one hidden layer, with two hidden neurons:
% init hyper-parameters
hidden_neurons=2;
input_neurons=1;
output_neurons=1;
learning_rate=0.5;
batchsize=50;
% load data
training_data=d(1:100)/100;
training_labels=v_start(1:100)/255;
% init weights
init_randomly=1;
if init_randomly
% initialize weights and bias with random numbers between -0.5 and +0.5
w1=rand(hidden_neurons,input_neurons)-0.5;
b1=rand(hidden_neurons,1)-0.5;
w2=rand(output_neurons,hidden_neurons)-0.5;
b2=rand(output_neurons,1)-0.5;
else
% initialize with manually determined values
w1=[10;-10];
b1=[-3;-0.5];
w2=[0.2 0.2];
b2=0;
end
for epochs =1:2000 % looping over some epochs
for i = 1:batchsize:length(training_data) % slice training data into batches
batch_data=training_data(i:min(i+batchsize,length(training_data))); % generating training batch
batch_labels=training_labels(i:min(i+batchsize,length(training_data))); % generating training label batch
% initialize weight updates for next batch
w2_update=0;
b2_update =0;
w1_update =0;
b1_update =0;
for k = 1: length(batch_data) % looping over one single batch
% extract trainig sample
x=batch_data(k); % extracting one single training sample
y=batch_labels(k); % extracting expected output of training sample
% forward pass
z1 = w1*x+b1; % sum of first layer
a1 = sigmoid(z1); % activation of first layer (sigmoid)
z2 = w2*a1+b2; % sum of second layer
a2=z2; %activation of second layer (linear)
% backward pass
delta_2=(a2-y); %calculating delta of second layer assuming quadratic cost; derivative of linear unit is equal to 1 for all x.
delta_1=(w2'*delta_2).* (a1.*(1-a1)); % calculating delta of first layer
% calculating the weight and bias updates averaging over one
% batch
w2_update = w2_update +(delta_2*a1') * (1/length(batch_data));
b2_update = b2_update + delta_2 * (1/length(batch_data));
w1_update = w1_update + (delta_1*x') * (1/length(batch_data));
b1_update = b1_update + delta_1 * (1/length(batch_data));
end
% actually updating the weights. Updated weights will be used in
% next batch
w2 = w2 - learning_rate * w2_update;
b2 = b2 - learning_rate * b2_update;
w1 = w1 - learning_rate * w1_update;
b1 = b1 - learning_rate * b1_update;
end
end
Here is the outcome with random initialization, showing the expected output, the output before training, and the output after training:
training with random init
One can argue that the blue line is already closer than the black one, in that sense the network has optimized the results already. But I am not satisfied.
Here is the result with my manually tweaked values:
training with pre-init
The black line is not bad for just two hidden neurons, but my expectation was rather, that such a black line would be the outcome of training starting with random init.
Any suggestions what I am doing wrong?
Thanks!
Ok, after some research I found some interesting points:
The function I tried to learn seems particularly hard to learn (not sure why)
With the same setup I tried to learn some 3rd degree polynomials which was successful (cost <1e-6)
Randomizing training samples seems to improve learning (for the polynomial and my initial function). I know this is well known in literature but I always skipped that part in implementation. So I learned for myself how important it is.
For learning "curvy/wiggly" functions, I found sigmoid works better than ReLu. (output layer is still "linear" as suggested for regression)
a learning rate of 0.1 worked fine for the curve fitting I finally wanted to perform
A larger batchsize would smoothen the cost vs. epochs plot (surprise...)
Initializing weigths between -5 and +5 worked better than -0.5 and 0.5 for my application
In the end I got quite convincing results for what I intendet to learn with the network :)
Have you tried with a much smaller learning rate? Generally, learning rates of 0.001 are a good starting point, 0.5 is in most cases way too large.
Also note that your predefined weights are in an extremely flat region of the sigmoid function (sigmoid(10) = 1, sigmoid(-10) = 0), with the derivative at both positions close to 0. That means that backpropagating from such a position (or getting to such a position) is extremely difficult; For exactly that reason, some people prefer to use ReLUs instead of sigmoid, since it has only a "dead" region for negative activations.
Also, am I correct in seeing that you only have 100 training samples? You could maybe try a smaller batch size, or increase the number of samples you take. Also don't forget to shuffle your samples after each epoch. Reasons are given plenty, for example here.
I'm currently working with neural networks and I'm still beginner. My purpose is to use a MLP to predict flow time series (I know, that NARX-networks may be more suitable for time series predictions, but the requirement is a MLP).
For example I want to predict the flow Q(t+x) with current and historical flow Q(t...t-n) and precipitation P(t...t-m) etc.
The results of my net-trainings (training, validation and test of the network) and an additional validation period show relatively good qualities (correlation and RMSE). But when I look closer at the output of training and validation period, there is a lag to the targets of the respective periode. And my problem is that I don't know why.
The lag exactly corresponds to my forecast period x, no matter how large x is.
I use a standard MLP from the Matlab-toolbox with default Settings (randomly divide, trainlm, etc.) like using the graphical NN-tool (but I also tested other Settings with my own code).
With a simple Q(t) NAR-net it is the same problem. If I try it with regular data like predicting sin(t+x) with sin(t..t-n) or the same with a rectangular function there is no time shift, it's all fine.
Only if I use real world data or irregular (but most constant) data like [0.12 0.14 0.13 0.1 0.1 0.1 ... (n times) 0.1 ... 0.1 0.1 0.14 0.15 0.12 ...] there is the shift between the target and the output. Although I train the network with the target Q(t+x) the real training output is Q(t). I try also some other input variable combinations from less to more information. My time series is above 7 years with hourly resolution. But it also occurs with other resolutions.
Is there something I am wrong in my work or something I can try. I've read that some others also have this Problem, but no solutions? I think it is no failure of my implementation, because I also tried the matlab-tool and the sinus function and there are the same outcomes. And if I ignore the shift, the accuracy of the values is not bad (thats why the goodness of correlation and rmse is also good obviously).
I use matlab 2012.
Here's also a minimalistic code example, only with the most import points. But also shows the problem very well.
%% minimalstic example
% but there is the same problem with more input variables
load Q
%% create net inputs and targets
% start point of t
t = 100;
% history data of Q -> Q(t-1), Q(t-2), Q(t-3)
inputs = [Q(t-1:end-1,1) Q(t-2:end-2,1) Q(t-3:end-3,1)]';
% timestep t that want to be predicted
targets = Q(t:end,1)';
%% create fitting net (MLP)
% but it is the same problem for NARnet
% and from here, you can also use the NN graphical tool
% number of hidden neurons
numHiddenNeurons = 6; % the described problem is not dependent on this
% point, therefor it is freely chosen
net = fitnet(numHiddenNeurons); % same problem if choosing the old version newfit
% default MLP settings, no changes, but the problem even exist with other
% combinations of settings
% train net
[trained_net,tr] = train(net,inputs,targets);
% apply trained net with given data (create net outputs)
outputs = sim(trained_net,inputs);
figure(1)
hold on
bar(targets',0.6,'FaceColor','r','EdgeColor','none')
bar(outputs',0.2,'FaceColor','b','EdgeColor','none')
legend('observation','prediction')
% please zoom very far to see single bars!! the bar plot shows very good
% the time shift
% if you choose a bigger forecasting time, the shift will also be better to
% see
%% the result: targets(1,1)=Q(t), outputs(1,1)=Q(t-1)
%% now try the sinus function, the problem will not be there
x = 1:1:1152;
SIN = sin(x);
inputs = [SIN(1,t-1:end-1);SIN(1,t-2:end-2);SIN(1,t-3:end-3)];
targets = SIN(1,t:end);
% start again from above, creating the net
I have not enough reputations to upload two excerpts of the results of the codes for one step ahead prediction.
Consider predicting not the absolute value of the flow, but the change of flow from the previous period, using the recent changes from the previous periods as inputs. As pointed out by Diphtong above, it very well may be the case that the previous flow values are not predictive of (contain no useful information about) the next flow value.
Conceptually, this is similar to predicting the next value of a random walk. Imagine you had a situation where the next value of a function was equal to the current value plus some random number between -1.0 and +1.0. If you tried to predict the next value from the previous values, the best that any function approximator/regressor could do to minimize the prediction error would be to use the current value as the best predictor of the next value.
However, in your case, it could still be possible that there is some information in the previous flow values. To prevent the current value from overwhelming the error term, deny the network from using the current value as the predictor by feeding it the derivative of the absolute flow values. If there is no useful information in those either, it should minimize the error by always predicting 0.
In summary, try:
Inputs: change in flow at [t-1], at [t-2], ... , [t-w]
Output: change in flow at [t]
This "time-shift" you are observing is exactly what #Diphtong mentions: your neural-network cannot resolve the relationship between the inputs and the output, so it bahaves like a "naive predictor" (look it up) where (in the financial stock market world) the best prediction for tomorrow's stock price is today's price.
It may help, but I've seen deltas of the input time series, LOG() and SQRT() perform the same...
I have a time series of N data points of sunspots and would like to predict based on a subset of these points the remaining points in the series and then compare the correctness.
I'm just getting introduced to linear prediction using Matlab and so have decided that I would go the route of using the following code segment within a loop so that every point outside of the training set until the end of the given data has a prediction:
%x is the data, training set is some subset of x starting from beginning
%'unknown' is the number of points to extend the prediction over starting from the
%end of the training set (i.e. difference in length of training set and data vectors)
%x_pred is set to x initially
p = length(training_set);
coeffs = lpc(training_set, p);
for i=1:unknown
nextValue = -coeffs(2:end) * x_pred(end-unknown-1+i:-1:end-unknown-1+i-p+1)';
x_pred(end-unknown+i) = nextValue;
end
error = norm(x - x_pred)
I have three questions regarding this:
1) Does this appropriately do what I have described? I ask because my error seems rather large (>100) when predicting over only the last 20 points of a dataset that has hundreds of points.
2) Am I interpreting the second argument of lpc correctly? Namely, that it means the 'order' or rather number of points that you want to use in predicting the next point?
3) If this is there a more efficient, single line function in Matlab that I can call to replace the looping and just compute all necessary predictions for me given some subset of my overall data as a training set?
I tried looking through the lpc Matlab tutorial but it didn't seem to do the prediction as I have described my needs require. I have also been using How to use aryule() in Matlab to extend a number series? as a reference.
So after much deliberation and experimentation I have found the above approach to be correct and there does not appear to be any single Matlab function to do the above work. The large errors experienced are reasonable since I am using a linear prediction algorithm for a problem (i.e. sunspot prediction) that has inherent nonlinear behavior.
Hope this helps anyone else out there working on something similar.
We recently studied the Naïve Bayesian Classifier in our Machine Learning class and now I'm trying to implement it on the Fisher Iris dataset as a self-exercise. The concept is easy and straightforward, with some trickiness involved for continuous attributes. I read up several literature resources which recommended using a Gaussian approximation to compute probability of test data values, so I'm going with it in my code.
Now I'm trying to run it initially for 50% training and 50% test data samples, but something is missing. The current code is always predicting class 1 (I used integers to represent the classes) for all test samples, which is obviously wrong.
My guess is that the problem may be due to normalization being omitted by the code? Though I think adding normalization would still yield proportionate results, and so far my attempts to normalize have produced the same classification results.
Can someone please suggest if there is anything obvious missing here? Or if I'm not approaching this right? Since most of the code is 'mechanics', I have made prominent (****************) the 2 lines that are responsible for the calculations. Any help is appreciated, thanks!
nsamples=75; % 50% samples
% acquire training set and test set
[trainingSample,idx] = datasample(data,nsamples,'Replace',false);
testData = data(setdiff(1:150,idx),:);
% define Gaussian function
%***********************************************************%
Phi=#(mu,sig2,x) (1/sqrt(2*pi*sig2))*exp(-((x-mu)^2)/2*sig2);
%***********************************************************%
for c=1:3 % for 3 classes in training set
clear y x mu sig2;
index=1;
for i=1 : length(trainingSample)
if trainingSample(i,5)==c
y(index,:)=trainingSample(i,:); % filter current class samples
index=index+1; % for conditional probabilities
end
end
for j=1:size(testData,1) % iterate over test samples
clear pf p;
for i=1:4 % iterate over columns
x=testData(j,i); % representing attributes
mu=mean(y(:,i));
sig2=var(y(:,i));
pf(i) = Phi(mu,sig2,x); % calc conditional probability
end
% calc class likelihood; prior * posterior
%*****************************************************%
pc(j,c) = size(y,1)/nsamples * pf(1)*pf(2)*pf(3)*pf(4);
%*****************************************************%
end
end
% find the predicted class for each test sample
% by taking the max probability calculated
for i=1:size(pc,1)
[~,q]=max(pc(i,:));
predicted(i)=q;
actual(i)=testData(i,5);
end
Normalization shouldn't be necessary since the features are only compared to each other.
p(class|thing) = p(class)p(thing|class) =
= p(class)p(feature_1|class)p(feature_2|class)...p(feature_N|class)
So when fitting the parameters for the distribution feature_i|class it will just rescale the parameters (for the new "scale") in this case (mu, sigma2), but the probabilities will remain the same.
It's hard to read the matlab code due to alot of indexing and splitting of training/testing etc. Which is a possible problem source.
You should try something with a lot less non-necessary stuff around it (I would recommend python with scikit-learn for example, alot of helpers for splitting data and such http://scikit-learn.org/).
It's really important that you separate the training and test data, and only train the model with training data and test the trained model with the test data. (Is this done?)
Next step is to check the parameters which is easiest done with either printing them out (sanity check) or..
for each feature render the gaussian bells fitted next to a histogram of the data to see that they match (remember that each histogram bar must be of height number_of_samples_within_range/total_number_of_samples.
Visualising the data and the model is really important to know what is happening.
I'm very much a novice at signal processing techniques, but I am trying to apply the fast fourier transform to a daily time series to remove the seasonality present in the data. The example I am working with is from here:
http://www.mathworks.com/help/signal/ug/frequency-domain-linear-regression.html
While I understand how to implement the code as it is written in the example, I am having a hard time adapting it to my specific application. What I am trying to do is create a preprocessing function which deseasonalizes the training data using similar code to the above example. Then, using the same estimated coefficients from the in-sample data, deseasonalize the out-of-sample data to preserve its independence from the in-sample data. Basically, once the coefficients are estimated, I will normalize each new data point using the same coefficients. I suspect this is akin to estimating a linear trend, then removing it from the in-sample data, and then using the same linear model on unseen data to detrend it i the same manner.
Obviously, when I estimate the fourier coefficients, the vector I get out is equal to the length of the in-sample data. The out-of-sample data is comprised of much fewer observations, so directly applying them is impossible.
Is this sort of analysis possible using this technique or am I going down a dead end road? How should I approach that using the code in the example above?
What you want to do is certainly possible, you are on the right track, but you seem to misunderstand a few points in the example. First, it is shown in the example that the technique is the equivalent of linear regression in the time domain, exploiting the FFT to perform in the frequency domain an operation with the same effect. Second, the trend that is removed is not linear, it is equal to a sum of sinusoids, which is why FFT is used to identify particular frequency components in a relatively tidy way.
In your case it seems you are interested in the residuals. The initial approach is therefore to proceed as in the example as follows:
(1) Perform a rough "detrending" by removing the DC component (the mean of the time-domain data)
(2) FFT and inspect the data, choose frequency channels that contain most of the signal.
You can then use those channels to generate a trend in the time domain and subtract that from the original data to obtain the residuals. You need not proceed by using IFFT, however. Instead you can explicitly sum over the cosine and sine components. You do this in a way similar to the last step of the example, which explains how to find the amplitudes via time-domain regression, but substituting the amplitudes obtained from the FFT.
The following code shows how you can do this:
tim = (time - time0)/timestep; % <-- acquisition times for your *new* data, normalized
NFpick = [2 7 13]; % <-- channels you picked to build the detrending baseline
% Compute the trend
mu = mean(ts);
tsdft = fft(ts-mu);
Nchannels = length(ts); % <-- size of time domain data
Mpick = 2*length(NFpick);
X(:,1:2:Mpick) = cos(2*pi*(NFpick-1)'/Nchannels*tim)';
X(:,2:2:Mpick) = sin(-2*pi*(NFpick-1)'/Nchannels*tim)';
% Generate beta vector "bet" containing scaled amplitudes from the spectrum
bet = 2*tsdft(NFpick)/Nchannels;
bet = reshape([real(bet) imag(bet)].', numel(bet)*2,1)
trend = X*bet + mu;
To remove the trend just do
detrended = dat - trend;
where dat is your new data acquired at times tim. Make sure you define the time origin consistently. In addition this assumes the data is real (not complex), as in the example linked to. You'll have to examine the code to make it work for complex data.