Neural Network with softmax activation - matlab

edit:
A more pointed question:
What is the derivative of softmax to be used in my gradient descent?
This is more or less a research project for a course, and my understanding of NN is very/fairly limited, so please be patient :)
I am currently in the process of building a neural network that attempts to examine an input dataset and output the probability/likelihood of each classification (there are 5 different classifications). Naturally, the sum of all output nodes should add up to 1.
Currently, I have two layers, and I set the hidden layer to contain 10 nodes.
I came up with two different types of implementations
Logistic sigmoid for hidden layer activation, softmax for output activation
Softmax for both hidden layer and output activation
I am using gradient descent to find local maximums in order to adjust the hidden nodes' weights and the output nodes' weights. I am certain in that I have this correct for sigmoid. I am less certain with softmax (or whether I can use gradient descent at all), after a bit of researching, I couldn't find the answer and decided to compute the derivative myself and obtained softmax'(x) = softmax(x) - softmax(x)^2 (this returns an column vector of size n). I have also looked into the MATLAB NN toolkit, the derivative of softmax provided by the toolkit returned a square matrix of size nxn, where the diagonal coincides with the softmax'(x) that I calculated by hand; and I am not sure how to interpret the output matrix.
I ran each implementation with a learning rate of 0.001 and 1000 iterations of back propagation. However, my NN returns 0.2 (an even distribution) for all five output nodes, for any subset of the input dataset.
My conclusions:
I am fairly certain that my gradient of descent is incorrectly done, but I have no idea how to fix this.
Perhaps I am not using enough hidden nodes
Perhaps I should increase the number of layers
Any help would be greatly appreciated!
The dataset I am working with can be found here (processed Cleveland):
http://archive.ics.uci.edu/ml/datasets/Heart+Disease

The gradient you use is actually the same as with squared error: output - target. This might seem surprising at first, but the trick is that a different error function is minimized:
(- \sum^N_{n=1}\sum^K_{k=1} t_{kn} log(y_{kn}))
where log is the natural logarithm, N depicts the number of training examples and K the number of classes (and thus units in the output layer). t_kn depicts the binary coding (0 or 1) of the k'th class in the n'th training example. y_kn the corresponding network output.
Showing that the gradient is correct might be a good exercise, I haven't done it myself, though.
To your problem: You can check whether your gradient is correct by numerical differentiation. Say you have a function f and an implementation of f and f'. Then the following should hold:
(f'(x) = \frac{f(x - \epsilon) - f(x + \epsilon)}{2\epsilon} + O(\epsilon^2))

please look at sites.google.com/site/gatmkorn for the open-source Desire simulation program.
For the Windows version, /mydesire/neural folder has several softmax classifiers, some with softmax-specific gradient-descent algorithm.
In the examples, this works nicely for a simplemcharacter-recognition task.
ASee also
Korn, G.A.: Advanced dynamic-system Simulation, Wiley 2007
GAK

look at the link:
http://www.youtube.com/watch?v=UOt3M5IuD5s
the softmax derivative is: dyi/dzi= yi * (1.0 - yi);

Related

Linear vs nonlinear neural network?

I'm new to machine learning and neural networks. I know how to build a nonlinear classification model, but my current problem has a continuous output. I've been searching for information on neural network regression, but all I encounter is information on linear regression - nothing about nonlinear cases. Which is odd, because why would someone use neural networks to solve a simple linear regression anyway? Isn't that like killing a fly with a nuclear bomb?
So my question is this: what makes a neural network nonlinear? (Hidden layers? Nonlinear activation function?) Or do I have a completely wrong understanding of the word "linear" - can a linear regression NN accurately model datasets that are more complex than y=aX+b? Is the word "linear" used just as the opposite of "logistic"?
(I'm planning to use TensorFlow, but the TensorFlow Linear Model Tutorial uses a binary classification problem as an example, so that doesn't help me either.)
For starters, a neural network can model any function (not just linear functions) Have a look at this - http://neuralnetworksanddeeplearning.com/chap4.html.
A Neural Network has got non linear activation layers which is what gives the Neural Network a non linear element.
The function for relating the input and the output is decided by the neural network and the amount of training it gets. If you supply two variables having a linear relationship, then your network will learn this as long as you don't overfit. Similarly, a complex enough neural network can learn any function.
WARNING: I do not advocate the use of linear activation functions only, especially in simple feed forward architectures.
Okay, I think I need to take some time and rewrite this answer explicitly because many people are misinterpreting the point I am trying to make.
First let me point out that we can talk about linearity in parameters or linearity in the variables.
The activation function is NOT necessarily what makes a neural network non-linear (technically speaking).
For example, notice that the following regression predicted values are considered linear predictions, despite non-linear transformations of the inputs because the output constitutes a linear combination of the parameters (although this model is non-linear in its variables):
Now for simplicity, let us consider a single neuron, single layer neural network:
If the transfer function is linear then:
As you have already probably noticed, this is a linear regression. Even if we were to add multiple inputs and neurons, each with a linear activation function, we would now only have an ensemble of regressions (all linear in their parameters and therefore this simple neural network is linear):
Now going back to (3), let's add two layers, so that we have a neural network with 3 layers, one neuron each (both with linear activation functions):
(first layer)
(second layer)
Now notice:
Reduces to:
Where and
Which means that our two layered network (each with a single neuron) is not linear in its parameters despite every activation function in the network being linear; however, it is still linear in the variables. Thus, once training has finished the model will be linear in both variables and parameters. Both of these are important because you cannot replicate this simple two layered network with a single regression and still capture all the effects of the model. Further, let me state clearly: if you use a model with multiple layers there is no guarantee that the output will be non-linear in it's variables (if you use a simple MLP perceptron and line activation functions your picture is still going to be a line).
That being said, let's take a look at the following statement from #Pawu regarding this answer:
The answer is very misleading and makes it sound, that we can learn non-linear relationships using only linear transformations, which is simply not true. When we back-propagate, we take the derivative of a single weight w1 and fix everything else. Now as mentioned above, we are still moving on a linear function.
While you could argue that what #Pawu is saying is technically true, I think they are implying:
The answer is very misleading and makes it sound, that we can learn non-linear relationships using only linear activation functions, which is simply not true.
I would argue that this modified statement is wrong and can easily be demonstrated incorrect. There is an implicit assumption being made about the architecture of the model. It is true that if you restrict yourself to using certain network architectures that you cannot introduce non-linearities without activation functions, but that is a arbitrary restriction and does not generalize to all network models.
Let me make this concrete. First take a simple xor problem. This is a basic classification problem where you are attempting to establish a boundary between data points in a configuration like so:
The kicker about this problem is that it is not linearly separable, meaning no single straight line will be able to perfectly classify. Now if you read anywhere on the internet I am sure they will say that this problem cannot be solved using only linear activation functions using a neural network (notice nothing is said about the architecture). This statement is only true in an extremely limited context and wrong generally.
Allow me to demonstrate. Below is a very simple hand written neural network. This network takes randomly generated weights between -1 and 1, an "xor_network" function which defines the architecture (notice no sigmoid, hardlims, etc. only linear transformations of the form mX or MX + B), and trains using standard backward propagation:
#%% Packages
import numpy as np
#%% Data
data = np.array([[0, 0, 0],[0, 1, 1],[1, 0, 1],[1, 1, 0]])
np.random.shuffle(data)
train_data = data[:,:2]
target_data = data[:,2]
#%% XOR architecture
class XOR_class():
def __init__(self, train_data, target_data, alpha=.1, epochs=10000):
self.train_data = train_data
self.target_data = target_data
self.alpha = alpha
self.epochs = epochs
#Random weights
self.W0 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b0 = np.random.uniform(low=-1, high=1, size=(1))
self.W2 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b2 = np.random.uniform(low=-1, high=1, size=(1))
#xor network (linear transfer functions only)
def xor_network(self, X0):
n0 = np.dot(X0, self.W0) + self.b0
X1 = n0*X0
a = np.dot(X1, self.W2) + self.b2
return(a, X1)
#Training the xor network
def train(self):
for epoch in range(self.epochs):
for i in range(len(self.train_data)):
# Forward Propagation:
X0 = self.train_data[i]
a, X1 = self.xor_network(X0)
# Backward Propagation:
e = self.target_data[i] - a
s_2 = -2*e
# Update Weights:
self.W0 = self.W0 - (self.alpha*s_2*X0)
self.b0 = self.b0 - (self.alpha*s_2)
self.W2 = self.W2 - (self.alpha*s_2*X1)
self.b2 = self.b2 - (self.alpha*s_2)
#Restart training if we get lost in the parameter space.
if np.isnan(a) or (a > 1) or (a < -1):
print('Bad initialization, reinitializing.')
self.W0 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b0 = np.random.uniform(low=-1, high=1, size=(1))
self.W2 = np.random.uniform(low=-1, high=1, size=(2)).T
self.b2 = np.random.uniform(low=-1, high=1, size=(1))
self.train()
#Predicting using the trained weights.
def predict(self, test_data):
for i in train_data:
a, X1 = self.xor_network(i)
#I cut off decimals past 12 for convienience, not necessary.
print(f'input: {i} - output: {np.round(a, 12)}')
Now let's take a look at the output:
#%% Execution
xor = XOR_class(train_data, target_data)
xor.train()
np.random.shuffle(data)
test_data = data[:,:2]
xor.predict(test_data)
input: [1 0] - output: [1.]
input: [0 0] - output: [0.]
input: [0 1] - output: [1.]
input: [1 1] - output: [0.]
And what do you know, I guess we can learn non-linear relationships using only linear activation functions and multiple layers (that's right classification with pure line activation functions, no sigmoid needed). . .
The only catch here is that I cut off all decimals past 12, but let's be honest 7.3 X 10^-16 is basically 0.
Now to be fair I am doing a little trick, where I am using the network connections to get the non-linear result, but that's the whole point I am trying to drive home: THE MAGIC OF NON-LINEARITY FOR NEURAL NETWORKS IS IN THE LAYERS, NOT JUST THE ACTIVATION FUNCTIONS.
Thus the answer to your question, "what makes a neural network non-linear" is: non-linearity in the parameters or, obviously, non-linearity in the variables.
This non-linearity in the parameters/variables comes about two ways: 1) having more than one layer with neurons in your network (as exhibited above), or 2) having activation functions that result in weight non-linearities.
For an example on non-linearity coming about through activation functions, suppose our input space, weights, and biases are all constrained such that they are all strictly positive (for simplicity). Now using (2) (single layer, single neuron) and the activation function , we have the following:
Which Reduces to:
Where , , and
Now, ignoring what issues this neural network has, it should be clear, that at the very least, it is non-linear in the parameters and variables and that non-linearity has been introduced solely by choice of the activation function.
Finally, yes neural networks can model complex data structures that cannot be modeled by using linear models (see xor example above).
EDIT:
As pointed out by #hH1sG0n3, non-linearity in the parameters does not follow directly from many common activation functions (e.g. sigmoid). This is not to say that common activation functions do not make neural networks nonlinear (because they are non-linear in the variables), but that the non-linearity introduced by them is degenerate without parameter non-linearity. For example, a single layered MLP with sigmoid activation functions will produce outputs that are non-linear in the variables in that the output is not proportional to the input, but in reality this is just an array of Generalized Linear Models. This should be especially obvious if we were to transform the targets by the appropriate link function, where now the activation functions would be linear. Now this is not to say that activation functions don't play an important role in the non-linearity of neural networks (clearly they do), but that their role is more to alter/expand the solution space. Said differently, non-linearities in the parameters (usually expressed through many layers/connections) are necessary for non-degenerate solutions that go beyond regression. When we have a model with non-linearity in the parameters we have a whole different beast than regression.
At the end of the day all I want to do with this post is point out that the "magic" of neural networks is also in the layers and to dispel the ubiquitous myth that a multilayered neural network with linear activation functions is always just a bunch of linear regressions.
When it comes to nonlinear regression, this is referring to how the weights affect the output. If a function is not linear with respect to the weights, then your problem is a nonlinear regression problem. So for example, let's look at a Feedforward Neural Network with one hidden layer where the activation functions in the hidden layer are some function and the output layer has linear activation functions. Given this, the mathematical representation can be:
where we assume can operator on scalars and vectors with this notation to make it easy. , , , and are the weight you are aiming to estimate with the regression. If this was linear regression, would equal z, because that would make y linearly dependent on & . But if is nonlinear, say like , then now y is nonlinearly dependent on the weights .
Now provided you understand all that, I am surprised you haven't seen discussion of the nonlinear case because that's pretty much all people talk about in textbooks and research. The use of things like stochastic gradient descent, Nonlinear Conjugate Gradient, RProp, and other methods are to help find local minima (and hopefully good local minima) for these nonlinear regression problems, even though a global optimum is not typically guaranteed.
Any non-linearity from the input to output makes the network non-linear. In the way we usually think about and implement neural networks, those non-linearities come from activation functions.
If we are trying to fit non-linear data and only have linear activation functions, our best approximation to the non-linear data will be linear since that's all we can compute. You can see an example of a neural network trying to fit non-linear data with only linear activation functions here.
However, if we change the linear activation function to something non-linear like ReLu, then we can see a better non-linear fitting of the data. You can see that here.
I do not have enough reputation to comment on itwasthekix post, but I want to share my insight.
Someone asked in the comments whether equation 8 was linear, and the
answer was, that if w1 were to be varied when all else is constant
we would move up and down a non-linear function. This is not true.
When we vary w1, we essentially only change the output of z1 = (w1*p + b1). Since z1 is linearly transformed later, we will still
move an a linear function. If we were to fix everything except w1
AND w2, then we would move on a non-linear function.
If a multi-layer ANN is non-linear in parameters, because we have a
multiplication of parameters. That does not mean it can learn non-linear relationships.
The answer is very misleading and makes it sound, that we can
learn non-linear relationships using only linear transformations,
which is simply not true. When we back-propagate, we take the derivative of a single weight w1 and fix everything else. Now as mentioned above, we are still moving on a linear function.
If we take the gradient of w1*w2 and perform gradient descent, we only know the joint gradient, there is no way to determine the influence of the separate parameters without fixing one of them. And if we fix one, we move on a linear function.
If we add an (non-linear) activation function, we linearly transform a non-linear output enabling us to learn non-linear relationships, since we do not move on a linear function anymore.
Lets look at the case z = w2 * g(w1 * p + b1) + b2 assuming g is a non-linear activation function. Then if we fix everything and vary w1, we will move on a non-linear function, since w1 * p + b1 is transformed by g.
Non-linearity means different things in communities of regression analysis and neural network machine learning.
In regression analysis, when we say a fitting model is nonlinear, we mean that the model is nonlinear in terms of its parameters (not in terms of the independent variables).
A multiple-layer neural network is usually nonlinear in terms of the weights even the activation function is linear. This is simple to see because the information propagating in the network corresponds to function composition: f3(f2(f1())), which generally gives nonlinear functions of weights. Therefore, in terms of regression analysis, all neural networks are nonlinear models.
However in the community of neural network, people talk about the linearity in terms of input variables, rather than the weights/biases. Therefore, they define a neutral network with linear activation functions as linear and that with nonlinear activation function as nonlinear.
I had the same struggle, most online courses use ANNs for classification, but you never actually solve a regression problem with them in the courses.
What does make an ANN non-linear? The activation function.
Even if you have an ANN with thousands of perceptrons and hidden units, if all the activations are linear (or not activated at all) you are just training a plain linear regression.
But be careful, some activations functions (like sigmoid), have a range of values that act as a linear function and you may get stuck with a linear model even with non-linear activations.
How to predict continuous output with an ANN? The same way as when you classify.
It is the same problem, you just backpropagate the error (label - prediction) and update the weights. But don't forget to CHANGE THE ACTIVATION FUNCTION of the output layer to a continuous function (maybe ReLu if all labels are positive or don't activate the output at all), the intermediate hidden layers can be activated however you wish.
For small regression problems with ANNs you may need to start with a veeeeeery small learning rate since there will be lots of variance since the error will be "unbounded" at first.
Hope this helps :)
I don't want to be impolite, but the current answers are all related to nonlinear ND-polynomials resulting from linear activation functions. That simply doesn't make sense in terms of this question.
I get the point because you will have a polynomial as the objective function to minimize with coefficients that are products of layer coefficients and a product is nonlinear. Anyway, such a system will never be able to converge and doesn't make sense at all without any extra constraints.
The described system is not only completely unnecessarily nonlinear, but also ill-posed. Don't argue about stuff that leads ad absurdum. The original question actually completely nailed it.
Build a "linear neural network" with layers and try to use it as usual... then you will realise that this goes nowhere and you wasted your time.
So unless there is good reasons to believe this kind of ill-posed stuff has been handled I would never ever consider using a linear activation function. If you have extra constraints this might make sense. If you use stochastic gradient descent then you will at least skip some bad properties of it.
That the objective function is nonlinear in its parameters gives an impression that is wrong and bogus. And if the writer would have known about optimisation problems connected to terms with a product of coefficients he would have never written anything like this.
Any objective function can be made nonlinear. If you just replace one linear coefficient with a product of two coefficients. But that is nonsense because you can never determine those coefficients. NEVER. There are infinitely many solutions! And that doesn't even depend on the amount of data.
Because the activation is w*x, which is linear operation, so you need to have extra elements to make it non-linear.

Can't approximate simple multiplication function in neural network with 1 hidden layer

I just wanted to test how good can neural network approximate multiplication function (regression task).
I am using Azure Machine Learning Studio. I have 6500 samples, 1 hidden layer
(I have tested 5 /30 /100 neurons per hidden layer), no normalization. And default parameters
Learning rate - 0.005, Number of learning iterations - 200, The initial learning weigh - 0.1,
The momentum - 0 [description]. I got extremely bad accuracy, close to 0.
At the same time boosted Decision forest regression shows very good approximation.
What am I doing wrong? This task should be very easy for NN.
Big multiplication function gradient forces the net probably almost immediately into some horrifying state where all its hidden nodes have zero gradient.
We can use two approaches:
1) Devide by constant. We are just deviding everything before the learning and multiply after.
2) Make log-normalization. It makes multiplication into addition:
m = x*y => ln(m) = ln(x) + ln(y).
Some things to check:
Your output layer should have a linear activation function. If it's sigmoidal, it won't be able to represent values outside it's range (e.g. -1 to 1)
You should use a loss function that's appropriate for regression (e.g. squared error)
If your hidden layer uses sigmoidal activation functions, check that you're not saturating them. Multiplication can work on arbitrarily small/large values. And, if you pass a large number as input you can get saturation, which will lose information. If using ReLUs, make sure they're not getting stuck at 0 on all examples (although activations will generally be sparse on any given example).
Check that your training procedure is working as intended. Plot the error over time during training. How does it look? Are your gradients well behaved or are they blowing up? One source of problems can be the learning rate being set too high (unstable error, exploding gradients) or too low (very slow progress, error doesn't decrease quickly enough).
This is how I do multiplication with neural network:
import numpy as np
from keras import layers
from keras import models
model = models.Sequential()
model.add(layers.Dense(150, activation='relu', input_shape=(2,)))
model.add(layers.Dense(1, activation='relu'))
data = np.random.random((10000, 2))
results = np.asarray([a * b for a, b in data])
model.compile(optimizer='sgd', loss='mae')
model.fit(data, results, epochs=1, batch_size=1)
model.predict([[0.8, 0.5]])
It works.
"Two approaches: divide by constant, or make log normalization"
I'm tried both approaches. Certainly, log normalization works since as you rightly point out it forces an implementation of addition. Dividing by constant -- or similarly normalizing across any range -- seems not to succeed in my extensive testing.
The log approach is fine, but if you have two datasets with a set of inputs and a target y value where:
In dataset one the target is consistently a sum of two of the inputs
In dataset two the target is consistently the product of two of the inputs
Then it's not clear to me how to design a neural network which will find the target y in both datasets using backpropogation. If this isn't possible, then I find it a surprising limitation in the ability of a neural network to find the "an approximation to any function". But I'm new to this game, and my expectations may be unrealistic.
Here is one way you could approximate the multiplication function using one hidden layer. It uses a sigmoidal activation in the hidden layer, and it works quite nicely until a certain range of numbers. This is the gist link
m = x*y => ln(m) = ln(x) + ln(y), but only if x, y > 0

Hyper-parameters of Gaussian Processes for Regression

I know a Gaussian Process Regression model is mainly specified by its covariance matrix and the free hyper-parameters act as the 'weights'of the model. But could anyone explain what do the 2 hyper-parameters (length-scale & amplitude) in the covariance matrix represent (since they are not 'real' parameters)? I'm a little confused on the 'actual' meaning of these 2 parameters.
Thank you for your help in advance. :)
First off I would like to point out that there are infinite number of kernels that could be used in a gaussian process. One of the most common however is the RBF (also referred to as squared exponential, the expodentiated quadratic, etc). This kernel is of the following form:
The above equation is of course for the simple 1D case. Here l is the length scale and sigma is the variance parameter (note they go under different names depending on the source). Effectively the length scale controls how two points appear to be similar as it simply magnifies the distance between x and x'. The variance parameter controls how smooth the function is. These are related but not the same.
The Kernel Cookbook give a nice little description and compares RBF kernels to other commonly used kernels.

Radial Basis Function

I am trying to make a simple radial basis function network (RBFN) for regression. I have a 20 dimensional (feature) dataset with over 600 samples. I need the final network to output 1 scalar value for each 20 dimensional sample.
Note: new to machine learning...and feel like I am missing an important concept here.
With the perceptron we can, and I have, trained a linear network until the prediction error is at a minimum using a small subset of the initial samples.
Is there a similar process with the RBFN?
Yes there is,
The main two differences between a multi-layer perceptron and a RBFN are the fact that a RBFN usually implies just one layer and that the activation function is a gaussian instead of a sigmoid.
The training phase can be done using gradient descend of the error loss function, so it is relatively simple to implement.
Keep in mind that RBFN is a linear combination of RBF units, so the range of the output is limited and you would need to transform it if you need an scalar outside of that range.
There is a few of resources that you could consult as reference:
[PDF] (http://scholar.lib.vt.edu/theses/available/etd-6197-223641/unrestricted/Ch3.pdf)
[Wikipedia] (http://en.wikipedia.org/wiki/Radial_basis_function_network)
[Wolfram] (http://reference.wolfram.com/applications/neuralnetworks/NeuralNetworkTheory/2.5.2.html)
Hope it helps,

SVM Classification with Cross Validation

I am new to using Matlab and am trying to follow the example in the Bioinformatics Toolbox documentation (SVM Classification with Cross Validation) to handle a classification problem.
However, I am not able to understand Step 9, which says:
Set up a function that takes an input z=[rbf_sigma,boxconstraint], and returns the cross-validation value of exp(z).
The reason to take exp(z) is twofold:
rbf_sigma and boxconstraint must be positive.
You should look at points spaced approximately exponentially apart.
This function handle computes the cross validation at parameters
exp([rbf_sigma,boxconstraint]):
minfn = #(z)crossval('mcr',cdata,grp,'Predfun', ...
#(xtrain,ytrain,xtest)crossfun(xtrain,ytrain,...
xtest,exp(z(1)),exp(z(2))),'partition',c);
What is the function that I should be implementing here? Is it exp or minfn? I will appreciate if you can give me the code for this section. Thanks.
I will like to know what does it mean when it says exp([rbf_sigma,boxconstraint])
rbf_sigma: The svm is using a gaussian kernel, the rbf_sigma set the standard deviation (~size) of the kernel. To understand how kernels work, the SVM is putting the kernel around every sample (so that you have a gaussian around every sample). Then the kernels are added up (sumed) for the samples of each category/type. At each point the type which sum is higher would be the "winner". For example if type A has a higher sum of these kernels at point X, then if you have a new datum to classify in point X, it will be classified as type A. (there are other configuration parameters that may change the actual threshold where a category is selected over another)
Fig. Analyze this figure from the webpage you gave us. You can see how by adding up the gaussian kernels on the red samples "sumA", and on the green samples "sumB"; it is logical that sumA>sumB in the center part of the figure. It is also logical that sumB>sumA in the outer part of the image.
boxconstraint: it is a cost/penalty over miss-classified data. During the training stage of the classifier, where you use the training data to adjust the SVM parameters, the training algorithm is using an error function to decide how to optimize the SVM parameters in an iterative fashion. The cost for a miss-classified sample is proportional to how far it is from the boundary where it would have been classified correctly. In the figure that I am attaching the boundary is the inner blue circumference.
Taking into account BGreene indications and from what I understand of the tutorial:
In the tutorial they advice to try values for rbf_sigma and boxconstraint that are exponentially apart. This means that you should compare values like {0.2, 2, 20, ...} (note that this is {2*10^(i-2), i=1,2,3,...}), and NOT like {0.2, 0.3, 0.4, 0.5} (which would be linearly apart). They advice this to try a wide range of values first. You can further optimize later FROM the first optimum that you obtained before.
The command "[searchmin fval] = fminsearch(minfn,randn(2,1),opts)" will give you back the optimum values for rbf_sigma and boxconstraint. Probably you have to use exp(z) because it affects how fminsearch increments the values of z(1) and z(2) during the search for the optimum value. I suppose that when you put exp(z(1)) in the definition of #minfn, then fminsearch will take 'exponentially' big steps.
In machine learning, always try to understand that there are three subsets in your data: training data, cross-validation data, and test data. The training set is used to optimize the parameters of the SVM classifier for EACH value of rbf_sigma and boxconstraint. Then the cross validation set is used to select the optimum value of the parameters rbf_sigma and boxconstraint. And finally the test data is used to obtain an idea of the performance of your classifier (the efficiency of the classifier is determined upon the test set).
So, if you start with 10000 samples you may divide the data for example as training(50%), cross-validation(25%), test(25%). So that you will sample randomly 5000 samples for the training set, then 2500 samples from the 5000 remaining samples for the cross-validation set, and the rest of samples (that is 2500) would be separated for the test set.
I hope that I could clarify your doubts. By the way, if you are interested in the optimization of the parameters of classifiers and machine learning algorithms I strongly suggest that you follow this free course -> www.ml-class.org (it is awesome, really).
You need to implement a function called crossfun (see example).
The function handle minfn is passed to fminsearch to be minimized.
exp([rbf_sigma,boxconstraint]) is the quantity being optimized to minimize classification error.
There are a number of functions nested within this function handle:
- crossval is producing the classification error based on cross validation using partition c
- crossfun - classifies data using an SVM
- fminsearch - optimizes SVM hyperparameters to minimize classification error
Hope this helps