I have just gone through some on-line open course lectures by Andrew Ng in Coursera. At the end of the lectures regarding Neural Networks, he explained reguralization but I am afraid I missed something. With reguralization, the value of cost function is calculated as follows:
J(theta) = -1/m * jValMain + lambda/(2*m)*JValReg
jValMain is set of sums to over y, and output of NN. The second component jValReg is to apply reguralization and looks something like this:
jValReg = lambda/(2*m)*sum( sum( sum( Theta(j)(i)(k)^2 ) ) )
Theta is a set of weights, m is a number of all elements/cases in database and then the lambda. What is lambda? Is it scalar or vector or matrix? How do we apply reguralization via lambda? Is lambda to regulate a particular jth and ith weight from lth layer or is it to regulate all weights by one number. It somehow confuses me. If anyone is familiar with this concept, I will be grateful for any help.
Cheers!
lambda is the regularization parameter in your estimation. Think of it as a means to control the bias in your estimate. It is a scalar and is often used to prevent over fitting of the data. Here are a few lines taken from the notes of the coursera assignments.
... the value of lambda can significantly affect the results of regularized polynomial regression on the training and cross validation set. In particular, a model without regularization (lambda = 0) fits the training set well, but does not generalize. Conversely, a model with too much regularization (lambda = 100) does not fit the training set and testing set well. A good choice of lambda (e.g., lambda = 1) can provide a good fit to the data.
Related
I am taking an Econometrics course, and have been trying to use Python rather than the propreitry STATA and EVIEWS they set the assignments in.
In one of the questions, I have consumption data over time. I am asked to compute it in two ways.
The first way is calculating a model of the form consumption = Aexp(Bt), and the second way is to log both sides and do ordinary OLS on log(consumption) = alpha + Bt
I know how to do the second way. Howver, when I try to do the first way it goes wrong. Using statsmodels, I can exponentiate the time data (after normalising), but this calculates a regression in the form consumption = Aexp(t) + B, which is not what I want. (I want to specify where the parameters go). In sklearn I could find a polynomial regression, but not exponential.
Then I found scipy.curve_fit
However this seems to have two problems:
(1) It seems to rely on initial guesses for parameters, which means my output will end up being different from proprietry software (whereas output for things like OLS are the same) [as I assume initial guesses means some iterative solution is done which is helpful for very weird and wonderful functions, but I assume fairly standard results hold for exponential regression]
(2) every time I try to implement it, it just returns the guess parameters.
Here is my code
`consumption_data = pd.read_csv(......\consumption.csv")
def func(x,a,b):
return a * np.exp(b*x)
xdata = consumption_data.YEAR
ydata = consumption_data.CONSUMPTION
ydata = (ydata - 1948)/100
popt, pcov = curve_fit(func, xdata, ydata, (1,1))
print(popt)
plt.plot(xdata, func(xdata, *popt), 'g--',)
`
The scipy.optimize code is basically just copy-pasted from their tutorial
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
short answer: use statsmodels GLM
statsmodels does not have nonlinear least squares. The best python library for that is lmfit https://pypi.org/project/lmfit/
curve_fit, lmfit and nonlinear least squares algorithm in general find an iterative solution to the optimization problem. Even when we have to provide starting values, the solution is in many cases the same across packages up to convergence tolerance, e.g. 1e-5 or 1e-6.
Many standard models in statistics and econometrics have a single global maximum with well behaved data. However, in other cases like mixture models, there might be many local optima and the estimation might converge to one of them.
To the specific case:
consumption = A exp(B t)
can be rewritten as
consumption = exp(a + B t)
So this is just a single index model or a generalized linear model with an exponential mean function.
The general version has the expectation of the dependent variable as a nonlinear function of a linear combination of the explanatory variables:
E(y | x) = g(x b)
This can be estimated with statsmodels with GLM with family Gaussian and the log-link.
Aside: In econometrics, there is a literature to use Poisson quasi-likelihood as an estimator for exp models instead of taking the log of the dependent variable.
Poisson usually uses the log-link function as in the above.
However, using GLM allows us to use log-link, i.e. exponential mean function, with any of the supported distribution families. The main difference is in the underlying variance assumption. Gaussian assumes constant variance, Poisson assumes that the variance is proportional to the mean and Gamma assumes that the variance is quadratic in the mean.
If we use a robust sandwich covariance estimator for parameter inference, then standard errors and inference are correct even if the variance function is misspecified.
I’m writing a neural network but I have trouble training it using backpropagation so I suspect there is a bug/mathematical mistake somewhere in my code. I’ve spent ours reading different literature on how the equations of backpropagation should look but I’m a bit confused by it since different books say different things, or at least use wildly confusing and contradictory notation. So, I was hoping that someone who knows with a 100% certainty how it works could clear it out for me.
There are two steps in the backpropagation that confuse me. Let’s assume for simplicity that I only have a three layer feed forward net, so we have connections between input-hidden and hidden-output. I call the weighted sum that reaches a node z and the same value but after it has passed the activation function of the node a.
Apparently I’m not allowed to embed an image with the equations that my question concern so I will have to link it like this: https://i.stack.imgur.com/CvyyK.gif
Now. During backpropagation, when calculating the error in the nodes of the output layer, is it:
[Eq. 1] Delta_output = (output-target) * a_output through the derivative of the activation function
Or is it
[Eq. 2] Delta_output = (output-target) * z_output through the derivative of the activation function
And during the error calculation of the nodes in the hidden layer, same thing, is it:
[Eq. 3] Delta_hidden = a_h through the derivative of the activation function * sum(w_h*Delta_output)
Or is it
[Eq. 4] Delta_hidden = z_h through the derivative of the activation function * sum(w_h*Delta_output)
So the question is basically; when running a node's value through the derivative version of the activation function during backpropagation, should the value be expressed as it was before or after it passed the activation function (z or a)?
Is the first or the second equation in the image correct and similarly is the third or fourth equation in the image correct?
Thanks.
You have to compute the derivatives with the values before it have passed through the activation function. So the answer is "z".
Some activation functions simplify the computation of the derivative, like tanh:
a = tanh(z)
derivative on z of tanh(z) = 1.0 - tanh(z) * tanh(z) = 1.0 - a * a
This simplification can lead to the confusion you was talking about, but here is another activation function without possible confusion:
a = sin(z)
derivative on z of sin(z) = cos(z)
You can find a list of activation functions and their derivatives on wikipedia: activation function.
Some networks doesn't have an activation function on the output nodes, so the derivative is 1.0, and delta_output = output - target or delta_output = target - output, depending if you add or substract the weight change.
If you are using and activation function on the output nodes, the you'll have to give targets that are in the range of the activation function like [-1,1] for tanh(z).
I am processing a set of data using ridge regression. I found a very interesting phenomenon when apply the learned function to data. Namely, when the ridge parameter increases from zero, the test error keeps increasing. But if we penalize small coefficients(set the parameter <0), the test error can even be smaller.
This is my matlab code:
for i = 1:100
beta = ridgePolyRegression(ty_train,tX_train,lambda(i));
sqridge_train_cost(i) = computePolyCostMSE(ty_train,tX_train,beta);
sqridge_test_cost(i) = computePolyCostMSE(ty_valid,tX_valid,beta);
end
plot(lambda,sqridge_test_cost,'color','b');
lambda is the ridge parameter. ty_train is the output of the training data, tX_train is the input of training data. Also, we use a quadratic function regression here.
function [ beta ] = ridgePolyRegression( y,tX,lambda )
X = tX(:,2:size(tX,2));
tX2 = [tX,X.^2];
beta = (tX2'*tX2 + lambda * eye(size(tX2,2))) \ (tX2'*y);
end
The plotted picture is:
Why the error is minimal when lambda is negative? Is it a sign of under-fitting?
You should not use negative lambdas.
From (probabilistic) theoretic point of view, lambda relates to the inverse of variance of parameter prior distribution, and variance can't be negative.
From computational point of view, it can (given it's less that the smallest eigenvalue of the covariance matrix) turn your positive-definite form into an indefinite form, which means you'll have not a maximum, but a saddle point. It also means there are points where your target function is as small (or as big) as you want, so you can reduce loss indefinitely and no minimum / maximum exists at all.
Your optimization algorithm gives you just a stationary point, which will be a global maximum if and only if the form is positive definite.
Short Answer: When lambda is negative, you're actually overfitting your data. Hence, it's reasonable to get much less error.
Long Answer:
The regularization term (or the penalty term as described by many statisticians) aims to penalize the weights (or the betas as written in the coming Eq.) for going too high (overfitting) and going too low (underfitting). Giving you the power to control how your model behaves, and you usually aim the "right fitting" model.
For mathematical intuition, you can check the following Eq. (P. S. Equation is screenshotted from Elements of Statistical Learning by Trevor Hastie et. al)
When you decide to make your lambda negative, the penalty term is indeed turned into a utility term that helps to increase the weights (i.e., overfitting).
Overfitting is, simply, understanding your data along with the features more than you should, because you do not have the whole population yet; therefore, what you understood so far is possibly wrong on a different dataset.
So, you should never be using negative values of lambdas.
Because of the help I received and researched here I was able to create a simple perceptron in C#, code of which goes like:
int Input1 = A;
int Input2 = B;
//weighted sum
double WSum = A * W1 + B * W2 + Bias;
//get the sign: -1 for negative, +1 for positive
int Sign=Math.Sign(WSum);
double error = Desired - Sign;
//updating weights
W1 += error * Input1 * 0.1; //0.1 being a learning rate
W2 += error * Input2 * 0.1;
return Sign;
I do not use Sigmoid here and just get -1 or 1.
I would have two questions:
1) Is that correct that my weights get values like -5 etc? When input is e.g. 100,50 it goes like: W1+=error*100*0.1
2) I want to proceed deeper and create more connected neurons - I guess I would need at least two to provide inputs to the third. Is that correct that the third will be fed with values just -1..1? I am aiming to a simple pattern recognition but so far do not understand how it should work.
It is perfectly valid that the values of your weights range from -Infinity to +Infinity. You should always use real numbers instead of integers (so as mentioned above, double will work. 32 bit floats precision is perfectly sufficient for neural networks).
Moreover, you should decay your learning rate with every learning step, e.g. reduce it by a factor of 0.99 after each update. Else, your algorithm will oscillate when approaching an optimum.
If you want to go "deeper", you will need to implement a Multilayer Perceptron (MLP). There exists a proof that a neural network with simple threshold neurons and multiple layers alsways has an equivalent with only 1 layer. This is why several decades ago the research community temporarily abandoned the idea of artificial neural networks. 1986, Geoffrey Hinton made the Backpropagation algorithm popular. With it you can train MLPs with multiple hidden layers.
To solve non-linear problems like XOR or other complex problems like pattern recognition, you need to apply a non-linear activation function. Have a look at the logistic sigmoid activation function for a start. f(x) = 1. / (1. + exp(-x)). When doing this you should normalize your input as well as your output values to the range [0.0; 1.0]. This is especially important for the output neurons since the output of the logistic sigmoid activation function is defined in exactly this range.
A simple Python implementation of feed-forward MLPs using arrays can be found in this answer.
Edit: You also need at least 1 hidden layer to solve e.g. XOR.
Try to set your weights as double.
Also i think it's much better to work with arrays, especially in neural networks and perceptron is the only way.
And you will need some for or while loops to succeed what you want.
I have a neural network with N input nodes and N output nodes, and possibly multiple hidden layers and recurrences in it but let's forget about those first. The goal of the neural network is to learn an N-dimensional variable Y*, given N-dimensional value X. Let's say the output of the neural network is Y, which should be close to Y* after learning. My question is: is it possible to get the inverse of the neural network for the output Y*? That is, how do I get the value X* that would yield Y* when put in the neural network? (or something close to it)
A major part of the problem is that N is very large, typically in the order of 10000 or 100000, but if anyone knows how to solve this for small networks with no recurrences or hidden layers that might already be helpful. Thank you.
If you can choose the neural network such that the number of nodes in each layer is the same, and the weight matrix is non-singular, and the transfer function is invertible (e.g. leaky relu), then the function will be invertible.
This kind of neural network is simply a composition of matrix multiplication, addition of bias and transfer function. To invert, you'll just need to apply the inverse of each operation in the reverse order. I.e. take the output, apply the inverse transfer function, multiply it by the inverse of the last weight matrix, minus the bias, apply the inverse transfer function, multiply it by the inverse of the second to last weight matrix, and so on and so forth.
This is a task that maybe can be solved with autoencoders. You also might be interested in generative models like Restricted Boltzmann Machines (RBMs) that can be stacked to form Deep Belief Networks (DBNs). RBMs build an internal model h of the data v that can be used to reconstruct v. In DBNs, h of the first layer will be v of the second layer and so on.
zenna is right.
If you are using bijective (invertible) activation functions you can invert layer by layer, subtract the bias and take the pseudoinverse (if you have the same number of neurons per every layer this is also the exact inverse, under some mild regularity conditions).
To repeat the conditions: dim(X)==dim(Y)==dim(layer_i), det(Wi) not = 0
An example:
Y = tanh( W2*tanh( W1*X + b1 ) + b2 )
X = W1p*( tanh^-1( W2p*(tanh^-1(Y) - b2) ) -b1 ), where W2p and W1p represent the pseudoinverse matrices of W2 and W1 respectively.
The following paper is a case study in inverting a function learned from Neural Networks. It is a case study from the industry and looks a good beginning for understanding how to go about setting up the problem.
An alternate way of approaching the task of getting the desired x that yields desired y would be start with random x (or input as seed), then through gradient decent (similar algorithm to back propagation, difference being that instead of finding derivatives of weights and biases, you find derivatives of x. Also, mini batching is not needed.) repeatedly adjust x until it yields a y that is close to the desired y. This approach has an advantage that it allows an input of a seed (starting x, if not randomly selected). Also, I have a hypothesis that the final x will have some similarity to initial x(seed), which would imply that this algorithm has the ability to transpose, depending on the context of the neural network application.