How pipelined neural networks are trained? - neural-network

This is my first question to stackoverflow. I would like to ask something about neural networks and I just need an intuitive explanation. Let's suppose there is a complex neural network architecture (like in the image). There are four vectors G, B, R, P (Green, Blue, Red, Purple) and a scalar number y. The dependencies are something like R = f(G;Weights), P = g(G,B;Weights) and y = h(R,G;Weights). The output is calculated by output = y*R + (1-y)*P. We also define a cost/loss function to calculate an error e. My question is how the error e is backpropagated to fix all the weights of each feedforward network. I just need an intuitive explanation (not the mathematics of gradient descent). I mean, how the error "fixes" the scalar y, which then "fixes" the R and G, but R also "fixes" G ? I'm confused how this kind of architecture works during training. Please help me. Thanx
Complex Model Architecture

Related

Fit a quadratic function of two variables (Practitioner Black Scholes Deterministic Volatility Functions)

I am attempting to fit the parameters of a deterministic volatility function for use in the practitioner Black Scholes model.
The formula for which I want to estimate the "a" parameters is:
sig = a0 + a1*K + a2*K^2 + a3*T + a4*T^2 + a5*KT
Where sig, K and T are known; I have multiple observations of K, T and sig combinations but only want a single set of "a" parameters.
How might I go about this? My google searches and own attempts all failed, unfortunately.
Thank you!
The function lsqcurvefit allows you to define the function that you want to fit. It should be straight forward from there on.
http://se.mathworks.com/help/optim/ug/lsqcurvefit.html
Some Mathematics
Notation stuff: index your observations by i and add an error term.
sig_i = a0 + a1*K_i + a2*K_i^2 + a3*T_i + a4*T_i^2 + a5*KT_i + e_i
Something probably not insane to do would be to minimize the square of the error term:
minimize (over a) \sum_i e_i^2
The solution to least squares is a simple linear algebra problem. (See https://stats.stackexchange.com/questions/186196/understanding-linear-algebra-in-ordinary-least-squares-derivation/186289#186289 for a solution if you really care.) (Further note: e_i is a linear function of a. I'm not sure why you would need lsqcurvefit as another answer suggested?)
Matlab Code for OLS (Ordinary Least Squares)
Assuming sig, K, T, and KT are n by 1 vectors
y = sig;
X = [ones(length(sig),1), K, K.^2, T, T.^2, KT];
a = X \ y; %basically computes a = inv(X'*X)*(X'*y) but in a better way
This an ordinary least squares regression of y on X.
Further Ideas
Depending on the distribution of your error terms, correlated error etc... regular OLS may be inefficient or possibly even inappropriate... I'm not familiar with the details of this problem to know. You may want to check what people do.
Eg. a technique that's less sensitive to big outliers is to minimize the absolute value of the error.
minimize (over a) \sum_i |a_i|
If you have a good, statistical model of how the data is generated you could do maximum likelihood estimation. Anyway... this rapidly devolve into a multi-quarter, statistics class.

TensorFlow or Theano: how do they know the loss function derivative based on the neural network graph?

In TensorFlow or Theano, you only tell the library how your neural network is, and how feed-forward should operate.
For instance, in TensorFlow, you would write:
with graph.as_default():
_X = tf.constant(X)
_y = tf.constant(y)
hidden = 20
w0 = tf.Variable(tf.truncated_normal([X.shape[1], hidden]))
b0 = tf.Variable(tf.truncated_normal([hidden]))
h = tf.nn.softmax(tf.matmul(_X, w0) + b0)
w1 = tf.Variable(tf.truncated_normal([hidden, 1]))
b1 = tf.Variable(tf.truncated_normal([1]))
yp = tf.nn.softmax(tf.matmul(h, w1) + b1)
loss = tf.reduce_mean(0.5*tf.square(yp - _y))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
I am using L2-norm loss function, C=0.5*sum((y-yp)^2), and in the backpropagation step presumably the derivative will have to be computed, dC=sum(y-yp). See (30) in this book.
My question is: how can TensorFlow (or Theano) know the analytical derivative for backpropagation? Or do they do an approximation? Or somehow do not use the derivative?
I have done the deep learning udacity course on TensorFlow, but I am still at odds at how to make sense on how these libraries work.
The differentiation happens in the final line:
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
When you execute the minimize() method, TensorFlow identifies the set of variables on which loss depends, and computes gradients for each of these. The differentiation is implemented in ops/gradients.py, and it uses "reverse accumulation". Essentially it searches backwards from the loss tensor to the variables, applying the chain rule at each operator in the dataflow graph. TensorFlow includes "gradient functions" for most (differentiable) operators, and you can see an example of how these are implemented in ops/math_grad.py. A gradient function can use the original op (including its inputs, outputs, and attributes) and the gradients computed for each of its outputs to produce gradients for each of its inputs.
Page 7 of Ilya Sutskever's PhD thesis has a nice explanation of how this process works in general.

Normalized cut: what does this code do?

I'm going through some MATLAB code for Normalized Cut for image segmentation, and I can't figure out what this code below does:
% degrees and regularization
d = sum(abs(W),2);
dr = 0.5 * (d - sum(W,2));
d = d + offset * 2;
dr = dr + offset;
W = W + spdiags(dr,0,n,n);
offset is defined to be 0.5.
W is a square, sparse, symmetric matrix (w_ij is defined by the similarity between pixels i and j).
W is then used to solve the eigenvalue problem d^(-1/2)(D-W)d^(-1/2) x = \lambda x
The w_ij's are all positives because of the way the weights are defined, so dr is a vector of 0's.
What are the offsets for? How are they chosen? What's the reason behind offset*2? I have the feeling this is to avoid some potential pitfalls in certain cases. What could these be?
Any help would be really appreciated, thanks!
I believe you came across a piece of code written by Prof Stella X Yu.
Indeed, when W is positive this code has no effect and this is the usual case for NCuts.
However, in a CVPR 2001 paper Yu and Shi extend NCuts to handle negative interactions as well as positive ones. In these circumstances dr (r for "repulsion") plays a significant role.
Speaking of negative weights, I must say that personally I do not agree with the approach of Yu and Shi.
I strongly believe that when there is repulsion information Correlation Clustering is a far better objective function than the extended NCuts objective. Results of some image segmentation experiments I conducted with negative weights suggested that Correlation clustering objective is better than the extended NCuts.

local inverse of a neural network

I have a neural network with N input nodes and N output nodes, and possibly multiple hidden layers and recurrences in it but let's forget about those first. The goal of the neural network is to learn an N-dimensional variable Y*, given N-dimensional value X. Let's say the output of the neural network is Y, which should be close to Y* after learning. My question is: is it possible to get the inverse of the neural network for the output Y*? That is, how do I get the value X* that would yield Y* when put in the neural network? (or something close to it)
A major part of the problem is that N is very large, typically in the order of 10000 or 100000, but if anyone knows how to solve this for small networks with no recurrences or hidden layers that might already be helpful. Thank you.
If you can choose the neural network such that the number of nodes in each layer is the same, and the weight matrix is non-singular, and the transfer function is invertible (e.g. leaky relu), then the function will be invertible.
This kind of neural network is simply a composition of matrix multiplication, addition of bias and transfer function. To invert, you'll just need to apply the inverse of each operation in the reverse order. I.e. take the output, apply the inverse transfer function, multiply it by the inverse of the last weight matrix, minus the bias, apply the inverse transfer function, multiply it by the inverse of the second to last weight matrix, and so on and so forth.
This is a task that maybe can be solved with autoencoders. You also might be interested in generative models like Restricted Boltzmann Machines (RBMs) that can be stacked to form Deep Belief Networks (DBNs). RBMs build an internal model h of the data v that can be used to reconstruct v. In DBNs, h of the first layer will be v of the second layer and so on.
zenna is right.
If you are using bijective (invertible) activation functions you can invert layer by layer, subtract the bias and take the pseudoinverse (if you have the same number of neurons per every layer this is also the exact inverse, under some mild regularity conditions).
To repeat the conditions: dim(X)==dim(Y)==dim(layer_i), det(Wi) not = 0
An example:
Y = tanh( W2*tanh( W1*X + b1 ) + b2 )
X = W1p*( tanh^-1( W2p*(tanh^-1(Y) - b2) ) -b1 ), where W2p and W1p represent the pseudoinverse matrices of W2 and W1 respectively.
The following paper is a case study in inverting a function learned from Neural Networks. It is a case study from the industry and looks a good beginning for understanding how to go about setting up the problem.
An alternate way of approaching the task of getting the desired x that yields desired y would be start with random x (or input as seed), then through gradient decent (similar algorithm to back propagation, difference being that instead of finding derivatives of weights and biases, you find derivatives of x. Also, mini batching is not needed.) repeatedly adjust x until it yields a y that is close to the desired y. This approach has an advantage that it allows an input of a seed (starting x, if not randomly selected). Also, I have a hypothesis that the final x will have some similarity to initial x(seed), which would imply that this algorithm has the ability to transpose, depending on the context of the neural network application.

MATLAB | calculating parameters of gamma dist based on mean and probability interval

I have a system of 2 equations in 2 unknowns that I want to solve using MATLAB but don't know exactly how to program. I've been given some information about a gamma distribution (mean of 1.86, 90% interval between 1.61 and 2.11) and ultimately want to get the mean and variance. I know that I could use the normal approximation but I'd rather solve for A and B, the shape and scale parameters of the gamma distribution, and find the mean and variance that way. In pseudo-MATLAB code I would want to solve this:
gamcdf(2.11, A, B) - gamcdf(1.61, A, B) = 0.90;
A*B = 1.86;
How would you go about solving this? I have the symbolic math toolbox if that helps.
The mean is A*B. So can you solve for perhaps A in terms of the mean(mu) and B?
A = mu/B
Of course, this does no good unless you knew B. Or does it?
Look at your first expression. Can you substitute?
gamcdf(2.11, mu/B, B) - gamcdf(1.61, mu/B, B) = 0.90
Does this get you any closer? Perhaps. There will be no useful symbolic solution available, except in terms of the incomplete gamma function itself. How do you solve a single equation numerically in one unknown in matlab? Use fzero.
Of course, fzero looks for a zero value. But by subtracting 0.90, that is resolved.
Can we define a function that fzero can use? Use a function handle.
>> mu = 1.86;
>> gamfun = #(B) gamcdf(2.11, mu/B, B) - gamcdf(1.61, mu/B, B) - 0.90;
So try it. Before we do that, I always recommend plotting things.
>> ezplot(gamfun)
Hmm. That plot suggests that it might be difficult to find a zero of your function. If you do try it, you will find that good starting values for fzero are necessary here.
Sorry about my first try. Better starting values for fzero, plus some more plotting does give a gamma distribution that yields the desired shape.
>> B = fzero(gamfun,[.0000001,.1])
B =
0.0124760672290871
>> A = mu/B
A =
149.085442218805
>> ezplot(#(x) gampdf(x,A,B))
In fact this is a very "normal", i.e, Gaussian, looking curve.