Choice of cost function in Michael Neilsen's book: Neural Networks and deep learning - neural-network

Here, w denotes the collection of all weights in the network, b all
the biases, n is the total number of training inputs, a is the vector
of outputs from the network when x is input, and the sum is over all
training inputs, x. Of course, the output aa depends on x, w and b,
but to keep the notation simple I haven't explicitly indicated this
dependence.
Taken from Michael Neilsen's Neural Network and Deep Learning
Does anyone know why he divides the sum by 2? I thought he was going to find the average by dividing by n; instead, he divides by 2n.

This is done so that when the partial derivatives of C(w, b) are computed, it will counter out with the 2 that is produced by the derivative of the quadratic term.
You are correct, normally we'd divide by n, but this trick is done for computational ease.

Related

MATLAB: How to compute the similarity of two signals and get the correct consistency or coherence metric

I was wondering about the consistency metric. Generally, it allows us to deduce the parity or similarity between two signals, right? If so, if the probability is higher (from 0.5 to 1), does it means that there is a strong similarity of the signals? If the margin is less than (0.1-0.43), can this predict the poor coherence between the signals (or poor similarity, the probability the signals are different)? So, if we got the metric <0, is this approved the signal is totally different? Because I'm getting negative numbers. Is this hypothesis possible?
Can I have a clear understanding of the consistency metric of the signal? Here is my small code and figure. Thanks in advance.
s1 = signal3
s2 = signal4
if s1 ~= s2
[C1] = xcorr(s1);
[C2] = xcorr(s2);
signal_mix = C1.*C2 %mixing vector
signal_mix1 = signal_mix
else
s1(1,:) == s2(1,:)
s3 = s1
s3= s2
signal_mix = s2
end
n =2;
for i = length(signal_mix1)
signal_mix1(i) = min(C1(i),C2(i))/ max(C1(i),C2(i)) % consistency score
signal_mix2 = sum(signal_mix1(i))
end
Depending on your use case you might want to consider a dynamic time wraping distance (Matlab has a build in function for that) as similarity metric. One problem with using correlation as a metric is that it compares always the same timestep of the signals. So two identical signals, where one is time delayed, could lead to low correlation. The DTW distance adresses this by comparing to values of adjacent timesteps.
The downside of the dtw distance is that the distance it self can't be interpretet on its only only relative to other distances. So you can tell that two signals A & B with a distance of 150 are more similar than A & C with a distance of 250. But the distance of 150 on its own doesn't tell you a lot.
first of all, you could use xcorrfunction to calculate cross-correlation between two signals.
from Matlab help:
r = xcorr(x,y) returns the cross-correlation of two discrete-time
sequences. Cross-correlation measures the similarity between a vector
x and shifted (lagged) copies of a vector y as a function of the lag.
If x and y have different lengths, the function appends zeros to the
end of the shorter vector so it has the same length as the other.
additionally you could use xcov:
xcov computes the mean of its inputs, subtracts the mean, and then
calls xcorr.
The result of xcov can be interpreted as an estimate of the covariance
between two random sequences or as the deterministic covariance
between two deterministic signals.
in case of your example you are using xcorr with one signal so it computes auto-correlation between the signal itself and its lagged signal.
update:
based on the comment, it seems you need linear correlation, it can be calculated by corr function:
p=corr(x,y)
the value of p is 1 when x , y behave exactly like each other, and is -1 when x and y behave quite the opposite of each other.
when p is 0 it means there is no correlation between two signals.

Is there a Matlab function for calculating std of a binomial distribution?

I have a binary vector V, in which each entry describes success (1) or failure (0) in the relevant trial out of a whole session.
(the length of the vector denotes the number of trials in the session).
I can easily calculate the success rate of the session (by taking the mean of the vector i.e. (sum(V)/length(V))).
However I also need to know the variance or std of each session.
In order to calculate that, is it OK to use the Matlab std function (i.e. to take std(V)/length(V))?
Or, should I use something which is specifically suited for the binomial distribution?
Is there a Matlab std (or variance) function which is specific for a "success/failure" distribution?
Thanks
If you satisfy the assumptions of the Binomial distribution,
a fixed number of n independent Bernoulli trials,
each with constant success probability p,
then I'm not sure that is necessary, since the parameters n and p are available from your data.
Note that we model number of successes (in n trials) as a random variable distributed with the Binomial(n,p) distribution.
n = length(V);
p = mean(V); % equivalently, sum(V)/length(V)
% the mean is the maximum likelihood estimator (MLE) for p
% note: need large n or replication to get true p
Then the standard deviation of the number of successes in n independent Bernoulli trials with constant success probability p is just sqrt(n*p*(1-p)).
Of course you can assess this from your data if you have multiple samples. Note this is different from std(V). In your data formatting, it would require having multiple vectors, V1, V2, V2, etc. (replication), then the sample standard deviation of the number of successes would obtained from the following.
% Given V1, V2, V3 sets of Bernoulli trials
std([sum(V1) sum(V2) sum(V3)])
If you already know your parameters: n, p
You can obtain it easily enough.
n = 10;
p = 0.65;
pd = makedist('Binomial',n, p)
std(pd) % 1.5083
or
sqrt(n*p*(1-p)) % 1.5083
as discussed earlier.
Does the standard deviation increase with n ?
The OP has asked:
Something is bothering me.. if std = sqrt(n*p*(1-p)), then it increases with n. Shoudn't the std decrease when n increases?
Confirmation & Derivation:
Definitions:
Then we know that
Then just from definitions of expectation and variance we can show the variance (similarly for standard deviation if you add the square root) increases with n.
Since the square root is a non-decreasing function, we know the same relationship holds for the standard deviation.

Computing the SVD of a rectangular matrix

I have a matrix like M = K x N ,where k is 49152 and is the dimension of the problem and N is 52 and is the number of observations.
I have tried to use [U,S,V]=SVD(M) but doing this I get less memory space.
I found another code which uses [U,S,V]=SVD(COV(M)) and it works well. My questions are what is the meaning of using the COV(M) command inside the SVD and what is the meaning of the resultant [U,S,V]?
Finding the SVD of the covariance matrix is a method to perform Principal Components Analysis or PCA for short. I won't get into the mathematical details here, but PCA performs what is known as dimensionality reduction. If you like a more formal treatise on the subject, you can read up on my post about it here: What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?. However, simply put dimensionality reduction projects your data stored in the matrix M onto a lower dimensional surface with the least amount of projection error. In this matrix, we are assuming that each column is a feature or a dimension and each row is a data point. I suspect the reason why you are getting more memory occupied by applying the SVD on the actual data matrix M itself rather than the covariance matrix is because you have a significant amount of data points with a small amount of features. The covariance matrix finds the covariance between pairs of features. If M is a m x n matrix where m is the total number of data points and n is the total number of features, doing cov(M) would actually give you a n x n matrix, so you are applying SVD on a small amount of memory in comparison to M.
As for the meaning of U, S and V, for dimensionality reduction specifically, the columns of V are what are known as the principal components. The ordering of V is in such a way where the first column is the first axis of your data that describes the greatest amount of variability possible. As you start going to the second columns up to the nth column, you start to introduce more axes in your data and the variability starts to decrease. Eventually when you hit the nth column, you are essentially describing your data in its entirety without reducing any dimensions. The diagonal values of S denote what is called the variance explained which respect the same ordering as V. As you progress through the singular values, they tell you how much of the variability in your data is described by each corresponding principal component.
To perform the dimensionality reduction, you can either take U and multiply by S or take your data that is mean subtracted and multiply by V. In other words, supposing X is the matrix M where each column has its mean computed and the is subtracted from each column of M, the following relationship holds:
US = XV
To actually perform the final dimensionality reduction, you take either US or XV and retain the first k columns where k is the total amount of dimensions you want to retain. The value of k depends on your application, but many people choose k to be the total number of principal components that explains a certain percentage of your variability in your data.
For more information about the link between SVD and PCA, please see this post on Cross Validated: https://stats.stackexchange.com/q/134282/86678
Instead of [U, S, V] = svd(M), which tries to build a matrix U that is 49152 by 49152 (= 18 GB 😱!), do svd(M, 'econ'). That returns the “economy-class” SVD, where U will be 52 by 52, S is 52 by 52, and V is also 52 by 52.
cov(M) will remove each dimension’s mean and evaluate the inner product, giving you a 52 by 52 covariance matrix. You can implement your own version of cov, called mycov, as
function [C] = mycov(M)
M = bsxfun(#minus, M, mean(M, 1)); % subtract each dimension’s mean over all observations
C = M' * M / size(M, 1);
(You can verify this works by looking at mycov(randn(49152, 52)), which should be close to eye(52), since each element of that array is IID-Gaussian.)
There’s a lot of magical linear algebraic properties and relationships between the SVD and EVD (i.e., singular value vs eigenvalue decompositions): because the covariance matrix cov(M) is a Hermitian matrix, it’s left- and right-singular vectors are the same, and in fact also cov(M)’s eigenvectors. Furthermore, cov(M)’s singular values are also its eigenvalues: so svd(cov(M)) is just an expensive way to get eig(cov(M)) 😂, up to ±1 and reordering.
As #rayryeng explains at length, usually people look at svd(M, 'econ') because they want eig(cov(M)) without needing to evaluate cov(M), because you never want to compute cov(M): it’s numerically unstable. I recently wrote an answer that showed, in Python, how to compute eig(cov(M)) using svd(M2, 'econ'), where M2 is the 0-mean version of M, used in the practical application of color-to-grayscale mapping, which might help you get more context.

local inverse of a neural network

I have a neural network with N input nodes and N output nodes, and possibly multiple hidden layers and recurrences in it but let's forget about those first. The goal of the neural network is to learn an N-dimensional variable Y*, given N-dimensional value X. Let's say the output of the neural network is Y, which should be close to Y* after learning. My question is: is it possible to get the inverse of the neural network for the output Y*? That is, how do I get the value X* that would yield Y* when put in the neural network? (or something close to it)
A major part of the problem is that N is very large, typically in the order of 10000 or 100000, but if anyone knows how to solve this for small networks with no recurrences or hidden layers that might already be helpful. Thank you.
If you can choose the neural network such that the number of nodes in each layer is the same, and the weight matrix is non-singular, and the transfer function is invertible (e.g. leaky relu), then the function will be invertible.
This kind of neural network is simply a composition of matrix multiplication, addition of bias and transfer function. To invert, you'll just need to apply the inverse of each operation in the reverse order. I.e. take the output, apply the inverse transfer function, multiply it by the inverse of the last weight matrix, minus the bias, apply the inverse transfer function, multiply it by the inverse of the second to last weight matrix, and so on and so forth.
This is a task that maybe can be solved with autoencoders. You also might be interested in generative models like Restricted Boltzmann Machines (RBMs) that can be stacked to form Deep Belief Networks (DBNs). RBMs build an internal model h of the data v that can be used to reconstruct v. In DBNs, h of the first layer will be v of the second layer and so on.
zenna is right.
If you are using bijective (invertible) activation functions you can invert layer by layer, subtract the bias and take the pseudoinverse (if you have the same number of neurons per every layer this is also the exact inverse, under some mild regularity conditions).
To repeat the conditions: dim(X)==dim(Y)==dim(layer_i), det(Wi) not = 0
An example:
Y = tanh( W2*tanh( W1*X + b1 ) + b2 )
X = W1p*( tanh^-1( W2p*(tanh^-1(Y) - b2) ) -b1 ), where W2p and W1p represent the pseudoinverse matrices of W2 and W1 respectively.
The following paper is a case study in inverting a function learned from Neural Networks. It is a case study from the industry and looks a good beginning for understanding how to go about setting up the problem.
An alternate way of approaching the task of getting the desired x that yields desired y would be start with random x (or input as seed), then through gradient decent (similar algorithm to back propagation, difference being that instead of finding derivatives of weights and biases, you find derivatives of x. Also, mini batching is not needed.) repeatedly adjust x until it yields a y that is close to the desired y. This approach has an advantage that it allows an input of a seed (starting x, if not randomly selected). Also, I have a hypothesis that the final x will have some similarity to initial x(seed), which would imply that this algorithm has the ability to transpose, depending on the context of the neural network application.

Calculating an inverse matrix in Matlab

I'm running an optimization algorithm that requires calculation of the inverse of a matrix. The goal of the algorithm is to eliminate negative values from the matrix A and obtain the new matrix B. Basically, I start with known square matrices B and C of the same size.
I start by calculating the matrix A which is equal to:
A = B^-1 * C
Or in Matlab:
A = B\C;
I use this because Matlab told me B\C is more accurate than inv(B)*C.
The negative values in A are then divided by two and A is then normalised so that it's rows have length of 1. Using this new A, I calculate a new B with:
(1/N) * A * C' = B^-1
where N is just a scaling factor (# of columns in A). This new B would then be used again in the first step and these iterations continue until the negatives in A are gone.
My problem is I have to calculate B from the second equation and then normalise it.
invB = (1/N)*A*C';
B = inv(invB);
I've been calculating B using inv(B^-1) but after a few iterations I start getting messages that B^-1 is "close to singular or badly scaled."
This algorithm actually works for smaller matrices (around 70x70) but when it gets up to about 500x500 I start getting these messages.
Are there any better ways to calculate inv(B^-1)?
You should definitely head warnings about singular matrices. Results in numerical linear algebra tend to break down as you move toward matrices with high condition numbers. The underlying idea is if
A*b_1 = c
and we're actually solving the problem (because we are using approximate numbers when we use computers)
(A + matrix error)*b_2 = (c + vector error)
how close are b_1 and b_2 as a function of the matrix and vector errors? When A has small condition number b_1 and b_2 are close. When A has large condition number b_1 and b_2 are not close.
There is an informative piece of analysis you could do on your algorithm. At each iteration, after you've found B, find use Matlab to find the condition number of it. This is
cond(B)
You will likely see the number climb rapidly. This indicates that every time you iterate your algorithm, you should trust your result for B less and less.
Problems like this crop up all the time in numerical mathematics. If you'll be working with numerical algorithms frequently you should take some time to familiarize with the role of condition numbers in the field and preconditioning techniques as mentioned above. My preferred text for this is "Numerical Linear Algebra" by Lloyd Trefethen, but any text on Numerical Algebra should address some of these issues.
Best of luck,
Andrew
The main issue is that your matrix has a high condition number (i.e. really small rcond(B) in your case). This is due to the iterative structure in your algorithm, I guess. As you do each iteration your small singular values get smaller and smaller so your condition number grows exponentially. You should check preconditioning to avoid this kind of behavior.