Weights, How to write it in matrix form? - neural-network

In backpropagation of a neural network having sigmoid activation function,
Weight updation rule is given by:
NewWeight = OldWeight - alpha * D * A
Where alpha is learning rate, A is Activations from previous layer,
D = (Y - Y')Y'(1-Y') ;D = Error Minimization, delta
where Y = given value and Y' is computed in the neural network by the output layer
Y in my case is 4x1 = [0.3,0.2,0.4,0.1] and an instance of Y' is 4x1= [0.2,0.1,0.1,0.2]
How do I compute D = (Y - Y')Y'(1-Y')
(Y-Y') = 4x1 and Y' = 4x1, and 1 <> 4 hence matrix multiplication is not possible. Also (1-Y') is 4x1.
How can i multiply {(Y - Y'),Y',(1-Y')} to obtain D? If I have to perform Transpose which matrix should I transpose so that net effect is unchanged?
Or is it a elementwise multiplication?

It is indeed elementwise multiplication. You need to multiply each output error (Y-Y') by the derivative of the respective output (w.r.t. the weights), which is Y'(1-Y') for the sigmoid activation. Think about it as a "corrected error signal". So, D * A is a vector outer product, which will give you a matrix with the same size as the weights.

Related

why is 1 appended to the input layer of a neural network?

I am following this tutorial for making neural network
https://www.kaggle.com/antmarakis/another-neural-network-from-scratch
I do not understand the train part of this code where 1 is appended to the input feature vector.
def Train(X, Y, lr, weights):`
`layers = len(weights)`
`for i in range(len(X)):`
`x, y = X[i], Y[i]`
`x = np.matrix(np.append(1, x)) # Augment feature vector`
`activations = ForwardPropagation(x, weights, layers)`
`weights = BackPropagation(y, activations, weights, layers)`
`return weights
any help in understanding this would be appreciated.
Forward propagation includes multiplying by weights and adding a bias term. The equation is
y = X*W + b. This can be written in a more vectorised form as y = [X, 1] * [W, b]. (* stands for matrix multiplication here).
In the code, the weights and biases seemed to have been combined into a single weight matrix W and x is modified as an augmented vector by appending a one to it.

ifft and using a sum of square waves instead of the sum of sine waves to rebuild a signal

I know that ifft sums multiple sine waves up from data obtain from doing an fft on a signal. is there a way to do a ifft using square waves instead of sine waves?
I'm not trying to get the original signal back but trying to rebuild it using square waves from the data taken from the fft instead of the normal sine wave summation process.
See simple example below: the signals I will be using are human audio signals about 60 seconds long so I'm trying to see if I can use / alter the ifft command in some way.
PS: I'm using Octave 4.0 which is similar to Matlab
clear all,clf reset, clc,tic
Fs = 200; % Sampling frequency
t=linspace(0,1,Fs);
freq=2;
%1 create signal
ya = .5*sin(freq*pi*2*t+pi);
%2 create frequency domain
ya_fft = fft(ya);
%3 rebuild signal
mag = abs(ya_fft);
phase = unwrap(angle(ya_fft));
ya_newifft=ifft(mag.*exp(i*phase));
ifft_sig_combined_L1=ifft(mag.*exp(i*phase),Fs); %use Fs to get correct file length
% square wave
vertoffset=0.5;
A=1
T = 1/freq; % period of the signal
square = mod(t * A / T, A) > A / 2;
square = square - vertoffset;
subplot(3,1,1);
plot(t,ya,'r')
title('orignal signal')
subplot(3,1,2);
plot(t,ifft_sig_combined_L1)
title('rebuilt signal')
subplot(3,1,3);
plot(t,square)
title('rebuilt signal with square wave')
Define the basis vectors you want to use and let them be the columns of a matrix, A. If b is your signal, then just get the least squares solution to Ax = b. If A is full rank, then you will be able to represent b exactly.
Edit:
Think about what a matrix-vector product does: Each column of the matrix is multiplied by the corresponding element of the vector (i.e., the n^th column of the matrix is multiplied by the n^th element of the vector) and the resulting products are summed together. (This would be a lot easier to illustrate if this site supported latex.) In Matlab, a horrible but hopefully illustrative way to do this is
A = some_NxN_matrix;
x = some_Nx1_vector;
b = zeros( size(A,1), 1 );
for n = 1 : length(x)
b = b + A(:,n) * x(n);
end
(Of course, you would never actually do the above but rather b = A*x;.)
Now define whatever square waves you want to use and assign each to its own Nx1 vector. Call these vectors s_1, s_2, ..., s_M, where M is the number of square waves you are using. Now let
A = [s1, s2, ..., s_M];
According to your question, you want to represent your signal as a weighted sum of these square waves. (Note that this is exactly what a DFT does it just uses orthogonal sinusoids rather than square waves.) To weight and sum these square waves, all you have to do is find the matrix-vector product A*x, where x is the vector of coefficients that weight each column (see the above paragraph). Now, if your signal is b and you want to the find the x that will best sum the square waves in order to approximate b, then all you have to do is solve A*x=b. In Matlab, this is given by
x = A \ b;
The rest is just linear algebra. If a left-inverse of A exists (i.e., if A has dimensions M x N and rank N, with M > N), then (A^-1) * A is an identity matrix and
(A^-1) * A * x = (A^-1) * b,
which implies that x = (A^-1) * b, which is what x = A \ b; will return in Matlab. If A has dimensions M x N and rank M, with N > M, then the system is underdetermined and a left-inverse does not exist. In this case you have to use the psuedo-inverse to solve the system. Now suppose that A is NxN with rank N, so that both the left- and right-inverse exist. In this case, x will give an exact representation of b:
x = (A^-1) * b
A * x = A * (A^-1) * b = b
If you want an example of A that uses square waves to get an exact representation of the input signal, check out the Haar transform. There is a function available here.

MATLAB: The determinant of a covariance matrix is either 0 or inf

I have a 1500x1500 covariance matrix of which I am trying to calculate the determinant for EM-ML method. The covariance matrix is obtained by finding the SIGMA matrix and then passing it into the nearestSPD library (Link) to make the matrix positive definite . In this case the matrix is always singular. Another method I tried was of manually generating a positive definite matrix using A'*A technique. (A was taken as a 1600x1500 matrix). This always gives me the determinant as infinite. Any idea on how I can get a positive definite matrix with a finite determinant?
Do you actually need the determinant, or the log of the determinant?
For example if you are computing a log likelihood of gaussians then what enters into the log likelihood is the log of the determinant. In high dimensions determinants mey not fit in a double, but its log most likely will.
If you perform a cholesky factorisation of the covariance C, with (lower triangular) factor L say so that
C = L*L'
then
det C = det(L) * det( L') = det(L) * det(L)
But the determinant of a lower triangular matrix is the product of its diagonal elements, so, taking logs above we get:
log det C = 2*Sum{ i | log( L[i,i])}
(In response to a comment)
Even if you need to calculate a gaussian pdf, it is better to calculate the log of that and exponentiate only when you need to. For example a d dimenions gaussian with covariance C (which has a cholesky factor L) and mean 0 (purely to save typing) is:
p(x) = exp( -0.5*x'*inv(C)*x) /( sqrt( pow(2pi,d) * det(C))
so
log p(x) = -0.5*x'*inv(C)*x - 0.5*d*log(2pi) - 0.5*log(det(C))
which can also be written
log p(x) = -0.5*y'*y - 0.5*d*log(2pi) - log(det(L))
where
y = inv(L)*x

Generating multivariate normally distributed random numbers in Matlab

This question is about the use of the covariance matrix in the multidimensional normal distribution:
I want to generate multi-dimensional random numbers x in Matlab with a given mean mu and covariance matrix Sigma. Assuming Z is a standard normally distributed random number (e.g. generated using randn), what is the correct code:
x = mu + chol(Sigma) * Z
or
x = mu + Sigma ^ 0.5 * Z
?
I am not sure about the use of the covariance matrix in the definition of the multidimensional normal distribution – whether the determinant in the denominator is of the square root or the Cholesky factor...
If by definition you refer to the density of the multivariate normal distribution:
it contains neither the Cholesky decomposition nor the matrix square root of Σ, but its inverse and the scalar square root of its determinant.
But for numerically generating random numbers from this distribution, the density is not helpful. It is not even the most general description of the multivariate normal distribution, since the density formula makes only sense for positive definite matrices Σ, while the distribution is also defined if there are zero eigenvalues – that just means that the variance is 0 in the direction of the respective eigenvector.
Your question follows the approach to start from standard multivariate normally distributed random numbers Z as produced by randn, and then apply a linear transformation. Assuming that mu is a p-dimensional row vector we want an nxp-dimensional random matrix (each row one observation, each column one variable):
Z = randn(n, p);
x = mu + Z * A;
We need a matrix A such that the covariance of x is Sigma. Since the covariance of Z is the identity matrix, the covariance of x is given by A' * A. A solution to this is given by the Cholesky decomposition, so the natural choice is
A = chol(Sigma);
where A is an upper triangular matrix.
However, we can also search for a Hermitian solution, A' = A, and then A' * A becomes A^2, the matrix square. A solution to this is given by a matrix square root, which is computed by replacing each eigenvalue of Sigma by its square root (or its negative); in general there are 2ⁿ possible solutions for n positive eigenvalues. The Matlab function sqrtm returns the principal matrix square root, which is the unique nonnegative-definite solution. Therefore,
A = sqrtm(Sigma)
works also. A ^ 0.5 should in principle do the same.
Simulations using this code
p = 10;
n = 1000;
nr = 1000;
cp = nan(nr, 1);
sp = nan(nr, 1);
pp = nan(nr, 1);
for i = 1 : nr
x = randn(n, p);
Sigma = cov(x);
cS = chol(Sigma);
cp(i) = norm(cS' * cS - Sigma);
sS = sqrtm(Sigma);
sp(i) = norm(sS' * sS - Sigma);
pS = Sigma ^ 0.5;
pp(i) = norm(pS' * pS - Sigma);
end
mean([cp sp pp])
yield that chol is more precise than the two other methods, and profiling shows that it is also much faster, for both p = 10 and p = 100.
The Cholesky decomposition does however have the disadvantage that it is only defined for positive-definite Σ, while the requirement of the matrix square root is merely that Σ is nonnegative-definite (sqrtm returns a warning for a singular input, but returns a valid result).

Basic SVM Implemented in MATLAB

Linearly Non-Separable Binary Classification Problem
First of all, this program isn' t working correctly for RBF ( gaussianKernel() ) and I want to fix it.
It is a non-linear SVM Demo to illustrate classifying 2 class with hard margin application.
Problem is about 2 dimensional radial random distrubuted data.
I used Quadratic Programming Solver to compute Lagrange multipliers (alphas)
xn = input .* (output*[1 1]); % xiyi
phi = gaussianKernel(xn, sigma2); % Radial Basis Function
k = phi * phi'; % Symmetric Kernel Matrix For QP Solver
gamma = 1; % Adjusting the upper bound of alphas
f = -ones(2 * len, 1); % Coefficient of sum of alphas
Aeq = output'; % yi
beq = 0; % Sum(ai*yi) = 0
A = zeros(1, 2* len); % A * alpha <= b; There isn't like this term
b = 0; % There isn't like this term
lb = zeros(2 * len, 1); % Lower bound of alphas
ub = gamma * ones(2 * len, 1); % Upper bound of alphas
alphas = quadprog(k, f, A, b, Aeq, beq, lb, ub);
To solve this non linear classification problem, I wrote some kernel functions such as gaussian (RBF), homogenous and non-homogenous polynomial kernel functions.
For RBF, I implemented the function in the image below:
Using Tylor Series Expansion, it yields:
And, I seperated the Gaussian Kernel like this:
K(x, x') = phi(x)' * phi(x')
The implementation of this thought is:
function phi = gaussianKernel(x, Sigma2)
gamma = 1 / (2 * Sigma2);
featDim = 10; % Length of Tylor Series; Gaussian Kernel Converge 0 so It doesn't have to Be Inf Dimension
phi = []; % Kernel Output, The Dimension will be (#Sample) x (featDim*2)
for k = 0 : (featDim - 1)
% Gaussian Kernel Trick Using Tylor Series Expansion
phi = [phi, exp( -gamma .* (x(:, 1)).^2) * sqrt(gamma^2 * 2^k / factorial(k)) .* x(:, 1).^k, ...
exp( -gamma .* (x(:, 2)).^2) * sqrt(gamma^2 * 2^k / factorial(k)) .* x(:, 2).^k];
end
end
*** I think my RBF implementation is wrong, but I don' t know how to fix it. Please help me here.
Here is what I got as output:
where,
1) The first image : Samples of Classes
2) The second image : Marking The Support Vectors of Classes
3) The third image : Adding Random Test Data
4) The fourth image : Classification
Also, I implemented Homogenous Polinomial Kernel " K(x, x') = ( )^2 ", code is:
function phi = quadraticKernel(x)
% 2-Order Homogenous Polynomial Kernel
phi = [x(:, 1).^2, sqrt(2).*(x(:, 1).*x(:, 2)), x(:, 2).^2];
end
And I got surprisingly nice output:
To sum up, the program is working correctly with using homogenous polynomial kernel but when I use RBF, it isn' t working correctly, there is something wrong with RBF implementation.
If you know about RBF (Gaussian Kernel) please let me know how I can make it right..
Edit: If you have same issue, use RBF directly that defined above and dont separe it by phi.
Why do you want to compute phi for Gaussian Kernel? Phi will be infinite dimensional vector and you are bounding the terms in your taylor series to 10 when we don't even know whether 10 is enough to approximate the kernel values or not! Usually, the kernel is computed directly instead of getting phi (and the computing k). For example [1].
Does this mean we should never compute phi for Gaussian? Not really, no, but we have to be slightly smarter about it. There have been recent works [2,3] which show how to compute phi for Gaussian so that you can compute approximate kernel matrices while having just finite dimensional phi's. Here [4] I give the very simple code to generate the approximate kernel using the trick from the paper. However, in my experiments I needed to generate anywhere from 100 to 10000 dimensional phi's to be able to get a good approximation of the kernel (depending upon on the number of features the original input had as well as the rate at which the eigenvalues of the original matrix tapers off).
For the moment, just use code similar to [1] to generate the Gaussian kernel and then observe the result of SVM. Also, play around with the gamma parameter, a bad gamma parameter can result in really bad classification.
[1] https://github.com/ssamot/causality/blob/master/matlab-code/Code/mfunc/indep/HSIC/rbf_dot.m
[2] http://www.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf
[3] http://www.eecs.berkeley.edu/~brecht/papers/08.rah.rec.nips.pdf
[4] https://github.com/aruniyer/misc/blob/master/rks.m
Since Gaussian kernel is often referred as mapping to infinity dimensions, I always have faith in its capacity. The problem here maybe due to a bad parameter while keeping in mind grid search is always needed for SVM training. Thus I propose you could take a look at here where you could find some tricks for parameter tuning. Exponentially increasing sequence is usually used as candidates.