Calculation detail of Batch Normalization - neural-network

I am implementing the BN algorithm in my CNN model using C++. But I am very confused because of its calculation of mean and standard deviation. Let's see the below pictures.
As far as I know, each pixel mean value has to be computed. but I don't know which of the equations is correct.
Let's assume that the BN's input is [B, H, W], where B is the batch size, (H, W) is the feature map size (Suppose to [3,2,2]).
Mean of all pixels in one image
([0,0,0]+[0,0,1]+[0,1,0]+[0,1,1])/4
Mini-batch means are scalar
Mean of the same location's pixel in multiple images (image num = batch num )
([0,0,0]+[1,0,0]+[2,0,0])/3 ,
([0,0,1]+[1,0,1]+[2,0,1])/3 ,
([0,1,0]+[1,1,0]+[2,1,0])/3 ,
([0,1,1]+[1,1,1]+[2,1,1])/3 )
Mini-batch means are an array
What calculation is right?

Neither is correct. Method 1 merely normalizes each image; it has nothing to do with the batch as a whole. Method 2 will almost certainly damage your training, as it decouples adjacent pixels from one another.
You need to compute the mean and variance of all pixels in the batch. Then normalize each pixel with the scale and shift equation using those batch statistics.

Related

Why is the number of sample frequencies in `scipy.signal.stft()` tied to the hop size?

This question relates to SciPy's Short-time Fourier Transform function for signal processing.
For some reason I don't understand, the size of the output 'array of sample frequencies' is exactly equal to the hop size. From the documentation:
nperseg : int, optional
Length of each segment. Defaults to 256.
noverlap : int, optional
Number of points to overlap between segments. If None, noverlap = nperseg // 2. Defaults to None. When specified, the COLA constraint must be met (see Notes below).
f : ndarray
Array of sample frequencies.
hop size H = nperseg - noverlap
I'm new to signal processing and Fourier transforms, but as far as I understand a STFT is just chopping an audio file into segments ('time frames') on which you perform a Fourier transform. So if I want to do a STFT on 100 time frames, I'd expect the output to be a matrix of size 100 x F, where F is an array of measured frequencies ('measured' probably isn't the right word here but you know what I mean).
This is kinda what SciPy's implementation does, but the size of f here is what bothers me. It's supposed to be an array describing the different frequencies, like [0Hz 500Hz 1000Hz], and it does, but for some reasons its size exactly the same as the hop size. If the hop size is 700, the number of measured frequencies is 700.
The hop size is the number of samples (i.e. time) between each time frame, and is correctly calculated as H = nperseg - noverlap, but what does this have to do with the frequency array?
Edit: Related to this question
An FFT is an square matrix transform from one orthogonal basis to another of the same dimension. This is because N is the exact number of orthogonal (e.g. that don't interfere with one another) complex sinusoids that fit in a time domain vector of length N.
A longer time vector can contain more frequency information (e.g. it's hard to tell 2 frequencies apart using just 3 sample points, but much easier with 3000 samples, etc.)
You can zero-pad your short time vector of length N to use a longer FFT, but that is identical to interpolating a nice curve between N frequency points, which makes all the FFT results interdependent.
For many purposes (visualization, etc.) an STFT is overlapped, where the adjacent segments share some overlapped data instead of just being end-to-end. This gives better time locality (e.g. the segments can be spaced closer but still be long enough so that each one can provide the frequency resolution required).

Scale correction for IFFT of smaller frequency space created by FFT

This might be considered a repost of this question however I am seeking a much deeper explanation on this matter and how to properly solve this problem.
I want to study the PSF/SRF of a voxel in a 44x44 matrix. For that I create a matrix 100x bigger (4400x4400) so 1 voxel in the smaller matrix corresponds to 100x100 voxels in the bigger one. I set the values to 1 of those 100^2 voxels.
Now I do a FFT of the big matrix and an IFFT of only the center portion (44x44) of the frequency space. This is the code:
A = zeros(4400,4400);
A(2201:2300,2201:2300) = 1;
B = fftshift(fft2(A));
C = ifft2(ifftshift(B(2179:2222,2179:2222)));
D = numel(C)/numel(B) * C;
figure, subplot(1,3,1), imshow(A), subplot(1,3,2), imshow(real(C)), subplot(1,3,3), imshow(real(D));
The problem is the following: I would expect the value in the voxel of the new 44x44 matrix to be 1. However, using this numel factor correction they decrease to 0.35. And if I don't apply the correction they go up to huge values.
For starters, let me try to clarify the scaling issue: For the DFT/IDFT there are various scaling conventions regarding the input size. You either need a factor of 1/N in the DFT or a factor of 1/N in the IDFT or a factor of 1/sqrt(N) in both. All have pros and cons and all are equally valid.
Matlab uses the 1/N in the IDFT convention, as you can see in the documentation.
In your example, the forward DFT has a size 4400, the backward IDFT a size of 44. Therefore the IDFT scaling is a factor 100 less than it should be to match the forward transformation and your values are a factor of 100 too large. Since you're doing a 2-D DFT/IDFT, the factor 100 is missing twice, so your rescaling should be 100^2. Your numel(C)/numel(B) does exactly that, I've just tried to give you the explanation for it.
A reason why you might not see the 1 is that you're plotting only the real part of the inverse DFT. Since you did some fftshifting you might have introduced a phase so that part of your signal is in the imaginary part.
edit: Another reason is that you truncate B to the central 44 by 44 window before transforming back. Since A is not bandlimited, B has energy also outside this window. By truncating you are losing a part of it. Therefore, it is not surprising that the resulting amplitude is lower.
Here is a zoom on the image of B to show this phenomenon:
The red square is what you keep, everything else is truncated. Due to Parsevals theorem, the total energy in image and Fourier domain is equal so by truncation you must also reduce the energy of your signal in the image domain.

Should I perform data centering before apply SVD?

I have to use SVD in Matlab to obtain a reduced version of my data.
I've read that the function svds(X,k) performs the SVD and returns the first k eigenvalues and eigenvectors. There is not mention in the documentation if the data have to be normalized.
With normalization I mean both substraction of the mean value and division by the standard deviation.
When I implemented PCA, I used to normalize in such way. But I know that it is not needed when using the matlab function pca() because it computes the covariance matrix by using cov() which implicitly performs the normalization.
So, the question is. I need the projection matrix useful to reduce my n-dim data to k-dim ones by SVD. Should I perform data normalization of the train data (and therefore, the same normalization to further projected new data) or not?
Thanks
Essentially, the answer is yes, you should typically perform normalization. The reason is that features can have very different scalings, and we typically do not want to take scaling into account when considering the uniqueness of features.
Suppose we have two features x and y, both with variance 1, but where x has a mean of 1 and y has a mean of 1000. Then the matrix of samples will look like
n = 500; % samples
x = 1 + randn(n,1);
y = 1000 + randn(n,1);
svd([x,y])
But the problem with this is that the scale of y (without normalizing) essentially washes out the small variations in x. Specifically, if we just examine the singular values of [x,y], we might be inclined to say that x is a linear factor of y (since one of the singular values is much smaller than the other). But actually, we know that that is not the case since x was generated independently.
In fact, you will often find that you only see the "real" data in a signal once we remove the mean. At the extremely end, you could image that we have some feature
z = 1e6 + sin(t)
Now if somebody just gave you those numbers, you might look at the sequence
z = 1000001.54, 1000001.2, 1000001.4,...
and just think, "that signal is boring, it basically is just 1e6 plus some round off terms...". But once we remove the mean, we see the signal for what it actually is... a very interesting and specific one indeed. So long story short, you should always remove the means and scale.
It really depends on what you want to do with your data. Centering and scaling can be helpful to obtain principial components that are representative of the shape of the variations in the data, irrespective of the scaling. I would say it is mostly needed if you want to further use the principal components itself, particularly, if you want to visualize them. It can also help during classification since your scores will then be normalized which may help your classifier. However, it depends on the application since in some applications the energy also carries useful information that one should not discard - there is no general answer!
Now you write that all you need is "the projection matrix useful to reduce my n-dim data to k-dim ones by SVD". In this case, no need to center or scale anything:
[U,~] = svd(TrainingData);
RecudedData = U(:,k)'*TestData;
will do the job. The svds may be worth considering when your TrainingData is huge (in both dimensions) so that svd is too slow (if it is huge in one dimension, just apply svd to the gram matrix).
It depends!!!
A common use in signal processing where it makes no sense to normalize is noise reduction via dimensionality reduction in correlated signals where all the fearures are contiminated with a random gaussian noise with the same variance. In that case if the magnitude of a certain feature is twice as large it's snr is also approximately twice as large so normalizing the features makes no sense since it would just make the parts with the worse snr larger and the parts with the good snr smaller. You also don't need to subtract the mean in that case (like in PCA), the mean (or dc) isn't different then any other frequency.

Explaining corr2 function in Matlab

Can someone explain to me the correlation function corr2 in MATLAB? I know that it is for 2D comparing similarities of objects, but in the equation I have doubts what it is A and B (probably matrices for comparison), and also Amn and Bmn.
I'm not sure how MATLAB executes this function, because I have found in several cases that the correlation is not executed for the entire image (matrix) but instead it divides the image into blocks and then compares blocks of one picture with blocks of another picture.
In MATLAB's documentation, the corr2 equation is not put as referral point to the way the equation itself is calculated, like in other functions in MATLAB's documentation, such as referring to what book it is taken from and where it is explained.
The correlation coefficient is a number representing the similarity between 2 images in relation with their respective pixel intensity.
As you pointed out this function is used to calculate this coefficient:
Here A and B are the images you are comparing, whereas the subscript indices m and n refer to the pixel location in the image. Basically what Matab does is to compute, for every pixel location in both images, the difference between the intensity value at that pixel and the mean intensity of the whole image, denoted as a letter with a straightline over it.
As Kostya pointed out, typing edit corr2 in the command window will show you the code used by Matlab to compute the correlation coefficient. The formula is basically this:
a = a - mean2(a);
b = b - mean2(b);
r = sum(sum(a.*b))/sqrt(sum(sum(a.*a))*sum(sum(b.*b)));
where:
a is the input image and b is the image you wish to compare to a.
If we break down the formula, we see that a - mean2(a) and b-mean2(b) are the elements in the numerator of the above equation. mean2(a) is equivalent to mean(mean(a)) or mean(a(:)), that is the mean intensity of the whole image. This is only calculated once.
The 3rd line of code calculates the coefficient. Here sum(sum(a.*b)) calculates the double-sum present in the formula element-wise, that is considering each pixel location separately. Be aware that using sum(a) calculates the sum in every column individually, hence in order to get a single value you need to apply sum twice.
That's pretty much the same happening in the denominator, however calculations are performed on a-mean2(a)^2 and b-mean2(b)^2. You can see this a some kind of normalization process in which you consider the pixel intensity difference among each individual image.
As for your last comment, you can break down an image into small blocks and calculate the correlation coefficient on them; that might save some time for very large images but since everything is vectorized the calculation is quite fast. It might be useful in distributed processing I guess. Of course the correlation coefficient between 2 blocks of images is not necessarily identical to that of the whole image.
For the sake of curiosity you can look at this paper which highlights some caveats in using the correlation coefficient for image comparison.
Hope that makes things a bit clearer!

computing PCA matrix for set of sift descriptors

I want to compute a general PCA matrix for a dataset, and I will use it to reduce dimensions of sift descriptors. I have already found some algorithms to compute it, but I couldn't find a way to compute it by using MATLAB.
Can someone help me?
[coeff, score] = princomp(X)
is the right thing to do, but knowing how to use it is a little tricky.
My understanding is that you did something like:
sift_image = sift_fun(img)
which gives you a binary image: sift_feature?
(Even if not binary, this still works.)
Inputs, formulating X:
To use princomp/pca formulate X so that each column is a numel(sift_image) x 1 vector (i.e. sift_image(:))
Do this for all your images and line them up as columns in X. So X will be numel(sift_image) x num_images.
If your images aren't the same size (e.g. pixel dimensions different, more or less of a scene in the images), then you'll need to bring them into some common space, which is a whole different problem.
Unless your stuff is binary, you'll probably want to de-mean/normalize X, both in the column direction (i.e. normalizing each individual image) and row direction (de-meaning the whole dataset).
Outputs
score is the set of eigen vectors: it will be num_pixels * num_images.
To get, say the first eigen vector back into an image shape, do:
first_component = reshape(score(:,1),size(im));
And so on for the rest of the components. There are as many components as input images.
Each row of coeff is the num_images (equal to num_components) set of weights that can be applied to generate each input image. i.e.
input_image_1 = reshape(score * coeff(:,1) , size(original_im));
where input_image_1 is the correct, original shape
coeff(1,:) is a vector (num_images x 1)
score is pixels x num_images
(Disclaimer: I may have the columns/rows mixed up, but the descriptions are correct.)
Does that help?
If you have access to Statistics Toolbox, you can use the command princomp, or in recent versions the command pca.