how to normalize fft values for neural networks - neural-network

I calculate the fft for a given soundfile and get an array of the shape e.g. (100,257) with 100 rows and 257 frequency bins. I want to use this as an input vector for a neural network but before I want to normalize with librosa lib
https://librosa.github.io/librosa/generated/librosa.util.normalize.html#librosa.util.normalize
so should I normalize over axis=0 or axis=1? axis=0 normalizes the columns aggregated over the rows and axis=1 normalizes every row or should I normalize over every value independent of rows and columns?

The way how you normalize the fft depends on your application and the final performance. There isn't a general normalization scheme.
In one of my application, I didn't normalize and input the raw fft to the neural network. One common way to normalize is taking the logarithm. This operation can reduce the dynamic range.

Related

Is it better to individually normalize all inputs for a neural network?

I'm working on a neural network with Keras using TensorFlow as the backend right now, and my model takes 5 inputs, all normalized to 0 to 1. The inputs' units vary from m/s to meters to m/s/s. So, for example, one input could vary from 0 m/s to 30 m/s, while another input could vary from 5 m to 200 m in the training dataset.
Is it better to individually and independently normalize all inputs so that I have different scales for each unit/input? Or would normalizing all inputs to one scale (mapping 0-200 to 0-1 for the example above) be better for accuracy?
Normalize individualy each input. Because if you normalize everything by dividing 200 some inputs will affect your network less than others. If one input vary between 0-30, after dividing by 200 you get 0-0.15 scale and scale for input which vary 0-200 will be 0-1 after division. So 0-30 input will have less numbers and you tell your network that input is not so relevant as one whith 0-200.

how to normalize an audio file so that the length (row) of the data is equal?

I want to do a comparison of 2 audio files (each audio file is speaking "ba a ta") with the existing function in matlab called Dynamic Time Warping (DTW). Before doing a dynamic time warping, I get an array/vector from the Fast Fourier Transform (FFT) functions available in matlab, my code so far (my matlab filename: test.m):
fftRecording1 = fft(audioread('C:\Users\handy\Documents\MATLAB\my_recording_1.wav'));
fftRecording2 = fft(audioread('C:\Users\handy\Documents\MATLAB\fajar.wav'));
dist = dtw(fftRecording1, fftRecording2);
When I try the DTW function there is an error because the length (row) of the array/vector 2 file is different. Error message:
Error using dtw (line 82)
The number of rows between X and Y must be equal when X and Y are matrices
Error in test (line 3)
dist = dtw(fftRecording1, fftRecording2);
contents of the fftRecording1 and fftRecording2 variables
My question is: before do the FFT and DTW, how do step by step normalize so that the length (row) 2 audio files is equal? or there are other ways to make the data length (row) 2 audio files is equal?
According to dtw's documentation:
To stretch the inputs, dtw repeats each element of x and y as many times as necessary. If x and y are matrices, then dist stretches them by repeating their columns. In that case, x and y must have the same number of rows.
In your case your columns represent the audio channels, with the rows representing the quantity to be aligned (i.e. the reverse of what dtw is expecting). To setup the inputs according to what dtw expect, simply transpose the inputs:
dist = dtw(transpose(fftRecording1), transpose(fftRecording2));
Dynamic Time Warping does not need the input sequences to be of same length. DTW is actually used to find similarity between two different time aligned sequences.
No, they don’t need to have the same length in a time-related-sense. They need to have the same number of dimensions (2D Signal, 3D Signal,...) which is equivalent to their number or rows. The whole idea of DTW is to match similar contents which might be stretched to different lengths - so there would absolutely be no point in requiring the inputs to have the same length.
Related to your question: just call the dtw with the transposed of your signals and you will get a proper result.
dtw(signal1’, signal2’);
You should apply the DTW on the original signals rather than the fourier transforms. The FFT transfers the signal from time to frequency domain. So instead of warping signal1 in order to match signal2, you are warping frequencies when using FFT before DTW. The amplitude of the fourier transform depends on the number of points in the considered FFT-Time-Window. From my point of view there is absolutely no point in applying DTW on a fourier transform.

Can I normalise subsets of training data for a neural network?

Say I have a training set with 50 vectors. I split this set into 5 sets each with 10 vectors and then I scale the vectors in each subset and normalise the subsets. Then I train my ANN with each vector from each subset.
After training is complete, I group my test set into subsets of 10 vectors each, scale the features of the vectors in each subset and normalise each subset and then feed it to the neural network to attempt to classify it.
Is this the right approach? Is it right to scale and normalise each subset, each with its own minimum, maximum, mean and standard deviation?

Making feature vector from Gabor filters for classification

My aim is to classify types of cars (Sedans,SUV,Hatchbacks) and earlier I was using corner features for classification but it didn't work out very well so now I am trying Gabor features.
code from here
Now the features are extracted and suppose when I give an image as input then for 5 scales and 8 orientations I get 2 [1x40] matrices.
1. 40 columns of squared Energy.
2. 40 colums of mean Amplitude.
Problem is I want to use these two matrices for classification and I have about 230 images of 3 classes (SUV,sedan,hatchback).
I do not know how to create a [N x 230] matrix which can be taken as vInputs by the neural netowrk in matlab.(where N be the total features of one image).
My question:
How to create a one dimensional image vector from the 2 [1x40] matrices for one image.(should I append the mean Amplitude to square energy matrix to get a [1x80] matrix or something else?)
Should I be using these gabor features for my purpose of classification in first place? if not then what?
Thanks in advance
In general, there is nothing to think about - simple neural network requires one dimensional feature vector and does not care about the ordering, so you can simply concatenate any number of feature vectors into one (and even do it in random order - it does not matter). In particular if you have same feature matrices you also concatenate each of its row to create a vectorized format.
The only exception is when your data actually has some underlying geometrical dependicies, for example - matrix is actualy a pixels matrix. In such case architectures like PyraNet, Convolutional Neural Networks and others, which apply some kind of receptive fields based on this 2d structure - should be better. But those implementations simply accept 2d feature vector as an input.

FFT: Match samples to frequency

let us assume,
I have a vector t with the times in seconds of my samples. (These samples are not equally distributed on the time domain.
Also I have a vector data containing the samplevalues at the time t.
t and data have the same length.
If I plot the graph some sort of periodical signal is obtained.
now I could perform: abs(fft(data)) to get my spectrum, which is then plotted over the amount of data points on the x-axis.
How can I obtain my spectrum regarding the times in vector t and plot it?
I want to see which frequencies in 1/s or which period in s my signal contains.
Thanks for your help.
[Not the OP's intention]: FFT will give you the spectrum (global) for any number of input data points. You cannot have a specific data point (in time) associated with parts (or the full) spectrum.
What you can do instead is use spectrogram and obtain the Short-Time Fourier Transform (STFT). This will give you a NxM discrete grid of time-frequency FT values (N: FT frequency bins, M: signal time-windows).
By localizing the (overlapping) STFT windows on your data samples of interest you will get N frequency magnitude values, thus the distribution of short-term spectrum estimates as the signal changes in time.
See also the possibly relevant answer here: https://stackoverflow.com/a/12085728/651951
EDIT/UPDATE:
For unevenly spaced data you need to consider the Non-Uniform DFT (and Non-uniform FFT implementations). See the relevant question/answer here https://scicomp.stackexchange.com/q/593
The primary approaches for NFFT or NUFFT, are based on creating a uniform grid through local convolutions/interpolation, running FFT on this and undoing the convolutional effect of the interpolation filter.
You can read more:
A. Dutt and V. Rokhlin, Fast Fourier transforms for nonequispaced data, SIAM J. Sci. Comput., 14, 1993.
L. Greengard and J.-Y. Lee, Accelerating the Nonuniform Fast Fourier Transform, SIAM Review, 46 (3), 2004.
Pippig, M. und Potts, D., Particle Simulation Based on Nonequispaced Fast Fourier Transforms, in: Fast Methods for Long-Range Interactions in Complex Systems, 2011.
For an implementation (with an interface to MATLAB) try NFFT and possibly its parallelized version PNFFT. You may find a nice walk-through on how to set-up and use here.
You can resample or interpolate your sample points to get another set of sample points that are equally spaced in t. The chosen spacing or sample rate of the second set of equally spaced sample points will allow you to infer frequencies to the result of an FFT of that second set.
The results may be noisy or include aliasing unless the initial data set is bandlimited to a sufficiently low frequency to allow interpolation. If bandlimited, then you might try something like cubic splines as an interpolation method.
Although it may look like one can get a high FFT bin frequency resolution by resampling to a larger number of data points, the actual useful resolution accuracy will be more related to the original number of samples.