Why is the number of sample frequencies in `scipy.signal.stft()` tied to the hop size? - scipy

This question relates to SciPy's Short-time Fourier Transform function for signal processing.
For some reason I don't understand, the size of the output 'array of sample frequencies' is exactly equal to the hop size. From the documentation:
nperseg : int, optional
Length of each segment. Defaults to 256.
noverlap : int, optional
Number of points to overlap between segments. If None, noverlap = nperseg // 2. Defaults to None. When specified, the COLA constraint must be met (see Notes below).
f : ndarray
Array of sample frequencies.
hop size H = nperseg - noverlap
I'm new to signal processing and Fourier transforms, but as far as I understand a STFT is just chopping an audio file into segments ('time frames') on which you perform a Fourier transform. So if I want to do a STFT on 100 time frames, I'd expect the output to be a matrix of size 100 x F, where F is an array of measured frequencies ('measured' probably isn't the right word here but you know what I mean).
This is kinda what SciPy's implementation does, but the size of f here is what bothers me. It's supposed to be an array describing the different frequencies, like [0Hz 500Hz 1000Hz], and it does, but for some reasons its size exactly the same as the hop size. If the hop size is 700, the number of measured frequencies is 700.
The hop size is the number of samples (i.e. time) between each time frame, and is correctly calculated as H = nperseg - noverlap, but what does this have to do with the frequency array?
Edit: Related to this question

An FFT is an square matrix transform from one orthogonal basis to another of the same dimension. This is because N is the exact number of orthogonal (e.g. that don't interfere with one another) complex sinusoids that fit in a time domain vector of length N.
A longer time vector can contain more frequency information (e.g. it's hard to tell 2 frequencies apart using just 3 sample points, but much easier with 3000 samples, etc.)
You can zero-pad your short time vector of length N to use a longer FFT, but that is identical to interpolating a nice curve between N frequency points, which makes all the FFT results interdependent.
For many purposes (visualization, etc.) an STFT is overlapped, where the adjacent segments share some overlapped data instead of just being end-to-end. This gives better time locality (e.g. the segments can be spaced closer but still be long enough so that each one can provide the frequency resolution required).

Related

Frequency domain phase shift, amplitude, hope size and non-linearity

I am trying to implement a frequency domain phase shift but there are few points on which I am not sure.
1- I am able to get a perfect reconstruction from a sine or sweep signal using a hanning window with a hop size of 50%. Nevertheless, how should I normalise my result when using a hop size > 50%?
2- When shifting the phase of low frequency signals (f<100, window size<1024, fs=44100) I can clearly see some non-linearity in my result. Is this because of the window size being to short for low frequencies?
Thank you very much for your help.
clear
freq=500;
fs=44100;
endTime=0.02;
t = 1/fs:1/fs:(endTime);
f1=linspace(freq,freq,fs*endTime);
x = sin(2*pi*f1.*t);
targetLength=numel(x);
L=1024;
w=hanning(L);
H=L*.50;% Hopsize of 50%
N=1024;
%match input length with window length
x=[zeros(L,1);x';zeros(L+mod(length(x),H),1)];
pend=length(x)- L ;
pin=0;
count=1;
X=zeros(N,1);
buffer0pad= zeros(N,1);
outBuffer0pad= zeros(L,1);
y=zeros(length(x),1);
delay=-.00001;
df = fs/N;
f= -fs/2:df:fs/2 - df;
while pin<pend
buffer = x(pin+1:pin+L).*w;
%append zero padding in the middle
buffer0pad(1:(L)/2)=buffer((L)/2+1: L);
buffer0pad(N-(L)/2+1:N)=buffer(1:(L)/2);
X = fft(buffer0pad,N);
% Phase modification
X = abs(X).*exp(1i*(angle(X))-(1i*2*pi*f'*delay));
outBuffer=real(ifft(X,N));
% undo zero padding----------------------
outBuffer0pad(1:L/2)=outBuffer(N-(L/2-1): N);
outBuffer0pad(L/2+1:L)=outBuffer(1:(L)/2);
%Overlap-add
y(pin+1:pin+L) = y(pin+1:pin+L) + outBuffer0pad;
pin=pin+H;
count=count+1;
end
%match output length with original input length
output=y(L+1:numel(y)-(L+mod(targetLength,H)));
figure(2)
plot(t,x(L+1:numel(x)-(L+mod(targetLength,H))))
hold on
plot(t,output)
hold off
Anything below 100 Hz has less than two cycles in your FFT window. Note that a DFT or FFT represents any waveform, including a single non-integer-periodic sinusoid, by possibly summing up of a whole bunch of sinusoids of very different frequencies. e.g. a lot more than just one. That's just how the math works.
For a von Hann window containing less than 2 cycles, these are often a bunch of mostly completely different frequencies (possibly very far away in terms of percentage from your low frequency). Shifting the phase of all those completely different frequencies may or may not shift your windowed low frequency sinusoid by the desired amount (depending on how different in frequency your signal is from being integer-periodic).
Also for low frequencies, the complex conjugate mirror needs to be shifted in the opposite direction in phase in order to still represent a completely real result. So you end up mixing 2 overlapped and opposite phase changes, which again is mostly a problem if the low frequency signal is far from being integer periodic in the DFT aperture.
Using a longer window in time and samples allows more cycles of a given frequency to fit inside it (thus possibly needing a lesser power of very different frequency sinusoids to be summed up in order to compose, make up or synthesize your low frequency sinusoid); and the complex conjugate is farther away in terms of FFT result bin index, thus reducing interference.
A sequence using any hop of a von Hann window that in 50% / (some-integer) in length is non-lossy (except for the very first or last window). All other hop sizes modulate or destroy information, and thus can't be normalized by a constant for reconstruction.

Any good ways to obtain zero local means in audio signals?

I have asked this question on DSP.SE before, but my question has got no attention. Maybe it was not so related to signal processing.
I needed to divide a discrete audio signal into segments to have some statistical processing and analysis on them. Therefore, segments with fixed local mean would be very helpful for my case. Length of segments are predefined, e.g. 512 samples.
I have tried several things. I do use reshape() function to divide audio signal into segments, and then calculate means of every segment as:
L = 512; % Length of segment
N = floor(length(audio(:,1))/L); % Number of segments
seg = reshape(audio(1:N*L,1), L, N); % Reshape into LxN sized matrix
x = mean(seg); % Calculate mean of each column
Subtracting x(k) from each seg(:,k) would make each local mean zero, yet it would distort audio signal a lot when segments are joined back.
So, since mean of hanning window is almost 0.5, substracting 2*x(k)*hann(L) from each seg(:,k) was the first thing I tried. But this time multiplying by 2 (to make the mean of hanning window be almost equal to 1) distorted the neighborhood of midpoints in each segments itself.
Then, I have used convolution by a smaller hanning window instead of multiplying directly, and subtracting these (as shown in figure below) from each seg(:,k).
This last step gives better results, yet it is still not very useful when segments are smaller. I have seen many amazing approaches here on this site for different problems. So I just wonder if there is any clever ways or existing methods to obtain zero local means which distorts an audio signal less. I read that, this property is useful in some decompositions such as EMD. So maybe I need such decompositions?
You can try to use a moving average filter:
x = cumsum(rand(15*512, 1)-0.5); % generate a random input signal
mean_filter = 1/512 * ones(1, 512); % generate a mean filter
mean = filtfilt(mean_filter, 1, x); % filtfilt is used instead of filter to obtain a symmetric moving average.
% plot the result
figure
subplot(2,1,1)
plot(x);
hold on
plot(mean);
subplot(2,1,2)
plot(x - mean);
You can tune the filter by changing the interval of the mean filter. Using a smaller interval, results in lower means inside each interval, but filters also more low frequencies out of your signal.

Why do I obtain a skewed spectrum from the FFT? (Matlab)

I try to find the strongest frequency component with Matlab. It works, but if the datapoints and periods are not nicely aligned, I need to zero-pad my data to increase the FFT resolution. So far so good.
The problem is that, when I zero-pad too much, the frequency with the maximal power changes, even if everything is aligned nicely and I would expect a clear result.
This is my MWE:
Tmax = 1024;
resolution = 1024;
period = 512;
X = linspace(0,Tmax,resolution);
Y = sin(2*pi*X/period);
% N_fft = 2^12; % still fine, max_period is 512
N_fft = 2^13; % now max_period is 546.1333
F = fft(Y,N_fft);
power = abs(F(1:N_fft/2)).^2;
dt = Tmax/resolution;
freq = (0:N_fft/2-1)/N_fft/dt;
[~, ind] = max(power);
max_period = 1/freq(ind)
With zero-padding up to 2^12 everything works fine, but when I zero-pad to 2^13, I get a wrong result. It seems like too much zero-padding shifts the spectrum, but I doubt it. I rather expect a bug in my code, but I cannot find it. What am I doing wrong?
EDIT: It seems like the spectrum is skewed towards the low frequencies. Zero-padding just makes this visible:
Why is my spectrum skewed? Shouldn't it be symmetric?
Here is a graphic explanation of what you're doing wrong (which is mostly a resolution problem).
EDIT: this shows the power for each fft data point, mapped to the indices of the 2^14 dataset. That is, the indices for the 2^13 data numbered 1,2,3 map to 1,3,5 on this graph; the indices for 2^12 data numbered 1,2,3 map to 1,5,9; and so on.
.
You can see that the "true" value should in fact not be 512 -- your indexing is off by 1 or a fraction of 1.
Its not a bug in your code. It has to do with the properties of the DFT (and thus the FFT, which is merely a fast version of the DFT).
When you zero-pad, you add frequency resolution, particularly on the lower end.
Here you use a sine wave as test, so you are basically convolving a finite length sine with finite sines and cosines (see here https://en.wikipedia.org/wiki/Fast_Fourier_transform details), which have almost the same or lower frequency.
If you were doing a "proper" fft, i.e. doing integrals from -inf to +inf, even those low frequency components would give you zero coefficients for the FFT, but since you are doing finite sums, the result of those convolutions is not zero and hence the actual computed fourier transform is inaccurate.
TL;DR: Use a better window function!
The long version:
After searching further, I finally found the explanation. Neither is indexing the problem, nor the additional low frequency components added by the zero-padding. The frequency response of the rectangular window, combined with the negative frequency components is the culprit. I found out on this website explaining window functions.
I made more plots to explain:
Top: The frequency response without windowing: two delta peaks, one at the positive and one at the negative frequency. I always plotted the positive part, since I didn't expect to need the negative frequency components. Middle: The frequency response of the rectangular window function. It is relatively broad, but I didn't care, because I thought I'd have only a single peak. Bottom: The frequency response of the zero-padded signal. In time domain, this is the multiplication of window function and sine-wave. In frequency domain, this amounts to the convolution of the frequency response of the window function with the frequency response of the perfect sine. Since there are two peaks, the relatively broad frequency responses of the window overlap significantly, leading to a skewed spectrum and therefore a shifted peak.
The solution: A way to circumvent this is to use a proper window function, like a Hamming window, to have a much smaller frequency response of the window, leading to less overlap.

What is a spectrogram and how do I set its parameters?

I am trying to plot the spectrogram of my time domain signal given:
N=5000;
phi = (rand(1,N)-0.5)*pi;
a = tan((0.5.*phi));
i = 2.*a./(1-a.^2);
plot(i);
spectrogram(i,100,1,100,1e3);
The problem is I don't understand the parameters and what values should be given. These values that I am using, I referred to MATLAB's online documentation of spectrogram. I am new to MATLAB, and I am just not getting the idea. Any help will be greatly appreciated!
Before we actually go into what that MATLAB command does, you probably want to know what a spectrogram is. That way you'll get more meaning into how each parameter works.
A spectrogram is a visual representation of the Short-Time Fourier Transform. Think of this as taking chunks of an input signal and applying a local Fourier Transform on each chunk. Each chunk has a specified width and you apply a Fourier Transform to this chunk. You should take note that each chunk has an associated frequency distribution. For each chunk that is centred at a specific time point in your time signal, you get a bunch of frequency components. The collection of all of these frequency components at each chunk and plotted all together is what is essentially a spectrogram.
The spectrogram is a 2D visual heat map where the horizontal axis represents the time of the signal and the vertical axis represents the frequency axis. What is visualized is an image where darker colours means that for a particular time point and a particular frequency, the lower in magnitude the frequency component is, the darker the colour. Similarly, the higher in magnitude the frequency component is, the lighter the colour.
Here's one perfect example of a spectrogram:
Source: Wikipedia
Therefore, for each time point, we see a distribution of frequency components. Think of each column as the frequency decomposition of a chunk centred at this time point. For each column, we see a varying spectrum of colours. The darker the colour is, the lower the magnitude component at that frequency is and vice-versa.
So!... now you're armed with that, let's go into how MATLAB works in terms of the function and its parameters. The way you are calling spectrogram conforms to this version of the function:
spectrogram(x,window,noverlap,nfft,fs)
Let's go through each parameter one by one so you can get a greater understanding of what each does:
x - This is the input time-domain signal you wish to find the spectrogram of. It can't get much simpler than that. In your case, the signal you want to find the spectrogram of is defined in the following code:
N=5000;
phi = (rand(1,N)-0.5)*pi;
a = tan((0.5.*phi));
i = 2.*a./(1-a.^2);
Here, i is the signal you want to find the spectrogram of.
window - If you recall, we decompose the image into chunks, and each chunk has a specified width. window defines the width of each chunk in terms of samples. As this is a discrete-time signal, you know that this signal was sampled with a particular sampling frequency and sampling period. You can determine how large the window is in terms of samples by:
window_samples = window_time/Ts
Ts is the sampling time of your signal. Setting the window size is actually very empirical and requires a lot of experimentation. Basically, the larger the window size, the better frequency resolution you get as you're capturing more of the frequencies, but the time localization is poor. Similarly, the smaller the window size, the better localization you have in time, but you don't get that great of a frequency decomposition. I don't have any suggestions here on what the most optimal size is... which is why wavelets are preferred when it comes to time-frequency decomposition. For each "chunk", the chunks get decomposed into smaller chunks of a dynamic width so you get a mixture of good time and frequency localization.
noverlap - Another way to ensure good frequency localization is that the chunks are overlapping. A proper spectrogram ensures that each chunk has a certain number of samples that are overlapping for each chunk and noverlap defines how many samples are overlapped in each window. The default is 50% of the width of each chunk.
nfft - You are essentially taking the FFT of each chunk. nfft tells you how many FFT points are desired to be computed per chunk. The default number of points is the largest of either 256, or floor(log2(N)) where N is the length of the signal. nfft also gives a measure of how fine-grained the frequency resolution will be. A higher number of FFT points would give higher frequency resolution and thus showing fine-grained details along the frequency axis of the spectrogram if visualised.
fs - The sampling frequency of your signal. The default is 1 Hz, but you can override this to whatever the sampling frequency your signal is at.
Therefore, what you should probably take out of this is that I can't really tell you how to set the parameters. It all depends on what signal you have, but hopefully the above explanation will give you a better idea of how to set the parameters.
Good luck!

Methodology of FFT for Matlab spectrogram / short time Fourier transform functions

I'm trying to figure out how MATLAB does the short time Fourier transforms for its spectrogram function (and related functions like specgram, or stft in Octave). What is curious to me is that you can apparently specify the length of the window and the FFT length (number of output frequencies) independently, whereas I would have expected that these two should be equal (since the length of an FFT'd signal is the same as the length of the original signal). To illustrate what I mean, here is the function call:
[S,F,T]=spectrogram(signal,winSize,overlapSize,fftSize,rate);
winSize is the length of subintervals which are to be (individually) FFT'd, and fftSize is the number of frequency components given in the output. When these are not equal, does Matlab do interpolation to produce the required number of frequency bins?
Ultimately the reason I want to know is so that I can determine the proper units and scaling for the frequencies.
Cheers
A windowed segment of a signal can be zero-padded to a longer length vector to use a longer FFT. The frequency scaling will be determined by the length of the FFT (and the signals sample rate). The window size and window formula will determine the effective resolution, in terms of peak separation ability.
Why do this? Some FFT sizes can be computed more efficiently than others (slightly or a lot, depending on the FFT library used). Also, a longer FFT will calculate more points or bins, thus producing a higher density of interpolated points in a potentially smoother spectrum result.