implementation of Lomb-Scargle periodogram - matlab

from matlab official site , Lomb-Scargle periodogram is defined as
http://www.mathworks.com/help/signal/ref/plomb.html#lomb
suppose we have some random signal let say
x=rand(1,1000);
average of this signal can be easily implemented as
average=mean(x);
variance can be implemented as
>> average_vector=repmat(average,1,1000);
>> diff=x-average_vector;
>> variance= sum(diff.*diff)/(length(x)-1);
how should i continue? i mean how to choose frequencies ? calculation of time offset is not problem,let us suppose that we have time vector
t=0:0.1:99.9;
so that total length of time vector be 1000, in generally for DFT, frequencies bins are represented as a multiplier of 2*pi/N, where N is length of signal, but what about this case? thanks in advance

As can be seen from the provided link to MATLAB documentation, the algorithm does not depend on a specific sampling times tk selection. Note that for equally spaced sampling times (as you have selected), the same link indicates:
The offset depends only on the measurement times and vanishes when the times are equally spaced.
So, as you put it "calculation of time offset is not a problem".
Similar to the DFT which can be obtained from the Discrete-Time Fourier Transform (DTFT) by selecting a discrete set of frequencies, we can also choose f[n] = n * sampling_rate/N (where sampling_rate = 10 for your selection of tk). If we disregard the value of PLS(f[n]) for n=mN where m is any integer (since it's value is ill-formed, at least in the formula posted in the link), then:
Thus for real-valued data samples:
where Y can be expressed in terms of the diff vector you defined as:
Y = fft(diff);
That said, as indicated on wikipedia, the Lomb–Scargle method is primarilly intended for use with unequally spaced data.

Related

Why is the number of sample frequencies in `scipy.signal.stft()` tied to the hop size?

This question relates to SciPy's Short-time Fourier Transform function for signal processing.
For some reason I don't understand, the size of the output 'array of sample frequencies' is exactly equal to the hop size. From the documentation:
nperseg : int, optional
Length of each segment. Defaults to 256.
noverlap : int, optional
Number of points to overlap between segments. If None, noverlap = nperseg // 2. Defaults to None. When specified, the COLA constraint must be met (see Notes below).
f : ndarray
Array of sample frequencies.
hop size H = nperseg - noverlap
I'm new to signal processing and Fourier transforms, but as far as I understand a STFT is just chopping an audio file into segments ('time frames') on which you perform a Fourier transform. So if I want to do a STFT on 100 time frames, I'd expect the output to be a matrix of size 100 x F, where F is an array of measured frequencies ('measured' probably isn't the right word here but you know what I mean).
This is kinda what SciPy's implementation does, but the size of f here is what bothers me. It's supposed to be an array describing the different frequencies, like [0Hz 500Hz 1000Hz], and it does, but for some reasons its size exactly the same as the hop size. If the hop size is 700, the number of measured frequencies is 700.
The hop size is the number of samples (i.e. time) between each time frame, and is correctly calculated as H = nperseg - noverlap, but what does this have to do with the frequency array?
Edit: Related to this question
An FFT is an square matrix transform from one orthogonal basis to another of the same dimension. This is because N is the exact number of orthogonal (e.g. that don't interfere with one another) complex sinusoids that fit in a time domain vector of length N.
A longer time vector can contain more frequency information (e.g. it's hard to tell 2 frequencies apart using just 3 sample points, but much easier with 3000 samples, etc.)
You can zero-pad your short time vector of length N to use a longer FFT, but that is identical to interpolating a nice curve between N frequency points, which makes all the FFT results interdependent.
For many purposes (visualization, etc.) an STFT is overlapped, where the adjacent segments share some overlapped data instead of just being end-to-end. This gives better time locality (e.g. the segments can be spaced closer but still be long enough so that each one can provide the frequency resolution required).

Sampling at exactly Nyquist rate in Matlab

Today I have stumbled upon a strange outcome in matlab. Lets say I have a sine wave such that
f = 1;
Fs = 2*f;
t = linspace(0,1,Fs);
x = sin(2*pi*f*t);
plot(x)
and the outcome is in the figure.
when I set,
f = 100
outcome is in the figure below,
What is the exact reason of this? It is the Nyquist sampling theorem, thus it should have generated the sine properly. Of course when I take Fs >> f I get better results and a very good sine shape. My explenation to myself is that Matlab was having hardtime with floating numbers but I am not so sure if this is true at all. Anyone have any suggestions?
In the first case you only generate 2 samples (the third input of linspace is number of samples), so it's hard to see anything.
In the second case you generate 200 samples from time 0 to 1 (including those two values). So the sampling period is 1/199, and the sampling frequency is 199, which is slightly below the Nyquist rate. So there is aliasing: you see the original signal of frequency 100 plus its alias at frequency 99.
In other words: the following code reproduces your second figure:
t = linspace(0,1,200);
x = .5*sin(2*pi*99*t) -.5*sin(2*pi*100*t);
plot(x)
The .5 and -.5 above stem from the fact that a sine wave can be decomposed as the sum of two spectral deltas at positive and negative frequencies, and the coefficients of those deltas have opposite signs.
The sum of those two sinusoids is equivalent to amplitude modulation, namely a sine of frequency 99.5 modulated by a sine of frequency 1/2. Since time spans from 0 to 1, the modulator signal (whose frequency is 1/2) only completes half a period. That's what you see in your second figure.
To avoid aliasing you need to increase sample rate above the Nyquist rate. Then, to recover the original signal from its samples you can use an ideal low pass filter with cutoff frequency Fs/2. In your case, however, since you are sampling below the Nyquist rate, you would not recover the signal at frequency 100, but rather its alias at frequency 99.
Had you sampled above the Nyquist rate, for example Fs = 201, the orignal signal could ideally be recovered from the samples.† But that would require an almost ideal low pass filter, with a very sharp transition between passband and stopband. Namely, the alias would now be at frequency 101 and should be rejected, whereas the desired signal would be at frequency 100 and should be passed.
To relax the filter requirements you need can sample well above the Nyquist rate. That way the aliases are further appart from the signal and the filter has an easier job separating signal from aliases.
† That doesn't mean the graph looks like your original signal (see SergV's answer); it only means that after ideal lowpass filtering it will.
Your problem is not related to the Nyquist theorem and aliasing. It is simple problem of graphic representation. You can change your code that frequency of sine will be lower Nyquist limit, but graph will be as strange as before:
t = linspace(0,1,Fs+2);
plot(sin(2*pi*f*t));
Result:
To explain problem I modify your code:
Fs=100;
f=12; %f << Fs
t=0:1/Fs:0.5; % step =1/Fs
t1=0:1/(10*Fs):0.5; % step=1/(10*Fs) for precise graphic representation
subplot (2, 1, 1);
plot(t,sin(2*pi*f*t),"-b",t,sin(2*pi*f*t),"*r");
subplot (2, 1, 2);
plot(t1,sin(2*pi*f*t1),"g",t,sin(2*pi*f*t),"r*");
See result:
Red star - values of sin(2*pi*f) with sampling rate of Fs.
Blue line - lines which connect red stars. It is usual data representation of function plot() - line interpolation between data points
Green curve - sin(2*pi*f)
Your eyes and brain can easily understand that these graphs represent the sine
Change frequency to more high:
f=48; % 2*f < Fs !!!
See on blue lines and red stars. Your eyes and brain do not understand now that these graphs represent the same sine. But your "red stars" are actually valid value of sine. See on bottom graph.
Finally, there is the same graphics for sine with frequency f=50 (2*f = Fs):
P.S.
Nyquist-Shannon sampling theorem states for your case that if:
f < 2*Fs
You have infinite number of samples (red stars on our plots)
then you can reproduce values of function in any time (green curve on our plots). You must use sinc interpolation to do it.
copied from Matlab Help:
linspace
Generate linearly spaced vectors
Syntax
y = linspace(a,b)
y = linspace(a,b,n)
Description
The linspace function generates linearly spaced vectors. It is similar to the colon operator ":", but gives direct control over the number of points.
y = linspace(a,b) generates a row vector y of 100 points linearly spaced between and including a and b.
y = linspace(a,b,n) generates a row vector y of n points linearly spaced between and including a and b. For n < 2, linspace returns b.
Examples
Create a vector of 100 linearly spaced numbers from 1 to 500:
A = linspace(1,500);
Create a vector of 12 linearly spaced numbers from 1 to 36:
A = linspace(1,36,12);
linspace is not apparent for Nyquist interval, so you can use the common form:
t = 0:Ts:1;
or
t = 0:1/Fs:1;
and change the Fs values.
The first Figure is due to the approximation of '0': sin(0) and sin(2*pi). We can notice the range is in 10^(-16) level.
I wrote the function reconstruct_FFT that can recover critically sampled data even for short observation intervals if the input sequence of samples is periodic. It performs lowpass filtering in the frequency domain.

What is NFFT used in fft() function in matlab?

I have a audio signal sample at the rate of 10Khz, I need to find fourier coefficients of my signal. I saw one example in mathwork's website where they are using following code to do the fft decomposition of a signal y:
NFFT = 2^nextpow2(L);
Y = fft(y,NFFT)/L;
f = Fs/2*linspace(0,1,NFFT/2+1);
where L is the length of the signal, I don't really understand why its defining the variable NFFT the way shown in the code above? Can't I just chose any value for NFFT? Also why are we taking Fs/2 in third line of the code above?
NFFT can be any positive value, but FFT computations are typically much more efficient when the number of samples can be factored into small primes. Quoting from Matlab documentation:
The execution time for fft depends on the length of the transform. It is fastest for powers of two. It is almost as fast for lengths that have only small prime factors. It is typically several times slower for lengths that are prime or which have large prime factors.
It is thus common to compute the FFT for the power of 2 which is greater or equal to the number of samples of the signal y. This is what NFFT = 2^nextpow2(L) does (in the Example from Matlab documentation y is constructed to have a length L).
When NFFT > L the signal is zero padded to the NFFT length.
As far as fs/2 goes, it is simply because the frequency spectrum of a real-valued signal has Hermitian symmetry (which means that the values spectrum above fs/2 can be obtained from the complex-conjugate of the values below fs/2), and as such is completely specifies from the first NFFT/2+1 values (with the index NFFT/2+1 corresponding to fs/2). So, instead of showing the redundant information above fs/2, the example chose to illustrate only the spectrum up to fs/2.
Output of FFT is complex for real valued input. That means for a signal sampled at Fs Hz, The fourier transform of this signal will have frequency components from -Fs/2 to Fs/2 and is symmetric at zero Hz. (Nyquist criterion states that if you have a signal with maxium frequency component at f Hz, you need to sample it with atleast 2f Hz .
You may wonder what does negative frequency mean here. If you are a mathematician you may care about the negative frequency but if you are an engineer, you may choose to ignore the notion of negative frequency and focus only on frequencies from 0 to Fs/2. (Max freq component for a signal sampled at Fs Hz is Fs/2)
Using FFT to learn more about frequency components present in your signal is cumbsrsome. You can use the function pwelch function in MATLAB to learn more frequencies present in your signal and also the power of these signals. MATLAB will automatically compute the NFFT required and return the frequencies present in your signal along with the power at each frequency. Use this syntax:
[p,f] = pwelch(x,[],[],[],Fs)
Look at the documentation of pwelch for more information.

Scipy periodogram terminology confusion

I am confused about the terminology used in scipy.signal.periodogram, namely:
scaling : { 'density', 'spectrum' }, optional
Selects between computing the power spectral density ('density')
where Pxx has units of V*2/Hz if x is measured in V and computing
the power spectrum ('spectrum') where Pxx has units of V*2 if x is
measured in V. Defaults to 'density'
(see: http://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.periodogram.html)
1) a few tests show that result for option 'density' is dependent on signal and window length and sampling frequency (grows when signal length increases). How come? I would say that it is exactly density that should be not dependent on these things. If I take a longer signal I should just get more accurate estimation, not different result. Not to mention that dependence on window length is also very surprising.
Result diverges in the limit of infinite signal, which could be a feature of energy, but not power. Shouldn't the periodogram converge to real theoretical PSD when length increases? If, so, am I supposed to perform another normalisation outside of the signal.periodogram method?
2) to the contrary I see that alternative option 'spectrum' gives what I would previously call Power Spectrum Density, that is, it gives a resuls independent on window segment and window length and consistent with theoretical calculation. For instance for Asin(2PIft) a two sided solution yields two peaks at -f and f, each of height 0.25*A^2.
There is a lot of literature on this subject, but I get an impression that also there is a lot of incompatibile terminology, so I will be thankful for any clarification. The straightforward question is how to interpret these options and their units. (I am used to seeing V^2/Hz which are labeled "Power Spectrum Density").
Let's take a real array called data, of length N, and with sampling frequency fs. Let's call the time bin dt=1/fs, and T = N * dt. In frequency domain, the frequency bin df = 1/T = fs/N.
The power spectrum PS (scaling='spectrum' in scipy.periodogram) is calculated as follow:
import numpy as np
import scipy.fft as fft
dft = fft.fft(data)
PS = np.abs(dft)**2 / N ** 2
It has the units of V^2. It can be understood as follow. By analogy to the continuous Fourier transform, the energy E of the signal is:
E := np.sum(data**2) * dt = 1/N * np.sum(np.abs(dft)**2) * dt
(by Parseval's theorem). The power P of the signal is the total energy E divided by the duration of the signal T:
P := E/T = 1/N**2 * np.sum(np.abs(dft)**2)
The power P only depends on the Discrete Fourier Transform (DFT) and the number of samples N. Not directly on the sampling frequency fs or signal duration T. And the power per frequency channel, i.e., power spectrum SP, is thus given by the formula above:
PS = np.abs(dft)**2 / N ** 2
For the power spectrum density PSD (scaling='density' in scipy.periodogram), one needs to divide the PS by the frequency bin of the DFT, df:
PSD := PS/df = PS * N * dt = PS * N / fs
and thus:
PSD = np.abs(dft)**2 / N * dt
This has the units of V^2/Hz = V^2 * s, and now depends on the sampling frequency. That way, integrating the PSD over the frequency range gives the same result as summing the individual values of the PS.
This should explain the relations that you see when changing the window, sampling frequency, duration.
scipy.signal.peridogram uses the scipy.signal.welch function with 0 overlap. Therefore, the scaling is similar to the one provided by the welch function, density or spectrum.
In case of the density scaling, the amplitude will vary with window length, as the longer the window the higher the frequency resolution e.g. the \Delta_f is smaller. Since the estimated density is the average one, the smaller the \Delta_f the less zero energy is considered in the averaging.
As you have mentioned spectrum scaling is an integration of the energy density over the spectrum to produce the energy. Therefore, the integration over zero values does not affect the final value.
Fourier transform actually requires finite energy in an infinite duration of time series (like a decay). So, If you just make your time series sample longer by "duplicating", the energy will be infinite with an infinite duration.
My main confusion was on the "spectrum" option for scipy.signal.periodogram, which seems to create a constant energy spectrum even when the time series become longer.
Normally, 0.5*A^2=S(f)*delta_f, where S(f) is the power density spectrum. S(f)*delta_f, representing energy is constant if A is constant. But when using a longer duration of time series, delta_f (i.e. incremental frequency) is reduced accordingly, based on FFT procedure. For example, 100s time series will lead to a delta_f=0.01Hz, while 1000s time series will have a delta_f=0.001Hz. S(f) representing density will accordingly change.

Help me understand FFT function (Matlab)

1) Besides the negative frequencies, which is the minimum frequency provided by the FFT function? Is it zero?
2) If it is zero how do we plot zero on a logarithmic scale?
3) The result is always symmetrical? Or it just appears to be symmetrical?
4) If I use abs(fft(y)) to compare 2 signals, may I lose some accuracy?
1) Besides the negative frequencies, which is the minimum frequency provided by the FFT function? Is it zero?
fft(y) returns a vector with the 0-th to (N-1)-th samples of the DFT of y, where y(t) should be thought of as defined on 0 ... N-1 (hence, the 'periodic repetition' of y(t) can be thought of as a periodic signal defined over Z).
The first sample of fft(y) corresponds to the frequency 0.
The Fourier transform of real, discrete-time, periodic signals has also discrete domain, and it is periodic and Hermitian (see below). Hence, the transform for negative frequencies is the conjugate of the corresponding samples for positive frequencies.
For example, if you interpret (the periodic repetition of) y as a periodic real signal defined over Z (sampling period == 1), then the domain of fft(y) should be interpreted as N equispaced points 0, 2π/N ... 2π(N-1)/N. The samples of the transform at the negative frequencies -π ... -π/N are the conjugates of the samples at frequencies π ... π/N, and are equal to the samples at frequencies
π ... 2π(N-1)/N.
2) If it is zero how do we plot zero on a logarithmic scale?
If you want to draw some sort of Bode plot you may plot the transform only for positive frequencies, ignoring the samples corresponding to the lowest frequencies (in particular 0).
3) The result is always symmetrical? Or it just appears to be symmetrical?
It has Hermitian symmetry if y is real: Its real part is symmetric, its imaginary part is anti-symmetric. Stated another way, its amplitude is symmetric and its phase anti-symmetric.
4) If I use abs(fft(y)) to compare 2 signals, may I lose some accuracy?
If you mean abs(fft(x - y)), this is OK and you can use it to get an idea of the frequency distribution of the difference (or error, if x is an estimate of y). If you mean abs(fft(x)) - abs(fft(y)) (???) you lose at least phase information.
Well, if you want to understand the Fast Fourier Transform, you want to go back to the basics and understand the DFT itself. But, that's not what you asked, so I'll just suggest you do that in your own time :)
But, in answer to your questions:
Yes, (excepting negatives, as you said) it is zero. The range is 0 to (N-1) for an N-point input.
In MATLAB? I'm not sure I understand your question - plot zero values as you would any other value... Though, as rightly pointed out by duffymo, there is no natural log of zero.
It's essentially similar to a sinc (sine cardinal) function. It won't necessarily be symmetrical, though.
You won't lose any accuracy, you'll just have the magnitude response (but I guess you knew that already).
Consulting "Numerical Recipes in C", Chapter 12 on "Fast Fourier Transform" says:
The frequency ranges from negative fc to positive fc, where fc is the Nyquist critical frequency, which is equal to 1/(2*delta), where delta is the sampling interval. So frequencies can certainly be negative.
You can't plot something that doesn't exist. There is no natural log of zero. You'll either plot frequency as the x-axis or choose a range that doesn't include zero for your semi-log axis.
The presence or lack of symmetry in the frequency range depends on the nature of the function in the time domain. You can have a plot in the frequency domain that is not symmetric about the y-axis.
I don't think that taking the absolute value like that is a good idea. You'll want to read a great deal more about convolution, correction, and signal processing to compare two signals.
result of fft can be 0. already answered by other people.
to plot 0 frequency, the trick is to set it to a very small positive number (I use exp(-15) for that purpose).
already answered by other people.
if you are only interested in the magnitude, yes you can do that. this is applicable, say, in many image processing problems.
Half your question:
3) The results of the FFT operation depend on the nature of the signal; hence there's nothing requiring that it be symmetrical, although if it is you may get some more information about the properties of the signal
4) That will compare the magnitudes of a pair of signals, but those being equal do no guarantee that the FFTs are identical (don't forget about phase). It may, however, be enough for your purposes, but you should be sure of that.