How many iterations should you make for the simulation to be a 'Monte Carlo simulation' for BER calculations? - matlab

Edited question
How many iterations should you make for the simulation to be an accurate 'Monte Carlo simulation' for Bit error rate calculations?
What is the minimum value? If I want to repeat the simulation by an exponentially growing number for five times? should I start from 1e2 thus>> iterations = [1e2 1e3 1e4 1e5 1e6] or 1e3 >> [1e3 1e4 1e5 1e6 1e7]? or something else? what is the common practice?
Additional info:
I used [8e3 1e4 3e4 5e4 8e4 1e5] before but that is not enough according to the prof. because the result is not satisfactory.
Simulations take a very long time on my computer so I cannot keep changing the iterations based on the result. If there is a common practice about this, please let me know.
Thanks #BillBokeey for helping me edit the question.

What your professor propose strikes me as qualitative, but not quantitative way to estimate the convergence of your simulation.
Frankly, I don't know how BER is computed, but I deal a lot with some integral calculations by MC.
In such case you sample xi over some interval and compute
fMC = Si fi / N, where S denotes summation. We know that fMC will converge to true value with variance of sigma2/N (or std.deviation of sigma/sqrt(N)). What do we do then, we compute in the same simulation estimation of sigma, assume for large enough N to be good approximation of sigma and get simulation error plotted. IN practical terms alongside with fMC we compute second momentum sum and average as f2MC = Si f2i / N, and at the end get s=sqrt(f2MC - (fMC)2)/sqrt(N) as estimated error of the MC simulation (it will be a bit biased though).
Thus you could plot on the same graph value of BER and statistical error of the simulation. You could even do better - ask user to input required statistical error (say, in %, meaning user enters s/f*100), and continue simulation in bunches till you reach required precision.
THen you could judge if 109 points are enough or not...

Assuming that we denote our simulated BER as Pb_hat and that Pb_hat in [(1 - alpha)Pb, (1 + alpha)Pb], where Pb is the true BER, and alpha is the percent deviation tolerance (e.g., 0.1), then from [van Trees 2013, pg. 83] we know that the number of Monte Carlo trials required to obtain Pb_hat with a confidence probability pc is K=(c / alpha)^2 x (1-Pb) / Pb,
with c given in Table I.
Table I: confidence interval probabilities from the Gaussian distribution
Example: Suppose we want to simulate a BER of 10^-4 with a percent deviation tolerance of 0.01 and a confidence probability 0.950, then from Table I we know that c = 1.960 and by applying the formula K = (1.96/0.01)^2 x (1-10^-4)/10^-4 = 384121584 Monte Carlo trials. This is a surprisingly large value, though.
As a rule of thumb, K should be on the order of 1O/BER [Jeruchim 1984]
[van Trees 2013] H. L. van Trees, K. L. Bell, and Z. Tian, Detection, estimation, and filtering theory, 2nd ed., Hoboken, NJ: Wiley, 2013.
[Jeruchim 1984] M. Jeruchim, "Techniques for Estimating the Bit Error Rate in the Simulation of Digital Communication Systems," in IEEE Journal on Selected Areas in Communications, vol. 2, no. 1, pp. 153-170, January 1984, doi: 10.1109/JSAC.1984.1146031.


Explain the intuition for the tol paramer in scipy differential evolution

I am using the differential evolution optimizer in scipy and I don't understand the intuition behind the tol argument. Specifically is say in the documentation:
tol: float, optional
When the mean of the population energies, multiplied by tol, divided
by the standard deviation of the population energies is greater than 1
the solving process terminates:
convergence = mean(pop) * tol / stdev(pop) > 1
What does setting tol represent from a user perspective?
Maybe the formula in the documentation is easier to understand in the following form (see lines 508 and 526 in the code):
std(population_energies) / mean(population_energies) < tol
It means that convergence is reached when the standard deviation of the energies for each individual in the population, normed by the average, is smaller than the given tolerance value.
The optimization algorithm is iterative. At every iteration a better solution is found. The tolerance parameters is used to define a stopping condition. The stopping condition is actually that all the individuals (parameter sets) have approximately the same energy, i.e. the same cost function value. Then, the parameter set giving the lowest energy is returned as a solution.
It also implies that all the individuals are relatively close to each other in the parameter space. So, no better solution can be expected on the following generations.

Function approximation by ANN

So I have something like this,
and something similar for x. Where thetai is angles from specified interval and li some coeficients. Task is approximate inversion of equation, so you set x and y and result will be appropriate theta. So I random generate thetas from specified intervals, compute x and y. Then I norm x and y between <-1,1> and thetas between <0,1>. This data I used as training set in such way, inputs of network are normalized x and y, outputs are normalized thetas.
I train the network, tried different configuration and absolute error of network was still around 24.9% after whole night of training. It's so much, so I don't know what to do.
Bigger training set?
Bigger network?
Experiment with learning rate?
Longer training?
Technical info
As training algorithm was used error back propagation. Neurons have sigmoid activation function, units are biased. I tried topology: [2 50 3], [2 100 50 3], training set has length 1000 and training duration was 1000 cycle(in one cycle I go through all dataset). Learning rate has value 0.2.
Error of approximation was computed as
sum of abs(desired_output - reached_output)/dataset_lenght.
Used optimizer is stochastic gradient descent.
Loss function,
1/2 (desired-reached)^2
Network was realized in my Matlab template for NN. I know that is weak point, but I'm sure my template is right because(successful solution of XOR problem, approximation of differential equations, approximation of state regulator). But I show this template, because this information may be useful.
Neuron class
Network class
I used 2500 unique data within theta ranges.
theta1<0, 180>, theta2<-130, 130>, theta3<-150, 150>
I also experiment with larger dataset, but accuracy doesn't improve.

Harmonic mean when a DC signal is present

I have an output from a noisy signal, saved as a set of cosines.
I have a set of frequencies from 0 to x Hz (x is a large number), and a set, of the same size, of amplitudes.
I want to work out the harmonic mean of the frequencies present, when the weighting of the frequency is the magnitude of the corresponding amplitude.
For example:
If I have a set of frequencies
[ 1 , 2 , 3] and amplitudes [ 10, 100, 1000 ] (such that the cosine with frequency 1 has amplitude 10, etc.). Then, the harmonic mean of the frequencies is 2.8647.
However, I run into problems when I have a zero frequency (a "DC" component) - the harmonic mean is just zero!
The real life problem is a very big set of cosines, starting with a zero frequency, going up to several GHz. Much of the signal is weighted in a portion of the spectrum and I want to compare a simple weighted mean of the spectrum with a harmonic mean.
The way around this (it seems a cheap way) is to ignore the zero frequency - it is only one frequency out of tens of thousands. But is there a correct way to do this?
Below is the equation for the weighted harmonic mean:
Applied to your example it's:
x = 1:3;
w = logspace(1,3,3); % [10 100 1000]
sum(w)/sum(w./x); % 2.8220
You can see that if one of the x values is 0, the sum in the denominator would be infinite. If you manually set the weight of this value to 0, you would have a 0/0 scenario in the bottom sum (which evaluates to NaN). Technically speaking - you can't have an x of 0 in the computation of this type of mean without getting a result of 0.
I think it's quite clear that this isn't the right tool to handle a DC signal. Several things come to mind in order to get some meaningful information:
It sounds reasonable to ignore the DC signal altogether in both means.
Perhaps you would be better off ignoring it for the purpose of the harmonic mean and add it afterwards for compatibility with the simple mean.
At the end of the day, you need to decide what is the point you're trying to make with this, and then process the data accordingly.

Scipy periodogram terminology confusion

I am confused about the terminology used in scipy.signal.periodogram, namely:
scaling : { 'density', 'spectrum' }, optional
Selects between computing the power spectral density ('density')
where Pxx has units of V*2/Hz if x is measured in V and computing
the power spectrum ('spectrum') where Pxx has units of V*2 if x is
measured in V. Defaults to 'density'
1) a few tests show that result for option 'density' is dependent on signal and window length and sampling frequency (grows when signal length increases). How come? I would say that it is exactly density that should be not dependent on these things. If I take a longer signal I should just get more accurate estimation, not different result. Not to mention that dependence on window length is also very surprising.
Result diverges in the limit of infinite signal, which could be a feature of energy, but not power. Shouldn't the periodogram converge to real theoretical PSD when length increases? If, so, am I supposed to perform another normalisation outside of the signal.periodogram method?
2) to the contrary I see that alternative option 'spectrum' gives what I would previously call Power Spectrum Density, that is, it gives a resuls independent on window segment and window length and consistent with theoretical calculation. For instance for Asin(2PIft) a two sided solution yields two peaks at -f and f, each of height 0.25*A^2.
There is a lot of literature on this subject, but I get an impression that also there is a lot of incompatibile terminology, so I will be thankful for any clarification. The straightforward question is how to interpret these options and their units. (I am used to seeing V^2/Hz which are labeled "Power Spectrum Density").
Let's take a real array called data, of length N, and with sampling frequency fs. Let's call the time bin dt=1/fs, and T = N * dt. In frequency domain, the frequency bin df = 1/T = fs/N.
The power spectrum PS (scaling='spectrum' in scipy.periodogram) is calculated as follow:
import numpy as np
import scipy.fft as fft
dft = fft.fft(data)
PS = np.abs(dft)**2 / N ** 2
It has the units of V^2. It can be understood as follow. By analogy to the continuous Fourier transform, the energy E of the signal is:
E := np.sum(data**2) * dt = 1/N * np.sum(np.abs(dft)**2) * dt
(by Parseval's theorem). The power P of the signal is the total energy E divided by the duration of the signal T:
P := E/T = 1/N**2 * np.sum(np.abs(dft)**2)
The power P only depends on the Discrete Fourier Transform (DFT) and the number of samples N. Not directly on the sampling frequency fs or signal duration T. And the power per frequency channel, i.e., power spectrum SP, is thus given by the formula above:
PS = np.abs(dft)**2 / N ** 2
For the power spectrum density PSD (scaling='density' in scipy.periodogram), one needs to divide the PS by the frequency bin of the DFT, df:
PSD := PS/df = PS * N * dt = PS * N / fs
and thus:
PSD = np.abs(dft)**2 / N * dt
This has the units of V^2/Hz = V^2 * s, and now depends on the sampling frequency. That way, integrating the PSD over the frequency range gives the same result as summing the individual values of the PS.
This should explain the relations that you see when changing the window, sampling frequency, duration.
scipy.signal.peridogram uses the scipy.signal.welch function with 0 overlap. Therefore, the scaling is similar to the one provided by the welch function, density or spectrum.
In case of the density scaling, the amplitude will vary with window length, as the longer the window the higher the frequency resolution e.g. the \Delta_f is smaller. Since the estimated density is the average one, the smaller the \Delta_f the less zero energy is considered in the averaging.
As you have mentioned spectrum scaling is an integration of the energy density over the spectrum to produce the energy. Therefore, the integration over zero values does not affect the final value.
Fourier transform actually requires finite energy in an infinite duration of time series (like a decay). So, If you just make your time series sample longer by "duplicating", the energy will be infinite with an infinite duration.
My main confusion was on the "spectrum" option for scipy.signal.periodogram, which seems to create a constant energy spectrum even when the time series become longer.
Normally, 0.5*A^2=S(f)*delta_f, where S(f) is the power density spectrum. S(f)*delta_f, representing energy is constant if A is constant. But when using a longer duration of time series, delta_f (i.e. incremental frequency) is reduced accordingly, based on FFT procedure. For example, 100s time series will lead to a delta_f=0.01Hz, while 1000s time series will have a delta_f=0.001Hz. S(f) representing density will accordingly change.

How does number of points change a FFT in MATLAB

When taking fft(signal, nfft) of a signal, how does nfft change the outcome and why? Can I have a fixed value for nfft, say 2^18, or do I need to go 2^nextpow2(2*length(signal)-1)?
I am computing the power spectral density(PSD) of two signals by taking the FFT of the autocorrelation, and I want to compare the the results. Since the signals are of different lengths, I am worried if I don't fix nfft, it would make the comparison really hard!
There is no inherent reason to use a power-of-two (it just might make the processing more efficient in some circumstances).
However, to make the FFTs of two different signals "commensurate", you will indeed need to zero-pad one or other (or both) signals to the same lengths before taking their FFTs.
However, I feel obliged to say: If you need to ask this, then you're probably not at a point on the DSP learning curve where you're going to be able to do anything useful with the results. You should get yourself a decent book on DSP theory, e.g. this.
Most modern FFT implementations (including MATLAB's which is based on FFTW) now rarely require padding a signal's time series to a length equal to a power of two. However, nearly all implementations will offer better, and sometimes much much better, performance for FFT's of data vectors w/ a power of 2 length. For MATLAB specifically, padding to a power of 2 or to a length with many low prime factors will give you the best performance (N = 1000 = 2^3 * 5^3 would be excellent, N = 997 would be a terrible choice).
Zero-padding will not increase frequency resolution in your PSD, however it does reduce the bin-size in the frequency domain. So if you add NZeros to a signal vector of length N the FFT will now output a vector of length ( N + NZeros )/2 + 1. This means that each bin of frequencies will now have a width of:
Bin width (Hz) = F_s / ( N + NZeros )
Where F_s is the signal sample frequency.
If you find that you need to separate or identify two closely space peaks in the frequency domain, you need to increase your sample time. You'll quickly discover that zero-padding buys you nothing to that end - and intuitively that's what we'd expect. How can we expect more information in our power spectrum w/o adding more information (longer time series) in our input?