Iterative quantile estimation in Matlab - matlab

I'm trying to implement an interative algorithm to estimate quantiles in data that is generated from a Monte-Carlo simulation. I want to make it iterative, because I have many iterations and variables so storing all data points and using Matlab's quantile function would take much of the memory that I actually need for the simulation.
I found some approaches based on the Robbin-Monro process, given by
The implementation with a control sequence ct = c / t where c is constant is quite straight forward. In the cited paper, they show that c = 2 * sqrt(2 * pi) gives quite good results, at least for the median. But they also propose an adaptive approach based on an estimation of the histogram. Unfortunately, I haven't figured out how to implement this adaptation yet.
I tested the implementation with a constant c for three test samples with 10.000 data points. The value c = 2 * sqrt(2 * pi) did not work well for me, but c = 100 looks quite good for the test samples. However, this selction is not very robust and failed in the actual Monte-Carlo simulation giving results wide off the mark.
probabilities = [0.1, 0.4, 0.7];
controlFactor = 100;
quantile = zeros(size(probabilities));
indicator = zeros(size(probabilities));
for index = 1:length(data)
control = controlFactor / index;
indices = (data(index) >= quantile);
indicator(indices) = probabilities(indices);
indices = (data(index) < quantile);
indicator(indices) = probabilities(indices) - 1;
quantile = quantile + control * indicator;
end
Is there a more robust solution for iterative quantile estimation or does anyone have an implementation for an adaptive approach with small memory consumption?

After trying some of the adaptive iterative approaches that I found in literature without great success (not sure, if I did it right), I came up with a solution that gives me good results for my test samples and also for the actual Monte-Carlo-Simulation.
I buffer a subset of simulation results, compute the sample quantiles and average over all subset sample quantiles in the end. This seems to work quite well and without tuning many parameters. The only parameter is the buffer size which is 100 in my case.
The results converge quite fast and increasing sample size does not improve the results dramatically. There seems to be a small but constant bias that presumably is the averaged error of the subset sample quantiles. And that is the downside of my solution. By choosing the buffer size, one fixes the achievable accuracy. Increasing the buffer size reduces this bias. In the end, it seems to be a memory and accuracy tradeoff.
% Generate data
rng('default');
data = sqrt(0.5) * randn(10000, 1) + 5 * rand(10000, 1) + 10;
% Set parameters
probabilities = 0.2;
% Compute reference sample quantiles
quantileEstimation1 = quantile(data, probabilities);
% Estimate quantiles with computing the mean over a number of subset
% sample quantiles
subsetSize = 100;
quantileSum = 0;
for index = 1:length(data) / subsetSize;
quantileSum = quantileSum + quantile(data(((index - 1) * subsetSize + 1):(index * subsetSize)), probabilities);
end
quantileEstimation2 = quantileSum / (length(data) / subsetSize);
% Estimate quantiles with iterative computation
quantileEstimation3 = zeros(size(probabilities));
indicator = zeros(size(probabilities));
controlFactor = 2 * sqrt(2 * pi);
for index = 1:length(data)
control = controlFactor / index;
indices = (data(index) >= quantileEstimation3);
indicator(indices) = probabilities(indices);
indices = (data(index) < quantileEstimation3);
indicator(indices) = probabilities(indices) - 1;
quantileEstimation3 = quantileEstimation3 + control * indicator;
end
fprintf('Reference result: %f\nSubset result: %f\nIterative result: %f\n\n', quantileEstimation1, quantileEstimation2, quantileEstimation3);

Related

Inconsistency when estimating AR model coefficients in MATLAB

I'm trying to estimate the coefficients of an AR[2] model
x(t) = a_1*x(t-1) + a_2*x(t-2) + e(t), e(t) ~ N(0, sigma^2)
in MATLAB. For a_1 = 2*cos(2*pi/T)*exp(-1/tau), a_2 = -exp(-2/tau), the AR[2] model corresponds to a linear damped oscillator with period T and relaxation time tau. I simulated some data for this process with T = 30 and tau = 100 which corresponds to a_1 = 1.9368, a_2 = -0.9802:
T = 30; tau = 100;
a_1 = 2*cos(2*pi/T)*exp(-1/tau); a_2 = -exp(-2/tau);
simuMdl = arima(2,0,0);
simuMdl.Constant = 0;
simuMdl.Variance = 1e-1;
simuMdl.AR{1} = a_1;
simuMdl.AR{2} = a_2;
data = simulate(simuMdl, 600);
data = data(501:end);
plot(data)
I only take the last 100 timepoints to make sure the system is not influenced by the initial conditions any more. Now, when trying to estimate the parameters, everything works just fine when using the estimate command that uses maximum likelihood estimation:
ToEstMdl = arima(2,0,0); ToEstMdl.Constant = 0;
EstMdl = estimate(ToEstMdl, data);
EstMdl.AR
%'[1.9319] [-0.9745]'
However, when I use the Yule-Walker-Equations implemented in aryule, I get a completely different result that does not match the true parameter values at all:
aryule(data, 2)
%'1.0000 -1.4645 0.5255'
Does anyone have an idea why the Yule-Walker-equations have such shortcomings to the MLE approach?
Yule-Walker (YW) is a method of moment based method. As such its estimate would get better with increasing data points. You can check it in this example by using all 600 data points to see what is the 'best' YW estimate you can get had you used all the data points and the MLE would still be better than it. You can also increase the data points to say 5000 instead of 600 and you will see in this case the best YW (the one that uses all 5000 points) would start to approach the MLE estimate.

Shifting the FFT of a signal in Matlab

I'm trying to implement a spectral correlation function to plot with the surf function.
I think I understand the idea of the SCF as described in a paper I read, but I'm having trouble implementing my function in Matlab. I've been following these instructions:
I'm mostly having trouble shifting my pieces of the data properly. Is there an easy way to achieve step 3?
Here's what I tried in my code:
function [output] = spectral(x, N)
% This function does cyclostationary spectral analysis
% on a data set and returns some features
t = length(x);
samplesPerFrame = floor(t / N);
count = 1;
for alpha = -1:0.01:1
% Split up the samples into frames
% Have to leave some samples out if unevenly split
for i = 1:N+1
frange = ((i - 1) * samplesPerFrame + 1):(i * samplesPerFrame);
if i == N+1
break;
end
xFrame(i, :) = x(frange);
ts = [1:length(xFrame(i,:))];
shiftLeft = fft(xFrame(i, :) .* exp(-1 * 2 * pi * 1i * (alpha / 2) .* ts));
shiftRight = fft(xFrame(i, :).* exp(2 * pi * 1i * (alpha / 2) .* ts));
S(i,:) = (1 / samplesPerFrame) .* shiftLeft .* conj(shiftRight);
end
Savg(count, :) = mean(S, 1);
Ssmooth(count, :) = smooth(Savg(count,:), 'moving');
count = count + 1;
end
output = Ssmooth;
end
It looks good actually.
You may also try circshift(fft(xFrame(i, :)),[1,a]) to achieve shiftRight, and circshift(fft(xFrame(i, :)),[1,-a]) to get shiftLeft. Please note here a is integer, indicates the elements in xFrame(i, :) that you wish to move, and corresponds to Fs*a in frequency domain where Fs is your sampling rate.
The method of spectral correlation estimation you are attempting is something I refer to as the Time-Smoothing Method of spectral correlation estimation, or the TSM. The code you posted cannot provide the correct answer except in some trivial cases such as alpha = 0. The reason is that you need to adjust the cyclic periodogram for each frame by a complex phase factor to compensate for the fact that each data block is a delayed version of the one preceding it.
If you replace the line
S(i,:) = (1 / samplesPerFrame) .* shiftLeft .* conj(shiftRight);
with the two lines
S(i,:) = (1 / samplesPerFrame) .* shiftLeft .* conj(shiftRight);
S(i, :) = S(i, :) * exp(-1i * 2 * pi * alpha * i * samplesPerFrame);
you'll be able to estimate the SCF. I confirmed this by applying your original code and the modified code to a BPSK signal with bit rate (normalized) of 1/10. In this case, one of your alpha values in the loop over alpha will exactly coincide with the true cycle frequency of 1/10. Only the modified code gives the correct SCF for the bit-rate cycle frequency.
Please see my blog cyclostationary.wordpress.com for more detail and examples. In particular, I have a post on the TSM at
http://cyclostationary.blog/2015/12/18/csp-estimators-the-time-smoothing-method. (Corrected this link 5/2/17.)

MATLAB code for Harmonic Product Spectrum

Can someone tell me how I can implement Harmonic Product Spectrum using MATLAB to find the fundamental frequency of a note in the presence of harmonics?? I know I'm supposed to downsample my signal a number of times (after performing fft of course) and then multiply them with the original signal.
Say my fft signal is "FFT1"
then the code would roughly be like
hps1 = downsample(FFT1,2);
hps2 = downsample(FFT1,3);
hps = FFT1.*hps1.*hps2;
Is this code correct??? I want to know if I've downsampled properly and since each variable has a different length multiplying them results in matrix dimension error.. I really need some real quick help as its for a project work... Really desperate....
Thanx in advance....
OK you can't do "hps = FFT1.*hps1.*hps2;" for each downsampled data, do you have different sizes ...
I did a example for you how make a very simple Harmonic Product Spectrum (HPS) using 5 harmonics decimation (downsample), I just test in sinusoidal signals, I get very near fundamental frequency in my tests.
This code only shows how to compute the main steps of the algorithm, is very likely that you will need improve it !
Source:
%[x,fs] = wavread('ederwander_IN_250Hz.wav');
CorrectFactor = 0.986;
threshold = 0.2;
%F0 start test
f = 250;
fs = 44100;
signal= 0.9*sin(2*pi*f/fs*(0:9999));
x=signal';
framed = x(1:4096);
windowed = framed .* hann(length(framed));
FFT = fft(windowed, 4096);
FFT = FFT(1 : size(FFT,1) / 2);
FFT = abs(FFT);
hps1 = downsample(FFT,1);
hps2 = downsample(FFT,2);
hps3 = downsample(FFT,3);
hps4 = downsample(FFT,4);
hps5 = downsample(FFT,5);
y = [];
for i=1:length(hps5)
Product = hps1(i) * hps2(i) * hps3(i) * hps4(i) * hps5(i);
y(i) = [Product];
end
[m,n]=findpeaks(y, 'SORTSTR', 'descend');
Maximum = n(1);
%try fix octave error
if (y(n(1)) * 0.5) > (y(n(2))) %& ( ( m(2) / m(1) ) > threshold )
Maximum = n(length(n));
end
F0 = ( (Maximum / 4096) * fs ) * CorrectFactor
plot(y)
HPS usually generates an error showing the pitch one octave up, I change a bit a code, see above :-)

Logistic regression in Matlab, confused about the results

I am testing out logistic regression in Matlab on 2 datasets created from the audio files:
The first set is created via wavread by extracting vectors of each file: the set is 834 by 48116 matrix. Each traning example is a 48116 vector of the wav's frequencies.
The second set is created by extracting frequencies of 3 formants of the vowels, where each formant(feature) has its' frequency range (for example, F1 range is 500-1500Hz, F2 is 1500-2000Hz and so on). Each training example is a 3-vector of the wav's formants.
I am implementing the algorithm like so:
Cost function and gradient:
h = sigmoid(X*theta);
J = sum(y'*log(h) + (1-y)'*log(1-h)) * -1/m;
grad = ((h-y)'*X)/m;
theta_partial = theta;
theta_partial(1) = 0;
J = J + ((lambda/(2*m)) * (theta_partial'*theta_partial));
grad = grad + (lambda/m * theta_partial');
where X is the dataset and y is the output matrix of 8 classes.
Classifier:
initial_theta = zeros(n + 1, 1);
options = optimset('GradObj', 'on', 'MaxIter', 50);
for c = 1:num_labels,
[theta] = fmincg(#(t)(lrCostFunction(t, X, (y==c), lambda)), initial_theta, options);
all_theta(c, :) = theta';
end
where num_labels = 8, lambda(regularization) is 0.1
With the first set, MaxIter = 50, and I get ~99.8% classification accuracy.
With the second set and MaxIter=50, the accuracy is poor - 62.589928
I thought about increasing MaxIter to a larger value to improve the performance, however, even at a ridiculous amount of iterations, the result doesn't go higher than 66.546763. Changing of the regularization value (lambda) doesn't seem to influence the results in any better way.
What could be the problem? I am new to machine learning and I can't seem to catch what exactly causes this drastic difference. The only reason that obviously stands out for me is that the first set's examples are very long vectors, hence, larger amount of features, and the second set's examples are represented by short 3-vectors. Is this data not enough to classify the second set? If so, what can be done about it to achieve better classification results for the second set?

Matlab firpm fails for large AFR data arrays

Here is a quick & dirty code for trying to create a high precision equalizer:
bandPoints = 355;
for n = 1:bandPoints
x = (n / (bandPoints + 2));
f = (x*x)*(22000-20)+20; % 20...22000
freqs(n) = f;
niqfreqs(n) = f/22050.0;
amps(n) = 0;
end
amps(bandPoints+1) = 0; % firpm needs even numbers
niqfreqs(bandPoints+1) = 1; % firpm needs even numbers
% set some point to have a high amplitude
amps(200) = 1;
fircfs = firpm(101,niqfreqs,amps);
[h,w] = freqz(fircfs,1,512);
plot(w/pi,abs(h));
legend('firpm Design')
but it gives me
Warning:
*** FAILURE TO CONVERGE ***
Probable cause is machine rounding error.
and all FIR coefficients are 0.
If I lower the n parameter from 101 to 91, firpm works without errors, but the frequency response is far from desired. And taking into account, that I want to calculate FIR coefficients for a hardware DSP FIR module, which supports up to 12288 taps, how can I make Matlab calculate the needed coefficients? Is firpm capable of doing this or do I need to use another approach (inverse FFT) in both Matlab and later in my application C++ code?
Oh, it seems MP algorithm really cannot handle this, so I need some other solution:
http://www.eetimes.com/design/embedded/4212775/Designing-very-high-order-FIR-filters-with-zero-stuffing
I guess, I'll have to stick with inverse FFT then.