MATLABs fitgmdist function in 1 dimension - matlab

I have previously posted this on the Mathworks Community, but am reposting here for a wider audience...
I have a 1 dimensional Histogram, to which I want to fit gaussians:
In the above example I need to find the centres of the 4 dominant peaks, however, the number of peaks may vary in a different Histogram.
Below is a MWE of my approach:
bins = 2000;
fsc_hist = histogram(FSC_data.FSC_HF,bins);hold on;
%% smooth data to get rid of discretization
fscValues = fsc_hist.Values;
binStep = (fsc_hist.BinLimits(2)-fsc_hist.BinLimits(1))/fsc_hist.NumBins;
binCenters = binStep * [0:fsc_hist.NumBins-1];
smoothValues = smooth(binCenters, fscValues, 0.1, 'rloess');
%% fit GMM
expectedPeaks = 4;
gmm = fitgmdist(smoothValues, expectedPeaks, 'RegularizationValue', 0.1);
Which returns the following GMM result:
Gaussian mixture distribution with 4 components in 1 dimensions
Component 1: Mixing proportion: 0.294734 Mean: 0.2417
Component 2: Mixing proportion: 0.152275 Mean: 41.9369
Component 3: Mixing proportion: 0.344658 Mean: 6.8231
Component 4: Mixing proportion: 0.208333 Mean: 24.6758
Obviously, the calculated Mean values of the gaussians is not correct.
Where is my approach going wrong? I believe that either my first input to the fitgmdist function must somehow be normalised, or that I need to post-process the output. So far, my attempts have failed.

What's happening is that the mixing models is giving you the means of Gaussian distributions of the counts. Instead of inputting the histogram into fitgmdist, you should input the raw FSC_data.FSC_HF data into the first argument.

Related

How to multiply a frequency in histogram by scalar

I am using Matlab for this (preferable idea).
I need to multiply a frequency of a histogram by a scalar value (for each bin).
I have tried this approach in a similar question but it is defined for hist and not histogram function.
This is my original distribution that needs to be multiplied:
This is what I get using the approach given in the similar question:
Additionally, when I finish this part I will have more histograms that I need to sum up into one histogram. So how would I do that? They might have different ranges.
The documentation clearly explains how to replicate the behavior of hist with histogram.
For example:
A = rand(100, 1);
h = histogram(A);
figure
h_new = histogram('BinCounts', h.Values*2, 'BinEdges', h.BinEdges);
Generates the following histograms:
You can modify the Bincounts like this:
X = normrnd(0,1,1000,1); % some data
h = histogram(X,3); % histogram with 3 bins
h.BinCounts = h.Values.*[3 5 1]; % scale each bin by factor 3, 5 and 1 respectively

GMModel - how do I use this to predict a label's data?

I've made a GMModel using fitgmdist. The idea is to produce two gaussian distributions on the data and use that to predict their labels. How can I determine if a future data point fits into one of those distributions? Am I misunderstanding the purpose of a GMModel?
clear;
load C:\Users\Daniel\Downloads\data1 data;
% Mixed Gaussian
GMModel = fitgmdist(data(:, 1:4),2)
Produces
GMModel =
Gaussian mixture distribution with 2 components in 4 dimensions
Component 1:
Mixing proportion: 0.509709
Mean: 2.3254 -2.5373 3.9288 0.4863
Component 2:
Mixing proportion: 0.490291
Mean: 2.5161 -2.6390 0.8930 0.4833
Edit:
clear;
load C:\Users\Daniel\Downloads\data1 data;
% Mixed Gaussian
GMModel = fitgmdist(data(:, 1:4),2);
P = posterior(GMModel, data(:, 1:4));
X = round(P)
blah = X(:, 1)
dah = data(:, 5)
Y = max(mean(blah == dah), mean(~blah == dah))
I don't understand why you round the posterior values. Here is what I would do after fitting a mixture model.
P = posterior(GMModel, data(:, 1:4));
[~,Y] = max(P,[],2);
Now Y contains the labels that is index of which Gaussian the data belongs in-terms of maximum aposterior (MAP). Important thing to do is to align the labels before evaluating the classification error. Since renumbering might happen, i.e., Gaussian component 1 in the true might be component 2 in the clustering produced and so on. May be that why you are getting varying accuracy ranging from 51% accuracy to 95% accuracy, in addition to other subtle problems.

MATLAB's newrb for designing radial basis networks does not behave in accordance to the documentation. Why?

I'm trying to approximate various signals using radial basis networks. In particular, I make use of MATLAB's newrb.
My problem is that this function seems to behave incorrectly if I follow the description of newrb. As I understand it, it makes sense to transpose all arguments despite the documentation.
The following example hopefully illustrates my problem.
I create one period of a sine wave with 100 samples. I would like to approximate this sine wave by means of a radial basis network with maximally two hidden neurons. I have one input vector (t) and one target vector (s). Hence, according to the documentation, I should call newrb with two column vectors. However, the approximation is too good. In fact, the mean squared error is 0 which can't be true using only two neurons. Additionally, the visualization with view(net) shows not only one but 100 inputs if I use column vectors.
In the example, the vectors corresponding to the "correct" (according to the documentation) function call are indicated by _doc, the ones corresponding to the "incorrect" call by _not_doc.
Can anybody explain this behavior?
% one period sine signal with
% carrier frequency = 1, sampling frequency = 100
Ts = 1 / 100;
t = 2 * pi * (0:Ts:1-Ts); % size(t) = 1 100
s = sin(t); % size(s) = 1 100
% design radial basis network
MSE_goal = 0.0; % mean squared error goal, default value
spread = 1.0; % spread of readial basis functions, default value
max_neurons = 2; % maximum number of neurons, custom value
DF = 25; % number of neurons to add between displays, default value
net_not_doc = newrb( t , s , MSE_goal, spread, max_neurons, DF ); % row vectors
net_doc = newrb( t', s', MSE_goal, spread, max_neurons, DF ); % column vectors
% simulate network
approx_not_doc = sim( net_not_doc, t );
approx_doc = sim( net_doc, t' );
% plot
figure;
plot( t, s, 'DisplayName', 'Sine' );
hold on;
plot( t, approx_not_doc, 'r:', 'DisplayName', 'Approximation_{not doc}');
hold on;
plot( t, approx_doc', 'g:', 'DisplayName', 'Approximation_{doc}');
grid on;
legend show;
% view neural networks
view(net_not_doc);
view(net_doc);
Because I had the same problem myself, I'll try to give an answer for anyone who will stumble upon the same post.
As I figured the problem is not the transpose vectors. You can use your data as it is, without transposing anything.
The fact that you train your RBF network with vector t and then simulate with the same vector that you trained your network, is the reason why you have so perfect approximation. You test your network with the same values that you taught it.
If you realy want to test your network you must choose a different vector for testing. In your example I used this:
% simulate network
t_test = 2 * pi * ((1-Ts)/2:Ts:3-Ts);
approx_not_doc = sim( net_not_doc, t_test );
And now when you plot your results, you can observe that the points that have the same value as in your train vector are almost flawless. The rest have an unknown target because of the small number of neurons (as you expected).
Plot of t_test with approx_not_doc.
Now If you add more neurons (in this example I used 100), you can see that now the new network can predict, with the same test vector t_test, an unknown part of your function. Plot t_test with approx_not_doc for 100 neurons. Of course, if you try with different number of neurons and spread your results will vary.
Hope this will help anyone with the same problem.

How do I visualize n-dimensional features?

I have two matrices A and B. The size of A is 200*1000 double (here: 1000 represents 1000 different features). Matrix A belongs to group 1, where I use ones(200,1) as the label vector. The size of B is also 200*1000 double (here: 1000 also represents 1000 different features). Matrix B belongs to group 2, where I use -1*ones(200,1) as the label vector.
My question is how do I visualize matrices A and B so that I can clearly distinguish them based on the given groups?
I'm assuming each sample in your matrices A and B is determined by a row in either matrix. If I understand you correctly, you want to draw a series of 1000-dimensional vectors, which is impossible. We can't physically visualize anything beyond three dimensions.
As such, what I suggest you do is perform a dimensionality reduction to reduce your data so that each input is reduced to either 2 or 3 dimensions. Once you reduce your data, you can plot them normally and assign a different marker to each point, depending on what group they belonged to.
If you want to achieve this in MATLAB, use Principal Components Analysis, specifically the pca function in MATLAB, that calculates the residuals and the reprojected samples if you were to reproject them onto a lower dimensionality. I'm assuming you have the Statistics Toolbox... if you don't, then sorry this won't work.
Specifically, given your matrices A and B, you would do this:
[coeffA, scoreA] = pca(A);
[coeffB, scoreB] = pca(B);
numDimensions = 2;
scoreAred = scoreA(:,1:numDimensions);
scoreBred = scoreB(:,1:numDimensions);
The second output of pca gives you reprojected values and so you simply have to determine how many dimensions you want by extracting the first N columns, where N is the desired number of dimensions you want.
I chose 2 for now, and we can see what it looks like in 3 dimensions after. Once we have what we need for 2 dimensions, it's just a matter of plotting:
plot(scoreAred(:,1), scoreAred(:,2), 'rx', scoreBred(:,1), scoreBred(:,2), 'bo');
This will produce a plot where the samples from matrix A are with red crosses while the samples from matrix B are with blue circles.
Here's a sample run given completely random data:
rng(123); %// Set seed for reproducibility
A = rand(200,1000); B = rand(200,1000); %// Generate random data
%// Code as before
[coeffA, scoreA] = pca(A);
[coeffB, scoreB] = pca(B);
numDimensions = 2;
scoreAred = scoreA(:,1:numDimensions);
scoreBred = scoreB(:,1:numDimensions);
%// Plot the data
plot(scoreAred(:,1), scoreAred(:,2), 'rx', scoreBred(:,1), scoreBred(:,2), 'bo');
We get this:
If you want three dimensions, simply change numDimensions = 3, then change the plot code to use plot3:
plot3(scoreAred(:,1), scoreAred(:,2), scoreAred(:,3), 'rx', scoreBred(:,1), scoreBred(:,2), scoreBred(:,3), 'bo');
grid;
With those changes, this is what we get:

Find the height and length of waves in noisy data

My goal is to find the maximum values of wave heights and wave lengths.
dwcL01 though dwcL10 is arrays of <3001x2 double> with output from a numerical wave model.
Part of my script:
%% Plotting results from SWASH
% Examination of phase velocity on deep water with different number of layers
% Wave height 3 meters, wave peroid 8 sec on a depth of 30 meters
clear all; close all; clc;
T=8;
L0=1.56*T^2;
%% Loading results tabels.
load dwcL01.tbl; load dwcL02.tbl; load dwcL03.tbl; load dwcL04.tbl;
load dwcL05.tbl; load dwcL06.tbl; load dwcL07.tbl; load dwcL08.tbl;
load dwcL09.tbl; load dwcL10.tbl;
M(:,:,1) = dwcL01; M(:,:,2) = dwcL02; M(:,:,3) = dwcL03; M(:,:,4) = dwcL04;
M(:,:,5) = dwcL05; M(:,:,6) = dwcL06; M(:,:,7) = dwcL07; M(:,:,8) = dwcL08;
M(:,:,9) = dwcL09; M(:,:,10) = dwcL10;
%% Finding position of wave crest using diff and sign.
for i=1:10
Tp(:,1,i) = diff(sign(diff([M(1,2,i);M(:,2,i)]))) < 0;
Wc(:,:,i) = M(Tp,:,i);
L(:,i) = diff(Wc(:,1,i))
end
This works fine for finding the maximum values, if the data is "smooth". The following image shows a section of my data. I get all peaks, when I only need the one around x = 40. How do I filter so I only get the "real" wave crests. The solution needs to be general so that it still works if I change the domain size, wave height or wave period.
If you're basically trying to fit this curve of data to a sine wave, have you considered performing Fourier analysis (FFT in Matlab), then checking the magnitude of that fundamental frequency? The frequency will tell you the wave spacing, and the magnitude the height, and when used over multiple periods will find an average.
See the Matlab help page for an example of the usage
but the basic gist is:
y = [...] %vector of wave data points
N=length(y); %Make sure this is an even number
Y = fft(y); %Convert into frequency domain
figure;
plot(y(1:N)); %Plot original wave data
figure;
plot(abs(Y(1:N/2))./N); %Plot only the lower half of frequencies to hide aliasing
I have one more solution that might work for you. It involves computing the 2nd-order derivative using a 5-point central difference instead of the 2-point finite differences. When using diff twice you are performing two first-order derivatives consecutively (finite 2-point differences) which are very susceptible to noise/oscillations. The advantage of using a higher-order approximation is that the neighboring points help filter out the small oscillations, and this may work for your case.
Let f(:) = squeeze(M(:,2,i)) be the array of data points, and h is the uniform spacing distance between the points:
%Better approximation of the 2nd derivative using neighboring points:
for j=3:length(f)-2
Tp(j,i) = (-f(j-2) + 16*f(j-1) - 30*f(j) + 16*f(j+1) - f(j+2))/(12*h^2);
end
Note that since this 2nd-order derivative requires the 2 neighboring points to the left and right, that the range of the loop must start at the 3rd index and end 2 short of the array length.