MatLab BayesNetToolbox parameter learning - matlab

My question is specific to the "learn_params()" function of the BayesNetToolbox in MatLab. In the user manual, "learn_params()" is stated to be suitable for use only if the input data is fully observed. I have tried it with a partially observed dataset where I represented unobserved values as NaN's.
It seems like "learn_params()" can deal with NaNs and the node state combinations that do not occur in the dataset. When I apply dirichlet priors to smoothen the 0 values, I get 'sensible' MLE distributions for all nodes. I have copied the script where I do this.
Can someone clarify whether what I am doing makes sense or if I am missing
something, i.e. the reason why "learn_params()" cannot be used with partially
observed data.
The MatLab Script where I test this is here:
% Incomplete dataset (where NaN's are unobserved)
Age = [1,2,2,NaN,3,3,2,1,NaN,2,1,1,3,NaN,2,2,1,NaN,3,1];
TNMStage = [2,4,2,3,NaN,1,NaN,3,1,4,3,NaN,2,4,3,4,1,NaN,2,4];
Treatment = [2,3,3,NaN,2,NaN,4,4,3,3,NaN,2,NaN,NaN,4,2,NaN,3,NaN,4];
Survival = [1,2,1,2,2,1,1,1,1,2,2,1,2,2,1,2,1,2,2,1];
matrixdata = [Age;TNMStage;Treatment;Survival];
node_sizes =[3,4,4,2];
% Enter the variablesmap
keys = {'Age', 'TNM','Treatment', 'Survival'};
v= 1:1:length(keys);
VariablesMap = containers.Map(keys,v);
% create the dag and the bnet
N = length(node_sizes); % Instead of entering it manually
dag2 = zeros(N,N);
dag2(VariablesMap('Treatment'),VariablesMap('Survival')) = 1;
bnet21 = mk_bnet(dag2, node_sizes);
draw_graph(bnet21.dag);
dirichletweight=1;
% define the CPD priors you want to use
bnet23.CPD{VariablesMap('Age')} = tabular_CPD(bnet23, VariablesMap('Age'), 'prior_type', 'dirichlet','dirichlet_type', 'unif', 'dirichlet_weight', dirichletweight);
bnet23.CPD{VariablesMap('TNM')} = tabular_CPD(bnet23, VariablesMap('TNM'), 'prior_type', 'dirichlet','dirichlet_type', 'unif', 'dirichlet_weight', dirichletweight);
bnet23.CPD{VariablesMap('Treatment')} = tabular_CPD(bnet23, VariablesMap('Treatment'), 'prior_type', 'dirichlet','dirichlet_type', 'unif','dirichlet_weight', dirichletweight);
bnet23.CPD{VariablesMap('Survival')} = tabular_CPD(bnet23, VariablesMap('Survival'), 'prior_type', 'dirichlet','dirichlet_type', 'unif','dirichlet_weight', dirichletweight);
% Find MLEs from incomplete data with Dirichlet prior CPDs
bnet24 = learn_params(bnet23, matrixdata);
% Look at the new CPT values after parameter estimation has been carried out
CPT24 = cell(1,N);
for i=1:N
s=struct(bnet24.CPD{i}); % violate object privacy
CPT24{i}=s.CPT;
end

According to my understanding of the BNT documentation, you need to make a couple of changes:
Missing values should be represented as empty cells instead of NaN values.
The learn_params_em function is the only one that supports missing values.
My previous response was incorrect, as I mis-recalled which of the BNT learning functions had support for missing values.

Related

How to perform fuzzy clustering method on Qualitative Bankruptcy dataset

We are required to build a fuzzy system with MATLAB on Qualitative_Bankruptcy Data Set and we were advised to implement Fuzzy Clustering Method on it.
There are 7 attributes (6+1) on the dataset (250 instances) and each independent attribute has 3 possible values, which are Positive, Average, and Negative. Please refer to the dataset for more.
From our understanding, clustering is about grouping instances that exhibit similar properties by calculating the distances between the parameters. So the data could be like this. Picture below is just a dummy data, not relevant to my project.
The question is, how is it possible to implement a cluster analysis on a dataset like this.
P,P,A,A,A,P,NB
N,N,A,A,A,N,NB
A,A,A,A,A,A,NB
P,P,P,P,P,P,NB
N,N,N,A,N,A,B
N,N,N,P,N,N,B
N,N,N,N,N,P,B
N,N,N,N,N,A,B
Since you asked about fuzzy clustering, you are contradicting yourself.
In fuzzy clustering, every object belongs to every cluster, just to a varying degree (the cluster assignment is "fuzzy").
It's mostly used with numerical data, where you can assume the measurements are not precise either, but come with a fuzzy error, too. So I don't think it makes as much sense on categoricial data.
Now categoricial data tends to cluster really bad beyond counting duplicates. It just has a too coarse resolution. People do all kind of crazy hacks like running k-means on dummy variables, and never seem to question what they actually compute/optimize by doing this. Nor test their result...
Well, let's start from reading your data:
clear();
clc();
close all;
opts = detectImportOptions('Qualitative_Bankruptcy.data.txt');
opts.DataLine = 1;
opts.MissingRule = 'omitrow';
opts.VariableNamesLine = 0;
opts.VariableNames = {'IR' 'MR' 'FF' 'CR' 'CO' 'OR' 'Class'};
opts.VariableTypes = repmat({'categorical'},1,7);
opts = setvaropts(opts,'Categories',{'P' 'A' 'N'});
opts = setvaropts(opts,'Class','Categories',{'B' 'NB'});
data = readtable('Qualitative_Bankruptcy.data.txt',opts);
data = rmmissing(data);
data_len = height(data);
Now, since the kmeans function (reference here) accepts only numeric values, we need to convert a table of categorical values into a matrix:
x = double(table2array(data));
And finally, we apply the function:
[idx,c] = kmeans(x,number_of_clusters);
Now comes the problem. The k-means clustering can be performed using a wide variety of distance measures together with a wide variety of options. You have to play with those parameters in order to obtain the clustering that better approximates your available output.
Since k-means clustering organizes your data into n clusters, this means that your output defines more than 3 clusters because 46 + 71 + 61 = 178... and since your data contains 250 observations, 72 of them are assigned to one or more clusters that are unknown to me (and maybe to you too).
If you want to replicate that output, or to find the clustering that better approximate your output... you have to find, if available, an algorithm that minimize the error... or alternatively you can try to brute-force it, for example:
% ...
x = double(table2array(data));
cl1_targ = 46;
cl2_targ = 71;
cl3_targ = 61;
dist = {'sqeuclidean' 'cityblock' 'cosine' 'correlation'};
res = cell(16,3);
res_off = 1;
for i = 1:numel(dist)
dist_curr = dist{i};
for j = 3:6
idx = kmeans(x,j,'Distance',dist_curr); % start parameter needed
cl1 = sum(idx == 1);
cl2 = sum(idx == 2);
cl3 = sum(idx == 3);
err = abs(cl1 - cl1_targ) + abs(cl2 - cl2_targ) + abs(cl3 - cl3_targ);
res(res_off,:) = {dist_curr j err};
res_off = res_off + 1;
end
end
[min_val,min_idx] = min([res{:,3}]);
best = res(min_idx,1:2);
Don't forget to remember that the kmeans function uses a randomly-chosen starting configuration... so it will end up delivering different solutions for different starting points. Define fixed starting points (means) using the Start parameter, otherwise a different result will be produced every time your run the kmeans function.

Matlab: Change variable resolution and names for viewing regression trees

Using treeMine = fitctree(....) I can generate a decision tree but the tree is very big, and therefore very difficult to convey information, when using view(treeMine,'Mode','Graph')
Therefore my question is if it is possible to change variable names x1-x9 to other names to make it human understandable and if I could force the numbers to be represented by engineering notation meaning 10e3.
Does anybody know how this can be done?
Minimal Example
Minimal example can be to use Matlabs own car example:
load carsmall
idxNaN = isnan(MPG + Weight);
X = Weight(~idxNaN);
Y = MPG(~idxNaN);
n = numel(X);
rng(1) % For reproducibility
idxTrn = false(n,1);
idxTrn(randsample(n,round(0.5*n))) = true; % Training set logical indices
idxVal = idxTrn == false; % Validation set logical indices
Mdl = fitrtree(X(idxTrn),Y(idxTrn));
view(Mdl,'Mode','graph')
How do you then specify the value resolution and variable name
About the names: It's a bit a poor example because you use only one predictor (weight), but you can change the name with the 'PredictorNames' name-value pair, e.g.
Mdl = fitrtree(X(idxTrn),Y(idxTrn),'PredictorNames',{'weight'});
If you were to use more predictors you just have to add more elements to the cell array, e.g.
'PredictorNames',{'weight','age','women'}
I don't know about the numbers tough.

MATLAB Piecewise function

I have to construct the following function in MATLAB and am having trouble.
Consider the function s(t) defined for t in [0,4) by
{ sin(pi*t/2) , for t in [0,1)
s(t) = { -(t-2)^3 , for t in [1,3)*
{ sin(pi*t/2) , for t in [3,4)
(i) Generate a column vector s consisting of 512 uniform
samples of this function over the interval [0,4). (This
is best done by concatenating three vectors.)
I know it has to be something of the form.
N = 512;
s = sin(5*t/N).' ;
But I need s to be the piecewise function, can someone provide assistance with this?
If I understand correctly, you're trying to create 3 vectors which calculate the specific function outputs for all t, then take slices of each and concatenate them depending on the actual value of t. This is inefficient as you're initialising 3 times as many vectors as you actually want (memory), and also making 3 times as many calculations (CPU), most of which will just be thrown away. To top it off, it'll be a bit tricky to use concatenate if your t is ever not as you expect (i.e. monotonically increasing). It might be an unlikely situation, but better to be general.
Here are two alternatives, the first is imho the nice Matlab way, the second is the more conventional way (you might be more used to that if you're coming from C++ or something, I was for a long time).
function example()
t = linspace(0,4,513); % generate your time-trajectory
t = t(1:end-1); % exclude final value which is 4
tic
traj1 = myFunc(t);
toc
tic
traj2 = classicStyle(t);
toc
end
function trajectory = myFunc(t)
trajectory = zeros(size(t)); % since you know the size of your output, generate it at the beginning. More efficient than dynamically growing this.
% you could put an assert for t>0 and t<3, otherwise you could end up with 0s wherever t is outside your expected range
% find the indices for each piecewise segment you care about
idx1 = find(t<1);
idx2 = find(t>=1 & t<3);
idx3 = find(t>=3 & t<4);
% now calculate each entry apprioriately
trajectory(idx1) = sin(pi.*t(idx1)./2);
trajectory(idx2) = -(t(idx2)-2).^3;
trajectory(idx3) = sin(pi.*t(idx3)./2);
end
function trajectory = classicStyle(t)
trajectory = zeros(size(t));
% conventional way: loop over each t, and differentiate with if-else
% works, but a lot more code and ugly
for i=1:numel(t)
if t(i)<1
trajectory(i) = sin(pi*t(i)/2);
elseif t(i)>=1 & t(i)<3
trajectory(i) = -(t(i)-2)^3;
elseif t(i)>=3 & t(i)<4
trajectory(i) = sin(pi*t(i)/2);
else
error('t is beyond bounds!')
end
end
end
Note that when I tried it, the 'conventional way' is sometimes faster for the sampling size you're working on, although the first way (myFunc) is definitely faster as you scale up really a lot. In anycase I recommend the first approach, as it is much easier to read.

ARMAX fit Percent

I am using armax model to describe the relationship between two signals. I have used matlab armax function with different model orders.
To evaluate the efficiency of the model I pulled the value from Report.Fit.FitPercent expecting that it will tell how well the model fits the experimental data. As it is fitpercent I would expect it to be between 0-100%. My results range from ~ -257 to 99.99.
I couldn't find on the mathworks or other websites how this value is calculated and how to interpret it. It would be great if you could explain how to understand the fitPercent value.
The code I used is very simple and it generates FitPercent for different model structures (orders).
opt = armaxOptions;
opt.InitialCondition = 'auto';
opt.Focus = 'simulation';
j=1; %number of dataset for analysis
i=1;
nk=0;
for na=1:1:6
for nb=1:1:6
for nc=1:1:6
m_armax = armax(data(:,:,:,j), [na nb nc nk], opt);
fit(i) = m_armax.Report.Fit.FitPercent
struct(:,i) = [na;nb;nc];
i=i+1
end
end
end
In the doc it states that the fit percent value is calculated with the compare function:
http://www.mathworks.de/de/help/ident/ref/compare.html?searchHighlight=fit

A moving average with different functions and varying time-frames

I have a matrix time-series data for 8 variables with about 2500 points (~10 years of mon-fri) and would like to calculate the mean, variance, skewness and kurtosis on a 'moving average' basis.
Lets say frames = [100 252 504 756] - I would like calculate the four functions above on over each of the (time-)frames, on a daily basis - so the return for day 300 in the case with 100 day-frame, would be [mean variance skewness kurtosis] from the period day201-day300 (100 days in total)... and so on.
I know this means I would get an array output, and the the first frame number of days would be NaNs, but I can't figure out the required indexing to get this done...
This is an interesting question because I think the optimal solution is different for the mean than it is for the other sample statistics.
I've provided a simulation example below that you can work through.
First, choose some arbitrary parameters and simulate some data:
%#Set some arbitrary parameters
T = 100; N = 5;
WindowLength = 10;
%#Simulate some data
X = randn(T, N);
For the mean, use filter to obtain a moving average:
MeanMA = filter(ones(1, WindowLength) / WindowLength, 1, X);
MeanMA(1:WindowLength-1, :) = nan;
I had originally thought to solve this problem using conv as follows:
MeanMA = nan(T, N);
for n = 1:N
MeanMA(WindowLength:T, n) = conv(X(:, n), ones(WindowLength, 1), 'valid');
end
MeanMA = (1/WindowLength) * MeanMA;
But as #PhilGoddard pointed out in the comments, the filter approach avoids the need for the loop.
Also note that I've chosen to make the dates in the output matrix correspond to the dates in X so in later work you can use the same subscripts for both. Thus, the first WindowLength-1 observations in MeanMA will be nan.
For the variance, I can't see how to use either filter or conv or even a running sum to make things more efficient, so instead I perform the calculation manually at each iteration:
VarianceMA = nan(T, N);
for t = WindowLength:T
VarianceMA(t, :) = var(X(t-WindowLength+1:t, :));
end
We could speed things up slightly by exploiting the fact that we have already calculated the mean moving average. Simply replace the within loop line in the above with:
VarianceMA(t, :) = (1/(WindowLength-1)) * sum((bsxfun(#minus, X(t-WindowLength+1:t, :), MeanMA(t, :))).^2);
However, I doubt this will make much difference.
If anyone else can see a clever way to use filter or conv to get the moving window variance I'd be very interested to see it.
I leave the case of skewness and kurtosis to the OP, since they are essentially just the same as the variance example, but with the appropriate function.
A final point: if you were converting the above into a general function, you could pass in an anonymous function as one of the arguments, then you would have a moving average routine that works for arbitrary choice of transformations.
Final, final point: For a sequence of window lengths, simply loop over the entire code block for each window length.
I have managed to produce a solution, which only uses basic functions within MATLAB and can also be expanded to include other functions, (for finance: e.g. a moving Sharpe Ratio, or a moving Sortino Ratio). The code below shows this and contains hopefully sufficient commentary.
I am using a time series of Hedge Fund data, with ca. 10 years worth of daily returns (which were checked to be stationary - not shown in the code). Unfortunately I haven't got the corresponding dates in the example so the x-axis in the plots would be 'no. of days'.
% start by importing the data you need - here it is a selection out of an
% excel spreadsheet
returnsHF = xlsread('HFRXIndices_Final.xlsx','EquityHedgeMarketNeutral','D1:D2742');
% two years to be used for the moving average. (250 business days in one year)
window = 500;
% create zero-matrices to fill with the MA values at each point in time.
mean_avg = zeros(length(returnsHF)-window,1);
st_dev = zeros(length(returnsHF)-window,1);
skew = zeros(length(returnsHF)-window,1);
kurt = zeros(length(returnsHF)-window,1);
% Now work through the time-series with each of the functions (one can add
% any other functions required), assinging the values to the zero-matrices
for count = window:length(returnsHF)
% This is the most tricky part of the script, the indexing in this section
% The TwoYearReturn is what is shifted along one period at a time with the
% for-loop.
TwoYearReturn = returnsHF(count-window+1:count);
mean_avg(count-window+1) = mean(TwoYearReturn);
st_dev(count-window+1) = std(TwoYearReturn);
skew(count-window+1) = skewness(TwoYearReturn);
kurt(count-window +1) = kurtosis(TwoYearReturn);
end
% Plot the MAs
subplot(4,1,1), plot(mean_avg)
title('2yr mean')
subplot(4,1,2), plot(st_dev)
title('2yr stdv')
subplot(4,1,3), plot(skew)
title('2yr skewness')
subplot(4,1,4), plot(kurt)
title('2yr kurtosis')