Matlab: Change variable resolution and names for viewing regression trees - matlab

Using treeMine = fitctree(....) I can generate a decision tree but the tree is very big, and therefore very difficult to convey information, when using view(treeMine,'Mode','Graph')
Therefore my question is if it is possible to change variable names x1-x9 to other names to make it human understandable and if I could force the numbers to be represented by engineering notation meaning 10e3.
Does anybody know how this can be done?
Minimal Example
Minimal example can be to use Matlabs own car example:
load carsmall
idxNaN = isnan(MPG + Weight);
X = Weight(~idxNaN);
Y = MPG(~idxNaN);
n = numel(X);
rng(1) % For reproducibility
idxTrn = false(n,1);
idxTrn(randsample(n,round(0.5*n))) = true; % Training set logical indices
idxVal = idxTrn == false; % Validation set logical indices
Mdl = fitrtree(X(idxTrn),Y(idxTrn));
view(Mdl,'Mode','graph')
How do you then specify the value resolution and variable name

About the names: It's a bit a poor example because you use only one predictor (weight), but you can change the name with the 'PredictorNames' name-value pair, e.g.
Mdl = fitrtree(X(idxTrn),Y(idxTrn),'PredictorNames',{'weight'});
If you were to use more predictors you just have to add more elements to the cell array, e.g.
'PredictorNames',{'weight','age','women'}
I don't know about the numbers tough.

Related

SVM for one class and one feature with Matlab fitcsvm gives overfitted results

I am trying to use SVM in the simplest possible way, for classifying a target category (class 1) with a range of values greater than some, and lower than other non-targets (class 0). However, the model I fit is too complex, and instead of giving me two support vectors, results in a bad separation between categories. I assume the model has too many dimensions and is therefore heavily overfitted. I tried changing 'KernelScale' but it is not it. Do you know how I can force the model to be simpler?
n0 = normrnd(0,1,300,1);
n1 = normrnd(1,1,300,1);
n2 = normrnd(2,1,300,1);
val = [n0;n1;n2];
lbl = [zeros(300,1);ones(300,1);zeros(300,1)];
d = fitcsvm(val, cellstr(str(lbl))','KernelScale', 'auto','KernelFunction','gaussian');
pred = ismember(d.predict(val),'1');
figure;
plot([1:300,601:900],val([1:300,601:900]),'.b')
hold on;
plot(301:600,val(301:600),'.r')
plot(find(pred),val(pred),'og')
legend('class 0','class 1','predicted class 1','location','southeast')

How can I determine the number of trees in random Forest in matlab?

In Matlab, we train the random forest by using TreeBagger() method. One of the parameters of this method is the number of trees. I am using random forest for classification approach. How can I determine the number of trees of random forest?
If you've been training this model, you should know the number of trees that used in the model because it must set as input for TreeBagger().
Anyway, for the learned model like RFmodel, you can use compact(RFmodel) to determine the number of trees.
This is regression example based on Matlab documentation :
load imports-85;
Y = X(:,1);
X = X(:,2:end);
isCat = [zeros(15,1);ones(size(X,2)-15,1)]; % Categorical variable flag
rng(1945,'twister')
UnknownNumberofTrees=100;
RFmodel = TreeBagger(UnknownNumberofTrees,X,Y,'Method','R','OOBPred','On',...
'Cat',find(isCat == 1),'MinLeaf',5);
RFmodelObject = compact(RFmodel);
RFmodelObject.NTrees
%ans =
% 100

Change the random number generator in Matlab function

I have a task to complete that requires quasi-random numbers as input, but I notice that the Matlab function I want to use does not have an option to select any of the quasi generators I want to use (e.g. Halton, Sobol, etc.). Matlab has them as stand alone functions and not as options in the ubiquitous 'randn' and 'rng' functions. What MatLab uses is the Mersenne Twister, a pseudo generator. So for instance the copularnd uses 'randn'/'rng' which is based on pseudo random numbers....
Is there a way to incorporate them into the rand or rng functions embedded in other code (e.g.copularnd)? Any pointers would be much appreciated. Note; 'copularnd' calls 'mvnrnd' which in turn uses 'randn' and then pulls 'rng'...
First you need to initialize the haltonset using the leap, skip, and scramble properties.
You can check the documents but the easy description is as follows:
Scramble - is used for shuffling the points
Skip - helps to exclude a range of points from the set
Leap - is the size of jump from the current selected point to the next one. The points in between are ignored.
Now you can built a haltonset object:
p = haltonset(2,'Skip',1e2,'Leap',1e1);
p = scramble(p,'RR2');
This makes a 2D halton number set by skipping the first 100 numbers and leaping over 10 numbers. The scramble method is 'PR2' which is applied in the second line. You can see that many points are generated:
p =
Halton point set in 2 dimensions (818836295885536 points)
Properties:
Skip : 100
Leap : 10
ScrambleMethod : RR2
When you have your haltonset object, p, you can access the values by just selecting them:
x = p(1:10,:)
Notice:
So, you need to create the object first and then use the generated points. To get different results, you can play with Leap and Scramble properties of the function. Another thing you can do is to use a uniform distribution such as randi to select numbers each time from the generated points. That makes sure that you are accessing uniformly random parts of the dataset each time.
For instance, you can generate a random index vector (4 points in this example). And then use those to select points from the halton points.
>> idx = randi(size(p,1),1,4)
idx =
1.0e+14 *
3.1243 6.2683 6.5114 1.5302
>> p(idx,:)
ans =
0.5723 0.2129
0.8918 0.6338
0.9650 0.1549
0.8020 0.3532
link
'qrandstream' may be the answer I am looking for....with 'qrand' instead of 'rand'
e.g..from MatLab doc
p = haltonset(1,'Skip',1e3,'Leap',1e2);
p = scramble(p,'RR2');
q = qrandstream(p);
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
X = qrand(q,sampSize);
[h,pval] = kstest(X,[X,X]);
PVALS(test) = pval;
end
I will post my solution once I am done :)

ARMAX fit Percent

I am using armax model to describe the relationship between two signals. I have used matlab armax function with different model orders.
To evaluate the efficiency of the model I pulled the value from Report.Fit.FitPercent expecting that it will tell how well the model fits the experimental data. As it is fitpercent I would expect it to be between 0-100%. My results range from ~ -257 to 99.99.
I couldn't find on the mathworks or other websites how this value is calculated and how to interpret it. It would be great if you could explain how to understand the fitPercent value.
The code I used is very simple and it generates FitPercent for different model structures (orders).
opt = armaxOptions;
opt.InitialCondition = 'auto';
opt.Focus = 'simulation';
j=1; %number of dataset for analysis
i=1;
nk=0;
for na=1:1:6
for nb=1:1:6
for nc=1:1:6
m_armax = armax(data(:,:,:,j), [na nb nc nk], opt);
fit(i) = m_armax.Report.Fit.FitPercent
struct(:,i) = [na;nb;nc];
i=i+1
end
end
end
In the doc it states that the fit percent value is calculated with the compare function:
http://www.mathworks.de/de/help/ident/ref/compare.html?searchHighlight=fit

MatLab BayesNetToolbox parameter learning

My question is specific to the "learn_params()" function of the BayesNetToolbox in MatLab. In the user manual, "learn_params()" is stated to be suitable for use only if the input data is fully observed. I have tried it with a partially observed dataset where I represented unobserved values as NaN's.
It seems like "learn_params()" can deal with NaNs and the node state combinations that do not occur in the dataset. When I apply dirichlet priors to smoothen the 0 values, I get 'sensible' MLE distributions for all nodes. I have copied the script where I do this.
Can someone clarify whether what I am doing makes sense or if I am missing
something, i.e. the reason why "learn_params()" cannot be used with partially
observed data.
The MatLab Script where I test this is here:
% Incomplete dataset (where NaN's are unobserved)
Age = [1,2,2,NaN,3,3,2,1,NaN,2,1,1,3,NaN,2,2,1,NaN,3,1];
TNMStage = [2,4,2,3,NaN,1,NaN,3,1,4,3,NaN,2,4,3,4,1,NaN,2,4];
Treatment = [2,3,3,NaN,2,NaN,4,4,3,3,NaN,2,NaN,NaN,4,2,NaN,3,NaN,4];
Survival = [1,2,1,2,2,1,1,1,1,2,2,1,2,2,1,2,1,2,2,1];
matrixdata = [Age;TNMStage;Treatment;Survival];
node_sizes =[3,4,4,2];
% Enter the variablesmap
keys = {'Age', 'TNM','Treatment', 'Survival'};
v= 1:1:length(keys);
VariablesMap = containers.Map(keys,v);
% create the dag and the bnet
N = length(node_sizes); % Instead of entering it manually
dag2 = zeros(N,N);
dag2(VariablesMap('Treatment'),VariablesMap('Survival')) = 1;
bnet21 = mk_bnet(dag2, node_sizes);
draw_graph(bnet21.dag);
dirichletweight=1;
% define the CPD priors you want to use
bnet23.CPD{VariablesMap('Age')} = tabular_CPD(bnet23, VariablesMap('Age'), 'prior_type', 'dirichlet','dirichlet_type', 'unif', 'dirichlet_weight', dirichletweight);
bnet23.CPD{VariablesMap('TNM')} = tabular_CPD(bnet23, VariablesMap('TNM'), 'prior_type', 'dirichlet','dirichlet_type', 'unif', 'dirichlet_weight', dirichletweight);
bnet23.CPD{VariablesMap('Treatment')} = tabular_CPD(bnet23, VariablesMap('Treatment'), 'prior_type', 'dirichlet','dirichlet_type', 'unif','dirichlet_weight', dirichletweight);
bnet23.CPD{VariablesMap('Survival')} = tabular_CPD(bnet23, VariablesMap('Survival'), 'prior_type', 'dirichlet','dirichlet_type', 'unif','dirichlet_weight', dirichletweight);
% Find MLEs from incomplete data with Dirichlet prior CPDs
bnet24 = learn_params(bnet23, matrixdata);
% Look at the new CPT values after parameter estimation has been carried out
CPT24 = cell(1,N);
for i=1:N
s=struct(bnet24.CPD{i}); % violate object privacy
CPT24{i}=s.CPT;
end
According to my understanding of the BNT documentation, you need to make a couple of changes:
Missing values should be represented as empty cells instead of NaN values.
The learn_params_em function is the only one that supports missing values.
My previous response was incorrect, as I mis-recalled which of the BNT learning functions had support for missing values.