Why won't my Decision Tree classifier work. The functions say not enough input arguments - matlab

I have coded a Decision Tree classifier in Matlab. To the best of my knowledge everything should work, the logic checks out. When I try to call the fit method it breaks on one of my functions telling me I haven't got the right input arguments but I'm sure I do! Been trying to solve this and similar errors to do with functions and input arguments for a day or two now. I wondered if it had something to do from calling them from within the constructor but calling them from the main script still doesn't work. Pls help!
classdef my_ClassificationTree < handle
properties
X % training examples
Y % training labels
MinParentSize % minimum parent node size
MaxNumSplits % maximum number of splits
Verbose % are we printing out debug as we go?
% MinLeafSize
CutPoint
CutPredictorIndex
Children
numSplits
root
end
methods
% constructor: implementing the fitting phase
function obj = my_ClassificationTree(X, Y, MinParentSize, MaxNumSplits, Verbose)
obj.X = X;
obj.Y = Y;
obj.MinParentSize = MinParentSize;
obj.MaxNumSplits = MaxNumSplits;
obj.Verbose = Verbose;
% obj.Children = zeros(1, 2);
% obj.CutPoint = 0;
% obj.CutPredictorIndex = 0;
% obj.MinLeafSize = MinLeafSize;
obj.numSplits = 0;
obj.root = Node(1, size(obj.X,1));
root = Node(1, size(obj.X,1));
fit(obj,root);
end
function node = Node(sIndex,eIndex)
node.startIndex = sIndex;
node.endIndex = eIndex;
node.leaf = false;
node.Children = 0;
node.size = eIndex - sIndex + 1;
node.CutPoint = 0;
node.CutPredictorIndex = 0;
node.NodeClass = 0;
end
function fit(obj,node)
if node.size < obj.MinParentSize || obj.numSplits >= obj.MaxNumSplits
% Mark the node as a leaf node
node.Leaf = true;
% Calculate the majority class label for the examples at this node
labels = obj.Y(node.startIndex:node.endIndex); %gather all the labels for the data in the nodes range
node.NodeClass = mode(labels); %find the most frequent label and classify the node as such
return;
end
bestCutPoint = findBestCutPoint(node, obj.X, obj.Y);
leftChild = Node(node.startIndex, bestCutPoint.CutIndex - 1);
rightChild = Node(bestSplit.splitIndex, node.endIndex);
obj.numSplits = obj.numSplits + 1;
node.CutPoint = bestSplit.CutPoint;
node.CutPredictorIndex = bestSplit.CutPredictorIndex;
%Attach the child nodes to the parent node
node.Children = [leftChild, rightChild];
% Recursively build the tree for the left and right child nodes
fit(obj, leftChild);
fit(obj, rightChild);
end
function bestCutPoint = findBestCutPoint(node, X, labels)
bestCutPoint.CutPoint = 0;
bestCutPoint.CutPredictorIndex = 0;
bestCutPoint.CutIndex = 0;
bestGDI = Inf; % Initialize the best GDI to a large value
% Loop through all the features
for i = 1:size(X, 2)
% Loop through all the unique values of the feature
values = unique(X(node.startIndex:node.endIndex, i));
for j = 1:length(values)
% Calculate the weighted impurity of the two resulting
% cut
leftLabels = labels(node.startIndex:node.endIndex, 1);
rightLabels = labels(node.startIndex:node.endIndex, 1);
leftLabels = leftLabels(X(node.startIndex:node.endIndex, i) < values(j));
rightLabels = rightLabels(X(node.startIndex:node.endIndex, i) >= values(j));
leftGDI = weightedGDI(leftLabels, labels);
rightGDI = weightedGDI(rightLabels, labels);
% Calculate the weighted impurity of the split
cutGDI = leftGDI + rightGDI;
% Update the best split if the current split has a lower GDI
if cutGDI < bestGDI
bestGDI = cutGDI;
bestCutPoint.CutPoint = values(j);
bestCutPoint.CutPredictorIndex = i;
bestCutPoint.CutIndex = find(X(:, i) == values(j), 1, 'first');
end
end
end
end
% the prediction phase:
function predictions = predict(obj, test_examples)
% get ready to store our predicted class labels:
predictions = categorical;
% Iterate over each example in X
for i = 1:size(test_examples, 1)
% Set the current node to be the root node
currentNode = obj.root;
% While the current node is not a leaf node
while ~currentNode.leaf
% Check the value of the predictor feature specified by the CutPredictorIndex property of the current node
value = test_examples(i, currentNode.CutPredictorIndex);
% If the value is less than the CutPoint of the current node, set the current node to be the left child of the current node
if value < currentNode.CutPoint
currentNode = currentNode.Children(1);
% If the value is greater than or equal to the CutPoint of the current node, set the current node to be the right child of the current node
else
currentNode = currentNode.Children(2);
end
end
% Once the current node is a leaf node, add the NodeClass of the current node to the predictions vector
predictions(i) = currentNode.NodeClass;
end
end
% add any other methods you want on the lines below...
end
end
This is the function that calls myClassificationTree
function m = my_fitctree(train_examples, train_labels, varargin)
% take an extra name-value pair allowing us to turn debug on:
p = inputParser;
addParameter(p, 'Verbose', false);
%addParameter(p, 'MinLeafSize', false);
% take an extra name-value pair allowing us to set the minimum
% parent size (10 by default):
addParameter(p, 'MinParentSize', 10);
% take an extra name-value pair allowing us to set the maximum
% number of splits (number of training examples-1 by default):
addParameter(p, 'MaxNumSplits', size(train_examples,1) - 1);
p.parse(varargin{:});
% use the supplied parameters to create a new my_ClassificationTree
% object:
m = my_ClassificationTree(train_examples, train_labels, ...
p.Results.MinParentSize, p.Results.MaxNumSplits, p.Results.Verbose);
end
that is my code from the main block of code
mym2_dt = my_fitctree(train_examples, train_labels, 'MinParentSize', 10)
These are the errors these are the errors
I'm expecting it to build a decision tree and fill it. However it breaks on the findBestCutPoint function and I cannot fix it

The first argument of class methods (except the constructor) should be an instance of the class (i.e obj). Your definition of Node and findBestCutPoint should have obj as the first argument.
Moreover, calls to class methods from within other methods should have the syntax obj.theMethod which seems not to be the case in your code.
So, for instance, the call to Node should be:
obj.root = obj.Node(1, size(obj.X,1));
and Node should be defined as follows:
function node = Node(obj,sIndex,eIndex)
Same applies to findBestCutPoint. Note that, in the calls, the reference to the class instance is passed implicitly, so you don't need to actually include it in the call.

Related

Custom learner function for Adaboost

I am using Adaboost to fit a classification problem. We can do the following:
ens = fitensemble(X, Y, 'AdaBoostM1', 100, 'Tree')
Now 'Tree' is the learner and we can change this to 'Discriminant' or 'KNN'. Each learner uses a certain Template Object Creation Function. More info here.
Is it possible to create your own function and use it as a learner? And how?
I open templateTree.m and templateKNN.m to see how MATLAB define Template Object Creation Function.
function temp = templateKNN(varargin)
classreg.learning.FitTemplate.catchType(varargin{:});
temp = classreg.learning.FitTemplate.make('KNN','type','classification',varargin{:});
end
and
function temp = templateTree(varargin)
temp = classreg.learning.FitTemplate.make('Tree',varargin{:});
end
It shows that MATLAB has a function in FitTemplate named make , if you open this m-file, you will see :
function temp = make(method,varargin)
% Check the type of the required argument
if ~ischar(method)
error(message('stats:classreg:learning:FitTemplate:make:BadArgs'));
end
% Extract type (classification or regression)
args = {'type'};
defs = { ''};
[usertype,~,modelArgs] = ...
internal.stats.parseArgs(args,defs,varargin{:});
% Check usertype
if ~isempty(usertype)
usertype = gettype(usertype);
end
% Method
namesclass = classreg.learning.classificationModels();
namesreg = classreg.learning.regressionModels();
[tfclass,locclass] = ismember(lower(method),lower(namesclass));
[tfreg,locreg] = ismember(lower(method),lower(namesreg));
if ~tfclass && ~tfreg
error(message('stats:classreg:learning:FitTemplate:make:UnknownMethod', method));
end
if tfclass && tfreg
method = namesclass{locclass}; % can get it from namesreg too
type = usertype;
% If type is not passed for an ensemble method, try to
% figure it out from learner types. This is useful for
% users who want to type
% fitensemble(X,Y,'Subspace',100,'Discriminant')
% instead of
% fitensemble(X,Y,'Subspace',100,'Discriminant','type','classification')
if isempty(type) && ismember(method,classreg.learning.ensembleModels())
[learners,~,~] = internal.stats.parseArgs({'learners'},{},modelArgs{:});
if ischar(learners) || isa(learners,'classreg.learning.FitTemplate')
learners = {learners};
elseif ~iscell(learners)
error(message('stats:classreg:learning:FitTemplate:make:BadLearnerTemplates'));
end
L = numel(learners);
% The user can pass several learner templates, and some
% of these learners may be appropriate for
% classification, some for regression, and some for
% both. The ensemble type cannot be determined
% unambiguously unless if all learners are appropriate
% for one type of learning *only*. For example, in 12a
% t1 = ClassificationDiscriminant.template
% t2 = ClassificationKNN.template
% fitensemble(X,Y,'Subspace',10,{t1 t2})
% is going to work because both discriminant and k-NN
% can be used for classification only. If you want to
% mix discriminant and tree, you have to specify the
% ensemble type explicitly:
% t1 = ClassificationDiscriminant.template
% t2 = ClassificationTree.template
% fitensemble(X,Y,'Bag',10,{t1 t2},'type','classification')
types = zeros(L,1); % -1 for regression and 1 for classification
for l=1:L
meth = learners{l};
if isa(meth,'classreg.learning.FitTemplate')
meth = meth.Method;
end
isc = ismember(lower(meth),lower(namesclass));
isr = ismember(lower(meth),lower(namesreg));
if ~isc && ~isr
error(message('stats:classreg:learning:FitTemplate:make:UnknownMethod', meth));
end
types(l) = isc - isr;
end
if all(types==1)
type = 'classification';
elseif all(types==-1)
type = 'regression';
end
end
elseif tfclass
method = namesclass{locclass};
type = 'classification';
else
method = namesreg{locreg};
type = 'regression';
end
% Make sure the type is consistent
if ~isempty(usertype) && ~strcmp(usertype,type)
error(message('stats:classreg:learning:FitTemplate:make:UserTypeMismatch', method, usertype));
end
% Make template
temp = classreg.learning.FitTemplate(method,modelArgs);
temp = fillIfNeeded(temp,type);
end
You must change this function.

AABB Intersections with Space Partitions, A Sample Code Performance and Reliability

I have originally written the following Matlab code to find intersection between a set of Axes Aligned Bounding Boxes (AABB) and space partitions (here 8 partitions). I believe it is readable by itself, moreover, I have added some comments for even more clarity.
function [A,B] = AABBPart(bbx,it) % bbx: aabb, it: iteration
global F
IT = it+1;
n = size(bbx,1);
F = cell(n,it);
A = Part([min(bbx(:,1:3)),max(bbx(:,4:6))],it,0); % recursive partitioning
B = F; % matlab does not allow
function s = Part(bx,it,J) % output to be global
s = {};
if it < 1; return; end
s = cell(8,1);
p = bx(1:3);
q = bx(4:6);
h = 0.5*(p+q);
prt = [p,h;... % 8 sub-parts (octa)
h(1),p(2:3),q(1),h(2:3);...
p(1),h(2),p(3),h(1),q(2),h(3);...
h(1:2),p(3),q(1:2),h(3);...
p(1:2),h(1),h(1:2),q(3);...
h(1),p(2),h(3),q(1),h(2),q(3);...
p(1),h(2:3),h(1),q(2:3);...
h,q];
for j=1:8 % check for each sub-part
k = 0;
t = zeros(0,1);
for i=1:n
if all(bbx(i,1:3) <= prt(j,4:6)) && ... % interscetion test for
all(prt(j,1:3) <= bbx(i,4:6)) % every aabb and sub-parts
k = k+1;
t(k) = i;
end
end
if ~isempty(t)
s{j,1} = [t; Part(prt(j,:),it-1,j)]; % recursive call
for i=1:numel(t) % collecting the results
if isempty(F{t(i),IT-it})
F{t(i),IT-it} = [-J,j];
else
F{t(i),IT-it} = [F{t(i),IT-it}; [-J,j]];
end
end
end
end
end
end
Concerns:
In my tests, it seems that probably few intersections are missing, say, 10 or so for 1000 or more setup. So I would be glad if you could help to find out any problematic parts in the code.
I am also concerned about using global F. I prefer to get rid of it.
Any other better solution in terms of speed, will be loved.
Note that the code is complete. And you can easily try it by some following setup.
n = 10000; % in the original application, n would be millions
bbx = rand(n,6);
it = 3;
[A,B] = AABBPart(bbx,it);

Solving PDE with Matlab

`sol = pdepe(m,#ParticleDiffusionpde,#ParticleDiffusionic,#ParticleDiffusionbc,x,t);
% Extract the first solution component as u.
u = sol(:,:,:);
function [c,f,s] = ParticleDiffusionpde(x,t,u,DuDx)
global Ds
c = 1/Ds;
f = DuDx;
s = 0;
function u0 = ParticleDiffusionic(x)
global qo
u0 = qo;
function [pl,ql,pr,qr] = ParticleDiffusionbc(xl,ul,xr,ur,t,x)
global Ds K n
global Amo Gc kf rhop
global uavg
global dr R nr
sum = 0;
for i = 1:1:nr-1
r1 = (i-1)*dr; % radius at i
r2 = i * dr; % radius at i+1
r1 = double(r1); % convert to double precision
r2 = double(r2);
sum = sum + (dr / 2 * (r1*ul+ r2*ur));
end;
uavg = 3/R^3 * sum;
ql = 1;
pl = 0;
qr = 1;
pr = -((kf/(Ds.*rhop)).*(Amo - Gc.*uavg - ((double(ur/K)).^2).^(n/2) ));`
dq(r,t)/dt = Ds( d2q(r,t)/dr2 + (2/r)*dq(r,t)/dr )
q(r, t=0) = 0
dq(r=0, t)/dr = 0
dq(r=dp/2, t)/dr = (kf/Ds*rhop) [C(t) - Cp(at r = dp/2)]
q = solid phase concentration of trace compound in a particle with radius dp/2
C = bulk liquid concentration of trace compound
Cp = trace compound concentration at particle surface
I want to solve the above pde with initial and boundary conditions given. Tried Matlab's pdepe, but does not work satisfactorily. Maybe the boundary conditions is creating problem for me. I also used this isotherm equation for equilibrium: q = K*Cp^(1/n). This is convection-diffusion equation but i could not find any write ups that addresses solving this type of equation properly.
There are two problems with the current implementation.
Incorrect Source Term
The PDE you are attempting to solve has the form
which has the equivalent form
where the last term arises due to the factor of 2 in the original PDE.
The last term needs to be incorporated into pdepe via a source term.
Calculation of q average
The current implementation attempts to calculate the average value of q using the left and right values of q passed to the boundary condition function.
This is incorrect.
The average value of q needs to be calculated from a vector of up-to-date values of the quantity.
However, we have the complication that the only function to receive all mesh values is ParticleDiffusionpde; however, the mesh values passed to that function are not guaranteed to be from the mesh we provided.
Solution: use events (as described in the pdepe documentation).
This is a hack since the event function is meant to detect zero-crossings, but it has the advantage that the function is given all values of q on the mesh we provide.
So, the working example below (you'll notice I set all of the parameters to 1 since I didn't know better) uses the events function to update a variable qStore that can be accessed by the boundary condition function (see here for an explanation), and the boundary condition function performs a vectorized trapezoidal integration for the average calculation.
Working Example
function [] = ParticleDiffusion()
% Parameters
Ds = 1;
q0 = 0;
K = 1;
n = 1;
Amo = 1;
Gc = 1;
kf = 1;
rhop = 1;
% Space
rMesh = linspace(0,1,10);
rMesh = rMesh(:) ;
dr = rMesh(2) - rMesh(1) ;
% Time
tSpan = linspace(0,1,10);
% Vector to store current u-value
qStore = zeros(size(rMesh));
options.Events = #(m,t,x,y) events(m,t,x,y);
% Solve
[sol,~,~,~,~] = pdepe(1,#ParticleDiffusionpde,#ParticleDiffusionic,#ParticleDiffusionbc,rMesh,tSpan,options);
% Use the events function to update qStore
function [value,isterminal,direction] = events(m,~,~,y)
qStore = y; % Value of q on rMesh
value = m; % Since m is constant, it will never be zero (no event detection)
isterminal = 0; % Continue integration
direction = 0; % Detect all zero crossings (not important)
end
function [c,f,s] = ParticleDiffusionpde(r,~,~,DqDr)
% Define the capacity, flux, and source
c = 1/Ds;
f = DqDr;
s = DqDr./r;
end
function u0 = ParticleDiffusionic(~)
u0 = q0;
end
function [pl,ql,pr,qr] = ParticleDiffusionbc(~,~,R,ur,~)
% Calculate average value of current solution
qL = qStore(1:end-1);
qR = qStore(2: end );
total = sum((qL.*rMesh(1:end-1) + qR.*rMesh(2:end))) * dr/2;
qavg = 3/R^3 * total;
% Left boundary
pl = 0;
ql = 1;
% Right boundary
qr = 1;
pr = -(kf/(Ds.*rhop)).*(Amo - Gc.*qavg - (ur/K).^n);
end
end

MeanShift Clustering on dataset

I have a numeric dataset and I want to cluster data with a non-parametric algorithm. Basically, I would like to cluster without specifying the number of clusters for the input. I am using this code that I accessed through the MathWorks File Exchange network which implements the Mean Shift algorithm. However, I don't Know how to adapt my data to this code as my dataset has dimensions 516 x 19.
function [clustCent,data2cluster,cluster2dataCell] =MeanShiftCluster(dataPts,bandWidth,plotFlag)
%UNTITLED2 Summary of this function goes here
% Detailed explanation goes here
%perform MeanShift Clustering of data using a flat kernel
%
% ---INPUT---
% dataPts - input data, (numDim x numPts)
% bandWidth - is bandwidth parameter (scalar)
% plotFlag - display output if 2 or 3 D (logical)
% ---OUTPUT---
% clustCent - is locations of cluster centers (numDim x numClust)
% data2cluster - for every data point which cluster it belongs to (numPts)
% cluster2dataCell - for every cluster which points are in it (numClust)
%
% Bryan Feldman 02/24/06
% MeanShift first appears in
% K. Funkunaga and L.D. Hosteler, "The Estimation of the Gradient of a
% Density Function, with Applications in Pattern Recognition"
%*** Check input ****
if nargin < 2
error('no bandwidth specified')
end
if nargin < 3
plotFlag = true;
plotFlag = false;
end
%**** Initialize stuff ***
%[numPts,numDim] = size(dataPts);
[numDim,numPts] = size(dataPts);
numClust = 0;
bandSq = bandWidth^2;
initPtInds = 1:numPts
maxPos = max(dataPts,[],2); %biggest size in each dimension
minPos = min(dataPts,[],2); %smallest size in each dimension
boundBox = maxPos-minPos; %bounding box size
sizeSpace = norm(boundBox); %indicator of size of data space
stopThresh = 1e-3*bandWidth; %when mean has converged
clustCent = []; %center of clust
beenVisitedFlag = zeros(1,numPts,'uint8'); %track if a points been seen already
numInitPts = numPts %number of points to posibaly use as initilization points
clusterVotes = zeros(1,numPts,'uint16'); %used to resolve conflicts on cluster membership
while numInitPts
tempInd = ceil( (numInitPts-1e-6)*rand) %pick a random seed point
stInd = initPtInds(tempInd) %use this point as start of mean
myMean = dataPts(:,stInd); % intilize mean to this points location
myMembers = []; % points that will get added to this cluster
thisClusterVotes = zeros(1,numPts,'uint16'); %used to resolve conflicts on cluster membership
while 1 %loop untill convergence
sqDistToAll = sum((repmat(myMean,1,numPts) - dataPts).^2); %dist squared from mean to all points still active
inInds = find(sqDistToAll < bandSq); %points within bandWidth
thisClusterVotes(inInds) = thisClusterVotes(inInds)+1; %add a vote for all the in points belonging to this cluster
myOldMean = myMean; %save the old mean
myMean = mean(dataPts(:,inInds),2); %compute the new mean
myMembers = [myMembers inInds]; %add any point within bandWidth to the cluster
beenVisitedFlag(myMembers) = 1; %mark that these points have been visited
%*** plot stuff ****
if plotFlag
figure(12345),clf,hold on
if numDim == 2
plot(dataPts(1,:),dataPts(2,:),'.')
plot(dataPts(1,myMembers),dataPts(2,myMembers),'ys')
plot(myMean(1),myMean(2),'go')
plot(myOldMean(1),myOldMean(2),'rd')
pause
end
end
%**** if mean doesnt move much stop this cluster ***
if norm(myMean-myOldMean) < stopThresh
%check for merge posibilities
mergeWith = 0;
for cN = 1:numClust
distToOther = norm(myMean-clustCent(:,cN)); %distance from posible new clust max to old clust max
if distToOther < bandWidth/2 %if its within bandwidth/2 merge new and old
mergeWith = cN;
break;
end
end
if mergeWith > 0 % something to merge
clustCent(:,mergeWith) = 0.5*(myMean+clustCent(:,mergeWith)); %record the max as the mean of the two merged (I know biased twoards new ones)
%clustMembsCell{mergeWith} = unique([clustMembsCell{mergeWith} myMembers]); %record which points inside
clusterVotes(mergeWith,:) = clusterVotes(mergeWith,:) + thisClusterVotes; %add these votes to the merged cluster
else %its a new cluster
numClust = numClust+1 %increment clusters
clustCent(:,numClust) = myMean; %record the mean
%clustMembsCell{numClust} = myMembers; %store my members
clusterVotes(numClust,:) = thisClusterVotes;
end
break;
end
end
initPtInds = find(beenVisitedFlag == 0); %we can initialize with any of the points not yet visited
numInitPts = length(initPtInds); %number of active points in set
end
[val,data2cluster] = max(clusterVotes,[],1); %a point belongs to the cluster with the most votes
%*** If they want the cluster2data cell find it for them
if nargout > 2
cluster2dataCell = cell(numClust,1);
for cN = 1:numClust
myMembers = find(data2cluster == cN);
cluster2dataCell{cN} = myMembers;
end
end
This is the test code I am using to try and get the Mean Shift program to work:
clear
profile on
nPtsPerClust = 250;
nClust = 3;
totalNumPts = nPtsPerClust*nClust;
m(:,1) = [1 1];
m(:,2) = [-1 -1];
m(:,3) = [1 -1];
var = .6;
bandwidth = .75;
clustMed = [];
%clustCent;
x = var*randn(2,nPtsPerClust*nClust);
%*** build the point set
for i = 1:nClust
x(:,1+(i-1)*nPtsPerClust:(i)*nPtsPerClust) = x(:,1+(i-1)*nPtsPerClust:(i)*nPtsPerClust) + repmat(m(:,i),1,nPtsPerClust);
end
tic
[clustCent,point2cluster,clustMembsCell] = MeanShiftCluster(x,bandwidth);
toc
numClust = length(clustMembsCell)
figure(10),clf,hold on
cVec = 'bgrcmykbgrcmykbgrcmykbgrcmyk';%, cVec = [cVec cVec];
for k = 1:min(numClust,length(cVec))
myMembers = clustMembsCell{k};
myClustCen = clustCent(:,k);
plot(x(1,myMembers),x(2,myMembers),[cVec(k) '.'])
plot(myClustCen(1),myClustCen(2),'o','MarkerEdgeColor','k','MarkerFaceColor',cVec(k), 'MarkerSize',10)
end
title(['no shifting, numClust:' int2str(numClust)])
The test script generates random data X. In my case. I want to use the matrix D of size 516 x 19 but I am not sure how to adapt my data to this function. The function is returning results that are not agreeing with my understanding of the algorithm.
Does anyone know how to do this?

Converting Matlab to Octave is there a containers.Map equivalent?

I am trying to convert some matlab code from the Maia package into something that will work with Octave. I am currently getting stuck because one of the files has several calls to containers.Map which is apparently something that has not yet been implemented in octave. Does anyone have any ideas for easily achieving similar functionality without doing a whole lot of extra work in octave? Thanks all for your time.
function [adj_direct contig_direct overlap names longest_path_direct...
weigth_direct deltafiles deltafiles_ref ReferenceAlignment ...
contig_ref overlap_ref name_hash_ref] = ...
assembly_driver(assemblies,ref_genome,target_chromosome, ...
deltafiles_ref,contig_ref, overlap_ref, ...
name_hash_ref, varargin)
% ASSEMBLY_DRIVER Combines contig sets into one assembled chromosome
%
% INPUT
% assemblies
% ref_chromosome
% Startnode_name
% Endnode_name
% OPTIONAL DEFAULT
% 'z_weigths' [.25 .25 .25 .25]
% 'clipping_thrs' 10
% 'ref_distance' -10
% 'ref_quality' 1E-5
% 'max_chromosome_dist' 100
% 'quit_treshold' 15
% 'tabu_time' 3
% 'minimum_improvement' -inf
% 'ref_node_assemblies' all assemblies (slow)
% 'endextend' true
%
%
% SET DEFAULTS
% General parameters
z_weights = [.25 .25 .25 .25];
clipping_thrs = 10;
mapfilter = '-rq';
alignlen = 75;
ident = 85;
% Reference nod parameters
ref_distance = -10;
ref_quality = 1E-5;
max_chromosome_dist = 100;
% TABU parameters
quit_treshold = 15;
tabu_time = 3;
minimum_improvement = -inf;
ref_node_assemblies = assemblies;
% Extending the assembly outwards from the start and en node
endextend = true;
AllowReverse = true;
% If no start and end node are given, they will be determined from tiling
Startnode_name = '';
Endnode_name = '';
containment_edge = true;
ref_first = true;
% If contigs have already been aligned to the reference, give the
% deltafile
ReferenceAlignment = 'NotYetDoneByMaia';
% Get VARARGIN user input
if length(varargin) > 0
while 1
switch varargin{1}
case 'Startnode_name'
Startnode_name = varargin{2};
case 'Endnode_name'
Endnode_name = varargin{2};
case 'z_weigths'
z_weights = varargin{2};
case 'clipping_thrs'
clipping_thrs = varargin{2};
case 'ref_distance'
ref_distance = varargin{2};
case 'ref_quality'
ref_quality = varargin{2};
case 'max_chromosome_dist'
max_chromosome_dist = varargin{2};
case 'quit_treshold'
quit_treshold = varargin{2};
case 'tabu_time'
tabu_time = varargin{2};
case 'minimum_improvement'
minimum_improvement = varargin{2};
case 'ref_node_assemblies'
ref_node_assemblies = assemblies(varargin{2},:);
case 'extend_ends'
endextend = assemblies(varargin{2},:);
case 'AllowReverse'
AllowReverse = varargin{2};
case 'ReferenceAlignment'
ReferenceAlignment = varargin{2};
case 'containment_edge'
containment_edge = varargin{2};
case 'ref_first'
ref_first = varargin{2};
case 'mapfilter'
mapfilter = varargin{2};
case 'alignlen'
alignlen = varargin{2};
case 'ident'
ident = varargin{2};
otherwise
error(['Input ' varargin{2} ' is not known']);
end
if length(varargin) > 2
varargin = varargin(3:end);
else
break;
end
end
end
% Read input assemblies
assembly_names = assemblies(:,1);
assembly_locs = assemblies(:,2);
assembly_quality = containers.Map(assemblies(:,1),assemblies(:,3));
assembly_quality('reference') = ref_quality;
% Read input assemblies for creation of reference nodes
ref_node_assembly_names = ref_node_assemblies(:,1);
ref_node_assembly_locs = ref_node_assemblies(:,2);
ref_node_assembly_quality = containers.Map(ref_node_assemblies(:,1),ref_node_assemblies(:,3));
ref_node_assembly_quality('reference') = ref_quality;
% If there is only one assembly there is nothing to align
if size(assemblies,1) >= 2
% Align assemblies against each other
assembly_pairs = {};
coordsfiles = [];
deltafiles = [];
for i = 1:length(assembly_locs)-1
for j = i+1:length(assembly_locs)
[coordsfile,deltafile] = align_assemblies({assembly_locs{i},assembly_locs{j}},{assembly_names{i}, assembly_names{j}}, ...
mapfilter, alignlen, ident);
coordsfiles = [coordsfiles; coordsfile];
%deltafiles = [deltafiles deltafile];
deltafiles = [deltafiles; {deltafile}];
assembly_pairs = [assembly_pairs;[assembly_names(i) assembly_names(j)]];
end
end
% fprintf('Loading alignment files.\n');
% load alignments_done;
% Put the nucmer alignments in an adjency matrix
%[adj, names, name_hash, contig, overlap] = get_adj_matrix(coordsfiles, assembly_pairs, assembly_quality, z_weights, 'clipping_thrs', clipping_thrs, 'dove_tail', 'double','edge_weight','z-scores', 'containment_edge', true);
[adj, names, name_hash, contig, overlap] = get_adj_matrix(deltafiles, assembly_pairs, assembly_quality, z_weights, 'clipping_thrs', clipping_thrs, 'dove_tail', 'double','edge_weight','z-scores', 'containment_edge', containment_edge);
% Merge deltafiles
deltafilesnew = deltafiles{1};
if size(deltafiles,1) > 1
for di = 2:size(deltafiles,1)
deltafilesnew = [deltafilesnew deltafiles{di}];
end
end
deltafiles = deltafilesnew;
else
assembly_pairs = {};
coordsfiles = [];
deltafiles = [];
adj = [];
names = {};
name_hash = containers.Map;
contig = struct('name',{},'size',[],'chromosome',[],'number',[], 'assembly', [], 'assembly_quality', []);
overlap = struct('Q',{},'R',[],'S1',[],'E1', [], 'S2', [], 'E2', [], 'LEN1', [], 'LEN2', [], 'IDY', [], 'COVR', [], 'COVQ', [],'LENR',[], 'LENQ',[]);
end
% Ad the pseudo nodes to the graph. If the contigs have already been
% aligned to the reference genome, just select the alignments that
% correspond to the target chromosome
if isequal(ReferenceAlignment,'NotYetDoneByMaia')
% Align all contigs in 'contig_sets_fasta' to the reference chromosome
[contig_ref, overlap_ref, name_hash_ref, deltafiles_ref] = align_contigs_sets(...
ref_genome, ref_node_assembly_locs, ref_node_assembly_names, ...
ref_node_assembly_quality, clipping_thrs, z_weights, ...
ref_distance,max_chromosome_dist);
ReferenceAlignment = 'out2.delta';
end
% Select only the entries in the deltafile for the current target chromosome
[contig_target_ref, overlap_target_ref, name_hash_target_ref, delta_target_ref] = ...
GetVariablesForTargetChromosome(...
contig_ref, overlap_ref, deltafiles_ref);
% Ref clipping should be high in case of tiling
%if isequal(max_chromosome_dist,'tiling')
% clipping_thrs = 10000
%end
% Add reference nodes to the adjency matrix
[adj, names, name_hash, contig, overlap, delta_target_ref, Startnode_name, Endnode_name] = get_reference_nodes( ...
adj, names, name_hash, contig, overlap, target_chromosome, ...
contig_target_ref, overlap_target_ref, name_hash_target_ref, delta_target_ref, ...
max_chromosome_dist, ref_distance, clipping_thrs, ref_first,...
Startnode_name, Endnode_name, AllowReverse);
% Give reference edges some small extra value to distict between
% assemblies to which a reference node leads
% adj = rank_reference_edges(adj,contig,assembly_quality);
% Specify a start and an end node for the assembly
Startnode = name_hash(Startnode_name);
Endnode = name_hash(Endnode_name);
% Find the best scoring path
fprintf('Directing the final graph\n');
% Calculate path on undirected graph to get an idea on how to direct the graph
[longest_path weigth] = longest_path_tabu(adj, Startnode, Endnode, quit_treshold, tabu_time, minimum_improvement);
% Make the graph directed (greedy)
[adj_direct contig_direct] = direct_graph(adj,overlap, contig, names, name_hash,clipping_thrs, Startnode, longest_path, true, ref_first);
% Calcultate final layout-path
fprintf('Find highest scoring path\n');
[longest_path_direct weigth_direct] = longest_path_tabu(adj_direct, Startnode, Endnode, quit_treshold, tabu_time, minimum_improvement);
function [contig_target_ref, overlap_target_ref, name_hash_target_ref, delta_target_ref] = ...
GetVariablesForTargetChromosome(...
contig_ref, overlap_ref, deltafiles_ref)
% Select only the entries in the deltafile for the current target chromosome
delta_target_ref = deltafiles_ref;
for di = size(delta_target_ref,2):-1:1
if ~isequal(delta_target_ref(di).R,target_chromosome)
delta_target_ref(di) = [];
end
end
overlap_target_ref = overlap_ref;
for oi = size(overlap_target_ref,2):-1:1
if ~isequal(overlap_target_ref(oi).R,target_chromosome)
overlap_target_ref(oi) = [];
end
end
contig_target_ref = contig_ref;
for ci = size(contig_target_ref,1):-1:1
if isequal(contig_target_ref(ci).assembly, 'reference') && ~isequal(contig_target_ref(ci).name,target_chromosome)
contig_target_ref(ci) = [];
end
end
name_hash_target_ref = make_hash({contig_target_ref.name}');
end
end
There is no exact equivalent of containers.Map in Octave that I know of...
One option is to use the java package to create java.util.Hashtable. Using this example:
pkg load java
d = javaObject("java.util.Hashtable");
d.put('a',1)
d.put('b',2)
d.put('c',3)
d.get('b')
If you are willing to do a bit of rewriting, you can use the builtin struct as a rudimentary hash table with strings (valid variable names) as keys, and pretty much anything stored in values.
For example, given the following:
keys = {'Mon','Tue','Wed'}
values = {10, 20, 30}
you could replace this:
map = containers.Map(keys,values);
map('Mon')
by:
s = struct();
for i=1:numel(keys)
s.(keys{i}) = values{i};
end
s.('Mon')
You might need to use genvarname to produce valid keys, or maybe a proper hashing function that produces valid key strings.
Also look into struct-related functions: getfield, setfield, isfield, fieldnames, rmfield, etc..