Dynamic balanced data structure in Matlab? - matlab

This answer states
I don't think you (or I) can do dynamic data structures 'in' MATLAB.
We have to use MATLAB OO features and MATLAB classes. Since I think
that these facilities are really a MATLAB wrapper around Java I make
the bold claim that those facilities are outside MATLAB. A matter of
semantics, I concede. If you want to do dynamic data structures with
MATLAB, you have to use OO and classes, you can't do it with what I
think of as the core language, which lacks pointers at the user level.
Now suppose a bag. New numbers are added to the bag in random order and still the numbers should be ordered. The amount of numbers is unknown. Hence I need a dynamic data-structure: the size of the structure must be able to get changed. Also the structure must be able to get balanced i.e. I need to get it ordered.
Which data structure should I use for the dynamic balanced data-structure requirement in Matlab?

Matlab's matrices are inherently dynamic. If you have a vector of ordered numbers and want to insert a new number in its proper place (maintaining the vector ordered), you can simply do
[~, ind] = find(number<=vector,1,'first'); % determine where to insert
if isempty(ind), ind = numel(vector)+1; end % in this case, insert at the end
vector = [vector(1:ind-1) number vector(ind:end)]; % do the insert, extending the vector
Of course this is not very fast because of the need for memory reallocation.

Related

How can I avoid having two instances of a very large matrix at the same time when loading it into a solver?

I am using both Cplex and Gurobi for an LP program whose inequality constraint matrix A can become truly large -- around 5 to 10GB. When I want to use one of those solvers, I have to create a separate struct with all the problem constraints. This means that I have the matrix A in my workspace, and the matrix A in my solver struct at the same time. Even if I clear it in my Workspace as fast as possible, there is still a time when both exist and my RAM is overloaded.
I am asking if there is some clever method to deliver the matrix A into the model without both existing at the same time. The only thing I can think of right now is delivering it in small chunks...
MATLAB using copy-on-write, or lazy copying. This means that, as long as you don't modify one of the copies, all copies of a matrix share the same data:
A = randn(10000);
B = A; % does not take up extra memory
myfunc(B);
function myfunc(matrix)
C = matrix; % does not take up extra memory.
For reference, see for example on Loren's blog and Undocumented Matlab.

Sparse boolean matrix multiplication

Does anybody know the efficient implementation of sparse boolean matrix multiplication? I'm interested in both CPU and GPGPU implementations because it is necessary to multiply matrices of different sizes (from 8x8 to up to 10^8x10^8). Currently, I use cuSPARSE library, but it supports only numerical matrices (float, double etc) and this fact leads to huge overhead (by memory and time) which is critical in my task.
Since a boolean matrix can be viewed as the adjacency matrix of some (bipartite) graph, its product with another matrix can be interpreted as the distance 2 connections between the nodes of two subgraphs linked by a common set of nodes.
To avoid wasting space and exploit some amount of bit parallelism, you could try using some form of succint data structure for graph storage and manipulation.
One such family of data structures which could be useful in your case is the K2-tree (or Kn in general), which uses an approach to store the adjacencies similar to spatial decompositions such as quad- and oct- trers.
Ultimately, the best algorithm and data structure will heavily depend on the dimension and sparsity patterns of your matrices.

Simple Sequential feature selection in Matlab

I have a 40X3249 noisy dataset and 40X1 resultset. I want to perform simple sequential feature selection on it, in Matlab. Matlab example is complicated and I can't follow it. Even a few examples on SoF didn't help. I want to use decision tree as classifier to perform feature selection. Can someone please explain in simple terms.
Also is it a problem that my dataset has very low number of observations compared to the number of features?
I am following this example: Sequential feature selection Matlab and I am getting error like this:
The pooled covariance matrix of TRAINING must be positive definite.
I've explained the error message you're getting in answers to your previous questions.
In general, it is a problem that you have many more variables than samples. This will prevent you using some techniques, such as the discriminant analysis you were attempting, but it's a problem anyway. The fact is that if you have that high a ratio of variables to samples, it is very likely that some combination of variables would perfectly classify your dataset even if they were all random numbers. That's true if you build a single decision tree model, and even more true if you are using a feature selection method to explicitly search through combinations of variables.
I would suggest you try some sort of dimensionality reduction method. If all of your variables are continuous, you could try PCA as suggested by #user1207217. Alternatively you could use a latent variable method for model-building, such as PLS (plsregress in MATLAB).
If you're still intent on using sequential feature selection with a decision tree on this dataset, then you should be able to modify the example in the question you linked to, replacing the call to classify with one to classregtree.
This error comes from the use of the classify function in that question, which is performing LDA. This error occurs when the data is rank deficient (or in other words, some features are almost exactly correlated). In order to overcome this, you should project the data down to a lower dimensional subspace. Principal component analysis can do this for you. See here for more details on how to use pca function within statistics toolbox of Matlab.
[basis, scores, ~] = pca(X); % Find the basis functions and their weighting, X is row vectors
indices = find(scores > eps(2*max(scores))); % This is to find irrelevant components up to machine precision of the biggest component .. with a litte extra tolerance (2x)
new_basis = basis(:, indices); % This gets us the relevant components, which are stored in variable "basis" as column vectors
X_new = X*new_basis; % inner products between the new basis functions spanning some subspace of the original, and the original feature vectors
This should get you automatic projections down into a relevant subspace. Note that your features won't have the same meaning as before, because they will be weighted combinations of the old features.
Extra note: If you don't want to change your feature representation, then instead of classify, you need to use something which works with rank deficient data. You could roll your own version of penalised discriminant analysis (which is quite simple), use support vector machines, or other classification functions which don't break with correlated features as LDA does (by virtue of requiring matrix inversion of the covariance estimate).
EDIT: P.S I haven't tested this, because I have rolled my own version of PCA in Matlab.

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

Multidimensional indexing of images

I would like to know if there is a good way for indexing multidimensional objects (i.e. images). More precisely, I have a large collection of images on which I calculate n-dimensional feature vectors. There is a distance metric (i.e. L2-norm) defined over those feature vectors d(u,v). Given a key (an n-dimensional) k, the index should allow fast retrieval of feature vectors that are "close" to k (that is, their distance is small).
MATLAB code reference would be great...
For distances r-tree's are often used. I think it can apply to n-dimensions, but I'm not sure if it will work with custom distance or dissimilarity functions. I think it's implemented in this library. It might help to convert your data to n-dimensional coordinates.