Matlab: multiple imputation for missing data - matlab

Is there any package available for multiple imputation? Or any reference I can use to write my own function? Since the percentage of missing data is really high in some columns of the data (approximately 50–70%), I think multiple imputation is a good choice.

If you installed Bioinformatics Toolbox, check knnimpute for more details. It is used to impute missing data using nearest-neighbor method.

Related

Feature selection for one class classification

I try to apply One Class SVM but my dataset contains too many features and I believe feature selection would improve my metrics. Are there any methods for feature selection that do not need the label of the class?
If yes and you are aware of an existing implementation please let me know
You'd probably get better answers asking this on Cross Validated instead of Stack Exchange, although since you ask for implementations I will answer your question.
Unsupervised methods exist that allow you to eliminate features without looking at the target variable. This is called unsupervised data (dimensionality) reduction. They work by looking for features that convey similar information and then either eliminate some of those features or reduce them to fewer features whilst retaining as much information as possible.
Some examples of data reduction techniques include PCA, redundancy analysis, variable clustering, and random projections, amongst others.
You don't mention which program you're working in but I am going to presume it's Python. sklearn has implementations for PCA and SparseRandomProjection. I know there is a module designed for variable clustering in Python but I have not used it and don't know how convenient it is. I don't know if there's an unsupervised implementation of redundancy analysis in Python but you could consider making your own. Depending on what you decide to do it might not be too tricky (especially if you just do correlation based).
In case you're working in R, finding versions of data reduction using PCA will be no problem. For variable clustering and redundancy analysis, great packages like Hmisc and ClustOfVar exist.
You can also read about other unsupervised data reduction techniques; you might find other methods more suitable.

calculating clustering validity of k-means using rapidminer

Well, I have been studying up on the different algorithms used for clustering like k-means, k-mediods etc and I was trying to run the algorithms and analyze their performance on the leaf dataset right here:
http://archive.ics.uci.edu/ml/datasets/Leaf
I was able to cluster the dataset via k-means by first reading the csv file, filtering out unneeded attributes and applying k-means on it. The problem that I am facing here that I wish to calculate measures such as entropy, precision, recall and f-measure for the model developed via k-means. Is there an operator avialable that allows me to do this so that I can quantitatively compare the different clustering algorithms available on rapid-miner?
P.S I know about performance operators like Performance(Classification) that allows me to calculate precision and recall for a model but I dont know any that allow me to calculate entropy.
Help would be much appreciated.
The short answer is to use R. Here's a link to a book chapter about this very subject. There is a revised version coming soon that works for the most recent version of RapidMiner.

Principal Component Analysis w/ Alternating Least Squares for Missing Data

In MATLAB R2014b there is a new function, pca(), that performs PCA that can handle missing data. In the documentation it says that it performs pca with the "alternating least squares" algorithm in order to estimate the missing values.
I would like to know if there are any practical references in how to apply PCA with this algorithm without the use of the function, or if there is a good reference on als. The reason is, there is no such function in Octave that can handle missing data and so I would like to code it myself.
Thanks for all your help. I went through the references and was able to find their matlab code on the als algorithm from two of the references. For anybody wondering, the source code can be found in these two links:
1) http://research.ics.aalto.fi/bayes/software/index.shtml
2) https://www.cs.nyu.edu/~roweis/code.html

Use a dataset array without Statistics Toolbox

At my workplace I have one license of MATLAB on a virtual machine, which has Statistics Toolbox included with it. I like to use that instance of MATLAB to import csv data into dataset arrays, because of the convenience it provides.
However, I'd like to use the imported data on my local machine, which has its own license for MATLAB but (unfortunately) no Statistics Toolbox.
What is the best way to convert the dataset object to something that can be used with only base MATLAB? dataset2struct? It seems that if I'm just converting it back to a structure, I might as well just write a function that imports the data directly to a structure. Or is there any other way to work with dataset array in a MATLAB instance that lacks Statistics Toolbox?
In version 13b of MATLAB (out this September, prerelease is available now), there will be something similar to a dataset array in base MATLAB called a table data container (I haven't tried it yet, and can't be sure it will be exactly the same). Also a categorical array similar to that currently in Statistics Toolbox.
Until then, there's not really a way to use a dataset array without Statistics Toolbox, and I would suggest either of the two methods you mention (personally I'd go with just using a structure throughout, as I find the convenience of dataset arrays to be overrated - but that's just my experience, yours may differ).

How can I access a MATLAB (interpolated) spline from another program?

If I was to create interpolated splines from a large amount of data (about 400 charts, 500,000 values each), how could I then access the coordinates of those splines from another software quickly and efficiently?
Initially I intended to run a regression on the data and use the resulting formula in my delphi program, but that turned out to be a bigger pain than I thought.
I am currently using Matlab but I can use another software if need be.
Edit: It is probably relevant that this data represents the empirical cumulative distribution of some other data (which I already have in a database).
Here is what one of these charts would look like.
The emphasis is on speed of access. I intend to use this data to run simulations on financial data.
MATLAB has a command for converting a spline into a piecewise polynomial. You can then extract the breaks and the coefficients of each piece of polynomial with unmkpp, and evaluate them in another program.
If you are also familiar with C, you could use Matlab coder or something similar to get an intermediate library to connect your Delphi program and MATLAB together. Interfacing Delphi and C code is, albeit a tad tedious, certainly possible (or it was back in the days of Delphi 7). Or you could even write the algorithm in MATLAB, convert the code to C using Matlab coder and from within Delphi call the generated C library.
Perhaps a bit overkill, but you can store your data in a database (e.g. MySQL) from MATLAB and retrieve them from Delphi.
Finally: is Delphi a real constraint? You could also use MATLAB to do the simulations, as you might have the same tools (or even more) available for MATLAB than in Delphi. Afterwards you can just share the results, which I suppose is less speed critical.
My initial guess at doing this efficiently would be to create a memory mapped file in MATLAB using memmapfile, stuff a look-up table with your data into that, then open the memory mapped file in your Delphi code and read the data from that file.
The fastest is most likely a look-up table that you save to disk and that you load and use in your simulation code (although: why not run the simulation in Matlab?)
You can evaluate the spline for a finely-grained list of values of x using FNVAL, and use the closest value of x to look up the cdf.