Does it matter which algorithm you use for Multiple Imputation by Chained Equations (MICE) - imputation

I have seen MICE implemented with different types of algorithms e.g. RandomForest or Stochastic Regression etc.
My question is that does it matter which type of algorithm i.e. does one perform the best? Is there any empirical evidence?
I am struggling to find any info on the web
Thank you

Yes, (depending on your task) it can matter quite a lot, which algorithm you choose.
You also can be sure, the mice developers wouldn't out effort into providing different algorithms, if there was one algorithm that anyway always performs best. Because, of course like in machine learning the "No free lunch theorem" is also relevant for imputation.
In general you can say, that the default settings of mice are often a good choice.
Look at this example from the miceRanger Vignette to see, how far imputations can differ for different algorithms. (the real distribution is marked in red, the respective multiple imputations in black)
The Predictive Mean Matching (pmm) algorithm e.g. makes sure that only imputed values appear, that were really in the dataset. This is for example useful, where only integer values like 0,1,2,3 appear in the data (and no values in between). Other algorithms won't do this, so while doing their regression they will also provide interpolated values like on the picture to the right ( so they will provide imputations that are e.g. 1.1, 1.3, ...) Both solutions can come with certain drawbacks.
That is why it is important to actually assess imputation performance afterwards. There are several diagnostic plots in mice to do this.

Related

How to interpret the discriminator's loss and the generator's loss in Generative Adversarial Nets?

I am reading people's implementation of DCGAN, especially this one in tensorflow.
In that implementation, the author draws the losses of the discriminator and of the generator, which is shown below (images come from https://github.com/carpedm20/DCGAN-tensorflow):
Both the losses of the discriminator and of the generator don't seem to follow any pattern. Unlike general neural networks, whose loss decreases along with the increase of training iteration. How to interpret the loss when training GANs?
Unfortunately, like you've said for GANs the losses are very non-intuitive. Mostly it happens down to the fact that generator and discriminator are competing against each other, hence improvement on the one means the higher loss on the other, until this other learns better on the received loss, which screws up its competitor, etc.
Now one thing that should happen often enough (depending on your data and initialisation) is that both discriminator and generator losses are converging to some permanent numbers, like this:
(it's ok for loss to bounce around a bit - it's just the evidence of the model trying to improve itself)
This loss convergence would normally signify that the GAN model found some optimum, where it can't improve more, which also should mean that it has learned well enough. (Also note, that the numbers themselves usually aren't very informative.)
Here are a few side notes, that I hope would be of help:
if loss haven't converged very well, it doesn't necessarily mean that the model hasn't learned anything - check the generated examples, sometimes they come out good enough. Alternatively, can try changing learning rate and other parameters.
if the model converged well, still check the generated examples - sometimes the generator finds one/few examples that discriminator can't distinguish from the genuine data. The trouble is it always gives out these few, not creating anything new, this is called mode collapse. Usually introducing some diversity to your data helps.
as vanilla GANs are rather unstable, I'd suggest to use some version
of the DCGAN models, as they contain some features like convolutional
layers and batch normalisation, that are supposed to help with the
stability of the convergence. (the picture above is a result of the DCGAN rather than vanilla GAN)
This is some common sense but still: like with most neural net structures tweaking the model, i.e. changing its parameters or/and architecture to fit your certain needs/data can improve the model or screw it.

Best Method to Intersect Huge HyperLogLogs in Redis

The problem is simple: I need to find the optimal strategy to implement accurate HyperLogLog unions based on Redis' representation thereof--this includes handling their sparse/dense representations if the data structure is exported for use elsewhere.
Two Strategies
There are two strategies, one of which seems vastly simpler. I've looked at the actual Redis source and I'm having a bit of trouble (not big in C, myself) figuring out whether it's better from a precision and efficiency perspective to use their built-in structures/routines or develop my own. For what it's worth, I'm willing to sacrifice space and to some degree errors (stdev +-2%) in the pursuit of efficiency with extremely large sets.
1. Inclusion Principle
By far the simplest of the two--essentially I would just use the lossless union (PFMERGE) in combination with this principle to calculate an estimate of the overlap. Tests seem to show this running reliably in many cases, although I'm having trouble getting an accurate handle on in-the-wild efficiency and accuracy (some cases can produce errors of 20-40% which is unacceptable in this use case).
Basically:
aCardinality + bCardinality - intersectionCardinality
or, in the case of multiple sets...
aCardinality + (bCardinality x cCardinality) - intersectionCardinality
seems to work in many cases with good accuracy, but I don't know if I trust it. While Redis has many built-in low-cardinality modifiers designed to circumvent known HLL issues, I don't know if the issue of wild inaccuracy (using inclusion/exclusion) is still present with sets of high disparity in size...
2. Jaccard Index Intersection/MinHash
This way seems more interesting, but a part of me feels like it may computationally overlap with some of Redis' existing optimizations (ie, I'm not implementing my own HLL algorithm from scratch).
With this approach I'd use a random sampling of bins with a MinHash algorithm (I don't think an LSH implementation is worth the trouble). This would be a separate structure, but by using minhash to get the Jaccard index of the sets, you can then effectively multiply the union cardinality by that index for a more accurate count.
Problem is, I'm not very well versed in HLL's and while I'd love to dig into the Google paper I need a viable implementation in short order. Chances are I'm overlooking some basic considerations either of Redis' existing optimizations, or else in the algorithm itself that allows for computationally-cheap intersection estimates with pretty lax confidence bounds.
thus, my question:
How do I most effectively get a computationally-cheap intersection estimate of N huge (billions) sets, using redis, if I'm willing to sacrifice space (and to a small degree, accuracy)?
Read this paper some time back. Will probably answer most of your questions. Inclusion Principle inevitably compounds error margins a large number of sets. Min-Hash approach would be the way to go.
http://tech.adroll.com/media/hllminhash.pdf
There is a third strategy to estimate the intersection size of any two sets given as HyperLogLog sketches: Maximum likelihood estimation.
For more details see the paper available at
http://oertl.github.io/hyperloglog-sketch-estimation-paper/.

kmean clustering: variable selection

I'm applying a kmean algorithm for clustering my customer base. I'm struggling conceptually on the selection process of the dimensions (variables) to include in the model. I was wondering if there are methods established to compare among models with different variables. In particular, I was thinking to use the common SSwithin / SSbetween ratio, but I'm not sure if that can be applied to compare models with a different number of dimensions...
Any suggestions>?
Thanks a lot.
Classic approaches are sequential selection algorithms like "sequential floating forward selection" (SFFS) or "sequential floating backward elimination (SFBS). Those are heuristic methods where you eliminate (or add) one feature at the time based on your performance metric, e.g,. mean squared error (MSE). Also, you could use a genetic algorithm for that if you like.
Here is an easy-going paper that summarizes the ideas:
Feature Selection from Huge Feature Sets
And a more advanced one that could be useful: Unsupervised Feature Selection for the k-means Clustering Problem
EDIT:
When I think about it again, I initially had the question in mind "how do I select the k (a fixed number) best features (where k < d)," e.g., for computational efficiency or visualization purposes. Now, I think what you where asking is more like "What is the feature subset that performs best overall?" The silhouette index (similarity of points within a cluster) could be useful, but I really don't think you can really improve the performance via feature selection unless you have the ground truth labels.
I have to admit that I have more experience with supervised rather than unsupervised methods. Thus, I typically prefer regularization over feature selection/dimensionality reduction when it comes to tackling the "curse of dimensionality." I use dimensionality reduction frequently for data compression though.

What's a genetic algorithm that would produce interesting/surprising results and not have a boring/obvious end point?

I find genetic algorithm simulations like this to be incredibly entrancing and I think it'd be fun to make my own. But the problem with most simulations like this is that they're usually just hill climbing to a predictable ideal result that could have been crafted with human guidance pretty easily. An interesting simulation would have countless different solutions that would be significantly different from each other and surprising to the human observing them.
So how would I go about trying to create something like that? Is it even reasonable to expect to achieve what I'm describing? Are there any "standard" simulations (in the sense that the game of life is sort of standardized) that I could draw inspiration from?
Depends on what you mean by interesting. That's a pretty subjective term. I once programmed a graph analyzer for fun. The program would first let you plot any f(x) of your choice and set the bounds. The second step was creating a tree holding the most common binary operators (+-*/) in a random generated function of x. The program would create a pool of such random functions, test how well they fit to the original curve in question, then crossbreed and mutate some of the functions in the pool.
The results were quite cool. A totally weird function would often be a pretty good approximation to the query function. Perhaps not the most useful program, but fun nonetheless.
Well, for starters that genetic algorithm is not doing hill-climbing, otherwise it would get stuck at the first local maxima/minima.
Also, how can you say it doesn't produce surprising results? Look at this vehicle here for example produced around generation 7 for one of the runs I tried. It's a very old model of a bicycle. How can you say that's not a surprising result when it took humans millennia to come up with the same model?
To get interesting emergent behavior (that is unpredictable yet useful) it is probably necessary to give the genetic algorithm an interesting task to learn and not just a simple optimisation problem.
For instance, the Car Builder that you referred to (although quite nice in itself) is just using a fixed road as the fitness function. This makes it easy for the genetic algorithm to find an optimal solution, however if the road would change slightly, that optimal solution may not work anymore because the fitness of a solution may have grown dependent on trivially small details in the landscape and not be robust to changes to it. In real, cars did not evolve on one fixed test road either but on many different roads and terrains. Using an ever changing road as the (dynamic) fitness function, generated by random factors but within certain realistic boundaries for slopes etc. would be a more realistic and useful fitness function.
I think EvoLisa is a GA that produces interesting results. In one sense, the output is predictable, as you are trying to match a known image. On the other hand, the details of the output are pretty cool.

Choose the right classification algorithm. Linear or non-linear? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I find this question a little tricky. Maybe someone knows an approach to answer this question. Imagine that you have a dataset(training data) which you don't know what it is about. Which features of training data would you look at in order to infer classification algorithm to classify this data? Can we say anything whether we should use a non-linear or linear classification algorithm?
By the way, I am using WEKA to analyze the data.
Any suggestions?
Thank you.
This is in fact two questions in one ;-)
Feature selection
Linear or not
add "algorithm selection", and you probably have three most fundamental questions of classifier design.
As an aside note, it's a good thing that you do not have any domain expertise which would have allowed you to guide the selection of features and/or to assert the linearity of the feature space. That's the fun of data mining : to infer such info without a priori expertise. (BTW, and while domain expertise is good to double-check the outcome of the classifier, too much a priori insight may make you miss good mining opportunities). Without any such a priori knowledge you are forced to establish sound methodologies and apply careful scrutiny to the results.
It's hard to provide specific guidance, in part because many details are left out in the question, and also because I'm somewhat BS-ing my way through this ;-). Never the less I hope the following generic advice will be helpful
For each algorithm you try (or more precisely for each set of parameters for a given algorithm), you will need to run many tests. Theory can be very helpful, but there will remain a lot of "trial and error". You'll find Cross-Validation a valuable technique.
In a nutshell, [and depending on the size of the available training data], you randomly split the training data in several parts and train the classifier on one [or several] of these parts, and then evaluate the classifier on its performance on another [or several] parts. For each such run you measure various indicators of performance such as Mis-Classification Error (MCE) and aside from telling you how the classifier performs, these metrics, or rather their variability will provide hints as to the relevance of the features selected and/or their lack of scale or linearity.
Independently of the linearity assumption, it is useful to normalize the values of numeric features. This helps with features which have an odd range etc.
Within each dimension, establish the range within, say, 2.5 standard deviations on either side of the median, and convert the feature values to a percentage on the basis of this range.
Convert nominal attributes to binary ones, creating as many dimensions are there are distinct values of the nominal attribute. (I think many algorithm optimizers will do this for you)
Once you have identified one or a few classifiers with a relatively decent performance (say 33% MCE), perform the same test series, with such a classifier by modifying only one parameter at a time. For example remove some features, and see if the resulting, lower dimensionality classifier improves or degrades.
The loss factor is a very sensitive parameter. Try and stick with one "reasonnable" but possibly suboptimal value for the bulk of the tests, fine tune the loss at the end.
Learn to exploit the "dump" info provided by the SVM optimizers. These results provide very valuable info as to what the optimizer "thinks"
Remember that what worked very well wih a given dataset in a given domain may perform very poorly with data from another domain...
coffee's good, not too much. When all fails, make it Irish ;-)
Wow, so you have some training data and you don't know whether you are looking at features representing words in a document, or genese in a cell and need to tune a classifier. Well, since you don't have any semantic information, you are going to have to do this soley by looking at statistical properties of the data sets.
First, to formulate the problem, this is more than just linear vs non-linear. If you are really looking to classify this data, what you really need to do is to select a kernel function for the classifier which may be linear, or non-linear (gaussian, polynomial, hyperbolic, etc. In addition each kernel function may take one or more parameters that would need to be set. Determining an optimal kernel function and parameter set for a given classification problem is not really a solved problem, there are only useful heuristics and if you google 'selecting a kernel function' or 'choose kernel function', you will be treated to many research papers proposing and testing various approaches. While there are many approaches, one of the most basic and well travelled is to do a gradient descent on the parameters-- basically you try a kernel method and a parameter set , train on half your data points and see how you do. Then you try a different set of parameters and see how you do. You move the parameters in the direction of best improvement in accuracy until you get satisfactory results.
If you don't need to go through all this complexity to find a good kernel function, and simply want an answer to linear or non-linear. then the question mainly comes down to two things: Non linear classifiers will have a higher risk of overfitting (undergeneralizing) since they have more dimensions of freedom. They can suffer from the classifier merely memorizing sets of good data points, rather than coming up with a good generalization. On the other hand a linear classifier has less freedom to fit, and in the case of data that is not linearly seperable, will fail to find a good decision function and suffer from high error rates.
Unfortunately, I don't know a better mathematical solution to answer the question "is this data linearly seperable" other than to just try the classifier itself and see how it performs. For that you are going to need a smarter answer than mine.
Edit: This research paper describes an algorithm which looks like it should be able to determine how close a given data set comes to being linearly seperable.
http://www2.ift.ulaval.ca/~mmarchand/publications/wcnn93aa.pdf