results of two feature selection algo do not match - matlab

I am working on two feature selection algorithms for a real world problem where the sample size is 30 and feature size is 80. The first algorithm is wrapper forward feature selection using SVM classifier, the second is filter feature selection algorithm using Pearson product-moment correlation coefficient and Spearman's rank correlation coefficient. It turns out that the selected features by these two algorithms are not overlap at all. Is it reasonable? Does it mean I made mistakes in my implementation? Thank you.
FYI, I am using Libsvm + matlab.

It can definitely happen as both strategies do not have the same expression power.
Trust the wrapper if you want the best feature subset for prediction, trust the correlation if you want all features that are linked to the output/predicted variable. Those subsets can be quite different, especially if you have many redundant features.
Using top correlated features is a strategy which assumes that the relationships between the features and the output/predicted variable are linear, (or at least monotonous in case of Spearman's rank correlation), and that features are statistically independent one from another, and do not 'interact' with one another. Those assumptions are most often violated in real world problems.
Correlations, or other 'filters' such as mutual information, are better used either to filter out features, to decide which features not to consider, rather than to decide which features to consider. Filters are necessary when the initial feature count is very large (hundreds, thousands) to reduce the workload for a subsequent wrapper algorithm.

Depending on the distribution of the data you can either use spearman or pearson.The latter is used for normal distribution while former for non-normal.Find the distribution and use appropriate one.

Related

Mutate weights and biases in a neural network through genetic algorithm

I have a genetic algorithm evolving a population of neural networks
Until now I make mutation on weights or biases using random.randn (Python) which is a random value from a normal distribution with mean = 0
It works "well" and I managed to achieve my project using it be wouldn't it be better to use a uniform distribution on a given interval ?
My intuition is that it would lead to more variety in my networks
I think, that this question has no simple solution. In case of normal distribution will be numbers around mean have more chances to be "selected" by your number generator, uniform distribution give almost equal chance to all numbers. That is clear but answer to question, will equal chance mean better result, lays according to me only at empirical experiments. So I suggest you to perform experiments with normal and uniform distribution a try to judge based on results.
About variety. I assume that you create some random vector which represents weights. At stage of mutation you perform addition of random number. This number will be more likely from close interval around mean, so in case 0 mutation with high probability will be change of some elements only little. So there will be only little improvements over vector and sometimes something big shows up. In case of uniform distribution will be changes more random, which leads to different individual. Question is, will be these individual better? I don't know, but I offer you another view. I look to genetic algorithms like an analogy to evolution theory. And from this point of view, cumulative little improvements of individual with little probability of some big change is more appropriate. Think about situation, used is uniform distribution, but children has low fitness due to big changes so at phase of creating new generation will be not selected. And you will wait so long for one tiny improvement which make your network works with good results.
Maybe one more thing. Your experiments maybe show that uniform/normal distribution is better. But such result may be true only for your current problem, no at general.

Shouldn't we take average of n models in cross validation in linear regression?

I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...

Best way to validate DBSCAN Clusters

I have used the ELKI implementation of DBSCAN to identify fire hot spot clusters from a fire data set and the results look quite good. The data set is spatial and the clusters are based on latitude, longitude. Basically, the DBSCAN parameters identify hot spot regions where there is a high concentration of fire points (defined by density). These are the fire hot spot regions.
My question is, after experimenting with several different parameters and finding a pair that gives a reasonable clustering result, how does one validate the clusters?
Is there a suitable formal validation method for my use case? Or is this subjective depending on the application domain?
ELKI contains a number of evaluation functions for clusterings.
Use the -evaluator parameter to enable them, from the evaluation.clustering.internal package.
Some of them will not automatically run because they have quadratic runtime cost - probably more than your clustering algorithm.
I do not trust these measures. They are designed for particular clustering algorithms; and are mostly useful for deciding the k parameter of k-means; not much more than that. If you blindly go by these measures, you end up with useless results most of the time. Also, these measures do not work with noise, with either of the strategies we tried.
The cheapest are the label-based evaluators. These will automatically run, but apparently your data does not have labels (or they are numeric, in which case you need to set the -parser.labelindex parameter accordingly). Personally, I prefer the Adjusted Rand Index to compare the similarity of two clusterings. All of these indexes are sensitive to noise so they don't work too well with DBSCAN, unless your reference has the same concept of noise as DBSCAN.
If you can afford it, a "subjective" evaluation is always best.
You want to solve a problem, not a number. That is the whole point of "data science", being problem oriented and solving the problem, not obsessed with minimizing some random quality number. If the results don't work in reality, you failed.
There are different methods to validate a DBSCAN clustering output. Generally we can distinguish between internal and external indices, depending if you have labeled data available or not. For DBSCAN there is a great internal validation indice called DBCV.
External Indices:
If you have some labeled data, external indices are great and can demonstrate how well the cluster did vs. the labeled data. One example indice is the RAND indice.https://en.wikipedia.org/wiki/Rand_index
Internal Indices:
If you don't have labeled data, then internal indices can be used to give the clustering result a score. In general the indices calculate the distance of points within the cluster and to other clusters and try to give you a score based on the compactness (how close are the points to each other in a cluster?) and
separability (how much distance is between the clusters?).
For DBSCAN, there is one great internal validation indice called DBCV by Moulavi et al. Paper is available here: https://epubs.siam.org/doi/pdf/10.1137/1.9781611973440.96
Python package: https://github.com/christopherjenness/DBCV

kmean clustering: variable selection

I'm applying a kmean algorithm for clustering my customer base. I'm struggling conceptually on the selection process of the dimensions (variables) to include in the model. I was wondering if there are methods established to compare among models with different variables. In particular, I was thinking to use the common SSwithin / SSbetween ratio, but I'm not sure if that can be applied to compare models with a different number of dimensions...
Any suggestions>?
Thanks a lot.
Classic approaches are sequential selection algorithms like "sequential floating forward selection" (SFFS) or "sequential floating backward elimination (SFBS). Those are heuristic methods where you eliminate (or add) one feature at the time based on your performance metric, e.g,. mean squared error (MSE). Also, you could use a genetic algorithm for that if you like.
Here is an easy-going paper that summarizes the ideas:
Feature Selection from Huge Feature Sets
And a more advanced one that could be useful: Unsupervised Feature Selection for the k-means Clustering Problem
EDIT:
When I think about it again, I initially had the question in mind "how do I select the k (a fixed number) best features (where k < d)," e.g., for computational efficiency or visualization purposes. Now, I think what you where asking is more like "What is the feature subset that performs best overall?" The silhouette index (similarity of points within a cluster) could be useful, but I really don't think you can really improve the performance via feature selection unless you have the ground truth labels.
I have to admit that I have more experience with supervised rather than unsupervised methods. Thus, I typically prefer regularization over feature selection/dimensionality reduction when it comes to tackling the "curse of dimensionality." I use dimensionality reduction frequently for data compression though.

Clustering: a training dataset of variable data dimensions

I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.