H2O Random Forest Impurity Measure for Classification - classification

I currently use the DRF implementation of H2O-3 to create a binary classification model. However, I was wondering, that H2O only supports squared error as impurity measure which is usually used to contruct regression trees but no other measure like e.g. Gini which I think are better suited for classification tasks.
Within the documentation I was not able to identify the reasonning for applying this metric also for classification tasks. Can anyone explain to, why this appraoch makes sense?

Related

Can linear model give more prediction accuracy than random forest,decision tree,neural network?

I have calculated the following parameters after applying the following algorithms on a dataset from kaggle
enter image description here
In the above case,linear model is giving the best results.
Are the above results correct and can linear model actually give better results than other 3 in any case?
Or am I missing something?
According to AUC criterion this classification is perfect (1 is theoretical maximum). This means a clear difference in the data. In this case, it makes no sense to talk about differences in the results of methods. Another point is that you can play with methods parameters (you likely will get slightly different results) and other methods can become better. But real result will be indistinguishable. Sophisticated methods are invented for sophisticated data. This is not the case.
All models are wrong, some are useful. - George Box
In terms of classification, a model would be effective as long as it could nicely fit the classification boundaries.
For binary classification case, supposing your data is perfectly linearly separable, then linear model will do the job - actually the "best" job since any more complicated models won't perform better.
If your +'s and -'s are somehow a bit scattered when they cannot be separated by a line (actually hyperplane), then linear model could be beaten by decision tree simply because decision trees can provide classification boundary of more complex shape (cubes).
Then random forest may beat decision tree as classification boundary of random forest is more flexible.
However, as we mentioned early, linear model still has its time.

Self organizing Maps and Linear vector quantization

Self organizing maps are more suited for clustering(dimension reduction) rather than classification. But SOM's are used in Linear vector quantization for fine tuning. But LVQ is a supervised leaning method. So to use SOM's in LVQ, LVQ should be provided with a labelled training data set. But since SOM's only do clustering and not classification and thus cannot have labelled data how can SOM be used as an input for LVQ?
Does LVQ fine tune the clusters in SOM?
Before using in LVQ should SOM be put through another classification algorithm so that it can classify the inputs so that these labelled inputs maybe used in LVQ?
It must be clear that supervised differs from unsupervised because in the first the target values are known.
Therefore, the output of supervised models is a prediction.
Instead, the output of unsupervised models is a label for which we don't know the meaning yet. For this purpose, after clustering, it is necessary to do the profiling of each one of those new label.
Having said so, you could label the dataset using an unsupervised learning technique such as SOM. Then, you should profile each class in order to be sure to understand the meaning of each class.
At this point, you can pursue two different path depending on what is your final objective:
1. use this new variable as a way for dimensionality reduction
2. use this new dataset featured with the additional variable representing the class as a labelled data that you will try to predict using the LVQ
Hope this can be useful!

Clustering Algorithm for average energy measurements

I have a data set which consists of data points having attributes like:
average daily consumption of energy
average daily generation of energy
type of energy source
average daily energy fed in to grid
daily energy tariff
I am new to clustering techniques.
So my question is which clustering algorithm will be best for such kind of data to form clusters ?
I think hierarchical clustering is a good choice. Have a look here Clustering Algorithms
The more simple way to do clustering is by kmeans algorithm. If all of your attributes are numerical, then this is the easiest way of doing the clustering. Even if they are not, you would have to find a distance measure for caterogical or nominal attributes, but still kmeans is a good choice. Kmeans is a partitional clustering algorithm... i wouldn't use hierarchical clustering for this case. But that also depends on what you want to do. you need to evaluate if you want to find clusters within clusters or they all have to be totally apart from each other and not included on each other.
Take care.
1) First, try with k-means. If that fulfills your demand that's it. Play with different number of clusters (controlled by parameter k). There are a number of implementations of k-means and you can implement your own version if you have good programming skills.
K-means generally works well if data looks like a circular/spherical shape. This means that there is some Gaussianity in the data (data comes from a Gaussian distribution).
2) if k-means doesn't fulfill your expectations, it is time to read and think more. Then I suggest reading a good survey paper. the most common techniques are implemented in several programming languages and data mining frameworks, many of them are free to download and use.
3) if applying state-of-the-art clustering techniques is not enough, it is time to design a new technique. Then you can think by yourself or associate with a machine learning expert.
Since most of your data is continuous, and it reasonable to assume that energy consumption and generation are normally distributed, I would use statistical methods for clustering.
Such as:
Gaussian Mixture Models
Bayesian Hierarchical Clustering
The advantage of these methods over metric-based clustering algorithms (e.g. k-means) is that we can take advantage of the fact that we are dealing with averages, and we can make assumptions on the distributions from which those average were calculated.

Support Vector Machine vs K Nearest Neighbours

I have a data set to classify.By using KNN algo i am getting an accuracy of 90% but whereas by using SVM i just able to get over 70%. Is SVM not better than KNN. I know this might be stupid to ask but, what are the parameters for SVM which will give nearly approximate results as KNN algo. I am using libsvm package on matlab R2008
kNN and SVM represent different approaches to learning. Each approach implies different model for the underlying data.
SVM assumes there exist a hyper-plane seperating the data points (quite a restrictive assumption), while kNN attempts to approximate the underlying distribution of the data in a non-parametric fashion (crude approximation of parsen-window estimator).
You'll have to look at the specifics of your scenario to make a better decision as to what algorithm and configuration are best used.
It really depends on the dataset you are using. If you have something like the first line of this image ( http://scikit-learn.org/stable/_images/plot_classifier_comparison_1.png ) kNN will work really well and Linear SVM really badly.
If you want SVM to perform better you can use a Kernel based SVM like the one in the picture (it uses a rbf kernel).
If you are using scikit-learn for python you can play a bit with code here to see how to use the Kernel SVM http://scikit-learn.org/stable/modules/svm.html
kNN basically says "if you're close to coordinate x, then the classification will be similar to observed outcomes at x." In SVM, a close analog would be using a high-dimensional kernel with a "small" bandwidth parameter, since this will cause SVM to overfit more. That is, SVM will be closer to "if you're close to coordinate x, then the classification will be similar to those observed at x."
I recommend that you start with a Gaussian kernel and check the results for different parameters. From my own experience (which is, of course, focused on certain types of datasets, so your mileage may vary), tuned SVM outperforms tuned kNN.
Questions for you:
1) How are you selecting k in kNN?
2) What parameters have you tried for SVM?
3) Are you measuring accuracy in-sample or out-of-sample?

Matlab: K-means clustering with predefined populations

I am trying to differentiate two populations. Each population is an NxM matrix in which N is fixed between the two and M is variable in length (N=column specific attributes of each run, M=run number). I have looked at PCA and K-means for differentiating the two, but I was curious of the best practice.
To my knowledge, in K-means, there is no initial 'calibration' in which the clusters are chosen such that known bimodal populations can be differentiated. It simply minimizes the distance and assigns the data to an arbitrary number of populations. I would like to tell the clustering algorithm that I want the best fit in which the two populations are separated. I can then use the fit I get from the initial clustering on future datasets. Any help, example code, or reading material would be appreciated.
-R
K-means and PCA are typically used in unsupervised learning problems, i.e. problems where you have a single batch of data and want to find some easier way to describe it. In principle, you could run K-means (with K=2) on your data, and then evaluate the degree to which your two classes of data match up with the data clusters found by this algorithm (note: you may want multiple starts).
It sounds to like you have a supervised learning problem: you have a training data set which has already been partitioned into two classes. In this case k-nearest neighbors (as mentioned by #amas) is probably the approach most like k-means; however Support Vector Machines can also be an attractive approach.
I frequently refer to The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) by Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman (Author).
It really depends on the data. But just to let you know K-means does get stuck at local minima so if you wanna use it try running it from different random starting points. PCA's might also be useful how ever like any other spectral clustering method you have much less control over the clustering procedure. I recommend that you cluster the data using k-means with multiple random starting points and c how it works then you can predict and learn for each the new samples with K-NN (I don't know if it is useful for your case).
Check Lazy learners and K-NN for prediction.