Does Scikit Learn SelectKBest group continous data itself? - feature-selection

I want to use the SelectKBest Scikit Learn algorithm (with chi-square test) to reduce the number of features in my ML setup. Unfortunately my feature-data is continous. So my question is, if scikit groups this data itself or is this my job? On the one hand in most cases the input data is continous and chi-sqaure-test seems to be an often used method. On the other hand there is no possibility to define a number of groups.
Thanks
Prickles

Related

Feature selection for one class classification

I try to apply One Class SVM but my dataset contains too many features and I believe feature selection would improve my metrics. Are there any methods for feature selection that do not need the label of the class?
If yes and you are aware of an existing implementation please let me know
You'd probably get better answers asking this on Cross Validated instead of Stack Exchange, although since you ask for implementations I will answer your question.
Unsupervised methods exist that allow you to eliminate features without looking at the target variable. This is called unsupervised data (dimensionality) reduction. They work by looking for features that convey similar information and then either eliminate some of those features or reduce them to fewer features whilst retaining as much information as possible.
Some examples of data reduction techniques include PCA, redundancy analysis, variable clustering, and random projections, amongst others.
You don't mention which program you're working in but I am going to presume it's Python. sklearn has implementations for PCA and SparseRandomProjection. I know there is a module designed for variable clustering in Python but I have not used it and don't know how convenient it is. I don't know if there's an unsupervised implementation of redundancy analysis in Python but you could consider making your own. Depending on what you decide to do it might not be too tricky (especially if you just do correlation based).
In case you're working in R, finding versions of data reduction using PCA will be no problem. For variable clustering and redundancy analysis, great packages like Hmisc and ClustOfVar exist.
You can also read about other unsupervised data reduction techniques; you might find other methods more suitable.

Can I separately train a classifier (e.g. SVM) with two different types of features and combine the results later?

I am a student and working on my first simple machine learning project. The project is about classifying articles into fake and true. I want to use SVM as classification algorithm and two different types of features:
TF-IDF
Lexical Features like the count of exclamation marks and numbers
I have figured out how to use the lexical features and TF-IDF as a features separately. However, I have not managed to figure out, how to combine them.
Is it possible, to train and test two separate learning algorithms (one with TF-IDF and the other one with lexical features) and later combine the results?
For example, can I calculate Accuracy, Precision and Recall for both separately and then take the average?
One way of combining two models is called model stacking. The idea behind it is, that you take the predictions of both models and feed them into a third model (called meta-model) which is then trained to make predictions given the output of the first two models. There is also another version of model stacking where you aditionally feed the original features into the meta-model.
However, in your case another way to combine both approaches would be to simply feed both the TF-IDF and the lexical features into one model and see how that performs.
For example, can I calculate Accuracy, Precision and Recall for both separately and then take the average?
This would unfortunately not work, because there is no combined model making those predictions for which your calculated metrics would be true.

Is it possible to simultaneously use and train a neural network?

Is it possible to use Tensorflow or some similar library to make a model that you can efficiently train and use at the same time.
An example/use case for this would be a chat bot that you give feedback to. Somewhat like how pets learn (i.e. replicating what they just did for a reward). Or being able to add new entries or new responses they can use.
I think what you are asking is whether a model can be trained continuously without having to retrain it from scratch each time new labelled data comes in.
Answer to that is - Online models
There are models that can be trained continuously on data without worrying about training them from scratch. As per Wikipedia definition
Online machine learning is a method of machine learning in which data becomes available in sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.
Some examples of such algorithms are
BernoulliNB
GaussianNB
MiniBatchKMeans
MultinomialNB
PassiveAggressiveClassifier
PassiveAggressiveRegressor
Perceptron
SGDClassifier
SGDRegressor
DNNs

Fusion Classifier in Weka?

I have a dataset with 20 features. 10 for age and 10 for weight. I want to classify the data for both separately then use the results from these 2 classifiers as an input to a third for the final result..
Is this possible with Weka????
Fusion of decisions is possible in WEKA (or with any two models), but not using the approach you describe.
Seeing as your using classifiers, each model will only output a class. You could use the two labels produced as features for a third model, but the lack of diversity in your inputs would most likely prevent the third model from giving you anything interesting.
At the most basic level, you could implement a voting scheme. Give each model a "vote" and then take assume that the correct class is the majority voted class. While this will give a rudimentary form of fusion, if you're familiar with voting theory you know that majority-rules somewhat falls apart when you have more than two classes.
I recommend that you use Combinatorial Fusion to fuse the output of the two classifiers. A good paper regarding the technique is available as a free PDF here. In essence, you use the Classifer::distributionForInstance() method provided by WEKA's classifiers and then use the sum of the distributions (called "scores") to rank the classes, choosing the class with the highest rank. The paper demonstrates that this method is superior to doing just voting alone.

Why we need training and test datasets in research?

I'm newbie in research area of data mining (text clustering) and i have couple question regarding to training and test datasets.
Is that clustering need training and testing datasets?
why we need to separate into training and test datasets?
Sorry for the rookie question hope expert in this group can help me.
As your question is on clustering:
In cluster analysis, there usually is no training or test data split.
Because you do cluster analysis when you do not have labels, so you cannot "train".
Training is a concept from machine learning, and train-test splitting is used to avoid overfitting.
But if you are not learning labels, you cannot overfit.
Properly used cluster analysis is a knowledge discovery method. You want to discover some new structure in your data, not rediscover something that is already labeled.
To train your data you need a sets of relevant data similar but not identical to your testing data. For example, you could split up your data where 0.7 of your data is training and the rest testing. This will allow your algorithm to get a feel for what it should be looking for. The rest of the data 0.3 can be used for testing as it is a distinct set of information (hopefully) which should allow the algorithm to test itself.
Why split it up?
Well if you train your data on data A and then test your algorithm on data A your algorithm will be able to identify all the information correctly because that is what it was trained on.
For example, if when learning addition you were given the sums 3+4, 4+5, 6+9, which you correctly solved it would be redundant to test your knowledge of addition using the same sums.
further information:
http://en.wikipedia.org/wiki/Natural_language_processing
http://www.nltk.org/book
Hope this helps.