Tree-Based dimensionality reduction in DNN algorithms - neural-network

My question is straightforward: is it possible to use a tree-based dimensionality reduction such as feature importance embedded in the Random Forest before training the dataset with a DNN algorithm?
In other words, does the use of tree-based feature importance prevents the use of training algorithms different from the tree/Random Forest?

I think you should read the DNN article.
Why? Why do you want to use Random Forest before DNN training?
Yes, you can display the feature importance of random-forest using
random_forest = RandomForestClassifier(random_state=42).fit(x_train, y_train)
feature_importances = DataFrame(random_forest.feature_importances_,
index = x_train.columns,
columns=['importance']).sort_values('importance',
ascending=False)
print(feature_importances)
But this is a feature-extraction method. The DNN is a neural-network method.
DNN is more complex than random-forest, while random-forest handles feature-extraction, DNN handles
feature-extraction,
back-propagation,
feed-forward methods.
If you feed enough training samples for DNN, you will have higher accuracy.
Does the use of tree-based feature importance prevents the use of training algorithms?
No, based on the problem, the sufficient feature size and samples vary. Usually, you don't use random-forest to extract 1M images feature importance.
Also, you don't use DNN for small-datasets.

Related

Feature Extraction from Convolutional Neural Network (CNN) and use this feature to other classification algorithm

My question is can we use CNN for feature extraction and then can we use this extracted feature as an input to another classification algorithm like SVM.
Thanks
Yes, this has already been done and well documented in several research papers, like CNN Features off-the-shelf: an Astounding Baseline for Recognition and How transferable are features in deep neural networks?. Both show that using CNN features trained on one dataset, but tested on a different one usually perform very well or beat the state of the art.
In general you can take the features from the layer before the last, normalize them and use them with another classifier.
Another related technique is fine tuning, where after training a network, the last layer is replaced and retrained, but previous layers' weights are kept fixed.

Clustering Algorithm for average energy measurements

I have a data set which consists of data points having attributes like:
average daily consumption of energy
average daily generation of energy
type of energy source
average daily energy fed in to grid
daily energy tariff
I am new to clustering techniques.
So my question is which clustering algorithm will be best for such kind of data to form clusters ?
I think hierarchical clustering is a good choice. Have a look here Clustering Algorithms
The more simple way to do clustering is by kmeans algorithm. If all of your attributes are numerical, then this is the easiest way of doing the clustering. Even if they are not, you would have to find a distance measure for caterogical or nominal attributes, but still kmeans is a good choice. Kmeans is a partitional clustering algorithm... i wouldn't use hierarchical clustering for this case. But that also depends on what you want to do. you need to evaluate if you want to find clusters within clusters or they all have to be totally apart from each other and not included on each other.
Take care.
1) First, try with k-means. If that fulfills your demand that's it. Play with different number of clusters (controlled by parameter k). There are a number of implementations of k-means and you can implement your own version if you have good programming skills.
K-means generally works well if data looks like a circular/spherical shape. This means that there is some Gaussianity in the data (data comes from a Gaussian distribution).
2) if k-means doesn't fulfill your expectations, it is time to read and think more. Then I suggest reading a good survey paper. the most common techniques are implemented in several programming languages and data mining frameworks, many of them are free to download and use.
3) if applying state-of-the-art clustering techniques is not enough, it is time to design a new technique. Then you can think by yourself or associate with a machine learning expert.
Since most of your data is continuous, and it reasonable to assume that energy consumption and generation are normally distributed, I would use statistical methods for clustering.
Such as:
Gaussian Mixture Models
Bayesian Hierarchical Clustering
The advantage of these methods over metric-based clustering algorithms (e.g. k-means) is that we can take advantage of the fact that we are dealing with averages, and we can make assumptions on the distributions from which those average were calculated.

SVM LibSVM Ignore Feature 1,3,5 when Predicting

this question is about LibSVM or SVMs in general.
I wonder if it is possible to categorize Feature-Vectors of different length with the same SVM Model.
Let's say we train the SVM with about 1000 Instances of the following Feature Vector:
[feature1 feature2 feature3 feature4 feature5]
Now I want to predict a test-vector which has the same length of 5.
If the probability I receive is to poor, I now want to check the first subset of my test-vector containing the columns 2-5. So I want to dismiss the 1 feature.
My question now is: Is it possible to tell the SVM only to check the features 2-5 for prediction (e.g. with weights), or do I have to train different SVM Models. One for 5 features, another for 4 features and so on...?
Thanks in advance...
marcus
You can always remove features from your test points by fiddling with the file, but I highly recommend not using such an approach. An SVM model is valid when all features are present. If you are using the linear kernel, simply setting a given feature to 0 will implicitly cause it to be ignored (though you should not do this). When using other kernels, this is very much a no no.
Using a different set of features for predictions than the set you used for training is not a good approach.
I strongly suggest to train a new model for the subset of features you wish to use in prediction.

why as number of features increases, the classification accuracy decreases when using svm

I am using libsvm for image classification. Why when I use more features for classification my prediction accuracy decreases? Shouldn´t it increase? My dataset size is fixed at 1600 for training and 400 for testing.
Because the additional features may not be at all useful for separating the classes in the feature space. Accuracy is not necessarily tied to number of features.
Including lots of poor features may cause your SVM to learn the noise in the data, damaging the accuracy.
For example, if your extra feature looks like this (using 2D plots for clarity):
Then it will not be a very good feature for separating the (in this case) two classes. If for example, the SVM trains only on this pattern, it will not be very good at predicting the class of a future point. However, there might be a feature in your dataset that looks like this:
A feature like this would be very useful in separating the two classes.

Optimization of Neural Network input data

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.
Basically I'm using Back-propagation algorithm to train the neural network using the dataset given here: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.
But in that dataset no. of attributes are very high. In fact one of the mentors of the project told me that If you train the Neural Network with that many attributes, it'll take lots of time to get trained. So is there a way to optimize the input dataset? Or I just have to use that many attributes?
1558 is actually a modest number of features/attributes. The # of instances(3279) is also small. The problem is not on the dataset side, but on the training algorithm side.
ANN is slow in training, I'd suggest you to use a logistic regression or svm. Both of them are very fast to train. Especially, svm has a lot of fast algorithms.
In this dataset, you are actually analyzing text, but not image. I think a linear family classifier, i.e. logistic regression or svm, is better for your job.
If you are using for production and you cannot use open source code. Logistic regression is very easy to implement compared to a good ANN and SVM.
If you decide to use logistic regression or SVM, I can future recommend some articles or source code for you to refer.
If you're actually using a backpropagation network with 1558 input nodes and only 3279 samples, then the training time is the least of your problems: Even if you have a very small network with only one hidden layer containing 10 neurons, you have 1558*10 weights between the input layer and the hidden layer. How can you expect to get a good estimate for 15580 degrees of freedom from only 3279 samples? (And that simple calculation doesn't even take the "curse of dimensionality" into account)
You have to analyze your data to find out how to optimize it. Try to understand your input data: Which (tuples of) features are (jointly) statistically significant? (use standard statistical methods for this) Are some features redundant? (Principal component analysis is a good stating point for this.) Don't expect the artificial neural network to do that work for you.
Also: remeber Duda&Hart's famous "no-free-lunch-theorem": No classification algorithm works for every problem. And for any classification algorithm X, there is a problem where flipping a coin leads to better results than X. If you take this into account, deciding what algorithm to use before analyzing your data might not be a smart idea. You might well have picked the algorithm that actually performs worse than blind guessing on your specific problem! (By the way: Duda&Hart&Storks's book about pattern classification is a great starting point to learn about this, if you haven't read it yet.)
aplly a seperate ANN for each category of features
for example
457 inputs 1 output for url terms ( ANN1 )
495 inputs 1 output for origurl ( ANN2 )
...
then train all of them
use another main ANN to join results