I'm running a Naive Bayes process in RapidMiner on Fisher's Iris dataset.
My main process is as follows:
Retrieve Iris, Set Role, Validation
The Validation subprocess is as follows:
Training Set: Naive Bayes; Test Set: Apply Model, Performance
When I run the process, there are no results. I've never had such an issue before with RapidMiner and can't find anything on the issue when I Google it or search Stackoverflow.
Related
I like to compare the various ROC curves, which are build by various classifier using WEKA KNOWLEDGE FLOW platform. I have a training data set and a test data set. I want to build the model using training dataset and then want to supply test dataset to build the ROC curves. As per my understanding I have created a knowledge flow environment.
However, I am not sure about my implementation.
I trained a Neural Network with a GA and with Backpropagation. The GA finds suitable weights for the training data but performs poorly on the test data. If I train the NN with BackPropagation, it performs much better on the test data even though the training error isn't much smaller than for the GA trained version. Even when I use the weights obtained by the GA as initial weights for Backpropagation, the NN performs worse on the test data than using only Backpropagation for training. Can anyone tell me where I could have made a mistake?
I suggest you read something about overfitting. In short you will be excelent at training set but poor at testing set(because NN follows anomaly and uncertainity and datas). Task of NN is generalize, but GA only perfect minimize error in training set(to be fair, this is GA task).
There are some methods how to deal with overfitting. I suggest you use validation set. First step is division your data into the three sets. Training testing and validation. Method is simple, you will train your NN with GA to minimalize error on training set, but you also run your NN on validation set, only run, not train. Error of network decrease on training set, but error should also decrease at validation set. So if error decrease at training set, but start increase at validation set, you must stop with learning(please don't stop at first iterations).
Hope it will be helpful.
I have encountered a similar problem, and the choice of the initial values of the neural network does not seem to affect the final classification accuracy. I used the feedforwardnet() function in matlab to compare the two cases. One is direct training, and the program gives random initial weights and bias values. One is to find the appropriate initial weights values and bias values through the GA algorithm, and then assign them to the neural network, and then start training. However, the latter approach does not improve the accuracy of neural network classification.
So after I have spent a few days cleaning my data, preproceasing, and experimenting with a few different models (e.g. in R Studio) how do I realistically deploy the solution.
Its straightforward if the model is a simple Model e.g Decision Tree, Logistic regression, as the model is obvious and the R Predictor model deployed into an commercial R Server with http endpoints etc.
My question is, what about complex pre processing ( e.g. PCA transforms, RBF kernels, or Random forests of 100 trees.) just as in the Validation phase, I would presume I would have to deploy R Scripts to preprocess, and PCA or apply RBF pre processing scripts etc to my deployment server ?
Does this mean for RBF I have to host all the original Training data set alongside my SVM predictor ? RBF transform being a function of Training set or at least the support vectors.
And for Random Forest, I assume I have to upload all 500 or so Trees, as part of a very big model.
First, export your R solution (data pre-processing steps and the model) into PMML data format using either the combo of pmml and pmmlTransformations packages, or the r2pmml package. Second, deploy the PMML solution using Openscoring REST web service, JPMML-Spark, or whatever else PMML integration that fits your deployment needs.
PMML has no problem representing PCA, SVM with RBF kernel, tree ensembles, etc.
A pure R vanilla solution for this problem. Most of the ensemble method provide utility to dump/save the learnt model. Learning is very time consuming and iterative process that should be done once. Post learning save/dump your R object. In the deployment have only scoring code. Scoring code will do all the data transformations and later on scoring.
For normal preprocessing you can reuse R code which was used in training. For complex processing like PCA again save the final model and just score/run data over saved PCA R object. Lastly post preprocessing score/run your data on learnt model and get final results.
I am new in Data Mining analytic and Machine Learning. I have been trying to compare the use of Predictive analysis and Clustering analysis using RapidMiner and Weka for my college assignment.
Just after I study the advantages and disadvantages from both tools and starting to do the analyzing process I found some problems. I tried doing Clustering using K-means and simpleKmeans for Weka and Regression analysis using LinearRegression and I am not quite satisfied with the result, since they contain result that significantly different. all of that I used a same datasets. numerical datasets.
I have been spending a lot of my time trying to figure something out by studying the initialization for each algorithm each tools since the interface is different and there are some parameter that is on RapidMiner but not in Weka or otherwise, so I am a bit confused. (is it the problem?)
Despite that what do you think is wrong? is there some initialization process that I missed? or is it because the code is different in each tools even they use the same algorithm?
Thank you for your answer!
Weka often uses built-in normalization at least in k-means and other algorithms.
Make sure you have disabled this if you want to make results comparable.
Also understand that k-means is a randomized algorithm. Different results even from the same package are to be expected (and desirable).
did you use WEKA itself or rapidminer's WEKA extension? Did you try to compare the results of WEKA with RM WEKA?
I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.
Basically I'm using Back-propagation algorithm to train the neural network using the dataset given here: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.
But in that dataset no. of attributes are very high. In fact one of the mentors of the project told me that If you train the Neural Network with that many attributes, it'll take lots of time to get trained. So is there a way to optimize the input dataset? Or I just have to use that many attributes?
1558 is actually a modest number of features/attributes. The # of instances(3279) is also small. The problem is not on the dataset side, but on the training algorithm side.
ANN is slow in training, I'd suggest you to use a logistic regression or svm. Both of them are very fast to train. Especially, svm has a lot of fast algorithms.
In this dataset, you are actually analyzing text, but not image. I think a linear family classifier, i.e. logistic regression or svm, is better for your job.
If you are using for production and you cannot use open source code. Logistic regression is very easy to implement compared to a good ANN and SVM.
If you decide to use logistic regression or SVM, I can future recommend some articles or source code for you to refer.
If you're actually using a backpropagation network with 1558 input nodes and only 3279 samples, then the training time is the least of your problems: Even if you have a very small network with only one hidden layer containing 10 neurons, you have 1558*10 weights between the input layer and the hidden layer. How can you expect to get a good estimate for 15580 degrees of freedom from only 3279 samples? (And that simple calculation doesn't even take the "curse of dimensionality" into account)
You have to analyze your data to find out how to optimize it. Try to understand your input data: Which (tuples of) features are (jointly) statistically significant? (use standard statistical methods for this) Are some features redundant? (Principal component analysis is a good stating point for this.) Don't expect the artificial neural network to do that work for you.
Also: remeber Duda&Hart's famous "no-free-lunch-theorem": No classification algorithm works for every problem. And for any classification algorithm X, there is a problem where flipping a coin leads to better results than X. If you take this into account, deciding what algorithm to use before analyzing your data might not be a smart idea. You might well have picked the algorithm that actually performs worse than blind guessing on your specific problem! (By the way: Duda&Hart&Storks's book about pattern classification is a great starting point to learn about this, if you haven't read it yet.)
aplly a seperate ANN for each category of features
for example
457 inputs 1 output for url terms ( ANN1 )
495 inputs 1 output for origurl ( ANN2 )
...
then train all of them
use another main ANN to join results