Error trying to build a classifier with MatLab+Weka - matlab

I'm trying to perform a classification with some classifiers using weka+Matlab, however, some classifiers are not accepting the paremeter I've sent with setOptions.
Look at this test code, I don't know why, the Logistic classifier is built properly, but the Ibk presents an error:
%Load the csv File returning an object with the features.
wekaObj= loadCSV('C:\experimento\selecionados para o experimento\Experimento Final\dados\todos.csv');
%Create an instance of the Logistic classifier - OK
classifier1=javaObject(['weka.classifiers.','functions.Logistic']);
classifier1.setOptions('-R 1.8E-8 -M -1');
classifier1.buildClassifier(wekaObj);
%Create an instance of the K-nearest Neighbour classifier - Error
classifier2=javaObject(['weka.classifiers.','lazy.IBk']);
classifier2.setOptions('-K 10 -W 0 -A "weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R first-last\""');
classifier2.buildClassifier(wekaObj);
%Create an instance of the random forest classifier - Error
classifier3=javaObject(['weka.classifiers.','trees.RandomForest']);
classifier3.setOptions('-I 1200 -K 0 -S 1 -num-slots 1');
classifier3.buildClassifier(wekaObj);
%Create an instance of the MultiLayer Perceptron classifier - Error
classifier4=javaObject(['weka.classifiers.','functions.MultilayerPerceptron']);
classifier4.setOptions('-L 0.1 -M 0.1 -N 500 -V 0 -S 0 -E 20 -H a');
classifier4.buildClassifier(wekaObj);
The error is that one:
Error using weka.classifiers.lazy.IBk/setOptions
Java exception occurred:
java.lang.Exception: Illegal options: -K 10 -W 0 -A
"weka.core.neighboursearch.LinearNNSearch -A "weka.core.EuclideanDistance -R
first-last""
at weka.core.Utils.checkForRemainingOptions(Utils.java:534)
at weka.classifiers.lazy.IBk.setOptions(IBk.java:715)
Has anyone here had this same problem?
obs: Sorry for possible typos, english is my second language.

I was able to figure out what was wrong, the correct implementation:
%Load the csv File returning an object with the features.
wekaObj= loadCSV('C:\experimento\selecionados para o experimento\Experimento Final\dados\todos.csv');
%Create an instance of the Logistic classifier - OK
classifier1=javaObject(['weka.classifiers.','functions.Logistic']);
classifier1.setOptions('-R 1.8E-8 -M -1');
classifier1.buildClassifier(wekaObj);
%Create an instance of the K-nearest Neighbour classifier - Error
classifier2=javaObject(['weka.classifiers.','lazy.IBk']);
classifier2.setOptions(weka.core.Utils.splitOptions('-K 10 -W 0 -A "weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R first-last\""'));
classifier2.buildClassifier(wekaObj);
%Create an instance of the random forest classifier - Error
classifier3=javaObject(['weka.classifiers.','trees.RandomForest']);
classifier3.setOptions(weka.core.Utils.splitOptions('-I 1200 -K 0 -S 1'));
classifier3.buildClassifier(wekaObj);
%Create an instance of the MultiLayer Perceptron classifier - Error
classifier4=javaObject(['weka.classifiers.','functions.MultilayerPerceptron']);
classifier4.setOptions(weka.core.Utils.splitOptions('-L 0.1 -M 0.1 -N 500 -V 0 -S 0 -E 20 -H a'));
classifier4.buildClassifier(wekaObj);

Related

WEKA Command Line Parameters

I am able to run Weka form CLI using below command:
java -cp weka.jar weka.classifiers.functions.MultilayerPerceptron -t Dataset.arff
Weka Explorer Target Selection Parameters
How can I set the Target Parameters for example "Number of time units for forecast" using command Line?
We are trying to use command line to improve memory utilization , we have a large dataset with 10000 attributes which is causing Java Heap Space everytime we run it from GUI.
Thanks For the response.
Posting answer to my own question:
java -cp weka.jar weka.Run weka.classifiers.timeseries.WekaForecaster -W "weka.classifiers.functions.MultilayerPerceptron -L 0.01 -M 0.2 -N 5000 -V 0 -S 0 -E 20 -H 20 " -t <dataset file> -F <FieldList> -L 1 -M 3 -prime 3 -horizon 6
We can always get more help using :
java -cp weka.jar weka.Run -h

How do I implement weka classifier MPL in matlab

hi : i m working with matlab 2014a and i need to implement MULTIPERCEPTRON classification algorithme from weka 3.6 to matlab .
I have dificulties to find the setting of the classifier
classifier=javaObject('weka.classifiers.functions.MultilayerPerceptron');
classifier.setOptions(weka.core.Utils.splitOptions(' -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a'));
and i m getting this error
> Java exception occurred: java.lang.Exception: Illegal options: -L
> 0.3 -M
> 0.2 -N 500 -V 0 -S 0 -E 20 -H a at
> weka.core.Utils.checkForRemainingOptions(Utils.java:534) at weka.classifiers.functions.MultilayerPerceptron.setOptions(MultilayerPerceptron.java:2376)

Mahout clustering: How to retrieve the name of a named vector

I want to cluster multiple documents using Mahout. The clustering works fine but I have no idea how to find out which documents are located in each cluster.
I read that you can use the option --namedVector when creating the sparse-files but where does it take the ID from and how can I retrieve this ID after the clustering is completed?
Right now I am doing the following steps:
I have a directory with a file for each document. The files are in the following format with the ID of the document as filename:
filename: documentID.txt
[TITLE]
[CONTENT]
I create a sparse directory with namedVectors using:
./mahout seqdirectory -i tmp/es-out -o tmp/es-out-seqdir -c UTF-8 -chunk 64 -xm sequential
./mahout seq2sparse -i tmp/es-out-seqdir -o tmp/es-out-sparse --maxDFPercent 85 --namedVector
Then I can cluster the results and create a dump:
./mahout kmeans -i tmp/es-out-sparse/tfidf-vectors -c tmp/es-kmeans-clusters -o tmp/es-kmeans -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 10 -k 20 -ow --clustering
./mahout clusterdump -i tmp/es-kmeans/clusters-10-final -o tmp/clusterdump -d tmp/es-out-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -sp 0 --pointsDir tmp/es-kmeans/clusteredPoints
The dump looks like this:
:VL-190{n=1 c=[1:3.407, 110:6.193, 2007:3.736, about:1.762, according:2.948, account:3.507, acting:6.
Top Terms:
epa => 13.471728324890137
mountaintop => 11.364262580871582
mine => 10.942587852478027
Weight : [props - optional]: Point:
[...]
k-means in Mahout is only a toy.
You can use it for howtos and tutorials, but for real use it is too slow, too limited, roo hard to use. (Also, k-means results are not half as good as people think... most of the time they are dogfood.)
Benchmark other tools, and you'll be surprised big time.
I found a way. You can use the seqdumper to extract the cluster mapping:
./mahout seqdumper -i /tmp/es-kmeans/clusteredPoints/part-m-00000 -o /tmp/cluster-points.txt
Than you can use a regex to extract the mapping of the vector IDs to cluster IDs.

Kmeans clustering using mahout

I am trying to perform kmeans algorithm on data using . The option that has to be passed while running need a path to initial clusters. Can anyone tell me how can we have initial clusters even before starting the algorithm?
bin/mahout kmeans \
-i <input vectors directory> \
-c <input clusters directory> \
-o <output working directory> \
-k <optional number of initial clusters to sample from input vectors> \
-dm <DistanceMeasure> \
-x <maximum number of iterations> \
-cd <optional convergence delta. Default is 0.5> \
-ow <overwrite output directory if present>
-cl <run input vector clustering after computing Canopies>
-xm <execution method: sequential or mapreduce>
A) Mahout is slooooow. If your data fits into main memory, use other tools such as ELKI. They outperformed Mahout for me by far. If your data doesn't fit into main memory: are you sure k-means makes any sense on your data anyway? There is no point in doing a computation that doesn't solve your problem. Start with a sample to first check if it works at all, then scale up. Mahout is a last resort choice: if you absolutely need this to be computed on all your data, and everything else failed, then use Mahout.
B) Read all the documentation... next line in the documentation of Mahout k-means says:
Note: if the -k argument is supplied, any clusters in the -c directory will be overwritten and -k random points will be sampled from the input vectors to become the initial cluster centers.
In other words: if you know the initial cluster centers, supply them via -c and do not set -k. Otherwise an empty -c folder is okay, if you provide -k, the number of cluster centers to sample.

font_properties error while training tesseract

While training Tesseract, I encountered an error like, "Failed to load font_properties from font_properties". I am running the command -
shapeclustering -F font_properties -U unicharset pristina.tr
My font_properties file is something like--> pristina 0 1 0 0 0.
I am taking help from this blog.
You need to follow Tesseract filename standard for image and box files: [lang].[fontname].exp[num]
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3