Weka LibSVM one class classifier always predicts one class - classification

I'm trying to use LibSVM classifier in Weka to build a one class SVM classifier.
My training file has list of noun words.
My test file has many words. My aim is to use the classifier to predict the words which are nouns in test file.
My input arff file (ip.arff)(training file) looks like this:
#relation test1
#attribute name string
#attribute class {yes}
#data
'building',yes
'car',yes
..... and so on
My test file(test.arff) (test file) looks like this:
#relation test2
#attribute name string
#attribute class {yes}
#data
'car',?
'window',?
'running',?
..... and so on
Here's what I've done:
Since the datatype is string, I used batch Filtering on both input files to generate
ipstd.arff and teststd.arff as mentioned in
http://weka.wikispaces.com/Batch+filtering
Next i load and run the classifier with ipstd.arff. (Note: All the words are classified as yes)
Next I load the test set teststd.arff and re-evaluate the model.
But all the words are classified as nouns('yes')
=== Predictions on user test set ===
inst# actual predicted error prediction
1 1:? 1:yes 1
2 1:? 1:yes 1
3 1:? 1:yes 1
and so on
My problem is that all words in test file(teststd.arff) are classified as nouns
Can someone tell where I'm going wrong..
What should I do classify noun words in test set with 'yes' and others as outliers.
Thanks...

Related

What are missing attributes as defined in the hdf5 specification and metadata in group h5md?

I have a one hdf5 format file Data File containing the molecular dynamics simulation data. For quick inspection, the h5ls tool is handy. For example:
h5ls -d xaa.h5/particles/lipids/positions/time | less
now my question is based on the comment I received on the data format! What attributes are missing according the hdf5 specifications and metadata in group?
Are you trying to get the value of the Time attribute from a dataset? If so, you need to use h5dump, not h5ls. And, the attributes are attached to each dataset, so you have to include the dataset name on the path. Finally, attribute names are case sensitive; Time != time. Here is the required command for dataset_0000 (repeat for 0001 thru 0074):
h5dump -d /particles/lipids/positions/dataset_0000/Time xaa.h5
You can also get attributes with Python code. Simple example below:
import h5py
with h5py.File('xaa.h5','r') as h5f:
for ds, h5obj in h5f['/particles/lipids/positions'].items():
print(f'For dataset={ds}; Time={h5obj.attrs["Time"]}')

Tensorflow 0.8 Import and Export output tensors problems

I am using Tensorflow 0.8 with Python 3. I am trying to train the Neural Network, and the goal is to automatically export/import network states every 50 iteration. The problem is when I export the output tensor at the first iteration, the output tensor name is ['Neg:0', 'Slice:0'], but when I export the output tensor at the second iteration, the output tensor name is changed as ['import/Neg:0', 'import/Slice:0'], and importing this output tensor is not working then:
ValueError: Specified colocation to an op that does not exist during import: import/Variable in import/Variable/read
I wonder if anyone has ideas on this problem. Thanks!!!
That's how tf.import_graph_def works.
If you don't want the prefix, just set the name parameter to the empty string as showed in the following example.
# import the model into the current graph
with tf.Graph().as_default() as graph:
const_graph_def = tf.GraphDef()
with open(TRAINED_MODEL_FILENAME, 'rb') as saved_graph:
const_graph_def.ParseFromString(saved_graph.read())
# replace current graph with the saved graph def (and content)
# name="" is important because otherwise (with name=None)
# the graph definitions will be prefixed with import.
# eg: the defined operation FC2/unscaled_logits:0
# will be import/FC2/unscaled_logits:0
tf.import_graph_def(const_graph_def, name="")
[...]

Does Weka setClassIndex and setAttributeIndices start attribute from different rage?

I am using WEKA for classification. I am using two function, "setClassIndex" and "setAttributeIndices". My dataset have two attributes, class and one more attribute. Following are some instances in my database:
#relation sms_test
#attribute spamclass {spam,ham}
#attribute mes String
#data
ham,'Go until jurong point'
ham,'Ok lar...'
spam,'Free entry in 2 a wkly'
Following is part of my code.
trainData.setClassIndex(0);
filter = new StringToWordVector();
filter.setAttributeIndices("2");
This code is running fine. But when I set, train.setClassIndex ("1") or filter.setAttributeIndices("1") , my code stops running. Do setClassIndex function take argument starting from 0 and setAttributeIndices takes argument starting from 1? How do we identify which WEKA function starts counting from 0 or 1?
Do setClassIndex function take argument starting from 0
Yes, index starts with 0.
and setAttributeIndices takes argument starting from 1?
Yes, indices start from 1
Source: http://weka.sourceforge.net/doc.stable

How to incorporate Weka Naive Bayes model into Java Code

I run a training set using Naive Bayes classifier using Weka. The resulted model is shown below. My question is:
a. Is it possible to incorporate the model into my java code?
b. If so, how can I do that?
c. If not, what should I do?
Thanks.
=== Classifier model (full training set) ===
Naive Bayes (simple)
Class A: P(C) = 0.42074928
Attribute mcv
'All'
1
Attribute alkphos
'All'
1
Attribute sgpt
'All'
1
Attribute sgot
'All'
1
Attribute gammagt
'(-inf-20.5]' '(20.5-inf)'
0.54421769 0.45578231
Attribute drinks
'All'
1
Class B: P(C) = 0.57925072
Attribute mcv
'All'
1
Attribute alkphos
'All'
1
Attribute sgpt
'All'
1
Attribute sgot
'All'
1
Attribute gammagt
'(-inf-20.5]' '(20.5-inf)'
0.30693069 0.69306931
Attribute drinks
'All'
1
Time taken to build model: 0 seconds
It is possible to build your naive Bayes model in Java using Weka. Once built you can use this model to predict the outcome of test instances using Weka.
A good source to begin using Weka in your Java code is here, and a more advanced tool is the Weka API here.
Provided you are able to load your training instances (called "train" below) and testing instances (called "test" below), the naive Bayes model can be built and then used as follows:
//build model
NaiveBayes model=new NaiveBayes();
model.buildClassifier(train);
//use
Evaluation eval_train = new Evaluation(test);
eval_train.evaluateModel(model,test);

Weka, SimpleKMeans cannot handle string attributes

I am using Weka in Scala (although the syntax is virtually identical to Java). I am trying to evaluate my data with a SimpleKMeans clusterer, but the clusterer won't accept string data. I don't want to cluster on the string data; I just want to use it to label the points.
Here is the data I am using:
#relation Locations
#attribute ID string
#attribute Latitude numeric
#attribute Longitude numeric
#data
'Carnegie Mellon University', 40.443064, -79.944163
'Stanford University', 37.427539, -122.170169
'Massachusetts Institute of Technology', 42.358866, -71.093823
'University of California Berkeley', 37.872166, -122.259444
'University of Washington', 47.65601, -122.30934
'University of Illinois Urbana Champaign', 40.091022, -88.229992
'University of Southern California', 34.019372, -118.28611
'University of California San Diego', 32.881494, -117.243079
As you can see, it's essentially a collection of points on an x and y coordinate plane. The value of any patterns is negligible; this is simply an exercise in working with Weka.
Here is the code that is giving me trouble:
val instance = new Instances(new StringReader(wekaHeader + wekaData))
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
val eval = new ClusterEvaluation()
eval.setClusterer(simpleKMeans)
eval.evaluateClusterer(new Instances(instance))
Logger.info(eval.clusterResultsToString)
I get the following error on simpleKMeans.buildClusterer(instance):
[UnsupportedAttributeTypeException: weka.clusterers.SimpleKMeans: Cannot handle string attributes!]
How do I get Weka to retain IDs while doing clustering?
Here are a couple of other steps I have taken to troubleshoot this:
I used the Weka Explorer and loaded this data as a CSV:
ID, Latitude, Longitude
'Carnegie Mellon University', 40.443064, -79.944163
'Stanford University', 37.427539, -122.170169
'Massachusetts Institute of Technology', 42.358866, -71.093823
'University of California Berkeley', 37.872166, -122.259444
'University of Washington', 47.65601, -122.30934
'University of Illinois Urbana Champaign', 40.091022, -88.229992
'University of Southern California', 34.019372, -118.28611
'University of California San Diego', 32.881494, -117.243079
This does what I want it to do in the Weka Explorer. Weka clusters the points and retains the ID column to identify each point. I would do this in my code, but I'm trying to do this without generating additional files. As you can see from the Weka Java API, Instances interprets a java.io.Reader only as an ARFF.
I have also tried the following code:
val instance = new Instances(new StringReader(wekaHeader + wekaData))
instance.deleteAttributeAt(0)
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
val eval = new ClusterEvaluation()
eval.setClusterer(simpleKMeans)
eval.evaluateClusterer(new Instances(instance))
Logger.info(eval.clusterResultsToString)
This works in my code, and displays results. That proves that Weka is working in general, but since I am deleting the ID attribute, I can't really map the clustered points back on the original values.
I am answering my own question, and in doing so, there are two issues that I would like to address:
Why CSV works with string values
How to get cluster information from the cluster evaluation
As Sentry points out in the comments, the ID does in fact get converted to a nominal attribute when loaded from a CSV.
If the data must be in an ARFF format (like in my example where the Instances object is created from a StringReader), then the StringToNominal filter can be applied:
val instances = new Instances(new StringReader(wekaHeader + wekaData))
val filter = new StringToNominal()
filter.setAttributeRange("first")
filter.setInputFormat(instances)
val filteredInstance = Filter.useFilter(instances, filter)
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
...
This allows for "string" values to be used in clustering, although it's really just treated as a nominal value. It doesn't impact the clustering (if the ID is unique), but it doesn't contribute to the evaluation as I had hoped, which brings me to the next issue.
I was hoping to be able to get a nice map of cluster and data, like cluster: Int -> Array[(ID, latitude, longitude)] or ID -> cluster: Int. However, the cluster results are not that convenient. In my experience these past few days, there are two approaches that can be used to find the cluster of each point of data.
To get the cluster assignments, simpleKMeans.getAssignments returns an array of integers that is the cluster assignments for each data element. The array of integers is in the same order as the original data items and has to be manually related back to the original data items. This can be easily accomplished in Scala by using the zip method on the original list of data items and then using other methods like groupBy or map to get the collection in your favorite format. Keep in mind that this method alone does not use the ID attribute at all, and the ID attribute could be omitted from the data points entirely.
However, you can also get the cluster centers with simpleKMeans.getClusterCentroids or eval.clusterResultsToString(). I have not used this very much, but it does seem to me that the ID attribute can be recovered from the cluster centers here. As far as I can tell, this is the only situation in which the ID data can be utilized or recovered from the cluster evaluation.
I got the same error while having String value in one of the line in a CSV file with couple of million rows. Here is how I figured out which line has string value.
Exception "Cannot handle string attributes!" doesn't give any clue about the line number. Hence:
I imported CSV file into Weka Explorer GUI and created a *.arff file.
Then manually changed type from string to numeric in the *.arrf file at the beginning as show below.
After that I tried to build the cluster using the *.arff file.
I got the exact line number as part of exception
I removed the line from *.arff file and loaded again. It worked without any issue.
Converted string --> numeric in *.arff file
#attribute total numeric
#attribute avgDailyMB numeric
#attribute mccMncCount numeric
#attribute operatorCount numeric
#attribute authSuccessRate numeric
#attribute totalMonthlyRequets numeric
#attribute tokenCount numeric
#attribute osVersionCount numeric
#attribute totalAuthUserIds numeric
#attribute makeCount numeric
#attribute modelCount numeric
#attribute maxDailyRequests numeric
#attribute avgDailyRequests numeric
Error reported the exact line number
java.io.IOException: number expected, read Token[value.total], line 1750464
at weka.core.converters.ArffLoader$ArffReader.errorMessage(ArffLoader.java:354)
at weka.core.converters.ArffLoader$ArffReader.getInstanceFull(ArffLoader.java:728)
at weka.core.converters.ArffLoader$ArffReader.getInstance(ArffLoader.java:545)
at weka.core.converters.ArffLoader$ArffReader.readInstance(ArffLoader.java:514)
at weka.core.converters.ArffLoader$ArffReader.readInstance(ArffLoader.java:500)
at weka.core.Instances.<init>(Instances.java:138)
at com.lokendra.dissertation.ModelingUtils.kMeans(ModelingUtils.java:50)
at com.lokendra.dissertation.ModelingUtils.main(ModelingUtils.java:28)