I'm using Weka to develop a classifier for a medical problem. This dataset has a class imbalance situation and I want to know if there is also a problem of class overlapping. Each record has 30 attributes, how can I discover if there is class overlapping using Weka features?
Class Overlapping happens when some samples from different classes have very similar characteristics
Cluster your data set.
If your instances belong to same cluster then they are very similar.
Then find error rate using actual class membership.
If your instances belong to same cluster but their classes are different, then you found what you are asking.
To solve the class imbalance problem, you can use SMOTE. It is in the Weka supervised filter (instance). But can you explain what do you mean by class overlapping?
I think you mean by 'class overlapping', Exist similar instances that belong to different classes. Simply, you can remove them. In awk, you could do the following:
awk '!NF || !seen[$0]++' inputFile > outputFile
Related
I am currently looking at a text classification problem (say N classes), for which labeled training data exists. Now, ocasionally, a new class is created and some of the labels in the "old" training data become wrong because they now should have the new class label. So the new class recruits from the old classes.
We can assume that we have some new labeled data for the new class, or even that from an input stream of new data we eventually obtain the correct labels by human verification (the goal, however, is to require as few manual corrections as possible).
How to set up a classifier that may face new "recruiting" classes from time to time? Are you aware of approaches/literature for the specific setting described above?
Perhaps, basic strategies may include
trying to relabel the training data and re-train,
using incremental classifiers (e.g., KNN)
I have several different .weight files that were outputted in training. The reason I did this, is I noticed the model trained better with fewer classes than if I combined all 35 together. Could it possible to loop through code and have multiple model.load_weights()?
Any help is appreciated!
Thanks
I don't see the code, but I can say, that you can try to create multiple class instances of the model class, each of them with their own weights and configs, and than run each of them in the way you want
I am really, really, new to Apache Spark.
I am working on implementing Approximate LOCI (or ALOCI), an anomaly detection algorithm, on a distributed way over Spark. This algorithm is based on storing points in a QuadTree that is used to find a point's number of neighbors.
I know exactly how QuadTrees work. In fact, I have implemented such a structure in Java recently. But I am completely lost as far as it concerns the way that such a structure can work in a distributed way over Spark.
Something similar to what I need can be found in Geospark.
https://github.com/DataSystemsLab/GeoSpark/tree/b2b6f1d7f0015d5c9d663a7b28d5e1bb1043c413/core/src/main/java/org/datasyslab/geospark/spatialPartitioning/quadtree
GeoSpark uses in many cases a PointRDD class, that extends a SpatialRDD class which I can see that uses the QuadTree that can be found in the link above to partition the Spatial objects. That is what I understood, at least, in theory.
In practice, I still cannot figure this out. Let's say for example that I have millions of records in a csv and I want to read and load them in a QuadTree.
I could read a csv to an RDD, but then what? How does this RDD logically connects to the QuadTree I am trying to build?
Of course, I don't expect a working solution here. I just need the logic here to fill the gap in my mind. How do I implement a distributed QuadTree and how do I use it?
Ok, sadly there are no answers to this, but here I am two weeks later with a working solution. Not 100% sure if it is the right approach here, though.
I created a class named Element and turned each line of my csv to an RDD[Element]. I then created a serializable class named QuadNode which has a List[Elements] and an Array[String] of size 4. On adding elements to a node, these elements are added in the node's List. If the list get more than X elements (20 in my case), the node breaks into 4 children and the elements are sent to the children. Finally, I created a class QuadTree which has an RDD[QuadNodes] among its rest properties. Every time a node breaks to children then these children-nodes are added in the tree's RDD.
In a non-functional language, each node would have 4 pointers, one for each child. Since, we are in a distributed environment this approach could not work. So, I gave each node a unique Id. Root node has an id = "0". Root's nodes have ids "00", "01", "02" and "03". Node-"00" children have ids "000","001","002","003". In this way if we want to find all the descendants of a node, we filter our tree's RDD[QuadNode] by checking if nodes' ids startWith out node id. Reversing this logic helps us to find a node's parent node.
This is how I implemented my QuadTree, at least for now. If someone knows a better way of implementing this I would love to hear his/her opinion.
I am working on a project in which I have about 18 classes with about 4,000 total instances. I have 7 attributes, 1 being string data, the rest nominal. I am currently using StringToWordVector on the string attribute with Platt's SMO classifier, achieving good results. We are about to implement this, but I would like to try other classifiers in case there maybe one I could get better results from. Any suggestions?
Also, should I be using MultiClassClassifier with so many classes? If so, what settings should I try within that?
Any advice is appreciated!
An AdaBoosted J48 Decision Tree yielded the best results has been well established in our division
Is there any method to use in biojava to predict secondary structure from a given sequence?
Or if not does any anyone how can i implement it? any source code? Any exe to recommend?
I think it is not available in BioJava but what you can do is to do a BLAST search by using BioJava libraries (http://biojava.org/wiki/BioJava:CookBook3:NCBIQBlastService) or RCSB web page and from your BLAST search you can find a protein with 3D structure. Afterwards you can use suggestDomains(Structure s) method of LocalProteinDomainParser class in org.biojava.bio.structure.domain package. Domains might give you some ideas about secondary structures.
I don't think there is secondary structure prediction from sequence in biojava. However there is secondary structure assignment given the structure, based on the DSSP algorithm, see https://github.com/biojava/biojava-tutorial/blob/master/structure/secstruc.md