ELKI clustering FDBSCAN algorithm - cluster-analysis

Please could you show me example of input file for FDBSCAN in ELKI. I got error like this:
Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: UncertainObject,field
Available types: DBID DoubleVector,dim=2
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
at de.lmu.ifi.dbs.elki.algorithm.clustering.uncertain.FDBSCANNeighborPredicate.instantiate(FDBSCANNeighborPredicate.java:131)
at de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan.GeneralizedDBSCAN.run(GeneralizedDBSCAN.java:122)
at de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan.GeneralizedDBSCAN.run(GeneralizedDBSCAN.java:79)
at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
at [...]

FDBSCAN requires data of the type UncertainObject, i.e. objects with uncertainty information.
If you simply load a CSV file, the data will be certain, and you cannot use uncertain clustering.
There are several ways of modeling uncertainty. These implement as filters in the typeconversions package.
UncertainSplitFilter can split a vector of length k*N into k possible instances, each of length N with uniform weight.
WeightedUncertainSplitFilter is similar, but every instance can also have a weight associated.
UncertainifyFilter can simulate uncertainty by e.g. assuming a Gaussian or Uniform distribution around the original vector.
UniformUncertainifier (the U-Model, see Javadoc of UniformContinuousUncertainObject)
SimpleGaussianUncertainifier (see Javadoc of SimpleGaussianContinuousUncertainObject)
UnweightedDiscreteUncertainifier (BID Model, see Javadoc of WeightedDiscreteUncertainObject)
WeightedDiscreteUncertainifier (as above)
or add your own uncertainty information by extending the API!

Related

Is it possible to use graph-on-parent with [clone]d abstractions in some way?

I know you can open an abstraction with the vis message, but I haven't found a way to present my abstractions in the patch containing the clone object. Perhaps dynamic patching is the only way to achieve this? I have searched the pd forum, mailing list and Facebook group without success.
Currently (as pd 0.48-1) there is no way of making the [clone] read the GOP of it's contents.
As a workaround you can encapsulate the [clone] object into an abstraction that provides a GUI that displays information about the selected clonede instance.
For example, let's say you have a Object called [HarmonicSeries] that, given an fundamental, it uses a [clone] object to create 8 instances of [Harmonic], each one containing a osc~ of the desired frequency. And you want to display the frequency of each Harmonic. Instead of using GOP on [Harmonic] you would use GOP on [HarmonicSeries] and provide an Interface to selected the desired harmonic to collect information.
The [harmonic] abstraction: it expects two parameters:
The fundamental frequency
The index of the harmonic
Then it multiplies both to get the harmonic's frequency and store it on an [float]. When it receives a bang it then outputs that frequency to it's left outlet.
[
Let's [clone] it and wrap it into the [HarmonicSeries] abstraction.
When the user clicks on the [hradio] to select the desired harmonic it sends a bang message to the correct harmonic, which in turn sends the stored frequency to it's outlet. It then displays the harmonic's index and the harmonic's frequency in number boxes.
Here's an example of it working (in the [HarmonicSeries-help] object)
This is a simple example but the principle is the same with complex cases. You encapsulate the [clone] into an abstraction that provides an interface for reading data from the cloned instances.

Multiclass classification in SVM

I have been working on "Script identification from bilingual documents".
I want to classify the pages/blocks as either Eng(class 1), Hindi (class 2) or Mixed using libsvm in matlab. but the problem is that the training data i have consists of samples corresponding to Hindi and english pages/blocks only but no mixed pages.
The test data i want to give may consists of Mixed pages/blocks also, in that case i want it to be classified as "Mixed". I am planning to do it using confidence score or probability values. like if the prob value of class 1 is greater than a threshold (say 0.8) and prob value of class 2 is less than a threshold say(0.05) then it will be classified as class 1, and class 2 vice-versa. but if aforementioned two conditions dont satisfy then i want to classify it as "Mixed".
The third return value from the "libsvmpredict" is prob_values and i was planning to go ahead with this prob_values to decide whether the testdata is Hindi, English or Mixed. but at few places i learnt that "libsvmpredict" does not produce the actual prob_values.
Is there any way which can help me to classify the test data into 3 classes( Hindi, English, Mixed) using training data consisting of only 2 classes in SVM.
This is not the modus operandi for SVMs.
In no way SVMs can predict a given class without knowing it, without knowing how to separate such class from all other classes.
The function svmpredict() in LibSVM actually shows the probability estimates and the greater this value is, the more confident you can be regarding your prediction. But you cannot rely on such values if you have just two classes in order to predict a third class: indeed svmpredict() will return as many decision values as there are classes.
You can go on with your thresholding system (which, again, is not SVM-based) but it most likely fail or give bad performances. Think about that: you have to set up two thresholds and use them in a logic AND manner. The chance of correctly classified non-Mixed documents will indeed drastically decrease.
My suggestion is: instead of wasting time setting up thresholds, with a high chance of bad performances, join some of these texts together or create some new files with some Hindi and some English lines in order to add to your training data some proper Mixed documents and perform a standard 3-classes SVM system.
In order to create such files you can as well use Matlab, which has a pretty decent file I/O functions such as fread(), fwrite(), fprintf(), fscanf(), importdata() and so on...

Unsupervised Anomaly Detection with Mixed Numeric and Categorical Data

I am working on a data analysis project over the summer. The main goal is to use some access logging data in the hospital about user accessing patient information and try to detect abnormal accessing behaviors. Several attributes have been chosen to characterize a user (e.g. employee role, department, zip-code) and a patient (e.g. age, sex, zip-code). There are about 13 - 15 variables under consideration.
I was using R before and now I am using Python. I am able to use either depending on any suitable tools/libraries you guys suggest.
Before I ask any question, I do want to mention that a lot of the data fields have undergone an anonymization process when handed to me, as required in the healthcare industry for the protection of personal information. Specifically, a lot of VARCHAR values are turned into random integer values, only maintaining referential integrity across the dataset.
Questions:
An exact definition of an outlier was not given (it's defined based on the behavior of most of the data, if there's a general behavior) and there's no labeled training set telling me which rows of the dataset are considered abnormal. I believe the project belongs to the area of unsupervised learning so I was looking into clustering.
Since the data is mixed (numeric and categorical), I am not sure how would clustering work with this type of data.
I've read that one could expand the categorical data and let each category in a variable to be either 0 or 1 in order to do the clustering, but then how would R/Python handle such high dimensional data for me? (simply expanding employer role would bring in ~100 more variables)
How would the result of clustering be interpreted?
Using clustering algorithm, wouldn't the potential "outliers" be grouped into clusters as well? And how am I suppose to detect them?
Also, with categorical data involved, I am not sure how "distance between points" is defined any more and does the proximity of data points indicate similar behaviors? Does expanding each category into a dummy column with true/false values help? What's the distance then?
Faced with the challenges of cluster analysis, I also started to try slicing the data up and just look at two variables at a time. For example, I would look at the age range of patients accessed by a certain employee role, and I use the quartiles and inter-quartile range to define outliers. For categorical variables, for instance, employee role and types of events being triggered, I would just look at the frequency of each event being triggered.
Can someone explain to me the problem of using quartiles with data that's not normally distributed? And what would be the remedy of this?
And in the end, which of the two approaches (or some other approaches) would you suggest? And what's the best way to use such an approach?
Thanks a lot.
You can decide upon a similarity measure for mixed data (e.g. Gower distance).
Then you can use any of the distance-based outlier detection methods.
You can use k-prototypes algorithm for mixed numeric and categorical attributes.
Here you can find a python implementation.

Running k-medoids algorithm in ELKI

I am trying to run ELKI to implement k-medoids (for k=3) on a dataset in the form of an arff file (using the ARFFParser in ELKI):
The dataset is of 7 dimensions, however the clustering results that I obtain show clustering only on the level of one dimension, and does this only for 3 attributes, ignoring the rest. Like this:
Could anyone help with how can I obtain a clustering visualization for all dimensions?
ELKI is mostly used with numerical data.
Currently, ELKI does not have a "mixed" data type, unfortunately.
The ARFF parser will split your data set into multiple relations:
a 1-dimensional numerical relation containing age
a LabelList relation storing sex and region
a 1-dimensional numerical relation containing salary
a LabelList relation storing married
a 1-dimensional numerical relation storing children
a LabelList relation storing car
Apparently it has messed up the relation labels, though. But other than that, this approach works perfectly well with arff data sets that consist of numerical data + a class label, for example - the use case this parser was written for. It is a well-defined and consistent behaviour, though not what you expected it to do.
The algorithm then ran on the first relation it could work with, i.e. age only.
So here is what you need to do:
Implement an efficient data type for storing mixed type data.
Modify the ARFF parser to produce a single relation of mixed type data.
Implement a distance function for this type, because the lack of a mixed type data representation means we do not have a distance to go with it either.
Choose this new distance function in k-Medoids.
Share the code, so others do not have to do this again. ;-)
Alternatively, you could write a script to encode your data in a numerical data set, then it will work fine. But in my opinion, the results of one-hot-encoding etc. are not very convincing usually.

RapidMiner: Ability to classify based off user set support threshold?

I am have built a small text analysis model that is classifying small text files as either good, bad, or neutral. I was using a Support-Vector Machine as my classifier. However, I was wondering if instead of classifying all three I could classify into either Good or Bad but if the support for that text file is below .7 or some user specified threshold it would classify that text file as neutral. I know this isn't looked at as the best way of doing this, I am just trying to see what would happen if I took a different approach.
The operator Drop Uncertain Predictions might be what you want.
After you have applied your model to some test data, the resulting example set will have a prediction and two new attributes called confidence(Good) and confidence(Bad). These confidences are between 0 and 1 and for the two class case they will sum to 1 for each example within the example set. The highest confidence dictates the value of the prediction.
The Drop Uncertain Predictions operator requires a min confidence parameter and will set the prediction to missing if the maximum confidence it finds is below this value (you can also have different confidences for different class values for more advanced investigations).
You could then use the Replace Missing Values operator to change all missing predictions to be a text value of your choice.