I am using DBSCAN to cluster the data, after clustering, for each cluster I want to get the data attribute value of the core point, is there a way to do this?
private static void ClusteringDemo(String filename) throws Exception {
ClusterEvaluation eval;
Instances data;
DBSCAN cl;
data = DataSource.read(filename);
// manual call
cl = new DBSCAN();
cl.setMinPoints(6);
cl.setEpsilon(0.05);
cl.buildClusterer(data);
eval = new ClusterEvaluation();
eval.setClusterer(cl);
eval.evaluateClusterer(new Instances(data));
System.out.println(eval.clusterResultsToString());
//setup visualization
PlotData2D predData = ClustererPanel.setUpVisualizableInstances(data, eval);
VisualizePanel vp = new VisualizePanel();
vp.addPlot(predData);
// display data
JFrame jf = new JFrame("Weka Clusterer Visualize: " + vp.getName());
jf.setDefaultCloseOperation(JFrame.DISPOSE_ON_CLOSE);
jf.setSize(500, 400);
jf.getContentPane().setLayout(new BorderLayout());
jf.getContentPane().add(vp, BorderLayout.CENTER);
jf.setVisible(true);
}
cl is the DBSCAN class, and I only implement the visualization. Anyone could teach me how to get the core point value?
There is no such thing as "the core point value".
DBSCAN does not use cluster centers like k-mean does.
Consider this DBSCAN image (Wikipedia). Where would "the core point value" of the red cluster be?
Clusters can be arbitrary shaped, and then there is no "center". In fact the average of all points may be outside of the cluster.
A cluster has at least one core point, but there may be many more - all of them could be core points at the same time. Thus, the information which points are core points is not very important. If I recall correctly, ELKI has an option to expose this information, but by default it is discarded immediately.
Related
I have calculated the embedding with the help of doc2vec and I have also calculated the distance between sentences in vector form. now I have a vector of sentences that tells the distance between them(sentences). how can I cluster them without giving the number of clusters? I have used k-means and agglomerative algo but they are not giving me good results. can anybody tell me the best method to determine the optimal number of clusters?
Try this. If it doesn't do what you want, I have a few other code samples to share. This may be the best option. The best option to use, can change, based on the dataset that you feed into the algo.
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = "kitten belly squooshy merley best eating google feedback face extension impressed map feedback google eating face extension climbing key".split(" ") #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Result:
I am working on a inter-class and intra-class classification problem with one CNN such as first there is two classes Cat and Dog than in Cat there is a classification three different breeds of cats and in Dog there are 5 different breeds dogs.
I haven't tried the coding yet just working on feasibility if that works.
My question is what will be the feasible design for this kind of problem.
I am thinking to design for the training, first CNN-1 network that will differentiate cat and dog and gather the image data of all the training images. After the separation of cat and dog, CNN-2 and CNN-3 will train these images further for each breed of dog and cat. I am just not sure how the testing will work in this situation.
I have approached a similar problem previously in Python. Hopefully this is helpful and you can come up with an alternative implementation in Matlab if that is what you are using.
After all was said and done, I landed on a single model for all predictions. For your purpose you could have one binary output for dog vs. cat, another multi-class output for the dog breeds, and another multi-class output for the cat breeds.
Using Tensorflow, I created a mask for the irrelevant classes. For example, if the image was of a cat, then all of the dog breeds are irrelevant and they should not impact model training for that example. This required a customized TF Dataset (that converted 0's to -1 for the mask) and a customized loss function that returned 0 error when the mask was present for that example.
Finally for the training process. Specific to your question, you will have to create custom accuracy functions that can handle the mask values how you want them to, but otherwise this part of the process should be standard. It was best practice to evenly spread out the classes among the training data but they can all be trained together.
If you google "Multi-Task Training" you can find additional resources for this problem.
Here are some code snips if you are interested:
For the customize TF dataset that masked irrelevant labels...
# Replace 0's with -1 for mask when there aren't any labels
def produce_mask(features):
for filt, tensor in features.items():
if "target" in filt:
condition = tf.equal(tf.math.reduce_sum(tensor), 0)
features[filt] = tf.where(condition, tf.ones_like(tensor) * -1, tensor)
return features
def create_dataset(filepath, batch_size=10):
...
# **** This is where the mask was applied to the dataset
dataset = dataset.map(produce_mask, num_parallel_calls=cpu_count())
...
return parsed_features
Custom loss function. I was using binary-crossentropy because my problem was multi-label. You will likely want to adapt this to categorical-crossentropy.
# Custom loss function
def masked_binary_crossentropy(y_true, y_pred):
mask = backend.cast(backend.not_equal(y_true, -1), backend.floatx())
return backend.binary_crossentropy(y_true * mask, y_pred * mask)
Then for the custom accuracy metrics. I was using top-k accuracy, you may need to modify for your purposes, but this will give you the general idea. When comparing this to the loss function, instead of converting all to 0, which would over-inflate the accuracy, this function filters those values out entirely. That works because the outputs are measured individually, so each output (binary, cat breed, dog breed) would have a different accuracy measure filtered only to the relevant examples.
backend is keras backend.
def top_5_acc(y_true, y_pred, k=5):
mask = backend.cast(backend.not_equal(y_true, -1), tf.bool)
mask = tf.math.reduce_any(mask, axis=1)
masked_true = tf.boolean_mask(y_true, mask)
masked_pred = tf.boolean_mask(y_pred, mask)
return top_k_categorical_accuracy(masked_true, masked_pred, k)
Edit
No, in the scenario I described above there is only one model and it is trained with all of the data together. There are 3 outputs to the single model. The mask is a major part of this as it allows the network to only adjust weights that are relevant to the example. If the image was a cat, then the dog breed prediction does not result in loss.
My goal is to have an autoencoding network where I can train the identity function and then do forward passes yielding a reconstruction of the input.
For this, I'm trying to use VariationalAutoencoder, e.g. something like:
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(77147718)
.trainingWorkspaceMode(WorkspaceMode.NONE)
.gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue)
.gradientNormalizationThreshold(1.0)
.optimizationAlgo(OptimizationAlgorithm.CONJUGATE_GRADIENT)
.list()
.layer(0, new VariationalAutoencoder.Builder()
.activation(Activation.LEAKYRELU)
.nIn(100).nOut(15)
.encoderLayerSizes(120, 60, 30)
.decoderLayerSizes(30, 60, 120)
.pzxActivationFunction(Activation.IDENTITY)
.reconstructionDistribution(new BernoulliReconstructionDistribution(Activation.SIGMOID.getActivationFunction()))
.build())
.pretrain(true).backprop(false)
.build();
However, VariationalAutoencoder seems to be designed for training (and providing) mappings from an input to an encoded version, i.e. a vector of size 100 to a vector of size 15 in above example configuration.
However, I'm not particularly interested in the encoded version, but would like to train a mapping of a 100-vector to itself. Then, I'd like to run a other 100-vectors through it and get back their reconstructed versions.
But even when looking at the API of of the VariationalAutoencoder (or AutoEncoder too), I can't figure out how to do this. Or are those layers not designed for this kind of "end-to-end usage" and I would have to manually construct an autoencoding network?
You can see how to use the VAE layer to extract averaged reconstructions from the variational example.
There's two methods for getting the reconstruction from a variational layer. The standard is generateAtMeanGivenZ Which will draw samples from the layer and give you the average. If you want raw samples you can use generateRandomGivenZ. See the javadoc page for all the other methods.
I am running kmeans in Mahout and as an output I get folders clusters-x, clusters-x-final and clusteredPoints.
If I understood well, clusters-x are centroid locations in each of iterations, clusters-x-final are final centroid locations, and clusteredPoints should be the points being clustered with cluster id and weight which represents probability of belonging to cluster (depending on the distance between point and its centroid). On the other hand, clusters-x and clusters-x-final contain clusters centroids, number of elements, features values of centroid and the radius of the cluster (distance between centroid and its farthest point.
How do I examine this outputs?
I used cluster dumper successfully for clusters-x and clusters-x-final from terminal, but when I used it clusteredPoints, I got an empty file? What seems to be the problem?
And how can I get this values from code? I mean, the centroid values and points belonging to clusters?
FOr clusteredPoint I used IntWritable as key, and WeightedPropertyVectorWritable for value, in a while loop, but it passes the loop like there are no elements in clusteredPoints?
This is even more strange because the file that I get with clusterDumper is empty?
What could be the problem?
Any help would be greatly appreciated!
I believe your interpretation of the data is correct (I've only been working with Mahout for ~3 weeks, so someone more seasoned should probably weigh in on this).
As far as linking points back to the input that created them I've used NamedVector, where the name is the key for the vector. When you read one of the generated points files (clusteredPoints) you can convert each row (point vector) back into a NamedVector and retrieve the name using .getName().
Update in response to comment
When you initially read your data into Mahout, you convert it into a collection of vectors with which you then write to a file (points) for use in the clustering algorithms later. Mahout gives you several Vector types which you can use, but they also give you access to a Vector wrapper class called NamedVector which will allow you to identify each vector.
For example, you could create each NamedVector as follows:
NamedVector nVec = new NamedVector(
new SequentialAccessSparseVector(vectorDimensions),
vectorName
);
Then you write your collection of NamedVectors to file with something like:
SequenceFile.Writer writer = new SequenceFile.Writer(...);
VectorWritable writable = new VectorWritable();
// the next two lines will be in a loop, but I'm omitting it for clarity
writable.set(nVec);
writer.append(new Text(nVec.getName()), nVec);
You can now use this file as input to one of the clustering algorithms.
After having run one of the clustering algorithms with your points file, it will have generated yet another points file, but it will be in a directory named clusteredPoints.
You can then read in this points file and extract the name you associated to each vector. It'll look something like this:
IntWritable clusterId = new IntWritable();
WeightedPropertyVectorWritable vector = new WeightedPropertyVectorWritable();
while (reader.next(clusterId, vector))
{
NamedVector nVec = (NamedVector)vector.getVector();
// you now have access to the original name using nVec.getName()
}
check the parameter named "clusterClassificationThreshold".
clusterClassificationThreshold should be 0.
You can check this http://mail-archives.apache.org/mod_mbox/mahout-user/201211.mbox/%3C50B62629.5020700#windwardsolutions.com%3E
I have seen that there is a new implementation of K-Means in mahout called the Streaming-Kmeans, which achieves the k-means clustering without chained Mapper-Reducer cycles:
https://github.com/dfilimon/mahout/tree/epigrams
I am not finding any articles for the its usage anywhere. Could anyone point out any useful links for its usage, which have some code examples on how to use the same.
StreamingKMeans is a new feature in mahout .8.
For more details, of its algorithms see:
"Streaming k-means approximation" by N. Ailon, R. Jaiswal, C. Monteleoni
http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
"Fast and Accurate k-means for Large Datasets" by M. Shindler, A. Wong, A. Meyerson,
http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
As you mentioned, There is no article for its usage. As other version of clustering algorithm there is a Driver which you can pass some configuration parameters as a string array and it will cluster your data :
String[] args1 = new String[] {"-i","/home/name/workspace/XXXXX-vectors/tfidf-vectors","-o","/home/name/workspace/XXXXX-vectors/tfidf-vectors/SKM-Main-result/","--estimatedNumMapClusters","200","--searchSize","2","-k","12", "--numBallKMeansRuns","3", "--distanceMeasure","org.apache.mahout.common.distance.CosineDistanceMeasure"};
StreamingKMeansDriver.main(args1);
for get description of important parameters just do a mistake like "-iiii" as first parameter. it will show you the parameters , their descriptions and default values.
but if you don't want to use it in this way, just read StreamingKMeansMapper, StreamingKmeansReducer, StreamingKmeansThread, these 3 classes code help you understand the usage of algorithm and costumaize it for your need.
Mapper use StreamingKMeans to produce estimated clusters of input data. for get k final cluster Reducer get intermediate points (the generated centroid in previous step) and by using ballKmeans it cluster these intermediate point to K cluster.
Here are the steps for running Streaming k-means:
Generate Sparse vectors via seq2sparse.
mahout streamingkmeans -i "" -o ""
--tempDir "" -ow
-sc org.apache.mahout.math.neighborhood.FastProjectionSearch
-k -km
-k = no of clusters
-km = (k * log(n)) where k = no. of clusters and n = no. of datapoints to cluster, round this to the nearest integer
You have option of using a FastProjectionSearch or ProjectionSearch or LocalitySensitiveHashSearch for the -sc parameter.