Using K-means with RDF files - cluster-analysis

I have a set of RDF triplets and I want to put them in clusters using K-Means, but I don't know how to represent words by numerical values
for example: I have 3 triplets like
<'name','isa','johan'>
<'name','isa','fred'>
<'sport','isa','football'>
How can I get distance between these triplets?

Related

Multi-label clustering

I have a question regarding a task that I am trying to solve. The data that I have are characterisation data,
meaning that I have a label (PASS/FAIL) for every single datapoint.
So my data matrix, is of n rows and m columns and the target variables are again a matrix of
n rows and m columns composed of binary values (0s and 1s).
My task is to apply clustering and partition all these datapoints into two clusters, one being for PASS
datapoints and the other for FAIL datapoints. I wasn't able to find an algorithm that can solve
this type of 'multi-label' problem with clustering.
I tried to implement algorithms like k-means but while tuning the number of clusters to initialise
I get k=6 which doesn't really make sense. In the data, outliers are already dropped and they
are normalised as well.
I have a large amount of features on my data matrix (eg. >3000) and I tried to apply
dimensionality reduction methods like PCA to at least drop the features that are more
irrelevant than the rest. But I am not sure if this would be applicable in my case when
I have a binary matrix as target variables.
Is there a specific algorithm that can solve this type of problem and if so, what is the
necessary pre-processing I should be doing before applying it?

Clustering in Weka

I have some data collected using an online survey. Therefore, there are no classes/labels in the data to evaluate clustering results. I am trying to do the clustering in order to cluster participants in some groups for another task.
In the data, I have 10 attributes like: Age, Gender, etc., and 111 examples or data-points.
It's my first time to perform clustering and it's been difficult to find potential clusters in the data.
Here are the steps I have performed in Weka:
I have tried to cluster the data using all attributes, all types of clustering in Weka (like cobweb, EM .. etc) and using different cluster numbers (1-10). And When I visualise the clusters, they don't make any sense and the data are widely spread between x and y axis.
I have applied PCA and selected different number of attribute combinations according to the ranks obtained in PCA. The best clustering result was obtained using k-means and with only 2 combinations of attributes and the number of clusters selected was 3, and seed was 7 (sorry, I have no idea what the seed is).
My Questions:
Are the steps I performed to cluster data correct? If not please give me advice/s
Is this considered as a good clustering result?
How can I optimise or enhance my clusters?
What is meant with seed in Weka clustering?

Interpreting the result of StreamingKMeans in mahout 0.8

What I want to achieve is simply find out which input points are included in a given cluster!?
I have a personal dataset which contains some documents that are grouped in 12 clusters manually.
I know how to interpret kmenas result in mahout .7 with using namedVector class and one of dumpers (like clusterdumper). after clustering using kmeans driver, a directory named clusteredPoints has created which contains clustering result and using clusterDumper, you can see the created clusters and the points that are in each one. in below link there is a good solution for this :
How to read Mahout clustering output
But, as I mentioned in title I want to have this capability to interpret Streaming Kmeans result which is a new feature in mahout .8.
In this feature, it uses a Centroid class for holding data points and each cluster seeds. The generated result of StreamingKMeans algorithm is only a sequence file which is constructed of centroid vectors + keys and weights of each cluster. And in this output there is no information of input data points to know the distribution of them between clusters. However, it is not possible to me to get a sense of accuracy of clustering.
by the way, How to get this information in clustering output ? It is not implemented or just I failed to find and use prepared soulution? How can I analysis the result of streamingKMeans?
thanks.

Clustering: a training dataset of variable data dimensions

I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.

MATLAB: Self-Organizing Map (SOM) clustering

I'm trying to cluster some images depending on the angles between body parts.
The features extracted from each image are:
angle1 : torso - torso
angle2 : torso - upper left arm
..
angle10: torso - lower right foot
Therefore the input data is a matrix of size 1057x10, where 1057 stands for the number of images, and 10 stands for angles of body parts with torso.
Similarly a testSet is 821x10 matrix.
I want all the rows in input data to be clustered with 88 clusters.
Then I will use these clusters to find which clusters does TestData fall into?
In a previous work, I used K-Means clustering which is very straightforward. We just ask K-Means to cluster the data into 88 clusters. And implement another method that calculates the distance between each row in test data and the centers of each cluster, then pick the smallest values. This is the cluster of the corresponding input data row.
I have two questions:
Is it possible to do this using SOM in MATLAB?
AFAIK SOM's are for visual clustering. But I need to know the actual class of each cluster so that I can later label my test data by calculating which cluster it belongs to.
Do you have a better solution?
Self-Organizing Map (SOM) is a clustering method considered as an unsupervised variation of the Artificial Neural Network (ANN). It uses competitive learning techniques to train the network (nodes compete among themselves to display the strongest activation to a given data)
You can think of SOM as if it consists of a grid of interconnected nodes (square shape, hexagonal, ..), where each node is an N-dim vector of weights (same dimension size as the data points we want to cluster).
The idea is simple; given a vector as input to SOM, we find the node closet to it, then update its weights and the weights of the neighboring nodes so that they approach that of the input vector (hence the name self-organizing). This process is repeated for all input data.
The clusters formed are implicitly defined by how the nodes organize themselves and form a group of nodes with similar weights. They can be easily seen visually.
SOM are in a way similar to the K-Means algorithm but different in that we don't impose a fixed number of clusters, instead we specify the number and shape of nodes in the grid that we want it to adapt to our data.
Basically when you have a trained SOM, and you want to classify a new test input vector, you simply assign it to the nearest (distance as a similarity measure) node on the grid (Best Matching Unit BMU), and give as prediction the [majority] class of the vectors belonging to that BMU node.
For MATLAB, you can find a number of toolboxes that implement SOM:
The Neural Network Toolbox from MathWorks can be used for clustering using SOM (see the nctool clustering tool).
Also worth checking out is the SOM Toolbox