Learn weights of features - cluster-analysis

I want to use WEKA for learning the weights of the features that i am using in order to create clusters of documents. From each document I extract some features, but each feature has a different importance in the clustering method.
I have a data set for the training, where each document is "represented" by the distance similarity per feature from another document and class one if they belong to the same cluster or 0.
How I am using WEKA in order to learn the weights with cross validation?
Thank you,
Evi

Firstly it is not possible to add weights in ARFF file format. Instead XRFF file format must be used. Further weights can be added to each individual instance or an attribute.
Check out the following links for examples.
http://weka.wikispaces.com/XRFF#Additional%20features-Attribute%20weights
http://weka.wikispaces.com/Add+weights+to+dataset
http://weka.8497.n7.nabble.com/can-I-weight-an-attribute-in-the-arff-file-td22889.html

Related

How to cluster a data set in WEKA

This is my homework question:
Use the OnlineRetail.arff from the Canvas. Pick one of the clustering algorithms to segment customers into different groups using Weka. Explain why you choose the method and visualize your result.
I feel like I have tried everything and I am getting no where. How do you determine which clustering algorithm to use? When I try to run them on WEKA most of them are greyed out and give me errors. Do I have to manipulate the data in order to cluster it, and if so how?
These are the attributes. They are a mix of string and numeric values. I keep getting errors that k-means and other clustering techniques cannot take strings. How do I combat this?
attributes

choose the proper clustering method for Latent Semantic Analysis

i want to cluster some text document to find the document with the same concept. i've done the semantic similarity using Latent Semantic Analysis (LSA), but i confuse which clustering method that i should choose for my purpose .
Thank you
You can use hierarchical clustering. There is a package in R called RClusterpp which is very efficient for hierarchical clustering of large data (it does a parallel computation). Then you can cut the dendrogram tree for different number of cluster within the possible range and check for cluster profiles using cross-tab.

Clustering and classification

I need to perform clustering and classification on data, which is present in a csv file. The data is in form of simple text containing the vendor names.
Is there some free library available for this task?
Thanks,
Ashish
I don't understand what you mean by "clustering a classification" since those two are different from each other, but you can do clustering and classification with these libraries:
Python-Scikit
Java-weka
First convert your Dataset from csv to arff using the following link.
http://www.cs.ccsu.edu/~markov/MDLclustering/MDLmanual.pdf
After doing this please let me know that what are your expectations from the data as every algorithm in weka show some different results.
You can simply apply k-means and any other algorithm once you convert the data.

How to use LibSVM for multiple descriptors for image classification - Matlab

I need to classify pairs of image and indicate whether they're the same of not. I use several descriptors as SIFT LBP and more.
I want now to use LIBSVM to do the training and test.
how can I use teh svmTrain.
should I save only the distance between 2 descriptors and then just have 1 1:SIftDelta, 2:LBPDelta
is this the correct way or is there any better approach?
thanks
I'm not sure this is the right forum for this question, as it deals more with "high level" notions of learning, rather the specific implementation of it in Matlab.
Having said that, it seems like you are trying to combine multiple cues for learning, which is not a trivial task.
I can propose two methods for you:
Direct method - just concatenate all your descriptors into a single, very long, one and do the learning in this high dimensional space.
Do the learning in two stages (consequently, you'll have to partition your training data into two):
At the first stage, learn K classifiers, each using a different descriptor (assuming you wish to use K different descriptors).
Then, at the second stage, (using the reminder of your training data), you classify each example using the K classifiers you have: this will give you a new K-dimensional feature vector for each sample (you can put the classification result, or use the distance from the separating hyper plane to populate the k-th entry in the new descriptor). Now you can train a second classifier on the new K-dimension vectors. This second classifier gives you the final output of your multi-descriptor system.
-Enjoy!

Clustering: a training dataset of variable data dimensions

I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.