Incorrectly clustered instances in Weka - cluster-analysis

I use Weka tool for data mining purpose of mine. When I feed the data set and cluster it using the SimpleKMeans algorithm it displays following statement.
Incorrectly clustered instances : 857.0 69.7883 %
Is it ok to proceed with that percentage ? If not please let me know how to reduce that percentage.

If you have labels, then use them, and do not use clustering at all.
Clustering is meant for data where you do not have labels.
How do you plan to proceed?

Related

Using clustering classification as regression feature?

I am attempting to use KMeans clustering to create a feature for an XGBOOST regression. The problem is, I am not sure if there is data leakage. It is data with a date, so right now I am clustering on the first 70% of data sorted by date, and using the same as my training set.
Included in the clustering is my target variable. Using the cluster as a feature provides a huge boost to test scores, so I worry that this is causing data leakage. However, the clusters used for test scores are unseen data in the test set.
Is this valid, or is it causing data leakage? Thank you

How to cluster a data set in WEKA

This is my homework question:
Use the OnlineRetail.arff from the Canvas. Pick one of the clustering algorithms to segment customers into different groups using Weka. Explain why you choose the method and visualize your result.
I feel like I have tried everything and I am getting no where. How do you determine which clustering algorithm to use? When I try to run them on WEKA most of them are greyed out and give me errors. Do I have to manipulate the data in order to cluster it, and if so how?
These are the attributes. They are a mix of string and numeric values. I keep getting errors that k-means and other clustering techniques cannot take strings. How do I combat this?
attributes

choose the proper clustering method for Latent Semantic Analysis

i want to cluster some text document to find the document with the same concept. i've done the semantic similarity using Latent Semantic Analysis (LSA), but i confuse which clustering method that i should choose for my purpose .
Thank you
You can use hierarchical clustering. There is a package in R called RClusterpp which is very efficient for hierarchical clustering of large data (it does a parallel computation). Then you can cut the dendrogram tree for different number of cluster within the possible range and check for cluster profiles using cross-tab.

Clustering and classification

I need to perform clustering and classification on data, which is present in a csv file. The data is in form of simple text containing the vendor names.
Is there some free library available for this task?
Thanks,
Ashish
I don't understand what you mean by "clustering a classification" since those two are different from each other, but you can do clustering and classification with these libraries:
Python-Scikit
Java-weka
First convert your Dataset from csv to arff using the following link.
http://www.cs.ccsu.edu/~markov/MDLclustering/MDLmanual.pdf
After doing this please let me know that what are your expectations from the data as every algorithm in weka show some different results.
You can simply apply k-means and any other algorithm once you convert the data.

Data is not well clusterd with any clustering approach

When I cluster my data (with any clustering approach) and compute the quality metrics (I tried several metrics, silhouette, Dunn, etc), I get very poor scores.
What I'm interested in is that whether my data is clusterable or not? Is there any methods to assess that? Or a method telling me if the data contain any useful information?
Thanks,
Hamid
Maybe it just doesn't have clusters?
Or they do not fit to the model evaluated by Silhouette, Dunn etc. - these metrics can be quite misleading, in particular when you have noise in your data set, too. Don't blindly trust such metrics.
The best way of seeing if your data can be clustered is visualization. If you can't visualize it in a way you see clusters, how can you expect an algorithm to return meaningful clusters?