Clustering and classification - classification

I need to perform clustering and classification on data, which is present in a csv file. The data is in form of simple text containing the vendor names.
Is there some free library available for this task?
Thanks,
Ashish

I don't understand what you mean by "clustering a classification" since those two are different from each other, but you can do clustering and classification with these libraries:
Python-Scikit
Java-weka

First convert your Dataset from csv to arff using the following link.
http://www.cs.ccsu.edu/~markov/MDLclustering/MDLmanual.pdf
After doing this please let me know that what are your expectations from the data as every algorithm in weka show some different results.
You can simply apply k-means and any other algorithm once you convert the data.

Related

How to cluster a data set in WEKA

This is my homework question:
Use the OnlineRetail.arff from the Canvas. Pick one of the clustering algorithms to segment customers into different groups using Weka. Explain why you choose the method and visualize your result.
I feel like I have tried everything and I am getting no where. How do you determine which clustering algorithm to use? When I try to run them on WEKA most of them are greyed out and give me errors. Do I have to manipulate the data in order to cluster it, and if so how?
These are the attributes. They are a mix of string and numeric values. I keep getting errors that k-means and other clustering techniques cannot take strings. How do I combat this?
attributes

Word to vector where should I start?

I'm trying to implement a neural networks model on labeled data that I have. The data contains several columns (categorical and numeric features as well).
Few columns in this data contains a short description, written by users which I also want to analyze but I don't know how to start.
The data looks something like this:
problem ID status description labels
1 closed short description of the problem CRM
2 open short description of the problem ERP
3 closed short description of the problem CRM
Using status (which I will convert into dummy variables) and description (this is where I need you guys), I want to train the model to predict the labels.
Any idea about how should I start? How can I convert the description columns into a useful data?
Thanks!
You want to do the classification basically based on the features, for categorical variables encode them into some trainable form. for text first, perform cleaning, if that has more numbers then convert numbers into their words form and make vectors for it using tf-idf or any other vectorization approach, also normalize your numerical features and then train a simple svm classifier with it, if not giving good accuracy then go with CNN and LSTM based neural network, you can also try CNN+Embeddings for better results.

can we use autoencoders for text data

I am doing my project based on health care.I am going to train my autoencoders with the symptoms and the diseases i.e my input is in textual form. Will that work? (I am using Rstudio).Please anyone help me with this
You have to convert the text to vectors/numbers. To do this traditional approaches like Bag of words, Tf-Idf will help but the latest Neural Word Embedding like Word2Vec, RNN Language model etc are the best techniques to obtain numeric representation of text.
Please use any Neural Word Embedding technique and convert the text(word level[word2vec], document level[doc2vec]) into numbers/vectors.
Now these vectors come with some dimension and to compress this representation to even smaller dimension u can use AutoEncoder.
Feel Free to ask any other information required.
Try using Python for these tasks as it has the latest packages.
You can use Autoencoder on Textual data as explained here.
Autoencoder usually worked better on image data but recent approaches changed the autoencoder in a way it is also good on the text data.
have a look at this.
the code is also available in GitHub.

Incorrectly clustered instances in Weka

I use Weka tool for data mining purpose of mine. When I feed the data set and cluster it using the SimpleKMeans algorithm it displays following statement.
Incorrectly clustered instances : 857.0 69.7883 %
Is it ok to proceed with that percentage ? If not please let me know how to reduce that percentage.
If you have labels, then use them, and do not use clustering at all.
Clustering is meant for data where you do not have labels.
How do you plan to proceed?

Learn weights of features

I want to use WEKA for learning the weights of the features that i am using in order to create clusters of documents. From each document I extract some features, but each feature has a different importance in the clustering method.
I have a data set for the training, where each document is "represented" by the distance similarity per feature from another document and class one if they belong to the same cluster or 0.
How I am using WEKA in order to learn the weights with cross validation?
Thank you,
Evi
Firstly it is not possible to add weights in ARFF file format. Instead XRFF file format must be used. Further weights can be added to each individual instance or an attribute.
Check out the following links for examples.
http://weka.wikispaces.com/XRFF#Additional%20features-Attribute%20weights
http://weka.wikispaces.com/Add+weights+to+dataset
http://weka.8497.n7.nabble.com/can-I-weight-an-attribute-in-the-arff-file-td22889.html