Word to vector where should I start? - neural-network

I'm trying to implement a neural networks model on labeled data that I have. The data contains several columns (categorical and numeric features as well).
Few columns in this data contains a short description, written by users which I also want to analyze but I don't know how to start.
The data looks something like this:
problem ID status description labels
1 closed short description of the problem CRM
2 open short description of the problem ERP
3 closed short description of the problem CRM
Using status (which I will convert into dummy variables) and description (this is where I need you guys), I want to train the model to predict the labels.
Any idea about how should I start? How can I convert the description columns into a useful data?
Thanks!

You want to do the classification basically based on the features, for categorical variables encode them into some trainable form. for text first, perform cleaning, if that has more numbers then convert numbers into their words form and make vectors for it using tf-idf or any other vectorization approach, also normalize your numerical features and then train a simple svm classifier with it, if not giving good accuracy then go with CNN and LSTM based neural network, you can also try CNN+Embeddings for better results.

Related

Input values of an ANN constructed with keras framework (using theano)

I want to costruct a neural network which will be trained based on data i create. My question is what form these data should have? In other words does keras allow neural networks that take strings/characters as input? If not, and only is able to accept numbers in what range should the input/output be?
The only condition for your input data i.e features, is that it should be numerical. There isn't really any constraint on range but it's always a good idea to do Feature Scaling, Normalization etc to make sure that our model won't get confused. Neural Networks or other machine learning methods cannot accept string (characters, words) directly, therefore, you need to first convert string to numbers. There are many ways to do that, most common techniques include Bag of Words, tf-idf features, word embeddings etc.
Following tutorials (using scikit) might be a good starting point:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words

can we use autoencoders for text data

I am doing my project based on health care.I am going to train my autoencoders with the symptoms and the diseases i.e my input is in textual form. Will that work? (I am using Rstudio).Please anyone help me with this
You have to convert the text to vectors/numbers. To do this traditional approaches like Bag of words, Tf-Idf will help but the latest Neural Word Embedding like Word2Vec, RNN Language model etc are the best techniques to obtain numeric representation of text.
Please use any Neural Word Embedding technique and convert the text(word level[word2vec], document level[doc2vec]) into numbers/vectors.
Now these vectors come with some dimension and to compress this representation to even smaller dimension u can use AutoEncoder.
Feel Free to ask any other information required.
Try using Python for these tasks as it has the latest packages.
You can use Autoencoder on Textual data as explained here.
Autoencoder usually worked better on image data but recent approaches changed the autoencoder in a way it is also good on the text data.
have a look at this.
the code is also available in GitHub.

Training SVM classifier in MATLAB with numeric+text data

I want to train a SVM classifier in MATLAB for threat detection. The training data is in Excel file and contains both numeric and text fields/columns. When I export this data to MATLAB, it is either in table or cell format. How do I convert it in matrix format?
P.S: Using xlsread function does not import text data.
There are 4 type of attributes in data. Numerical ,discrete , nominal and ordinal. Here you can read more about them . First run an statistical analysis for each feature in your dataset to know the basic statistics such as mean, median, max , min , variable type and if it like nominal or ordinal distinct words and all. So you then have a pretty good idea what you are dealing with.Then according to the variable type you can decide which vectorization we are using.if it is an numerical variable you can divide it into different classes and feature scaling . if it an ordinal variable you can give logical order . if it is nominal variable you can give a identical numerical names. Here , you are just checking how much each feature bring the impact to final prediction
My advice , use Weka GUI too to visualize the data. Then you can pre process the data with column by column
You need to transform your text fields into numeric using dummy variables or another technique, or drop them entirely if they actually are id's (e.g. patient name for medical data, record number, respondent uuid for a survey, etc.)
This would actually be easier in R or Python+Pandas, but in Matlab, you will need to perform encoding by yourself, working from the cell array towards a matrix. Or you can try this toolbox.

neural network data for training and testing

I have a question regarding Training and testing data for my ANN .
Should the testing data going trough a feature extraction process before it can be classified?
I am new to this field. Is what I am doing right?
I separate the dataset to 80% train and 20 % test. Both sets , I extract the features. for train data I put it into training network but not for the test data. Then go to classification. Is this correct? because my SV said the test data should not go through the feature extraction process. I am wondering how the ANN can recognize the input if not specific feature is being extract. Apologize my bad English.
If anyone have link or journal that I can refer please provide it..
Thanks a lot.
Both the training and the test data needs to be in the same format - thus your training data and test data should go through the same pre-processing steps else your network will not learn correctly.
You are doing it right (as far as I understand your question).
Example: If you were to show me 10 images of faces (training data) on paper and then present me 2 people (training data) by their name only (different feature representation) - I wouldn't be able to classify what I didn't learn. You can't train the network with images and then test it with audio or any representation other than the one you used for training. I can't link any papers for that as it's just common sense.
You can modify the training set, e.g. by adding noise. But whatever you do, the representation format has to be the same.

How to train SVM in matlab for character recognition?

Im a final year student working on my major project. My project is basically to extract text from a natural scene, and recognize it and then display them in a notepad etc..
I have already extracted the text form the images and have also obtained 85 features for each character which is extracted.
How ever, for the recognition part, I have no clue as of how to train or use SVM(support vector machines) in matlab so I can get a match.
Please help me out as this is turning out to be painstakingly difficult
If you're happy with using an existing SVM implementation, then you should either use the bioinformatics toolbox svmtrain, or download the Matlab version of libsvm. If you want to implement an SVM yourself then you should understand SVM theory and you can use quadprog to solve the appropriate optimisation problem.
With your data, you will need to have an N-by-85 feature matrix, where N is a number of characters, and an N-by-1 array of 'true labels' which you provide manually. Depending on which tool you use to train an SVM, the paramaters to svmtrain are slightly different - check the documentation.
If you want to evaluate your SVM to show that it works, you may need to organise your data such that you can estimate the generalization error of classifier - see cross-validation