Training SVM classifier in MATLAB with numeric+text data - matlab

I want to train a SVM classifier in MATLAB for threat detection. The training data is in Excel file and contains both numeric and text fields/columns. When I export this data to MATLAB, it is either in table or cell format. How do I convert it in matrix format?
P.S: Using xlsread function does not import text data.

There are 4 type of attributes in data. Numerical ,discrete , nominal and ordinal. Here you can read more about them . First run an statistical analysis for each feature in your dataset to know the basic statistics such as mean, median, max , min , variable type and if it like nominal or ordinal distinct words and all. So you then have a pretty good idea what you are dealing with.Then according to the variable type you can decide which vectorization we are using.if it is an numerical variable you can divide it into different classes and feature scaling . if it an ordinal variable you can give logical order . if it is nominal variable you can give a identical numerical names. Here , you are just checking how much each feature bring the impact to final prediction
My advice , use Weka GUI too to visualize the data. Then you can pre process the data with column by column

You need to transform your text fields into numeric using dummy variables or another technique, or drop them entirely if they actually are id's (e.g. patient name for medical data, record number, respondent uuid for a survey, etc.)
This would actually be easier in R or Python+Pandas, but in Matlab, you will need to perform encoding by yourself, working from the cell array towards a matrix. Or you can try this toolbox.

Related

Word to vector where should I start?

I'm trying to implement a neural networks model on labeled data that I have. The data contains several columns (categorical and numeric features as well).
Few columns in this data contains a short description, written by users which I also want to analyze but I don't know how to start.
The data looks something like this:
problem ID status description labels
1 closed short description of the problem CRM
2 open short description of the problem ERP
3 closed short description of the problem CRM
Using status (which I will convert into dummy variables) and description (this is where I need you guys), I want to train the model to predict the labels.
Any idea about how should I start? How can I convert the description columns into a useful data?
Thanks!
You want to do the classification basically based on the features, for categorical variables encode them into some trainable form. for text first, perform cleaning, if that has more numbers then convert numbers into their words form and make vectors for it using tf-idf or any other vectorization approach, also normalize your numerical features and then train a simple svm classifier with it, if not giving good accuracy then go with CNN and LSTM based neural network, you can also try CNN+Embeddings for better results.

How to use multiple labels as targets in Neural Net Pattern Recognition Toolbox?

I am trying to use the Neural Net Pattern Recognition toolbox in MATLAB for recognizing different types of classes in my dataset. I have a 21392 x 4 table, with the columns 1-3 which I would like to use as predictors and the 4th column has the labels with 14 different categories (strings like Angry, Sad, Happy, Neutral etc.). It seems that the Neural Net Pattern Recognition toolbox, unlike the MATLAB Classification Learner toolbox doesn't allow me to import the table and automatically extract the predictors and responses from it. Moreover, I am unable to either specify the inputs and targets to the neural network manually as it isn't showing up in the options.
I looked into the examples like the Iris Dataset, Wine Dataset, Cancer Dataset etc., but all of them only have 2-3 classes as outputs which are being Identified (and encoded in binary like 000, 010, 011 etc.) and the labels are not string type unlike mine like Angry, Sad, Happy, Neutral etc. (total 14 different classes). I would like to know how I can use my table as input to the neural network pattern recognition toolbox, or otherwise, any way in which I can extract the data from my table and use it in the toolbox. I am new to using the toolbox, so any help in this regard would be highly appreciated. Thanks!
The first step to use the Neural Net Pattern Recognition Toolbox is to convert the table to a numeric array, as neural networks work only with numeric arrays, not other datatypes directly. Considering the table as my_table, it can be converted to a numeric array using
my_table_array = table2array(my_table);
From my_table_array, the inputs (predictors) and outputs/targets can be extracted. But, it is imperative to mention that the inputs and outputs need to be transposed (as the data is needed to be in column format for the toolbox, each column is one datapoint, and each row is the feature), which can easily be accomplished using:-
inputs = inputs'; %(now of dimensions 3x21392)
labels = labels'; %(now of dimensions 1x21392)
The string type labels (categorical) can be converted to numeric values using a one-hot encoding technique with categorical, followed by ind2vec:
my_table_vector = ind2vec(double(categorical(labels)));
Now, the my_table_vector (final targets) and inputs (final input predictors) can easily be fed to the neural network and used for classification/prediction of the target labels.

Auto-encoder based unsupervised clustering

I am trying to cluster a dataset using an encoder and since I am new in this field I cant tell how to do it.My main issue is how to define the loss function since the dataset is unlabeled and up to know, what I have seen from bibliography they define as loss function the distance between the desired output and the predicted output.My question is since that I dont have a desired output how should I implement this?
You can use an auto encoder to pre-train your convolutional layers, like it described in my question here with usage of convolutional autoencoder for images
As you can see form code, loss function is Adam with metrics accuracy and dice coefficient, I think you can use accuracy only, since dice coefficient is image-specific
I’m not sure how it will work for you, because you hadn’t provided your idea how you will transform your bibliography lists to vector, perhaps you will create a list for bibliography id’s sorted by the cosine distance between them
For example, you can use a set of vector with cosine distances to each item in a bibliography list above for each reference in your dataset and use it as input for autoencoder
After encoder will be trained, you can remove the decoder part from your model output and use as an input for one of unsupervised clustering algorithms, for example, k-mean. You can find details about them here

Pattern recognition teachniques that allow input as sequence of different length

I am trying to classify water end-use events expressed as a time-series sequences into appropriate categories (e.g. toilet, tap, shower, etc). My first attempt using HMM shows a quite promising result with an average accuracy of 80%. I just wonder if there is any other techniques that allow the training input as time-series sequences of different length like HMM does rather than the extracted feature vector of each sequence. I have tried Conditional Random Field (CRF) and SVM ;however, as far as I know, these two techniques require input as a pre-computed feature vector and the length of all input vectors must be the same for training purpose. I am not sure if I am right or wrong at this point. Any help would be appreciated.
Thanks, Will

Using MNIST DATABASE for digits recognition.

I am trying to use the MNIST DATABASE in order to recognize hand written digits. What I have so far is a binary matrix that represents the digit , the algorithm is written in matlab . I would love some help on getting started with using the MNIST DATABASE to recognize the digit from the binary image.
Thanks.
If you are using Matlab and already have the binary images now you need to:
1) Extract features from the images (you have many choices). For example, you can start by using the raw pixels ==> convert each image matrix into a row vector.
(Use a part of the data for training and the rest for testing)
Create a feature matrix with all these row vectors. Each row will be an "instance" in your feature matrix.
2) Now can select and try different classifiers. Try for example, an SVM (Support Vector Machine). The most basic way is using the svmtrain and svmclassify functions. The usage is simple and well explained in Matlab's help.
3)Test different partitions of data.
4)Experiment with other features and classifiers.