Machine Learning: How to handle discrete and continuous data together

Machine Learning: How to handle discrete and continuous data together - matlab

I'm posting to ask whether there any methodologies, or ideas as to how to handle discrete and continuous data in a classifying problem.
In my situation, I have a bunch of independent "batches" that have discrete data. This is process related data, and so for each batch, there are separate points. I also have a dataset, that varies with time for the same batches. This time however there are many time observations for every batch. The data sets look like below:
Data Set 1
Batch 1 DiscreteInfo(1) DiscreteInfo(2) ....... DiscreteInfo(n)
Batch 2 DiscreteInfo(1) DiscreteInfo(2) ....... DiscreteInfo(n)
Batch 3 DiscreteInfo(1) DiscreteInfo(2) ....... DiscreteInfo(n)
Batch 4 DiscreteInfo(1) DiscreteInfo(2) ....... DiscreteInfo(n)
Data Set 2
Batch 1 t(1) TimeData
Batch 1 t(2) TimeData
Batch 1 t(3) TimeData
Batch 1 t(4) TimeData
.
.
.
.
Batch n (t1) TimeData
Batch n (t2) TimeData
Batch n (t3) TimeData
I am trying to classify whether all this data belongs to a 'Good' batch, a 'Bad' batch, or a 'so-so' batch. This is determined by one specific discrete parameter (not used in the data sets).
I'm very new to machine learning; any input or ideas would be appreciated. I'm using the matlab classification learner to try to tackle this problem.

There are a few things that you need to consider while dealing with a classification problem.
Training Data. We need training data for classification, i-e we need all the above mentioned attribute's values along with the class value whether it is 'Good' or 'Bad' or 'so-so'.
Using this we can train a model, and then given a new data for all the trained attributes we can predict which class it belongs to.
As far as discrete and continuous is concerned, There is no difference in the way we handle discrete and continuous data. In fact, for this case we can generate a new attribute which will be a function of all the other time variables for a given batch and then perform the classification. If you provide an instance of the data-set then the question can be answered more precisely.

Related

keras fit_generator parameter steps_per_epoch

I want to use the keras model.fit_generator method for that I wrote my own generator and for the method I need to define the parameter "steps_per_epoch" I want to use every training data once for every epoch.
Now my problem is I generate the features in the generator I read wav-files and create the fft and before I start the training I don't know how much batches/samples I have. I can calculate the fft for every file before I start using the fit_generator method but every time I change my dataset(>20GB) I would need to recalculate the fft for every file and save the count for steps per epoch. Is there a better way how I can define that the fit_generator uses every sample only one time without calculate the steps per epoch? Or can my own generator pass the fit_generator when to start a new epoch?
Here is the code for my generator
def my_generator(filename_list):
while True:
for fname in filename_list:
data, sr = librosa.load(fname)
fft_result = librosa.core.stft(data)
batches = features.create_batches(fft_result, batch_size)
for i in range(len(batches)):
yield (batches[i], label)
model.fit_generator(my_generator(filename_list=filename_list, batch_size=batch_size), steps_per_epoch=100, epochs=10)

For each file in the list, you have to calculate fft that has 'n' batches where 'n' is different for each file. If this is the case than:
Navie method is to loop through the batch generator to calculate the actual number of batches. This process needs to be done only once. you can save that number for future use as well.
The second method could be to assign an arbitrary number to step_per_epoch. That arbitrary number should be greater than or equal to the number of files in the list multiplied by the number of the batches each fft can generate. The number of fft batches could be an arbitrary number. This way, if you shuffle data after the external "for" loop completes, then after some epoch statistically speaking all training data would be seen by the model. By using early_stop you can have properly converged model where "epochs" should be a very large value, 1000 for example.

Some questions about split_train_test() function

I am currently trying to use Python's linearregression() model to describe the relationship between two variables X and Y. Given a dataset with 8 columns and 1000 rows, I want to split this dataset into training and test sets using split_train_test.
My question: I wonder what is the difference between train_test_split(dataset, test_size, random_test = int) vs train_test_split(dataset, test_size).Also, does the 2nd one (without setting random_test=int) give me a different test set and training set each time I re-run my program? Also, does the 1st one give me the same test set and training set every time I re-run my program? What is the difference between setting random_test=42 vs random_test=43, for example?

In python scikit-learn train_test_split will split your input data into two sets i) train and ii) test. It has argument random_state which allows you to split data randomly.
If the argument is not mentioned it will classify the data in a stratified manner which will give you the same split for the same dataset.
Assume you want a random split the data so that you could measure the performance of your regression on the same data with different splits. you can use random_state to achieve it. Each random state will give you pseudo-random split of your initial data. In order to keep track of performance and reproduce it later on the same data you will use the random_state argument with value used before.
It is useful for cross validation technique in machine learning.

How to pass dataset in Classification Learner App of Matlab

My question is about passing variables (training dataset ,Labels and test variable) as predictors and responses. What I do is that load all 3 in workspace of matlab and start session. But every time I get the error(Described in attached Image) i.e No responses selected ,select response variable.My dataset is as following:
faces [ size : 5000 * 10000 (5000 samples ,10000 features)]
TrainingLabels [ size :5000 *1]
TestVariable [ size :1*10000]
Now what should be Predictors and responses in my case and how can I use them correctly in order to make classification Learner app work?
Any Kind help regarding this matter will be highly appreciated. Thankyou.

Step 1): Prepare the data!! If you have N samples of training data and M samples of test data, then combine it together to make it MxN samples. The rows, here, represent each sample and the columns the different types of features detected from a sample.
Step 2): Add an extra column at FIRST or LAST of the data (preferably): This column should represent the desired labels for the data. So, now you will have : total no.of columns= no.of features + 1. While importing the data into the Classification Learner App, it is advised to import the data as a TABLE.
Step 3): Now, set up the data to be used by the Classification Learner App!! By default, all columns will be selected as predictors. The app will prompt you to select the responses. A response is the one which you added as an extra column (the label). So, change the label-column to make it point as a response.
Step 4): Before starting the session, you need to set up the Cross Validation strategy adopted. [ A k-fold validation divides the total MxN data into k-parts and begins by taking the first part of testing and rest k-1 parts for training. Then, again it takes the second part for testing and rest k-1 parts for training and so on. Finally, average of all the accuracies obtained is taken as final accuracy].
Step 5): Start session, select the classifier you want and the hit the training button!!!

Choose one of the columns as response that is change it from predictor to response in "Import as" dropdown.

How to get the dataset size of a Caffe net in python?

I look at the python example for Lenet and see that the number of iterations needed to run over the entire MNIST test dataset is hard-coded. However, can this value be not hard-coded at all? How to get the number of samples of the dataset pointed by a network in python?

You can use the lmdb library to access the lmdb directly
import lmdb
db = lmdb.open('/path/to/lmdb_folder') //Needs lmdb - method
num_examples = int( db.stat()['entries'] )
Should do the trick for you.

It seems that you mixed iterations and amount of samples in one question. In the provided example we can see only number of iterations, i. e. how many times training phase will be repeated. The is no any direct relationship between amount of iterations (network training parameters) and amount of samples in dataset (network input).
Some more detailed explanation:
EDIT: Caffe will totally load (batch size x iterations) samples for training or testing, but there is no relation with amount of loaded samples and actual database size: it will start reading from the beginning after reaching database last record - it other words, database in caffe acts like a circular buffer.
Mentioned example points to this configuration. We can see that it expects lmdb input, and sets batch size to 64 (some more info about batches and BLOBs) for training phase and 100 for testing phase. Really we don't make any assumption about input dataset size, i. e. number of samples in dataset: batch size is only processing chunk size, iterations is how many batches caffe will take. It won't stop after reaching database end.
In other words, network itself (i. e. protobuf config files) doesn't point to any number of samples in database - only to dataset name and format and desired amount of samples. There is no way to determine database size with caffe at the current moment, as I know.
Thus if you want to load entire dataset for testing, you have only option to firstly determine amount of samples in mnist_test_lmdb or mnist_train_lmdb manually, and then specify corresponding values for batch size and iterations.
You have some options for this:
Look at ./examples/mnist/create_mnist.sh console output - it prints amount of samples while converting from initial format (I believe that you followed this tutorial);
follow #Shai's advice (read lmdb file directly).

How to use KNN to classify data in MATLAB?

I'm having problems in understanding how K-NN classification works in MATLAB.´
Here's the problem, I have a large dataset (65 features for over 1500 subjects) and its respective classes' label (0 or 1).
According to what's been explained to me, I have to divide the data into training, test and validation subsets to perform supervised training on the data, and classify it via K-NN.
First of all, what's the best ratio to divide the 3 subgroups (1/3 of the size of the dataset each?).
I've looked into ClassificationKNN/fitcknn functions, as well as the crossval function (idealy to divide data), but I'm really not sure how to use them.
To sum up, I wanted to
- divide data into 3 groups
- "train" the KNN (I know it's not a method that requires training, but the equivalent to training) with the training subset
- classify the test subset and get it's classification error/performance
- what's the point of having a validation test?
I hope you can help me, thank you in advance
EDIT: I think I was able to do it, but, if that's not asking too much, could you see if I missed something? This is my code, for a random case:
nfeats=60;ninds=1000;
trainRatio=0.8;valRatio=.1;testRatio=.1;
kmax=100; %for instance...
data=randi(100,nfeats,ninds);
class=randi(2,1,ninds);
[trainInd,valInd,testInd] = dividerand(1000,trainRatio,valRatio,testRatio);
train=data(:,trainInd);
test=data(:,testInd);
val=data(:,valInd);
train_class=class(:,trainInd);
test_class=class(:,testInd);
val_class=class(:,valInd);
precisionmax=0;
koptimal=0;
for know=1:kmax
%is it the same thing use knnclassify or fitcknn+predict??
predicted_class = knnclassify(val', train', train_class',know);
mdl = fitcknn(train',train_class','NumNeighbors',know) ;
label = predict(mdl,val');
consistency=sum(label==val_class')/length(val_class);
if consistency>precisionmax
precisionmax=consistency;
koptimal=know;
end
end
mdl_final = fitcknn(train',train_class','NumNeighbors',know) ;
label_final = predict(mdl,test');
consistency_final=sum(label==test_class')/length(test_class);
Thank you very much for all your help

For your 1st question "what's the best ratio to divide the 3 subgroups" there are only rules of thumb:
The amount of training data is most important. The more the better.
Thus, make it as big as possible and definitely bigger than the test or validation data.
Test and validation data have a similar function, so it is convenient to assign them the same amount
of data. But it is important to have enough data to be able to recognize over-adaptation. So, they
should be picked from the data basis fully randomly.
Consequently, a 50/25/25 or 60/20/20 partitioning is quite common. But if your total amount of data is small in relation to the total number of weights of your chosen topology (e.g. 10 weights in your net and only 200 cases in the data), then 70/15/15 or even 80/10/10 might be better choices.
Concerning your 2nd question "what's the point of having a validation test?":
Typically, you train the chosen model on your training data and then estimate the "success" by applying the trained model to unseen data - the validation set.
If you now would completely stop your efforts to improve accuracy, you indeed don't need three partitions of your data. But typically, you feel that you can improve the success of your model by e.g. changing the number of weights or hidden layers or ... and now a big loops starts to run with many iterations:
1) change weights and topology, 2) train, 3) validate, not satisfied, goto 1)
The long-term effect of this loop is, that you increasingly adapt your model to the validation data, so the results get better not because you so intelligently improve your topology but because you unconsciously learn the properties of the validation set and how to cope with them.
Now, the final and only valid accuracy of your neural net is estimated on really unseen data: the test set. This is done only once and is also useful to reveal over-adaption. You are not allowed to start a second even bigger loop now to prohibit any adaption to the test set!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse