How to fix the issue "ValueError: Found input variables with inconsistent numbers of samples" in Python - classification

I have two files namely data [3806, 2] and target [4039, 2]. My aim is split these files into training and testing datasets. I have already tried doing:
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size=0.2, random_state=0)
However, it gives the error:
ValueError: Found input variables with inconsistent numbers of samples: [3806, 4039]
What is the optimal solution for this issue in order to run the classification algorithms?

There is a confusion about what train_test_split does: its goal is to split the data between training and test set. Usually the input data is made of two arrays: X for the features, y for the labels, and the function randomly selects a given proportion of rows as test set, leaving the rest as training set. Example:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
The rows used for the test set are the same from X and y, so that the output arrays still contain the correct label for each instance: for instance rows 3,5,6,8 are selected in both arrays, and the features for row 3 still correspond to the label for row 3. This is why the two input arrays must have the same length.
In your code you give two arrays which have different length, hence the error.
Either you want to use one of the arrays as training set and the other as test set, in which case you don't need to call train_test_split since your data is already split.
Or you want to randomly select a test set from the union of these two datasets, and in this case you should concatenate the rows of the two arrays before calling train_test_split.

"train and test data sets are not equal in size:"
I presume you have already completed generating an X features and y label of sets
that has the same amount of data in them. If not check that step again!
For example:
features = Sample_Set.drop('X', axis=1)
labels = Sample_Set['X']
Now, you'll generate your X_train, X_test, y_train, y_test
*** Don't forget train_test_split can only split the dataset into two.
So with this step, you generate a training set from 33% of the data and the rest left for the validation and test sets are 67%. So it will be harder for you to split those.
Why don't you take test_size=0.4 so that you will have 60 % for training and left 40% for validation and testing that can easily be divided into half easily?
This we will have 60% of data as "train set" and 40 % of the data as "test set"
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)
Change the test_size into 50% so that half is going to our "test set", and the other half is going to our "validation test"
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

Related

Caffe classification labels in HDF5

I am finetuning a network. In a specific case I want to use it for regression, which works. In another case, I want to use it for classification.
For both cases I have an HDF5 file, with a label. With regression, this is just a 1-by-1 numpy array that contains a float. I thought I could use the same label for classification, after changing my EuclideanLoss layer to SoftmaxLoss. However, then I get a negative loss as so:
Iteration 19200, loss = -118232
Train net output #0: loss = 39.3188 (* 1 = 39.3188 loss)
Can you explain if, and so what, goes wrong? I do see that the training loss is about 40 (which is still terrible), but does the network still train? The negative loss just keeps on getting more negative.
UPDATE
After reading Shai's comment and answer, I have made the following changes:
- I made the num_output of my last fully connected layer 6, as I have 6 labels (used to be 1).
- I now create a one-hot vector and pass that as a label into my HDF5 dataset as follows
f['label'] = numpy.array([1, 0, 0, 0, 0, 0])
Trying to run my network now returns
Check failed: hdf_blobs_[i]->shape(0) == num (6 vs. 1)
After some research online, I reshaped the vector to a 1x6 vector. This lead to the following error:
Check failed: outer_num_ * inner_num_ == bottom[1]->count() (40 vs. 240)
Number of labels must match number of predictions; e.g., if softmax axis == 1
and prediction shape is (N, C, H, W), label count (number of labels)
must be N*H*W, with integer values in {0, 1, ..., C-1}.
My idea is to add 1 label per data set (image) and in my train.prototxt I create batches. Shouldn't this create the correct batch size?
Since you moved from regression to classification, you need to output not a scalar to compare with "label" but rather a probability vector of length num-labels to compare with the discrete class "label". You need to change num_output parameter of the layer before "SoftmaxWithLoss" from 1 to num-labels.
I believe currently you are accessing un-initialized memory and I would expect caffe to crash sooner or later in this case.
Update:
You made two changes: num_output 1-->6, and you also changed your input label from a scalar to vector.
The first change was the only one you needed for using "SoftmaxWithLossLayer".
Do not change label from a scalar to a "hot-vector".
Why?
Because "SoftmaxWithLoss" basically looks at the 6-vector prediction you output, interpret the ground-truth label as index and looks at -log(p[label]): the closer p[label] is to 1 (i.e., you predicted high probability for the expected class) the lower the loss. Making a prediction p[label] close to zero (i.e., you incorrectly predicted low probability for the expected class) then the loss grows fast.
Using a "hot-vector" as ground-truth input label, may give rise to multi-category classification (does not seems like the task you are trying to solve here). You may find this SO thread relevant to that particular case.

cross validation function crossvalind

I have question please; concerning cross validation, for me the cross-validation is used to find the best parameters.
but I did not understand the role of this function "crossvalind":Generate cross-validation indices, it just takes a data set without model, like in this exemple :
load fisheriris
[g gn] = grp2idx(species);
[trainIdx testIdx] = crossvalind('HoldOut', species, 1/3);
crossvalind() function splits your data in two groups: the training set and the cross-validation set.
By your example:
[trainIdx testIdx] = crossvalind('HoldOut', size(species,1), 1/3); means split the data in species (2/3 in the training set and 1/3 in the cross-validation set).
Supposing that your data is like:
species=[datarow1;datarow2;datarow3;datarow4;datarow5;datarow6] then
trainIdx would be like [1;1;0;1;1;0] and testIdx would be like [0;0;1;0;0;1] meaning that from the 6 total elements in our set crossvalind function assigned 4 to the train set and 2 to the cross-validation set. Of course this is a random assignment meaning that the zero and ones indices will vary every time you call the function but the proportion between them will be fixed and trainIdx + testIdx will always be ones(size(species,1),1)
crossvalind('LeaveMout',size(species,1),2) would be exactly the same as crossvalind('HoldOut', size(species,1), 1/3) in this particular case. In the 'HoldOut' format you provide parameter P which takes values from 0 to 1 (like 1/3 in the example above) while with the option 'LeaveMout' you provide integer M like 2 samples from the 6 total or like 2000 samples from the 10000 total samples in your dataset. In case of 'Resubstitution': crossvalind('Resubstitution', size(species,1), [1/3,2/3]) would be yet the same but here you also have the option of let's say [1/3,3/4] meaning that some samples can be on both the train and cross-validation sets, or even [1,1] which means that all the samples are used in both sets (trainIdx=testIdx=[1;1;1;1;1;1] in the above example). I strongly suggest to type help crossvalind and take a look at the help file which is always a lot more detailed and helpful than i could ever be.

How to use the Embedding Layer for Recurrent Neural Network (RNN) in Keras

I'm rather new to Neural Networks and the Keras Library and I'm wondering how I can use the Embedding Layer as described here to mask my input data from a 2D tensor to a 3D tensor for a RNN.
Say my timeseries data looking as follows (with an increasing time):
X_train = [
[1.0,2.0,3.0,4.0],
[2.0,5.0,6.0,7.0],
[3.0,8.0,9.0,10.0],
[4.0,11.0,12.0,13.0],
...
] # with a length of 1000
Now, say I would want to give the RNN the last 2 feature vectors in order to predict the feature vector for time t+1.
Currently (without the Embedding Layer), I am creating the required 3D tensor with shape (nb_samples, timesteps, input_dim) myself (as in this example here).
Related to my example, the final 3D Tensor would then look as follows:
X_train_2 = [
[[1.0,2.0,3.0,4.0],
[2.0,5.0,6.0,7.0]],
[[2.0,5.0,6.0,7.0],
[3.0,8.0,9.0,10.0]],
[[3.0,8.0,9.0,10.0],
[4.0,11.0,12.0,13.0]],
etc...
]
and Y_train:
Y_train = [
[3.0,8.0,9.0,10.0],
[4.0,11.0,12.0,13.0],
etc...
]
My model looks as follows (adapted to the simplified example above):
num_of_vectors = 2
vect_dimension = 4
model = Sequential()
model.add(SimpleRNN(hidden_neurons, return_sequences=False, input_shape=(num_of_vectors, vect_dimension)))
model.add(Dense(vect_dimension))
model.add(Activation("linear"))
model.compile(loss="mean_squared_error", optimizer="rmsprop")
model.fit(X_train, Y_train, batch_size=50, nb_epoch=10, validation_split=0.15)
And finally, my question would be, how can I avoid doing those 2D tensor to 3D tensor reshaping myself and use the Embedding layer instead? I guess after model = sequential() I would have to add something like:
model.add(Embedding(?????))
Probably the answer is rather simple, I'm simply confused by the documentation of the embedding layer.
You can you it as follows:
Note:
I generated some X and y as 0s just to give you some idea of the input structure.
If you are having a multi class y_train, you will need to binarize.
You might need to add padding if you have data of various length.
If I understood correctly about predicting at time t+1, you might want to look at Sequence to Sequence learning.
Try something like:
hidden_neurons = 4
nb_classes =3
embedding_size =10
X = np.zeros((128, hidden_neurons), dtype=np.float32)
y = np.zeros((128, nb_classes), dtype=np.int8)
model = Sequential()
model.add(Embedding(hidden_neurons, embedding_size))
model.add(SimpleRNN(hidden_neurons, return_sequences=False))
model.add(Dense(nb_classes))
model.add(Activation("softmax"))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', class_mode="categorical")
model.fit(X, y, batch_size=1, nb_epoch=1)
From what I know so far, the Embedding layer seems to be more or less for dimensionality reduction like word embedding. So in this sense it does not seem applicable as general reshaping tool.
Basicaly if you have a mapping of words to integers like {car: 1, mouse: 2 ... zebra: 9999}, your input text would be vector of words represented by they integer id, like [1, 2, 9999 ...], which would mean [car, mouse, zebra ...]. But it seems to be efficient to map words to vectors of real numbers with length of the vocabular, so if your text has 1000 unique words, you would map each word to vector of real numbers with length of 1000. I am not sure but I think that it mostly represents a weigh of how similar a meaning of a word is to all other words, but I am not sure that is right and wheher there are other ways to embed words.

Unreasonable [positive] log-likelihood values from matlab "fitgmdist" function

I want to fit a data sets with Gaussian mixture model, the data sets contains about 120k samples and each sample has about 130 dimensions. When I use matlab to do it, so I run scripts (with cluster number 1000):
gm = fitgmdist(data, 1000, 'Options', statset('Display', 'iter'), 'RegularizationValue', 0.01);
I get the following outputs:
iter log-likelihood
1 -6.66298e+07
2 -1.87763e+07
3 -5.00384e+06
4 -1.11863e+06
5 299767
6 985834
7 1.39525e+06
8 1.70956e+06
9 1.94637e+06
The log likelihood is bigger than 0! I think it's unreasonable, and don't know why.
Could somebody help me?
First of all, it is not a problem of how large your dataset is.
Here is some code that produces similar results with a quite small dataset:
options = statset('Display', 'iter');
x = ones(5,2) + (rand(5,2)-0.5)/1000;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 64.4731
2 73.4987
3 73.4987
Of course you know that the log function (the natural logarithm) has a range from -inf to +inf. I guess your problem is that you think the input to the log (i.e. the aposteriori function) should be bounded by [0,1]. Well, the aposteriori function is a pdf function, which means that its value can be very large for very dense dataset.
PDFs must be positive (which is why we can use the log on them) and must integrate to 1. But they are not bounded by [0,1].
You can verify this by reducing the density in the above code
x = ones(5,2) + (rand(5,2)-0.5)/1;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 -8.99083
2 -3.06465
3 -3.06465
So, I would rather assume that your dataset contains several duplicate (or very close) values.

Matlab - neural networks - How to use different datasets for training, validation and testing?

Best
I've a question about Neural networks in Matlab.
First of all, I've a small NN, 2 inputs, 1 hidden layer with 10 neurons and one output. And this works fine. But the question which I've is. Can I determine my training date, validation data and test data?
I know, if I use e.g. net = feedforwardnet(10); that I can divide my overall dataset into e.g.70/100 15/100 and 15/100. But I don't want to do this, because in this case I want to train my NN with a 1000 data-points, validate them with another data-points and use another independent data-set of 1000 data-points to test them. With other words, I want to control these 3 interdependent data-sets.
Thus, can someone help me?
Kind regards
Edit, I don't want to use a data-set with 3000 data-points and set the devideParams on 1/3 1/3 & 1/3.
Best myself
When you use a feedforwardnet then you can define your divide parameters
net.divideParam.trainRatio = 1/3;
net.divideParam.valRatio = 1/3;
net.divideParam.testRatio = 1/3;
You know that your data will be divided into 3 pieces.
But you (I) didn't know which data.
But when you and thus I, train my network via the following command line:
[net,tr]=train(net,x,t);
then, tr will contain all the necessary info, like for example :
tr.trainInd 1x1000 double,
tr.valInd 1x1000 double,
tr.testInd 1x1000 double,
Thus, e.g. tr.trainInd, will contain all the indexes of our set of data which were used for training. Also, in tr, we can see that the type of tr.divideFcn is set on dividerand which means that the indexes are picked at random. Thus it would be logical, that there is a possibility that those indexes aren't picked randomly which means that, if we combine both things. It should be possible to use another test set --> net.divideParam.testRatio = 0 and to use two different train and validation sets --> net.divideParam.trainRatio = 1/2 and net.divideParam.valRatio = 1/2 - If you can set the tr.divideFcn on something chronological, sequential. Last but not least, if this is possible then we have nothing more to do, then put the training and validation set into one data set, etc...
Kind regards to myself
By default it will use a random index for train,validation,test. This is manually set with the following, though since it's default usually not needed:
net.divideFcn = 'dividerand'
and then you use the the commands you noted above:
net.divideParam.trainRatio = 1/3;
net.divideParam.valRatio = 1/3;
net.divideParam.testRatio = 1/3;
To do what you want and set the index of each you can do the following:
net.divideFcn = 'divideind'
net.divideParam.trainInd = [1:1000]
net.divideParam.valInd=[1001:2000]
net.divideParam.testInd=[2001:3000]