I am trying to setup a non-linear regression problem in Keras. I have two sets of data say X1 and X2 whose Y values have a similar mean and standard deviation.
The following procedure was under taken:
Combine the dataset X1 and X2, shuffle it and train on 30% of the data. Keras reported a training score of 3.20 RMSE and test score 3.22 RMSE
Use the weights from above and test against 100% of X1 data. Keras reported a test score of 23.97 RMSE
Use the same weights and test against 100% of X2 data. Keras reported a test score 6.49 RMSE
It is not clear to me why there is such a big difference in the Test score between X1 and X2. Is there any way i can improve the result?
For giggles, I repeated the same procedure as above but included the whole of X1 and X2 dataset instead of taking 30%.
Combine X1 and X2, and train on the whole dataset. Keras returned Training score 1.81 RMSE
Use the weights from above and test against 100% of X1 data. Keras reported a score of 22.80 RMSE
Testing on X2 gave a score of 7.50 RMSE
Again X2 seems to perform poorly compared with X1.
The problem was with scaling data appropriately. After data was rescaled to good format - model started to work.
Related
I am training a network for detecting sentence similarity (paraphrase detection) using joint loss from LSTM layer and CNN layer. The final cost is simply the summation of individual loss (likelihood loss) from these two layers.
Probability of two sentence to be similar: sigmoid(vec1TWvec2 + b), where vec1 and vec2 are vector representations of the two sentences, W and b are weights and biases to be learnt during training.
Final loss = Loss from LSTM layer + Loss from CNN layer.
When I train the system on sample data of 32 random sentences, my model converges well.
However, on using complete data the loss becomes stagnant and all
sigmoid values goes to very close to 0.
My network parameters:
Gradient Descent Optimizer with learning rate 0.01 or 0.001.
Hidden State Dim 200.
Word Embedding Dim 300.
Gradient Clipping by norm to 5.
1 Layer of convolution with 200 kernels followed by 1 layer of max pool.
Can anyone give my any hint on what might be the issue with training on complete data even though it works on small dataset? Is there a problem of vanishing gradient?
So I have something like this,
y=l3*[sin(theta1)*cos(theta2)*cos(theta3)+cos(theta1)*sin(theta2)*cos(theta3)-sin(theta1)*sin(theta2)*sin(theta3)+cos(theta1)*cos(theta2)sin(theta3)]+l2[sin(theta1)*cos(theta2)+cos(theta1)*sin(theta2)]+l1*sin(theta1)+l0;
and something similar for x. Where thetai is angles from specified interval and li some coeficients. Task is approximate inversion of equation, so you set x and y and result will be appropriate theta. So I random generate thetas from specified intervals, compute x and y. Then I norm x and y between <-1,1> and thetas between <0,1>. This data I used as training set in such way, inputs of network are normalized x and y, outputs are normalized thetas.
I train the network, tried different configuration and absolute error of network was still around 24.9% after whole night of training. It's so much, so I don't know what to do.
Bigger training set?
Bigger network?
Experiment with learning rate?
Longer training?
Technical info
As training algorithm was used error back propagation. Neurons have sigmoid activation function, units are biased. I tried topology: [2 50 3], [2 100 50 3], training set has length 1000 and training duration was 1000 cycle(in one cycle I go through all dataset). Learning rate has value 0.2.
Error of approximation was computed as
sum of abs(desired_output - reached_output)/dataset_lenght.
Used optimizer is stochastic gradient descent.
Loss function,
1/2 (desired-reached)^2
Network was realized in my Matlab template for NN. I know that is weak point, but I'm sure my template is right because(successful solution of XOR problem, approximation of differential equations, approximation of state regulator). But I show this template, because this information may be useful.
Neuron class
Network class
EDIT:
I used 2500 unique data within theta ranges.
theta1<0, 180>, theta2<-130, 130>, theta3<-150, 150>
I also experiment with larger dataset, but accuracy doesn't improve.
I have two gaussian distribution samples, one guassian contains 10,000 samples and the other gaussian also contains 10,000 samples, I would like to train a feed-forward neural network with these samples but I dont know how many samples I have to take in order to get an optimal decision boundary.
Here is the code but I dont know exactly the solution and the output are weirds.
x1 = -49:1:50;
x2 = -49:1:50;
[X1, X2] = meshgrid(x1, x2);
Gaussian1 = mvnpdf([X1(:) X2(:)], mean1, var1);// for class A
Gaussian2 = mvnpdf([X1(:) X2(:)], mean2, var2);// for Class B
net = feedforwardnet(10);
G1 = reshape(Gaussian1, 10000,1);
G2 = reshape(Gaussian2, 10000,1);
input = [G1, G2];
output = [0, 1];
net = train(net, input, output);
When I ran the code it give me weird results.
If the code is not correct, can someone please suggest me so that I can get a decision boundary for these two distributions.
I'm pretty sure that the input must be the Gaussian distribution (and not the x coordinates). In fact the NN has to understand the relationship between the phenomenons themselves that you are interested (the Gaussian distributions) and the output labels, and not between the space in which are contained the phenomenons and the labels. Moreover, If you choose the x coordinates, the NN will try to understand some relationship between the latter and the output labels, but the x are something of potentially constant (i.e., the input data might be even all the same, because you can have very different Gaussian distribution in the same range of the x coordinates only varying the mean and the variance). Thus the NN will end up being confused, because the same input data might have more output labels (and you don't want that this thing happens!!!).
I hope I was helpful.
P.S.: for doubt's sake I have to tell you that the NN doesn't fit very well the data if you have a small training set. Moreover don't forget to validate your data model using the cross-validation technique (a good rule of thumb is to use a 20% of your training set for the cross-validation set and another 20% of the same training set for the test set and thus to use only the remaining 60% of your training set to train your model).
I have read in the documentation of crossval is that mcr = crossval('mcr',X,y,'Predfun',predfun) function in matlab calculate the misclassification rate, But if it's apply with 10-fold cross-validation, then we will have 10 different values for misclassification, since we done 10 testing, and each testing produce a result, but the value mcr is single or scalar , So does it take the average misclassification rates or it's take the minimum..etc ?
The average misclassification rate (across all folds and all monte-carlo repartitions) is used. The following line of crossval demonstrates the calculation of the average loss -
loss = sum(loss)/ (mcreps * sum(cvp.TestSize));
where loss is initially a vector of losses for each cross-validation fold and each repartition, mcreps is the number of repartitions and sum(cvp.TestSize) is the total size of the cross-validation test sets.
This is used for both the MSE (mean-squared error) and MCR loss functions.
I have a data set of 13 attributes where some are categorical and some are continuous (can be converted to categorical). I need to use logistic regression to create a model that predicts the responses of a row and find the prediction's accuracy, sensitivity, and specificity.
Can/Should I use cross validation to divide my data set and get the results?
Is there any code sample on how to go about doing this? (I'm new to all of this)
Should I be using mnrfit/mnrval or glmfit/glmval? What's the difference and how do I choose?
Thanks!
If you want to determine how well the model can predict unseen data you can use cross validation. In Matlab, you can use glmfit to fit the logistic regression model and glmval to test it.
Here is a sample of Matlab code that illustrates how to do it, where X is the feature matrix and Labels is the class label for each case, num_shuffles is the number of repetitions of the cross-validation while num_folds is the number of folds:
for j = 1:num_shuffles
indices = crossvalind('Kfold',Labels,num_folds);
for i = 1:num_folds
test = (indices == i); train = ~test;
[b,dev,stats] = glmfit(X(train,:),Labels(train),'binomial','logit'); % Logistic regression
Fit(j,i) = glmval(b,X(test,:),'logit')';
end
end
Fit is then the fitted logistic regression estimate for each test fold. Thresholding this will yield an estimate of the predicted class for each test case. Performance measures are then calculated by comparing the predicted class label against the actual class label. Averaging the performance measures across all folds and repetitions gives an estimate of the model performance on unseen data.
originally answered by BGreene on #Stats.SE.