I am facing a peculiar problem and I was wondering if there is an explanation. I am trying to run a linear regression problem and test different optimization methods and two of them have a strange outcome when comparing to each other. I build a data set that satisfies y=2x+5 and I add a random noise to that.
xtrain=np.range(0,50,1).reshape(50,1)
ytrain=2*train+5+np.random.normal(0,2,(50,1))
opt1=torch.optim.SGD(model.parameters(),lr=1e-5,momentum=0.8))
opt2=torch.optim.Rprop(model.parameters(),lr=1e-5)
F_loss=F.mse_loss
from torch.utils.data import TensorDataset,DataLoader
train_d=TensorDataset(xtrain,ytrain)
train=DataLoader(train_d,50,shuffle=True)
model1=nn.Linear(1,1)
loss=F_loss(model1(xtrain),ytrain)
def fit(nepoch, model1, F_loss, opt):
for epoch in range(nepoch):
for i,j in train:
predict = model1(i)
loss = F_loss(predict, j)
loss.backward()
opt.step()
opt.zero_grad()
When i compare the results of the following commands:
fit(500000, model1, F_loss, opt1)
fit(500000, model1, F_loss, opt2)
In the last epoch for opt1:loss=2.86,weight=2.02,bias=4.46
In the last epoch for opt2:loss=3.47,weight=2.02,bias=4.68
These results do not make sense to me, shouldn't opt2 have a smaller loss than opt1 since the weight and bias it finds is closer to the real value? opt2's method finds weights and biases to be closer to the real value (they are respectively 2 and 5). Am i doing something wrong?
This has to do with the fact that you are drawing the training samples themselves from a random distribution.
By doing so, you inherently randomized the ground truth to some extent. Sure, you will get values that are inherently distributed around 2x+5, but you do not guarantee that 2x+5 will also be the best fit to this data distribution.
It could thus happen that you accidentally end up with values that deviate quite significantly from the original function, and, since you use a mean squared error, these values get weighted quite significantly.
In expectation (i.e., for the number of samples going towards infinity), you will likely get closer and closer to the expected parameters.
A way to verify this would be to plot your training samples against the parameter set, as well as the (ideal) underlying function.
Also note that Linear Regression does have a direct solution - something that is very uncommon in Machine Learning - meaning you can directly calculate an optimal solution, e.g., with sklearn's function
Related
I am running a logistic regression using the matlab function fitclinear with the following parameters:
rng('default')
[Mdl,FitInfo] = fitclinear(X',y', 'Lambda','auto',...
'Learner','logistic',...
'ObservationsIn','columns',...
'Regularization','ridge',...
'Solver','sgd',...
'Verbose',1,...
'BatchSize',100,...
'LearnRate',0.1,...
'OptimizeLearnRate',true,...
'PassLimit',100,...
'ClassNames',[-1,1]);
And due to the fact that i m working with recent and long historycal data, I came to realize that training this logistic regression with the exact same X and y and after setting the random generator to default to reproduce results, could results in 2 different results, i.e. 2 different set of Betas and different bias.
Could anyone tell me what could be the reason behing? Where could the randomness come from?
The system starts at a random start point, from there, with the size of your system, there are many local minima which could still be good. The idea is that with larger sizes of frameworks we don't really care about having the same global minima, we care about having decent results. Therefore, we can start at any random point and accept that our system is unlikely to end up with the best result but rather in some location that gives us good results. This means that it is unlikely, given a large system that any two training sequences will be the same.
https://stats.stackexchange.com/questions/203288/understanding-almost-all-local-minimum-have-very-similar-function-value-to-the
I want to force a classifier to not come up with the same results all the time (unsupervised, so I have no targets):
max_indices = tf.argmax(result, 1)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(result, max_indices, name="cross_entropy_per_example")
cross_entropy_mean = tf.reduce_mean(cross_entropy, name="cross_entropy")
Where:
result are the logits returned from inference
max_indices are thus the predicted classes across all the batches (size=matchsize)
cross_entropy as implemented here measures how strongly the predicted result is in fact predicted (as if measuring simply the confidence)
I then optimize to minimize that loss. Basically I want the net to predict a class as strongly as possible.
Obviously this converges to some random class and will then classify everything in that one class.
So what I want is to add a penalty to prevent all predictions in a batch to be the same. I checked the math and came up with the Shannon Diversity as a good measure, but I cannot implement this in tensorflow. Any idea how to do this, either with the diversity measure stated or any substitute?
Thx
A good rule of thumb is to have the loss function that reflects on what you actually want to optimize. If you want to increase the diversity, it would make sense to have your loss function actually measure diversity.
While I'm sure there's a more correct way to do it, here's one heuristic that can get you closer to the Shannon Diversity you mention:
Let's make a hypothesis that the output of the softmax is actually close to one for the predicted class and is close to zero for all other classes.
Then the proportion of each class is the sum of outputs of the softmax over the batch divided by the batch size.
Then the loss function that approximates the Shannon Diversity would be something along the lines of:
sm = tf.softmax(result)
proportions = tf.reduce_mean(result, 0) # approximated proportion of each class
addends = proportions * tf.log(proportions) # multiplied by the log of itself
loss = tf.reduce_sum(addends) # add them up together to get the loss
When I think more about it, it might potentially break and instead of trying to diversify classes instead make very uncertain predictions (effectively breaking the original assumption that softmax is a good approximation for the one-hot encoding of the predicted class). To get around it I would add up together the loss I described above and your original loss from your question. The loss I described will be optimizing the approximated Shannon Diversity, while your original loss will prevent the softmax from becoming more and more uncertain.
I just wanted to test how good can neural network approximate multiplication function (regression task).
I am using Azure Machine Learning Studio. I have 6500 samples, 1 hidden layer
(I have tested 5 /30 /100 neurons per hidden layer), no normalization. And default parameters
Learning rate - 0.005, Number of learning iterations - 200, The initial learning weigh - 0.1,
The momentum - 0 [description]. I got extremely bad accuracy, close to 0.
At the same time boosted Decision forest regression shows very good approximation.
What am I doing wrong? This task should be very easy for NN.
Big multiplication function gradient forces the net probably almost immediately into some horrifying state where all its hidden nodes have zero gradient.
We can use two approaches:
1) Devide by constant. We are just deviding everything before the learning and multiply after.
2) Make log-normalization. It makes multiplication into addition:
m = x*y => ln(m) = ln(x) + ln(y).
Some things to check:
Your output layer should have a linear activation function. If it's sigmoidal, it won't be able to represent values outside it's range (e.g. -1 to 1)
You should use a loss function that's appropriate for regression (e.g. squared error)
If your hidden layer uses sigmoidal activation functions, check that you're not saturating them. Multiplication can work on arbitrarily small/large values. And, if you pass a large number as input you can get saturation, which will lose information. If using ReLUs, make sure they're not getting stuck at 0 on all examples (although activations will generally be sparse on any given example).
Check that your training procedure is working as intended. Plot the error over time during training. How does it look? Are your gradients well behaved or are they blowing up? One source of problems can be the learning rate being set too high (unstable error, exploding gradients) or too low (very slow progress, error doesn't decrease quickly enough).
This is how I do multiplication with neural network:
import numpy as np
from keras import layers
from keras import models
model = models.Sequential()
model.add(layers.Dense(150, activation='relu', input_shape=(2,)))
model.add(layers.Dense(1, activation='relu'))
data = np.random.random((10000, 2))
results = np.asarray([a * b for a, b in data])
model.compile(optimizer='sgd', loss='mae')
model.fit(data, results, epochs=1, batch_size=1)
model.predict([[0.8, 0.5]])
It works.
"Two approaches: divide by constant, or make log normalization"
I'm tried both approaches. Certainly, log normalization works since as you rightly point out it forces an implementation of addition. Dividing by constant -- or similarly normalizing across any range -- seems not to succeed in my extensive testing.
The log approach is fine, but if you have two datasets with a set of inputs and a target y value where:
In dataset one the target is consistently a sum of two of the inputs
In dataset two the target is consistently the product of two of the inputs
Then it's not clear to me how to design a neural network which will find the target y in both datasets using backpropogation. If this isn't possible, then I find it a surprising limitation in the ability of a neural network to find the "an approximation to any function". But I'm new to this game, and my expectations may be unrealistic.
Here is one way you could approximate the multiplication function using one hidden layer. It uses a sigmoidal activation in the hidden layer, and it works quite nicely until a certain range of numbers. This is the gist link
m = x*y => ln(m) = ln(x) + ln(y), but only if x, y > 0
I'm using Matlab ( github code repository ). The details of the network are:
Hidden units: 100 ( variable )
Epochs : 500
Batch size: 100
The weights are being updated using Back propagation algorithm.
I've been able to recognize 0,1,2,3,4,5,6,8 which I have drawn in photoshop.
However 7,9 are not recognized, but upon running on the test set I get only 749/10000 wrong and it correctly classifies 9251/10000.
Any idea what might be wrong? Because it is learning and based on the test set results its learning correctly.
I don't see anything downright incorrect in your code, but there is a lot that can be improved:
You use this to set the initial weights:
hiddenWeights = rand(hiddenUnits,inputVectorSize);
outputWeights = rand(outputVectorSize,hiddenUnits);
hiddenWeights = hiddenWeights./size(hiddenWeights, 2);
outputWeights = outputWeights./size(outputWeights, 2);
This will make your weights very small I think. Not only that, but you will have no negative values, so you'll throw away half of the sigmoid's range of values. I suggest you try:
weights = 2*rand(x, y) - 1
Which will generate random numbers in [-1, 1]. You can then try dividing this interval to get smaller weights (try dividing by the sqrt of the size).
You use this as the output delta:
outputDelta = dactivation(outputActualInput).*(outputVector - targetVector) % (tk-yk)*f'(yin)
Multiplying by the derivative is done if you use the square loss function. For log loss (which is usually the one used in classification), you should have just outputVector - targetVector. It might not make that big of a difference, but you might want to try.
You say in the comments that the network doesn't detect your own sevens and nines. This can suggest overfitting on the MNIST data. To address this, you'll need to add some form of regularization to your network: either weight decay or dropout.
You should try different learning rates as well, if you haven't already.
You don't seem to have any bias neurons. Each layer, except the output layer, should have a neuron that only returns the value 1 to the next layer. You can implement this by adding another feature to your input data that is always 1.
MNIST is a big data set for which better algorithms are still being researched. Your networks is very basic, small, with no regularization, no bias neurons and no improvements to classic gradient descent. It's not surprising that it's not working too well: you'll likely need a more complex network for better results.
Nothing to do with neural nets or your code,
but this picture of KNN-nearest digits shows that some MNIST digits
are simply hard to recognize:
Perhaps this is an easy question, but I want to make sure I understand the conceptual basis of the LibSVM implementation of one-class SVMs and if what I am doing is permissible.
I am using one class SVMs in this case for outlier detection and removal. This is used in the context of a greater time series prediction model as a data preprocessing step. That said, I have a Y vector (which is the quantity we are trying to predict and is continuous, not class labels) and an X matrix (continuous features used to predict). Since I want to detect outliers in the data early in the preprocessing step, I have yet to normalize or lag the X matrix for use in prediction, or for that matter detrend/remove noise/or otherwise process the Y vector (which is already scaled to within [-1,1]). My main question is whether it is correct to model the one class SVM like so (using libSVM):
svmod = svmtrain(ones(size(Y,1),1),Y,'-s 2 -t 2 -g 0.00001 -n 0.01');
[od,~,~] = svmpredict(ones(size(Y,1),1),Y,svmod);
The resulting model does yield performance somewhat in line with what I would expect (99% or so prediction accuracy, meaning 1% of the observations are outliers). But why I ask is because in other questions regarding one class SVMs, people appear to be using their X matrices where I use Y. Thanks for your help.
What you are doing here is nothing more than a fancy range check. If you are not willing to use X to find outliers in Y (even though you really should), it would be a lot simpler and better to just check the distribution of Y to find outliers instead of this improvised SVM solution (for example remove the upper and lower 0.5-percentiles from Y).
In reality, this is probably not even close to what you really want to do. With this setup you are rejecting Y values as outliers without considering any context (e.g. X). Why are you using RBF and how did you come up with that specific value for gamma? A kernel is total overkill for one-dimensional data.
Secondly, you are training and testing on the same data (Y). A kitten dies every time this happens. One-class SVM attempts to build a model which recognizes the training data, it should not be used on the same data it was built with. Please, think of the kittens.
Additionally, note that the nu parameter of one-class SVM controls the amount of outliers the classifier will accept. This is explained in the LIBSVM implementation document (page 4): It is proved that nu is an upper bound on the fraction of training errors and
a lower bound of the fraction of support vectors. In other words: your training options specifically state that up to 1% of the data can be rejected. For one-class SVM, replace can by should.
So when you say that the resulting model does yield performance somewhat in line with what I would expect ... ofcourse it does, by definition. Since you have set nu=0.01, 1% of the data is rejected by the model and thus flagged as an outlier.