In Tensorflow, I have a classifier network and unbalanced training classes. For various reasons I cannot use resampling to compensate for the unbalanced data. Therefore I am forced to compensate for the misbalance by other means, specifically multiplying the logits by weights based on the number of examples in each class. I know this is not the preferred approach, but resampling is not an option. My training loss op is tf.nn.softmax_cross_entropy_with_logits (I might also try tf.nn.sparse_softmax_cross_entropy_with_logits). The Tensorflow docs includes the following in the description of these ops:
WARNING: This op expects unscaled logits, since it performs a softmax
on logits internally for efficiency. Do not call this op with the
output of softmax, as it will produce incorrect results.
My question: Is the warning above referring only to scaling done by softmax, or does it mean any logit scaling of any type is forbidden? If the latter, then is my class-rebalancing logit scaling causing erroneous results?
Thanks,
Ron
The warning just informs you that tf.nn.softmax_cross_entropy_with_logits will apply a softmax on the input logits, before computing cross-entropy. This warning seems really to avoid applying softmax twice, as the cross-entropy results would be very different.
Here is a comment in the relevant source code, about the function that implements tf.nn.softmax_cross_entropy_with_logits:
// NOTE(touts): This duplicates some of the computations in softmax_op
// because we need the intermediate (logits -max(logits)) values to
// avoid a log(exp()) in the computation of the loss.
As the warning states, this implementation is for improving performance, with the caveat that you should not put your own softmax layer as input (which is somewhat convenient, in practice).
If the forced softmax hinders your computation, perhaps another API could help: tf.nn.sigmoid_cross_entropy_with_logits or maybe tf.nn.weighted_cross_entropy_with_logits.
The implementation does not seem to indicate, though, that any scaling will impact the result. I guess a linear scaling function should be fine, as long as it preserves the original logits repartition. But whatever is applied on the input logits, tf.nn.softmax_cross_entropy_with_logits will apply a softmax before computing the cross-entropy.
Related
I am trying to use the following CNN architecture for semantic pixel classification. The code I am using is here
However, from my understanding this type of semantic segmentation network typically should have a softmax output layer for producing the classification result.
I could not find softmax used anywhere within the script. Here is the paper I am reading on this segmentation architecture. From Figure 2, I am seeing softmax being used. Hence I would like to find out why this is missing in the script. Any insight is welcome.
You are using quite a complex code to do the training/inference. But if you dig a little you'll see that the loss functions are implemented here and your model is actually trained using cross_entropy loss. Looking at the doc:
This criterion combines log_softmax and nll_loss in a single function.
For numerical stability it is better to "absorb" the softmax into the loss function and not to explicitly compute it by the model.
This is quite a common practice having the model outputs "raw" predictions (aka "logits") and then letting the loss (aka criterion) do the softmax internally.
If you really need the probabilities you can add a softmax on top when deploying your model.
I'm currently trying to use an autoencoder network for dimensionality reduction.
(i.e. using the bottleneck activation as the compressed feature)
I noticed that a lot of studies that used autoencoder for this task uses a linear bottleneck layer.
By intuition, I think this makes sense since the usage of non-linear activation function may reduce the bottleneck feature's capability to represent the principle information contained within the original feature.
(e.g., ReLU ignores the negative values and sigmoid suppresses values too high or too low)
However, is this correct? And is using linear bottleneck layer for autoencoder necessary?
If it's possible to use a non-linear bootleneck layer, what activation function would be the best choice?
Thanks.
No, you are not limited to linear activation functions. An example of that is this work, where they use the hidden state of the GRU layers as an embedding for the input. The hidden state is obtained by using non-linear tanh and sigmoid functions in its computation.
Also, there is nothing wrong with 'ignoring' the negative values. The sparsity may, in fact, be beneficial. It can enhance the representation. The noise that can be created by other functions such as identity or sigmoid function may introduce false dependencies where there are none. By using ReLU we can represent the lack of dependency properly (as a zero) as opposed to some near zero value which is likely for e.g. sigmoid function.
I want to force a classifier to not come up with the same results all the time (unsupervised, so I have no targets):
max_indices = tf.argmax(result, 1)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(result, max_indices, name="cross_entropy_per_example")
cross_entropy_mean = tf.reduce_mean(cross_entropy, name="cross_entropy")
Where:
result are the logits returned from inference
max_indices are thus the predicted classes across all the batches (size=matchsize)
cross_entropy as implemented here measures how strongly the predicted result is in fact predicted (as if measuring simply the confidence)
I then optimize to minimize that loss. Basically I want the net to predict a class as strongly as possible.
Obviously this converges to some random class and will then classify everything in that one class.
So what I want is to add a penalty to prevent all predictions in a batch to be the same. I checked the math and came up with the Shannon Diversity as a good measure, but I cannot implement this in tensorflow. Any idea how to do this, either with the diversity measure stated or any substitute?
Thx
A good rule of thumb is to have the loss function that reflects on what you actually want to optimize. If you want to increase the diversity, it would make sense to have your loss function actually measure diversity.
While I'm sure there's a more correct way to do it, here's one heuristic that can get you closer to the Shannon Diversity you mention:
Let's make a hypothesis that the output of the softmax is actually close to one for the predicted class and is close to zero for all other classes.
Then the proportion of each class is the sum of outputs of the softmax over the batch divided by the batch size.
Then the loss function that approximates the Shannon Diversity would be something along the lines of:
sm = tf.softmax(result)
proportions = tf.reduce_mean(result, 0) # approximated proportion of each class
addends = proportions * tf.log(proportions) # multiplied by the log of itself
loss = tf.reduce_sum(addends) # add them up together to get the loss
When I think more about it, it might potentially break and instead of trying to diversify classes instead make very uncertain predictions (effectively breaking the original assumption that softmax is a good approximation for the one-hot encoding of the predicted class). To get around it I would add up together the loss I described above and your original loss from your question. The loss I described will be optimizing the approximated Shannon Diversity, while your original loss will prevent the softmax from becoming more and more uncertain.
I am currently trying to use Neural Network to make regression predictions.
However, I don't know what is the best way to handle this, as I read that there were 2 different ways to do regression predictions with a NN.
1) Some websites/articles suggest to add a final layer which is linear.
http://deeplearning4j.org/linear-regression.html
My final layers would look like, I think, :
layer1 = tanh(layer0*weight1 + bias1)
layer2 = identity(layer1*weight2+bias2)
I also noticed that when I use this solution, I usually get a prediction which is the mean of the batch prediction. And this is the case when I use tanh or sigmoid as a penultimate layer.
2) Some other websites/articles suggest to scale the output to a [-1,1] or [0,1] range and to use tanh or sigmoid as a final layer.
Are these 2 solutions acceptable ? Which one should one prefer ?
Thanks,
Paul
I would prefer the second case, in which we use normalization and sigmoid function as the output activation and then scale back the normalized output values to their actual values. This is because, in the first case, to output the large values (since actual values are large in most cases), the weights mapping from penultimate layer to the output layer would have to be large. Thus, for faster convergence, the learning rate has to be made larger. But this may also cause learning of the earlier layers to diverge since we are using a larger learning rate. Hence, it is advised to work with normalized target values, so that the weights are small and they learn quickly.
Hence in short, the first method learns slowly or may diverge if a larger learning rate is used and on the other hand, the second method is comparatively safer to use and learns quickly.
What's the correct way to do 'disjoint' classification (where the outputs are mutually exclusive, i.e. true probabilities sum to 1) in FANN since it doesn't seems to have an option for softmax output?
My understanding is that using sigmoid outputs, as if doing 'labeling', that I wouldn't be getting the correct results for a classification problem.
FANN only supports tanh and linear error functions. This means, as you say, that the probabilities output by the neural network will not sum to 1. There is no easy solution to implementing a softmax output, as this will mean changing the cost function and hence the error function used in the backpropagation routine. As FANN is open source you could have a look at implementing this yourself. A question on Cross Validated seems to give the equations you would have to implement.
Although not the mathematically elegant solution you are looking for, I would try play around with some cruder approaches before tackling the implementation of a softmax cost function - as one of these might be sufficient for your purposes. For example, you could use a tanh error function and then just renormalise all the outputs to sum to 1. Or, if you are actually only interested in what the most likely classification is you could just take the output with the highest score.
Steffen Nissen, the guy behind FANN, presents an example here where he tries to classify what language a text is written in based on letter frequency. I think he uses a tanh error function (default) and just takes the class with the biggest score, but he indicates that it works well.