Why does this semantic segmentation network have no softmax classification layer in Pytorch? - neural-network

I am trying to use the following CNN architecture for semantic pixel classification. The code I am using is here
However, from my understanding this type of semantic segmentation network typically should have a softmax output layer for producing the classification result.
I could not find softmax used anywhere within the script. Here is the paper I am reading on this segmentation architecture. From Figure 2, I am seeing softmax being used. Hence I would like to find out why this is missing in the script. Any insight is welcome.

You are using quite a complex code to do the training/inference. But if you dig a little you'll see that the loss functions are implemented here and your model is actually trained using cross_entropy loss. Looking at the doc:
This criterion combines log_softmax and nll_loss in a single function.
For numerical stability it is better to "absorb" the softmax into the loss function and not to explicitly compute it by the model.
This is quite a common practice having the model outputs "raw" predictions (aka "logits") and then letting the loss (aka criterion) do the softmax internally.
If you really need the probabilities you can add a softmax on top when deploying your model.

Related

Can I train Word2vec using a Stacked Autoencoder with non-linearities?

Every time I read about Word2vec, the embedding is obtained with a very simple Autoencoder: just one hidden layer, linear activation for the initial layer, and softmax for the output layer.
My question is: why can't I train some Word2vec model using a stacked Autoencoder, with several hidden layers with fancier activation functions? (The softmax at the output would be kept, of course.)
I never found any explanation about this, therefore any hint is welcome.
Word vectors are noting but hidden states of a neural network trying to get good at something.
To answer your question
Of course you can.
If you are going to do it why not use fancier networks/encoders as well like BiLSTM or Transformers.
This is what people who created things like ElMo and BERT did(though their networks were a lot fancier).

Tensorflow: scaled logits with cross entropy

In Tensorflow, I have a classifier network and unbalanced training classes. For various reasons I cannot use resampling to compensate for the unbalanced data. Therefore I am forced to compensate for the misbalance by other means, specifically multiplying the logits by weights based on the number of examples in each class. I know this is not the preferred approach, but resampling is not an option. My training loss op is tf.nn.softmax_cross_entropy_with_logits (I might also try tf.nn.sparse_softmax_cross_entropy_with_logits). The Tensorflow docs includes the following in the description of these ops:
WARNING: This op expects unscaled logits, since it performs a softmax
on logits internally for efficiency. Do not call this op with the
output of softmax, as it will produce incorrect results.
My question: Is the warning above referring only to scaling done by softmax, or does it mean any logit scaling of any type is forbidden? If the latter, then is my class-rebalancing logit scaling causing erroneous results?
Thanks,
Ron
The warning just informs you that tf.nn.softmax_cross_entropy_with_logits will apply a softmax on the input logits, before computing cross-entropy. This warning seems really to avoid applying softmax twice, as the cross-entropy results would be very different.
Here is a comment in the relevant source code, about the function that implements tf.nn.softmax_cross_entropy_with_logits:
// NOTE(touts): This duplicates some of the computations in softmax_op
// because we need the intermediate (logits -max(logits)) values to
// avoid a log(exp()) in the computation of the loss.
As the warning states, this implementation is for improving performance, with the caveat that you should not put your own softmax layer as input (which is somewhat convenient, in practice).
If the forced softmax hinders your computation, perhaps another API could help: tf.nn.sigmoid_cross_entropy_with_logits or maybe tf.nn.weighted_cross_entropy_with_logits.
The implementation does not seem to indicate, though, that any scaling will impact the result. I guess a linear scaling function should be fine, as long as it preserves the original logits repartition. But whatever is applied on the input logits, tf.nn.softmax_cross_entropy_with_logits will apply a softmax before computing the cross-entropy.

How to do disjoint classification without softmax output?

What's the correct way to do 'disjoint' classification (where the outputs are mutually exclusive, i.e. true probabilities sum to 1) in FANN since it doesn't seems to have an option for softmax output?
My understanding is that using sigmoid outputs, as if doing 'labeling', that I wouldn't be getting the correct results for a classification problem.
FANN only supports tanh and linear error functions. This means, as you say, that the probabilities output by the neural network will not sum to 1. There is no easy solution to implementing a softmax output, as this will mean changing the cost function and hence the error function used in the backpropagation routine. As FANN is open source you could have a look at implementing this yourself. A question on Cross Validated seems to give the equations you would have to implement.
Although not the mathematically elegant solution you are looking for, I would try play around with some cruder approaches before tackling the implementation of a softmax cost function - as one of these might be sufficient for your purposes. For example, you could use a tanh error function and then just renormalise all the outputs to sum to 1. Or, if you are actually only interested in what the most likely classification is you could just take the output with the highest score.
Steffen Nissen, the guy behind FANN, presents an example here where he tries to classify what language a text is written in based on letter frequency. I think he uses a tanh error function (default) and just takes the class with the biggest score, but he indicates that it works well.

Neural networks: classification using Encog

I'm trying to get started using neural networks for a classification problem. I chose to use the Encog 3.x library as I'm working on the JVM (in Scala). Please let me know if this problem is better handled by another library.
I've been using resilient backpropagation. I have 1 hidden layer, and e.g. 3 output neurons, one for each of the 3 target categories. So ideal outputs are either 1/0/0, 0/1/0 or 0/0/1. Now, the problem is that the training tries to minimize the error, e.g. turn 0.6/0.2/0.2 into 0.8/0.1/0.1 if the ideal output is 1/0/0. But since I'm picking the highest value as the predicted category, this doesn't matter for me, and I'd want the training to spend more effort in actually reducing the number of wrong predictions.
So I learnt that I should use a softmax function as the output (although it is unclear to me if this becomes a 4th layer or I should just replace the activation function of the 3rd layer with softmax), and then have the training reduce the cross entropy. Now I think that this cross entropy needs to be calculated either over the entire network or over the entire output layer, but the ErrorFunction that one can customize calculates the error on a neuron-by-neuron basis (reads array of ideal inputs and actual inputs, writes array of error values). So how does one actually do cross entropy minimization using Encog (or which other JVM-based library should I choose)?
I'm also working with Encog, but in Java, though I don't think it makes a real difference. I have similar problem and as far as I know you have to write your own function that minimizes cross entropy.
And as I understand it, softmax should just replace your 3rd layer.

Optimization of Neural Network input data

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.
Basically I'm using Back-propagation algorithm to train the neural network using the dataset given here: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.
But in that dataset no. of attributes are very high. In fact one of the mentors of the project told me that If you train the Neural Network with that many attributes, it'll take lots of time to get trained. So is there a way to optimize the input dataset? Or I just have to use that many attributes?
1558 is actually a modest number of features/attributes. The # of instances(3279) is also small. The problem is not on the dataset side, but on the training algorithm side.
ANN is slow in training, I'd suggest you to use a logistic regression or svm. Both of them are very fast to train. Especially, svm has a lot of fast algorithms.
In this dataset, you are actually analyzing text, but not image. I think a linear family classifier, i.e. logistic regression or svm, is better for your job.
If you are using for production and you cannot use open source code. Logistic regression is very easy to implement compared to a good ANN and SVM.
If you decide to use logistic regression or SVM, I can future recommend some articles or source code for you to refer.
If you're actually using a backpropagation network with 1558 input nodes and only 3279 samples, then the training time is the least of your problems: Even if you have a very small network with only one hidden layer containing 10 neurons, you have 1558*10 weights between the input layer and the hidden layer. How can you expect to get a good estimate for 15580 degrees of freedom from only 3279 samples? (And that simple calculation doesn't even take the "curse of dimensionality" into account)
You have to analyze your data to find out how to optimize it. Try to understand your input data: Which (tuples of) features are (jointly) statistically significant? (use standard statistical methods for this) Are some features redundant? (Principal component analysis is a good stating point for this.) Don't expect the artificial neural network to do that work for you.
Also: remeber Duda&Hart's famous "no-free-lunch-theorem": No classification algorithm works for every problem. And for any classification algorithm X, there is a problem where flipping a coin leads to better results than X. If you take this into account, deciding what algorithm to use before analyzing your data might not be a smart idea. You might well have picked the algorithm that actually performs worse than blind guessing on your specific problem! (By the way: Duda&Hart&Storks's book about pattern classification is a great starting point to learn about this, if you haven't read it yet.)
aplly a seperate ANN for each category of features
for example
457 inputs 1 output for url terms ( ANN1 )
495 inputs 1 output for origurl ( ANN2 )
...
then train all of them
use another main ANN to join results