Why pytorch has two kinds of Non-linear activations? - neural-network

Why pytorch has two kinds of Non-linear activations?
Non-liner activations (weighted sum, nonlinearity):
https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity
Non-linear activations (other): https://pytorch.org/docs/stable/nn.html#non-linear-activations-other

The primary difference is that the functions listed under Non-linear activations (weighted sum, nonlinearity) perform only thresholding and do not normalize the output. (i.e. the resultant tensor need not necessarily sum up to 1, either on the whole or along some specified axes/dimensions)
Example non-linearities:
nn.ReLU
nn.Sigmoid
nn.SELU
nn.Tanh
Whereas the non-linearities listed under Non-linear activations (other) perform thresholding and normalization (i.e. the resultant tensor sums up to 1, either for the whole tensor if no axis/dimension is specified; Or along the specified axes/dimensions)
Example non-linearities: (note the normalization term in the denominator)
However, with the exception of nn.LogSoftmax() for which the resultant tensor doesn't sum up to 1 since we apply log over the softmax output.

Related

Is a fully connected layer equivalent to Flatten + Dense in Tensorflow?

A fully-connected layer, also known as a dense layer, refers to the layer whose inside neurons connect to every neuron in the preceding layer (see Wikipedia).
In the MATLAB Deep Learning Toolkit, when defining a fullyConnectedLayer(n), the output will always be (borrowing the terminology from Tensorflow) a "tensor" of shape 1×1×n.
However, defining a dense layer in Keras via tf.keras.layers.Dense(n) will not result in a rank 1 tensor depending on the input, as explained in the Keras documentation:
For example, if input has dimensions (batch_size, d0, d1), then we create a kernel with shape (d1, units), and the kernel operates along axis 2 of the input, on every sub-tensor of shape (1, 1, d1) (there are batch_size * d0 such sub-tensors). The output in this case will have shape (batch_size, d0, units).
Am I correct in assuming that what MATLAB does in fullyConnectedLayer(n) is equivalent to cascading a Flatten() layer and a Dense(n) layer in Tensorflow? By equivalent I mean that exactly the same operation is performed.
It would appear that this is the case based on the number of weights that MATLAB requires for a fullyConnectedLayer. The weights in fact are n×M where M is the dimension of the input (see MATLAB Documentation: "At training time, Weights is an OutputSize-by-InputSize matrix"). In fact snooping around the internals of this MATLAB function, it seems to me that the InputSize is precisely the size of the input if it were "flattened", i.e. M = a*b*c if the input tensor has shape (a,b,c) (and of course I experimentally verified this by multiplying).
The layer I'm trying to build is towards the final stages of a categorical classifiers, so I need the final output of the Keras model to be of shape (None, n) where n is the number of labels in the training data.

Why is softmax function necessory? Why not simple normalization?

I am not familiar with deep learning so this might be a beginner question.
In my understanding, softmax function in Multi Layer Perceptrons is in charge of normalization and distributing probability for each class.
If so, why don't we use the simple normalization?
Let's say, we get a vector x = (10 3 2 1)
applying softmax, output will be y = (0.9986 0.0009 0.0003 0.0001).
Applying simple normalization (dividing each elements by the sum(16))
output will be y = (0.625 0.1875 0.125 0.166).
It seems like simple normalization could also distribute the probabilities.
So, what is the advantage of using softmax function on the output layer?
Normalization does not always produce probabilities, for example, it doesn't work when you consider negative values. Or what if the sum of the values is zero?
But using exponential of the logits changes that, it is in theory never zero, and it can map the full range of the logits into probabilities. So it is preferred because it actually works.
This depends on the training loss function. Many models are trained with a log loss algorithm, so that the values you see in that vector estimate the log of each probability. Thus, SoftMax is merely converting back to linear values and normalizing.
The empirical reason is simple: SoftMax is used where it produces better results.

What does the Kernel Scale parameter inLinear and polynomial kernel in SVM?

I have a theoretical question, and understand the concept of Kernel scale with the Gaussian Kernel, but when I run 'OptimizeHyperparameters' in fitcsvm in Matlab, it gives me different values than one, and I would like to understand what that means...
What does it mean a high value of kernel scale in linear kernel svm? and in polynomial?
Please pay attention to these paragraph from MATLAB help :
You cannot use any cross-validation name-value pair argument along with the 'OptimizeHyperparameters' name-value pair argument. You can modify the cross-validation for 'OptimizeHyperparameters' only by using the 'HyperparameterOptimizationOptions' name-value pair argument.
OptimizeHyperparameters values override any values you set using other name-value pair arguments. For example, setting OptimizeHyperparameters to 'auto' causes the 'auto' values to apply.
MATLAB divides all elements of the predictor matrix X by the value of KernelScale. Then, the software applies the appropriate kernel norm to compute the Gram matrix. Therefore a high value for kernel scale means all elements of the predictor matrix must be divided to a big value.
KernelScale can be between [1e-3,1e3]. Fitcsvm searches among positive values, by default log-scaled in the range [1e-3,1e3].
If you specify KernelScale and your own kernel function, for example, 'KernelFunction','kernel', then the software throws an error. You must apply scaling within kernel.

Can any function be decomposed as sum of Gaussians?

In Fourier series, any function can be decomposed as sum of sine and
cosine
In neural networks, any function can be decomposed as weighted sum over logistic functions. (A one layer neural network)
In wavelet transforms, any function can be decomposed as weighted sum of Haar functions
Is there also such property for decomposition into mixture of Gaussians? If so, is there a proof?
If the sum allows to be infinite, then the answer is Yes. Please refer to Yves Meyer's book of "Wavelet and Operators", section 6.6, lemma 10.
There's a theorem, the Stone-Weierstrass theorem, which gives conditions for when a family of functions can approximate any continuous function. You need
an algebra of functions (closed under addition, subtraction, and
multiplication)
the constant functions
and you need the functions to separate points:
(for any two distinct points you can find a a function that assigns them different values)
You can approximate a constant function with increasingly wide gaussians. You can time-shift gaussians to separate points. So if you form an algebra out of gaussians, you can approximate any continuous function with them.
Yes. Decomposing any function to a sum of any kind of Gaussians is possible, since it can be decomposed to a sum of Dirac functions :) (and Dirac is a Gaussian where the variance approaches zero).
Some more interesting questions would be:
Can any function be decomposed to a sum of non-zero variance Gaussians, with a given, constant variance, that are defined around varying centers?
Can any function be be decomposed to a sum of non-zero variance Gaussians, all having 0 as the center, but defined with alternating variances?
The Mathematics Stack Exchange might be a better place to answer these questions though.

Creating a 1D Second derivative of gaussian Window

In MATLAB I need to generate a second derivative of a gaussian window to apply to a vector representing the height of a curve. I need the second derivative in order to determine the locations of the inflection points and maxima along the curve. The vector representing the curve may be quite noise hence the use of the gaussian window.
What is the best way to generate this window?
Is it best to use the gausswin function to generate the gaussian window then take the second derivative of that?
Or to generate the window manually using the equation for the second derivative of the gaussian?
Or even is it best to apply the gaussian window to the data, then take the second derivative of it all? (I know these last two are mathematically the same, however with the discrete data points I do not know which will be more accurate)
The maximum length of the height vector is going to be around 100-200 elements.
Thanks
Chris
I would create a linear filter composed of the weights generated by the second derivative of a Gaussian function and convolve this with your vector.
The weights of a second derivative of a Gaussian are given by:
Where:
Tau is the time shift for the filter. If you are generating weights for a discrete filter of length T with an odd number of samples, set tau to zero and allow t to vary from [-T/2,T/2]
sigma - varies the scale of your operator. Set sigma to a value somewhere between T/6. If you are concerned about long filter length then this can be reduced to T/4
C is the normalising factor. This can be derived algebraically but in practice I always do this numerically after calculating the filter weights. For unity gain when smoothing periodic signals, I will set C = 1 / sum(G'').
In terms of your comment on the equivalence of smoothing first and taking a derivative later, I would say it is more involved than that. As which derivative operator would you use in the second step? A simple central difference would not yield the same results.
You can get an equivalent (but approximate) response to a second derivative of a Gaussian by filtering the data with two Gaussians of different scales and then taking the point-wise differences between the two resulting vectors. See Difference of Gaussians for that approach.