Are PyTorch activation functions best stored as fields? - neural-network

An example of a simple neural network in PyTorch can be found at https://visualstudiomagazine.com/articles/2020/10/14/pytorch-define-network.aspx
class Net(T.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.hid1 = T.nn.Linear(4, 8) # 4-(8-8)-1
self.hid2 = T.nn.Linear(8, 8)
self.oupt = T.nn.Linear(8, 1)
T.nn.init.xavier_uniform_(self.hid1.weight)
T.nn.init.zeros_(self.hid1.bias)
T.nn.init.xavier_uniform_(self.hid2.weight)
T.nn.init.zeros_(self.hid2.bias)
T.nn.init.xavier_uniform_(self.oupt.weight)
T.nn.init.zeros_(self.oupt.bias)
def forward(self, x):
z = T.tanh(self.hid1(x))
z = T.tanh(self.hid2(z))
z = T.sigmoid(self.oupt(z))
return z
A distinctive feature of the above is that the layers are stored as fields within the Net object (as they need to be, in the sense that they contain the weights, which need to be remembered across training epochs), but the activation functors such as tanh are re-created on every call to forward. The author says:
The most common structure for a binary classification network is to define the network layers and their associated weights and biases in the __init__() method, and the input-output computations in the forward() method.
Fair enough. On the other hand, perhaps it would be marginally faster to store the functors rather than re-create them on every call to forward. On the third hand, it's unlikely to make any measurable difference, which means it might end up being a matter of code style.
Is the above, indeed the most common way to do it? Does either way have any technical advantage, or is it just a matter of style?

On "storing" functors
The snippet is not "re-creating" anything -- calling torch.tanh(x) is literally just calling the function tanh exported by the torch package with arguments x.
Other ways of doing it
I think the snippet is a fair example for small neural blocks that are use-and-forget or are just not meant to be parameterizable.
Depending on your intentions, there are of course alternatives, but you'd have to weigh yourself whether the added complexity offers any value.
activation functions as strings
allow a selection of an activation function from a fixed set
class Model(torch.nn.Module):
def __init__(..., activation_function: Literal['tanh'] | Literal['relu']):
...
if activation_function == 'tanh':
self.activation_function = torch.tanh
elif activation_function == 'relu':
self.activation_function = torch.relu
else:
raise ValueError(f'activation function {activation_function} not allowed, use tanh or relu.'}
def forward(...) -> Tensor:
output = ...
return self.activation_function(output)
activation functions as callables
use arbitrary modules or functions as activations
class Model(torch.nn.Module):
def __init__(..., activation_function: torch.nn.Module | Callable[[Tensor], Tensor]):
self.activation_function = activation_function
def forward(...) -> Tensor:
output = ...
return self.activation_function(output)
which would for instance work like
def cube(x: Tensor) -> Tensor: return x**3
cubic_model = Model(..., activation_function=cube)
The key difference between the above examples and your snippet is the fact that the latter are transparent and adjustable wrt. to the activation used; you can inspect the activation function (i.e. model.activation_function), and change it (before or after initialization), whereas in the case of the original snippet it is invisible and baked into the model's functionality (to replicate the model with a different function, you'd need to define it from scratch).
Overall, I think the best way to go is to create small, locally tunable blocks that are as parametric as you need them to be, and wrap them into bigger blocks that make generalizations over the contained parameters. i.e. if your big model consists of 5 linear layers, you could make a single, activation-parametric wrapper for 1 layer (including dropouts, layer norms, whatever), and then another wrapper for a flow of N layers, which asks once for which activation function to initialize its children with. In other words, generalize and parameterize when you anticipate this to save you from extra effort and copy-pasting code in the future, but don't overdo it or you'll end up far away from your original specifications and needs.
ps: I don't know whether calling activation functions functors is justifiable.

Related

Use of priors on hyper-parameters with SVGP model

I wanted to use priors on hyper-parameters as in (https://gpflow.readthedocs.io/en/develop/notebooks/advanced/mcmc.html) but with an SVGP model.
Following the steps of example 1, I got an error when I run de run_chain_fn :
TypeError: maximum_log_likelihood_objective() missing 1 required positional argument: 'data'
Contrary to GPR or SGPMC, the data are not an attribut of the model, they are included as external parameter.
To avoid that problem I modified slightly SVGP class to include data as parameter (I don't care with mini-batching for now)
class SVGP_with_data(gpflow.models.SVGP):
"""This model is a tiny variation of classical SVGP. It just includes the data as an optionnal
parameter of the model, since they are necessary of MCMC sampling"""
def __init__(self,data,**kwargs):
super().__init__(**kwargs)
self.data = data
def maximum_log_likelihood_objective(self,_=None):
return self.elbo(self.data) #here we don't care about mini-batching
It seems to work well.
I couldn't find code example of SVGP with priors on hyper-parameters. Is their a more standard way to deel with this ?
Thanks !
SVGP is a GPflow model for a variational approximation. Using MCMC on the q(u) distribution parameterised by q_mu and q_sqrt doesn't make sense (if you want to do MCMC on q(u) in a sparse approximation, use SGPMC).
You can still put (hyper)priors on the hyperparameters in the SVGP model; gradient-based optimisation will then lead to the maximum a-posteriori (MAP) point estimate (as opposed to pure maximum likelihood).

Why does huggingface bert pooler hack make mixed precission training stable?

Huggigface BERT implementation has a hack to remove the pooler from optimizer.
https://github.com/huggingface/transformers/blob/b832d5bb8a6dfc5965015b828e577677eace601e/examples/run_squad.py#L927
# hack to remove pooler, which is not used
# thus it produce None grad that break apex
param_optimizer = [n for n in param_optimizer if 'pooler' not in n[0]]
We are trying to run pretrining on huggingface bert models. The code always diverges later during the training if this pooler hack is not applied. I also see the pooler layer being used during classification.
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
The pooler layer is a FFN with tanh activation
class BertPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
My question is why this pooler hack solves numeric instability?
Problem seen with pooler
There are quite a few resources out there that probably tackle this issue better than me, see for example here, or here.
Specifically, the problem is that you are dealing with vanishing (or exploding) gradients, specifically when using loss functions that flatten in either direction for very small/large inputs, which is the case for both sigmoid and tanh (the only difference here is the range in which their output lies, which is [0, 1] and [-1, 1], respectively.
Additionally, if you have a low-precision decimal, as is the case with APEX, then the gradient vanishing behavior is much more likely to appear already for relatively moderate outputs, as the precision limits the numbers which it is able to differentiate from zero. One way to deal with this is to have functions that have strictly non-zero and easily computable derivatives, such as Leaky ReLU, or simply avoid the activation function altogether (which I'm assuming is what huggingface is doing here).
Note that the problem of exploding gradients is usually not as tragic, as we can apply gradient clipping (limiting it to a fixed maximum size), but nonetheless the principle is the same. For zeroed gradients, on the other hand, there is no such easy fix, since it causes your neurons to "die" (no active learning is happening with zero backflow), which is why I'm assuming that you see the diverging behavior.

Q-Learning equation in Deep Q Network

I'm new to reinforcement learning at all, so I may be wrong.
My questions are:
Is the Q-Learning equation ( Q(s, a) = r + y * max(Q(s', a')) ) used in DQN only for computing a loss function?
Is the equation recurrent? Assume I use DQN for, say, playing Atari Breakout, the number of possible states is very large (assuming the state is single game's frame), so it's not efficient to create a matrix of all the Q-Values. The equation should update the Q-Value of given [state, action] pair, so what will it do in case of DQN? Will it call itself recursively? If it will, the quation can't be calculated, because the recurrention won't ever stop.
I've already tried to find what I want and I've seen many tutorials, but almost everyone doesn't show the background, just implements it using Python library like Keras.
Thanks in advance and I apologise if something sounds dumb, I just don't get that.
Is the Q-Learning equation ( Q(s, a) = r + y * max(Q(s', a')) ) used in DQN only for computing a loss function?
Yes, generally that equation is only used to define our losses. More specifically, it is rearranged a bit; that equation is what we expect to hold, but it generally does not yet precisely hold during training. We subtract the right-hand side from the left-hand side to compute a (temporal-difference) error, and that error is used in the loss function.
Is the equation recurrent? Assume I use DQN for, say, playing Atari Breakout, the number of possible states is very large (assuming the state is single game's frame), so it's not efficient to create a matrix of all the Q-Values. The equation should update the Q-Value of given [state, action] pair, so what will it do in case of DQN? Will it call itself recursively? If it will, the quation can't be calculated, because the recurrention won't ever stop.
Indeed the space of state-action pairs is much too large to enumerate them all in a matrix/table. In other words, we can't use Tabular RL. This is precisely why we use a Neural Network in DQN though. You can view Q(s, a) as a function. In the tabular case, Q(s, a) is simply a function that uses s and a to index into a table/matrix of values.
In the case of DQN and other Deep RL approaches, we use a Neural Network to approximate such a "function". We use s (and potentially a, though not really in the case of DQN) to create features based on that state (and action). In the case of DQN and Atari games, we simply take a stack of raw images/pixels as features. These are then used as inputs for the Neural Network. At the other end of the NN, DQN provides Q-values as outputs. In the case of DQN, multiple outputs are provided; one for every action a. So, in conclusion, when you read Q(s, a) you should think "the output corresponding to a when we plug the features/images/pixels of s as inputs into our network".
Further question from comments:
I think I still don't get the idea... Let's say we did one iteration through the network with state S and we got following output [A = 0.8, B = 0.1, C = 0.1] (where A, B and C are possible actions). We also got a reward R = 1 and set the y (a.k.a. gamma) to 0.95 . Now, how can we put these variables into the loss function formula https://imgur.com/a/2wTj7Yn? I don't understand what's the prediction if the DQN outputs which action to take? Also, what's the target Q? Could you post the formula with placed variables, please?
First a small correction: DQN does not output which action to take. Given inputs (a state s), it provides one output value per action a, which can be interpreted as an estimate of the Q(s, a) value for the input state s and the action a corresponding to that particular output. These values are typically used afterwards to determine which action to take (for example by selecting the action corresponding to the maximum Q value), so in some sense the action can be derived from the outputs of DQN, but DQN does not directly provide actions to take as outputs.
Anyway, let's consider the example situation. The loss function from the image is:
loss = (r + gamma max_a' Q-hat(s', a') - Q(s, a))^2
Note that there's a small mistake in the image, it has the old state s in the Q-hat instead of the new state s'. s' in there is correct.
In this formula:
r is the observed reward
gamma is (typically) a constant value
Q(s, a) is one of the output values from our Neural Network that we get when we provide it with s as input. Specifically, it is the output value corresponding to the action a that we have executed. So, in your example, if we chose to execute action A in state s, we have Q(s, A) = 0.8.
s' is the state we happen to end up in after having executed action a in state s.
Q-hat(s', a') (which we compute once for every possible subsequent action a') is, again, one of the output values from our Neural Network. This time, it's a value we get when we provide s' as input (instead of s), and again it will be the output value corresponding to action a'.
The Q-hat instead of Q there is because, in DQN, we typically actually use two different Neural Networks. Q-values are computed using the same Neural Network that we also modify by training. Q-hat-values are computed using a different "Target Network". This Target Network is typically a "slower-moving" version of the first network. It is constructed by occasionally (e.g. once every 10K steps) copying the other Network, and leaving its weights frozen in between those copy operations.
Firstly, the Q function is used both in the loss function and for the policy. Actual output of your Q function and the 'ideal' one is used to calculate a loss. Taking the highest value of the output of the Q function for all possible actions in a state is your policy.
Secondly, no, it's not recurrent. The equation is actually slightly different to what you have posted (perhaps a mathematician can correct me on this). It is actually Q(s, a) := r + y * max(Q(s', a')). Note the colon before the equals sign. This is called the assignment operator and means that we update the left side of the equation so that it is equal to the right side once (not recurrently). You can think of it as being the same as the assignment operator in most programming languages (x = x + 1 doesn't cause any problems).
The Q values will propagate through the network as you keep performing updates anyway, but it can take a while.

Keras: make specific weights in a dense layer untrainable [duplicate]

I am using keras and tensorflow 1.4.
I want to explicitly specify which neurons are connected between two layers. Therefor I have a matrix A with ones in it, whenever neuron i in the first Layer is connected to neuron j in the second Layer and zeros elsewhere.
My first attempt was to create a custom layer with a kernel, that has the same size as A with non-trainable zeros in it, where A has zeros in it and trainable weights, where A has ones in it. Then, the desired output would be a simple dot-product. Unfortunately I did not manage to figure out, how to implement a kernel that is partly trainable and partly non-trainable.
Any suggestions?
(Building a functional model with a lot of neurons that are connected by hand could be a work around, but somehow 'ugly' solution)
The simplest way I can think of, if you have this matrix correctly shaped, is to derive the Dense layer and simply add the matrix in the code multiplying the original weights:
class CustomConnected(Dense):
def __init__(self,units,connections,**kwargs):
#this is matrix A
self.connections = connections
#initalize the original Dense with all the usual arguments
super(CustomConnected,self).__init__(units,**kwargs)
def call(self,inputs):
#change the kernel before calling the original call:
self.kernel = self.kernel * self.connections
#call the original calculations:
super(CustomConnected,self).call(inputs)
Using:
model.add(CustomConnected(units,matrixA))
model.add(CustomConnected(hidden_dim2, matrixB,activation='tanh')) #can use all the other named parameters...
Notice that all the neurons/units have yet a bias added at the end. The argument use_bias=False will still work if you don't want biases. You can also do exactly the same thing using a vector B, for instance, and mask the original biases with self.biases = self.biases * vectorB
Hint for testing: use different input and output dimensions, so you can be sure that your matrix A has the correct shape.
I just realized that my code is potentially buggy, because I'm changing a property that is used by the original Dense layer. If weird behaviors or messages appear, you can try another call method:
def call(self, inputs):
output = K.dot(inputs, self.kernel * self.connections)
if self.use_bias:
output = K.bias_add(output, self.bias)
if self.activation is not None:
output = self.activation(output)
return output
Where K comes from import keras.backend as K.
You may also go further and set a custom get_weights() method if you want to see the weights masked with your matrix. (This would not be necessary in the first approach above)

How can I optimize machine learning hyperparameters to be reused in multiple models?

I have a number of datasets, to each of which I want to fit a Gaussian process regression model. The default hyperparameters selected by fitrgp seem subjectively to produce less-than-ideal models. Enabling hyperparameter optimisation tends to result in a meaningful improvement but occasionally produces extreme overfitted values and is a computationally hungry process which prohibits an optimization for every model anyway.
Since fitrgp simply wraps bayesopt for its hyperparameter optimization, is it possible to call bayesopt directly to minimize some aggregate of the loss for multiple models (say, the mean) rather than the loss for one model at a time?
For example, if each dataset is contained in a cell array of tables tbls, I want to find a single value for sigma which can be imposed in calls to fitrgp for each table:
gprMdls = cellfun(#(tbl) {fitrgp(tbl,'ResponseVarName', 'Sigma',sigma)}, tbls);
Where numel(tbls) == 1 the process would be equivalent to:
gprMdl = fitrgp(tbls{1},'ResponseVarName', 'OptimizeHyperparameters','auto');
sigma = gprMdl.Sigma;
but this implementation doesn't naturally extend to a result where a single Sigma value is optimized for multiple models.
I managed this in the end by directly intervening in the built-in optimization routines.
By placing a breakpoint at the start of bayesopt (via edit bayesopt) and calling fitrgp with a single input dataset, I was able to determine from the Function Call Stack that the objective function used by bayesopt is constructed with a call to classreg.learning.paramoptim.createObjFcn. I also captured and stored the remaining input arguments to bayesopt to ensure my function call would be exactly analagous to one constructed by fitrgp.
Placing a breakpoint at the start of classreg.learning.paramoptim.createObjFcn and making a fresh call to fitrgp I was able to capture and store the input arguments to this function, so I could then create objective functions for different tables of predictors.
For my cell array of tables tbls, and all other variables kept as named in the captured createObjFcn scope:
objFcns = cell(size(tbls));
for ii = 1:numel(tbls)
objFcn{ii} = classreg.learning.paramoptim.createObjFcn( ...
BOInfo, FitFunctionArgs, tbls{ii}, Response, ...
ValidationMethod, ValidationVal, Repartition, Verbose);
end
An overall objective function can then be constructed by taking the mean of the objective functions for each dataset:
objFcn = #(varargin) mean(cellfun(#(f) f(varargin{:}),objFcns));
I was then able to call bayesopt with this objFcn along with the remaining arguments captured from the original call. This produced a set of hyperparameters as required and they seem to perform well for all datasets.