Removing (Pruning) neurons from TensorFlow.JS layer - neural-network

I'm new to Tensorflow, Neural Nets and I never used other than the JavaScript version of Tensorflow. And basically I'm experimenting and studdying all this.
Reading the (Python) Tensorflow docs I saw that Pruning can be done by TF.CONTRIB.MODEL_PRUNING, but, as far as I have found, there is nothing similar for Tensorflow.JS. So I'd like to experiment a bit and implement at least a very simple / basic pruning method.
This "very simple / basic pruning method" can be something like removing from the hidden layers those neurons whose weight is very near to 0. I would then train the model a bit more and see if I can recover the loss in accuracy.
I know I can access the weights with something like this:
const weights = model.layers.map(layer => {
return layer.getWeights()[0].dataSync();
});
What I would like to know if it is actually possible to find and remove units associated with those weights (and if I can do this during training).
Thanks!
Edu

It is possible to set the weights on the model. The same way you retrieve the model weights using get, you can use set to change the weights of your model.
model.fit(x, y, {epochs: 1000,
callbacks: {
onEpochEnd: () => {
// check your weight
model.layers[0].getWeights()
// set your weiths
model.layers[0].setWeights([tensors])
}
}})

Related

Compute softmax using breeze

I am constructing a deep neural network from scratch and I want to implement the softmax http://neuralnetworksanddeeplearning.com/chap3.html#softmax distributed function.
I am using breeze for that but it is not working as expected.
The documentation is also poor with very few examples, so it is difficult for me to understand how I should use it.
here is an example :
I have an ouput array that contains 10 dimensions.
I have my label array also.
Z contains 10 rows with the weighted values.
My label array contains also 10 rows and one is set to 1 to specify which row is the expected result.
lab(0) = 1
lab(1 to 9) = 0
my code :
def ComputeZ(ActivationFunction : String, z:Array[Double], label:Array[Double]) : Array[Double] = {
ActivationFunction match {
case "SoftMax" => **val t = softmax(z,label)**
t
}
}
I was expecting having a distributed probability with a total of 1 for the 10 rows but it returns actually the same values as Z.
I don't know what I am doing wrong
thanks for your help
Your question seems a little bit confusing to me. I mean, creating a SoftMax from scratch has nothing to do with the label or the real output value. A Softmax function is used to create a valid output probability distribution of a neural network, used in multiclass classification problems. As I see you have a one hot vector as label, it seems that you want to implement a CrossEntropy criterion or some error function that evaluates the divergence of the prediction distribution and the label distribution. That needs the output prediction probability distribution(applying your Softmax to the output layer) and the one hot vector of the output.
I watched the code of the softmax function in breeze but I don´t see a Layer implementation and it doesn´t do what I was expecting. Have in mind that you need a forward an a backward function.

Why does huggingface bert pooler hack make mixed precission training stable?

Huggigface BERT implementation has a hack to remove the pooler from optimizer.
https://github.com/huggingface/transformers/blob/b832d5bb8a6dfc5965015b828e577677eace601e/examples/run_squad.py#L927
# hack to remove pooler, which is not used
# thus it produce None grad that break apex
param_optimizer = [n for n in param_optimizer if 'pooler' not in n[0]]
We are trying to run pretrining on huggingface bert models. The code always diverges later during the training if this pooler hack is not applied. I also see the pooler layer being used during classification.
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
The pooler layer is a FFN with tanh activation
class BertPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
My question is why this pooler hack solves numeric instability?
Problem seen with pooler
There are quite a few resources out there that probably tackle this issue better than me, see for example here, or here.
Specifically, the problem is that you are dealing with vanishing (or exploding) gradients, specifically when using loss functions that flatten in either direction for very small/large inputs, which is the case for both sigmoid and tanh (the only difference here is the range in which their output lies, which is [0, 1] and [-1, 1], respectively.
Additionally, if you have a low-precision decimal, as is the case with APEX, then the gradient vanishing behavior is much more likely to appear already for relatively moderate outputs, as the precision limits the numbers which it is able to differentiate from zero. One way to deal with this is to have functions that have strictly non-zero and easily computable derivatives, such as Leaky ReLU, or simply avoid the activation function altogether (which I'm assuming is what huggingface is doing here).
Note that the problem of exploding gradients is usually not as tragic, as we can apply gradient clipping (limiting it to a fixed maximum size), but nonetheless the principle is the same. For zeroed gradients, on the other hand, there is no such easy fix, since it causes your neurons to "die" (no active learning is happening with zero backflow), which is why I'm assuming that you see the diverging behavior.

Pytorch - how to undersample using weightedrandomsampler

I have an unbalanced dataset and would like to undersample the class that is overrepresented.How do I go about it. I would like to use to weightedrandomsampler but I am also open to other suggestions.
So far I am assuming that my code will have to be structured kind of like the following. But I dont know how to exaclty do it.
trainset = datasets.ImageFolder(path_train,transform=transform)
...
sampler = data.WeightedRandomSampler(weights=..., num_samples=..., replacement=...)
...
trainloader = data.DataLoader(trainset, batchsize = batchsize, sampler=sampler)
I hope someone can help. Thanks a lot
From my understanding, pytorch WeightedRandomSampler 'weights' argument is somewhat similar to numpy.random.choice 'p' argument which is the probability that a sample will get randomly selected. Pytorch uses weights instead to random sample training examples and they state in the doc that the weights don't have to sum to 1 so that's what I mean that it's not exactly like numpy's random choice. The stronger the weight, the more likely that sample will get sampled.
When you have replacement=True, it means that training examples can be drawn more than once which means you can have copies of training examples in your train set that get used to train your model; oversampling. Alongside, if the weights are low COMPARED TO THE OTHER TRAINING SAMPLE WEIGHTS the opposite occurs which means that those samples have a lower chance of being selected for random sampling; undersampling.
I have no clue how the num_samples argument works when using it with the train loader but I can warn you to NOT put your batch size there. Today, I tried putting the batch size and it gave horrible results. My co-worker put the number of classes*100 and his results were much better. All I know is that you should not put the batch size there. I also tried putting the size of all my training data for num_samples and it had better results but took forever to train. Either way, play around with it and see what works best for you. I would guess that the safe bet is to use the number of training examples for the num_samples argument.
Here's the example I saw somebody else use and I use it as well for binary classification. It seems to work just fine. You take the inverse of the number of training examples for each class and you set all training examples with that class its respective weight.
A quick example using your trainset object
labels = np.array(trainset.samples)[:,1] # turn to array and take all of column index 1 which are the labels
labels = labels.astype(int) # change to int
majority_weight = 1/num_of_majority_class_training_examples
minority_weight = 1/num_of_minority_class_training_examples
sample_weights = np.array([majority_weight, minority_weight]) # This is assuming that your minority class is the integer 1 in the labels object. If not, switch places so it's minority_weight, majority_weight.
weights = samples_weights[labels] # this goes through each training example and uses the labels 0 and 1 as the index in sample_weights object which is the weight you want for that class.
sampler = WeightedRandomSampler(weights=weights, num_samples=, replacement=True)
trainloader = data.DataLoader(trainset, batchsize = batchsize, sampler=sampler)
Since the pytorch doc says that the weights don't have to sum to 1, I think you can also just use the ratio which between the imbalanced classes. For example, if you had 100 training examples of the majority class and 50 training examples of the minority class, it would be a 2:1 ratio. To counterbalance this, I think you can just use a weight of 1.0 for each majority class training example and a weight 2.0 for all minority class training examples because technically you want the minority class to be 2 times more likely to be selected which would balance your classes during random selection.
I hope this helped a little bit. Sorry for the sloppy writing, I was in a huge rush and saw that nobody answered. I struggled through this myself without being able to find any help for it either. If it doesn't make sense just say so and I'll re-edit it and make it more clear when I get free time.
Based on torchdata (disclaimer: I'm the author) one can create a custom undersampler.
First, _Equalizer base class which:
creates multiple RandomSubsetSamplers (one for each class)
based on function (torch.max or torch.min) will behave as oversampler or undersampler
Code:
class _Equalizer(Sampler):
def __init__(self, labels: torch.tensor, function):
if len(labels.shape) > 1:
raise ValueError(
"labels can only have a single dimension (N, ), got shape: {}".format(
labels.shape
)
)
tensors = [
torch.nonzero(labels == i, as_tuple=False).flatten()
for i in torch.unique(labels)
]
self.samples_per_label = getattr(builtins, function)(map(len, tensors))
self.samplers = [
iter(
RandomSubsetSampler(
tensor,
replacement=len(tensor) < self.samples_per_label,
num_samples=self.samples_per_label
if len(tensor) < self.samples_per_label
else None,
)
)
for tensor in tensors
]
#property
def num_samples(self):
return self.samples_per_label * len(self.samplers)
def __iter__(self):
for _ in range(self.samples_per_label):
for index in torch.randperm(len(self.samplers)).tolist():
yield next(self.samplers[index])
def __len__(self):
return self.num_samples
Now, we can create undersampler (added oversampler as it is really short right now):
class RandomUnderSampler(_Equalizer):
def __init__(self, labels: torch.tensor):
super().__init__(labels, "min")
class RandomOverSampler(_Equalizer):
def __init__(self, labels):
super().__init__(labels, "max")
Just pass in your labels to the __init__ (has to be 1D but can have multiple or binary classes) and you can up/under sample your data.

Reading recustructed vector from autoencoder in DL4J

My goal is to have an autoencoding network where I can train the identity function and then do forward passes yielding a reconstruction of the input.
For this, I'm trying to use VariationalAutoencoder, e.g. something like:
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(77147718)
.trainingWorkspaceMode(WorkspaceMode.NONE)
.gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue)
.gradientNormalizationThreshold(1.0)
.optimizationAlgo(OptimizationAlgorithm.CONJUGATE_GRADIENT)
.list()
.layer(0, new VariationalAutoencoder.Builder()
.activation(Activation.LEAKYRELU)
.nIn(100).nOut(15)
.encoderLayerSizes(120, 60, 30)
.decoderLayerSizes(30, 60, 120)
.pzxActivationFunction(Activation.IDENTITY)
.reconstructionDistribution(new BernoulliReconstructionDistribution(Activation.SIGMOID.getActivationFunction()))
.build())
.pretrain(true).backprop(false)
.build();
However, VariationalAutoencoder seems to be designed for training (and providing) mappings from an input to an encoded version, i.e. a vector of size 100 to a vector of size 15 in above example configuration.
However, I'm not particularly interested in the encoded version, but would like to train a mapping of a 100-vector to itself. Then, I'd like to run a other 100-vectors through it and get back their reconstructed versions.
But even when looking at the API of of the VariationalAutoencoder (or AutoEncoder too), I can't figure out how to do this. Or are those layers not designed for this kind of "end-to-end usage" and I would have to manually construct an autoencoding network?
You can see how to use the VAE layer to extract averaged reconstructions from the variational example.
There's two methods for getting the reconstruction from a variational layer. The standard is generateAtMeanGivenZ Which will draw samples from the layer and give you the average. If you want raw samples you can use generateRandomGivenZ. See the javadoc page for all the other methods.

Scikit-Learn's DPGMM fitting: number of components?

I'm trying to fit a mixed normal model to some data using scikit-learn's DPGMM algorithm. One of the advantages advertised on [0] is that I don't need to specify the number of components; which is good, because I do not know the number of components in my data. The documentation states that I only need to specify an upper bound. However, it looks very much like that is not true:
>>> data = numpy.random.normal(loc = 0.0, scale = 1.0, size = 1000)
>>> from sklearn.mixture import DPGMM
>>> d = DPGMM(n_components=5)
>>> d.fit(data.reshape(-1,1))
DPGMM(alpha=1.0, covariance_type='diag', init_params='wmc', min_covar=None,
n_components=5, n_iter=10, params='wmc', random_state=None, thresh=None,
tol=0.001, verbose=0)
>>> d.n_components
5
>>> d.means_
array([[-0.02283383],
[ 0.06259168],
[ 0.00390097],
[ 0.02934676],
[-0.05533165]])
As you can see, the fitting reports five components (the upper bound) even for data clearly sampled from just one normal distribution.
Am I doing something wrong? Did I misunderstand something?
Thanks a lot in advance,
Lukas
[0] http://scikit-learn.org/stable/modules/mixture.html#dpgmm
I recently had similar doubts about results of this DPGMM implementation. If you check provided example you notice that DPGMM always return model with n_components, now the trick is to remove redundant components. This can be done with predict function.
Unfortunately this important pice is hidden in comment in code example.
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant components
Perhaps look at using an improved sklearn solution for this kind of problem, namely a Bayesian Gaussian Mixture. With this model, the suggested prior number of components must be given, but once trained, the model assigns weightings to each component, which essentially indicate their relevance. Here is a pretty cool visual demo of BGMM in action.
Once you have experimented with training a few BGMMs on your data, you can get a feel for a sensible estimate to the number of components for your given problem.