How embedding_bag exactly works in PyTorch

How embedding_bag exactly works in PyTorch - neural-network

in PyTorch, torch.nn.functional.embedding_bag seems to be the main function responsible for doing the real job of embedding lookup. On PyTorch's documentation, it has been mentioned that embedding_bag does its job > without instantiating the intermediate embeddings. What does that exactly mean? Does this mean for example when the mode is "sum" it does in-place summation? or it just means that no additional Tensors will be produced when calling embedding_bag but still from the system's point of view all the intermediate row-vectors are already fetched into the processor to be used for calculating the final Tensor?

In the simplest case, torch.nn.functional.embedding_bag is conceptually a two step process. The first step is to create an embedding and the second step is to reduce (sum/mean/max, according to the "mode" argument) the embedding output across dimension 0. So you can get the same result that embedding_bag gives by calling torch.nn.functional.embedding, followed by torch.sum/mean/max. In the following example, embedding_bag_res and embedding_mean_res are equal.
>>> weight = torch.randn(3, 4)
>>> weight
tensor([[ 0.3987, 1.6173, 0.4912, 1.5001],
[ 0.2418, 1.5810, -1.3191, 0.0081],
[ 0.0931, 0.4102, 0.3003, 0.2288]])
>>> indices = torch.tensor([2, 1])
>>> embedding_res = torch.nn.functional.embedding(indices, weight)
>>> embedding_res
tensor([[ 0.0931, 0.4102, 0.3003, 0.2288],
[ 0.2418, 1.5810, -1.3191, 0.0081]])
>>> embedding_mean_res = embedding_res.mean(dim=0, keepdim=True)
>>> embedding_mean_res
tensor([[ 0.1674, 0.9956, -0.5094, 0.1185]])
>>> embedding_bag_res = torch.nn.functional.embedding_bag(indices, weight, torch.tensor([0]), mode='mean')
>>> embedding_bag_res
tensor([[ 0.1674, 0.9956, -0.5094, 0.1185]])
However, the conceptual two step process does not reflect how it's actually implemented. Since embedding_bag does not need to return the intermediate result, it doesn't actually generate a Tensor object for the embedding. It just goes straight to computing the reduction, pulling in the appropriate data from the weight argument according to the indices in the input argument. Avoiding the creation of the embedding Tensor allows for better performance.
So the answer to your question (if I understand it correctly)
it just means that no additional Tensors will be produced when calling embedding_bag but still from the system's point of view all the intermediate row-vectors are already fetched into the processor to be used for calculating the final Tensor?
is yes.

Related

Dimensionality reduction using PCA - MATLAB

I am trying to reduce dimensionality of a training set using PCA.
I have come across two approaches.
[V,U,eigen]=pca(train_x);
eigen_sum=0;
for lamda=1:length(eigen)
eigen_sum=eigen_sum+eigen(lamda,1);
if(eigen_sum/sum(eigen)>=0.90)
break;
end
end
train_x=train_x*V(:, 1:lamda);
Here, I simply use the eigenvalue matrix to reconstruct the training set with lower amount of features determined by principal components describing 90% of original set.
The alternate method that I found is almost exactly the same, save the last line, which changes to:
train_x=U(:,1:lamda);
In other words, we take the training set as the principal component representation of the original training set up to some feature lamda.
Both of these methods seem to yield similar results (out of sample test error), but there is difference, however minuscule it may be.
My question is, which one is the right method?

The answer depends on your data, and what you want to do.
Using your variable names. Generally speaking is easy to expect that the outputs of pca maintain
U = train_x * V
But this is only true if your data is normalized, specifically if you already removed the mean from each component. If not, then what one can expect is
U = train_x * V - mean(train_x * V)
And in that regard, weather you want to remove or maintain the mean of your data before processing it, depends on your application.
It's also worth noting that even if you remove the mean before processing, there might be some small difference, but it will be around floating point precision error
((train_x * V) - U) ./ U ~~ 1.0e-15
And this error can be safely ignored

Scikit-Learn's DPGMM fitting: number of components?

I'm trying to fit a mixed normal model to some data using scikit-learn's DPGMM algorithm. One of the advantages advertised on [0] is that I don't need to specify the number of components; which is good, because I do not know the number of components in my data. The documentation states that I only need to specify an upper bound. However, it looks very much like that is not true:
>>> data = numpy.random.normal(loc = 0.0, scale = 1.0, size = 1000)
>>> from sklearn.mixture import DPGMM
>>> d = DPGMM(n_components=5)
>>> d.fit(data.reshape(-1,1))
DPGMM(alpha=1.0, covariance_type='diag', init_params='wmc', min_covar=None,
n_components=5, n_iter=10, params='wmc', random_state=None, thresh=None,
tol=0.001, verbose=0)
>>> d.n_components
5
>>> d.means_
array([[-0.02283383],
[ 0.06259168],
[ 0.00390097],
[ 0.02934676],
[-0.05533165]])
As you can see, the fitting reports five components (the upper bound) even for data clearly sampled from just one normal distribution.
Am I doing something wrong? Did I misunderstand something?
Thanks a lot in advance,
Lukas
[0] http://scikit-learn.org/stable/modules/mixture.html#dpgmm

I recently had similar doubts about results of this DPGMM implementation. If you check provided example you notice that DPGMM always return model with n_components, now the trick is to remove redundant components. This can be done with predict function.
Unfortunately this important pice is hidden in comment in code example.
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant components

Perhaps look at using an improved sklearn solution for this kind of problem, namely a Bayesian Gaussian Mixture. With this model, the suggested prior number of components must be given, but once trained, the model assigns weightings to each component, which essentially indicate their relevance. Here is a pretty cool visual demo of BGMM in action.
Once you have experimented with training a few BGMMs on your data, you can get a feel for a sensible estimate to the number of components for your given problem.

Using SparseTensor as a trainable variable?

I'm trying to use SparseTensor to represent weight variables in a fully-connected layer.
However, it seems that TensorFlow 0.8 doesn't allow to use SparseTensor as tf.Variable.
Is there any way to go around this?
I've tried
import tensorflow as tf
a = tf.constant(1)
b = tf.SparseTensor([[0,0]],[1],[1,1])
print a.__class__ # shows <class 'tensorflow.python.framework.ops.Tensor'>
print b.__class__ # shows <class 'tensorflow.python.framework.ops.SparseTensor'>
tf.Variable(a) # Variable is declared correctly
tf.Variable(b) # Fail
By the way, my ultimate goal of using SparseTensor is to permanently mask some of connections in dense form. Thus, these pruned connections are ignored while calculating and applying gradients.
In my current implementation of MLP, SparseTensor and its sparse form of matmul ops successfully reports inference outputs. However, the weights declared using SparseTensor aren't trained as training steps go.

As a workaround to your problem, you can provide a tf.Variable (until Tensorflow v0.8) for the values of a sparse tensor. The sparsity structure has to be pre-defined in that case, the weights however remain trainable.
weights = tf.Variable(<initial-value>)
sparse_var = tf.SparseTensor(<indices>, weights, <shape>) # v0.8
sparse_var = tf.SparseTensor(<indices>, tf.identity(weights), <shape>) # v0.9

TensorFlow doesn't currently support sparse tensor variables. However, it does support sparse lookups (tf.embedding_lookup) and sparse gradient updates (tf.sparse_add) of dense variables. I suspect these two will suffice your use case.

TensorFlow doesn't support training on sparse tensors yet. You can initialize a sparse tensor as you wish, then convert it into a dense tensor and create a variable from it like that:
# You need to correctly initialize the sparse tensor with indices, values and a shape
b = tf.SparseTensor(indices, values, shape)
b_dense = tf.sparse_tensor_to_dense(b)
b_variable = tf.Variable(b_dense)
Now you have initialized a sparse tensor as a variable. Now you need to take care of the gradient update (in other words, make sure the entries in the variable stay 0, since there is a non-vanishing gradient calculated in the backpropagation algorithm for them when using this naively).
In order to do this, TensorFlow optimizers have a method called tf.train.Optimizer.compute_gradients(loss, [list_of_variables]). This calculates all the gradients in the graph necessary to minimize the loss function, but doesn't apply them yet. This method returns a list of tuples in a form of (gradients, variable). You can modify these gradients freely, but in your case it makes sense to mask the gradients not needed to 0 (i.e. by creating another sparse tensor with default values 0.0 and values 1.0 where the weights in your network are present).
After having modified them, you call the optimizer method tf.train.Optimizer.apply_gradients(grads_and_vars) to actually apply the gradients. An example code would look like this:
# Create optimizer instance
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
# Get the gradients for your weights
grads_and_vars = optimizer.compute_gradients(loss, [b_variable])
# Modify the gradients at will
# In your case it would look similar to this
modified_grads_and_vars = [(tf.multiply(gv[0], mask_tensor), gv[1] for gv in grads_and_vars]
# Apply modified gradients to your model
optimizer.apply_gradients(modified_grads_and_vars)
This makes sure your entries stay 0 in your weight matrix and no unwanted connections are created. You need to take care of all the other gradients for all other variables later.

The above code works with some minor correction like this.
def optimize(loss, mask_tensor):
optimizer = tf.train.AdamOptimizer(0.001)
grads_and_vars = optimizer.compute_gradients(loss)
modified_grads_and_vars = [
(tf.multiply(gv[0], mask_tensor[gv[1]]), gv[1]) for gv in grads_and_vars
]
return optimizer.apply_gradients(modified_grads_and_vars)

Theano -- Mean of squared gradients

In theano, given a batch cost cost with shape (batch_size,), it is easy to compute the gradient of the mean cost, as in T.grad(T.mean(cost,axis=0),p) with p being a parameter used in the computation of cost. This is done efficiently by backpropagating the gradient through the computational graph. What I would now like to do is to compute the mean of the squared gradients over the batch. This can be done using the following piece of code:
import theano.tensor as T
g_square = T.mean(theano.scan(lambda i:T.grad(cost[i],p)**2,sequences=T.arange(cost.shape[0]))[0],axis=0)
Where for convenience p is assumed to be a single theano tensor and not a list of tensors.
The computation could be performed efficiently by simply backpropagating the gradient until the last step, and squaring the components of the last operation (which should be a sum over the batch index). I might be wrong on this one, but the computation should be as easy, and nearly as fast as a simple backpropagation. However, theano seems unable to optimize the computation, and it keeps using a loop, making computations extremely slow.
Would anyone know of a solution to make the computation efficient, either by forcing optimizations, expressing the computation in a different way, or even going through the backpropagation process ?
Thanks in advance.

Your function g_square happens to have complexity O(batch_size**2) instead of O(batch_size) as expected. This lets it appear incredibly slow for larger batch sizes.
The reason is because in every iteration the forward and backward pass is computed over the whole batch, even though just cost[i] for one data point is needed.
I assume the input to the cost computation graph, x, is a tensor with the first dimension of size batch_size. Theano has no means to automatically slice this tensor along this dimension. Therefore computation is always done over the whole batch.
Unfortunately I see no better solution than slicing your input and doing the loop outside Theano:
# x: input data batch
batch_size = x.shape[0]
g_square_fun = theano.function( [p], T.grad(cost[0],p)**2)
g_square_value = 0
for i in batch_size:
g_square_value += g_square_fun( x[i:i+1])
Perhaps when future versions of Theano come with better build in capabilities to compute Jacobians there will be more elegant solutions.

After digging deeper in the Theano docs I found a solution that would work in the compute graph. Key idea is that you clone the graph of your network inside the scan function, thereby explicitly slicing the input tensor. I tried the following code and empirically it shows O(batch_size) as expected:
# x: input data batch
# assuming cost = network(x,p)
from theano.gof.graph import clone_get_equiv
def g_square(cost,p):
g = T.zeros_like(p)
def scan_fn( i, g, cost, p):
# clone the graph computing cost, but slice it's input
cloned = clone_get_equiv([],[cost],
copy_inputs_and_orphans=False,
memo={x: x[i:i+1]})
cost_slice = cloned[cost].reshape([])
return g+T.grad(cost_slice,p)**2
result,updates = theano.reduce( scan_fn,
outputs_info=g,
sequences=[T.arange(cost.size)],
non_sequences=[cost.flatten(),p])
return result

Reporting log-likelihood / perplexity of spark LDA model (different in local vs distributed models?)

Given a training corpus docsWithFeatures, I've trained an LDA model in Spark (via Scala API) like so:
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}
val n_topics = 10;
val lda = new LDA().setK(n_topics).setMaxIterations(20)
val ldaModel = lda.run(docsWithFeatures)
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
And now I want to report the log-likelihood and perplexity of the model.
I can get the log-likelihood like so:
scala> distLDAModel.logLikelihood
res11: Double = -2600097.2875547716
But this is where things get weird. I also wanted the perplexity, which is only implemented for a local model, so I run:
val localModel = distLDAModel.toLocal
Which lets me get the (log) perplexity like so:
scala> localModel.logPerplexity(docsWithFeatures)
res14: Double = 0.36729132682898674
But the local model also supports the log-likelihood calculation, which I run like this:
scala> localModel.logLikelihood(docsWithFeatures)
res15: Double = -3672913.268234148
So what's going on here? Shouldn't the two log-likelihood values be the same? The documentation for a distributed model says
"logLikelihood: log likelihood of the training corpus, given the inferred topics and document-topic distributions"
while for a local model it says:
"logLikelihood(documents): Calculates a lower bound on the provided documents given the inferred topics."
I guess these are different, but it's not clear to me how or why. Which one should I use? That is, which one is the "true" likelihood of the model, given the training documents?
To summarize, two main questions:
1 - How and why are the two log-likelihood values different, and which should I use?
2 - When reporting perplexity, am I correct in thinking that I should use the exponential of the logPerplexity result? (But why does the model give log perplexity instead of just plain perplexity? Am I missing something?)

1) These two log-likelihood values differ because they are computing the log-likelihood for two different models. DistributedLDAModel is effectively computing the log-likelihood w.r.t. a model where the parameters for the topics and the mixing weights for each of the documents are constants (as I mentioned in another post, the DistributedLDAModel is essentially regularized PLSI, though you need to use logPrior to also account for the regularization), while the LocalLDAModel takes the view that the topic parameters as well as the mixing weights for each document are random variables. So in the case of LocalLDAModel you have to integrate (marginalize) out the topic parameters and document mixing weights in order to compute the log-likelihood (and this is what makes the variational approximation/lower bound necessary, though even without the approximation the log-likelihoods would not be the same since the models are just different.)
As far as which one you should use, my suggestion (without knowing what you ultimately want to do) would be to go with the log-likelihood method attached to the class you originally trained (i.e. the DistributedLDAModel.) As a side note, the primary (only?) reason that I can see to convert a DistributedLDAModel into a LocalLDAModel via toLocal is to enable the computation of topic mixing weights for a new (out-of-training) set of documents (for more info on this see my post on this thread: Spark MLlib LDA, how to infer the topics distribution of a new unseen document?), a operation which is not (but could be) supported in DistributedLDAModel.
2) log-perplexity is just the negative log-likelihood divided by the number of tokens in your corpus. If you divide the log-perplexity by math.log(2.0) then the resulting value can also be interpreted as the approximate number of bits per a token needed to encode your corpus (as a bag of words) given the model.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How embedding_bag exactly works in PyTorch - neural-network

Related

Dimensionality reduction using PCA - MATLAB

Scikit-Learn's DPGMM fitting: number of components?

Using SparseTensor as a trainable variable?

Theano -- Mean of squared gradients

Reporting log-likelihood / perplexity of spark LDA model (different in local vs distributed models?)

Categories

Resources