CPU memory allocation failed in Dynet - neural-network

I am not sure why I am running out of memory. Take the parser of Goldberg, all I do is to change this line:
scores, exprs = self.__evaluate(conll_sentence, True)
and add a for loop around it to repeat it K times:
for k in xrange(K):
scores, exprs = self.__evaluate(conll_sentence, True)
# do something
Then in the getExpr, i do the following:
samples_out = np.random.normal(0,0.001, (1, self.hidden_units))
samples_FOH = np.random.normal(0,0.001,(self.hidden_units, self.ldims * 2))
samples_FOM = np.random.normal(0,0.001,(self.hidden_units, self.ldims * 2))
samples_Bias = np.random.normal(0,0.001, (self.hidden_units))
XoutLayer = self.outLayer.expr()+inputTensor(samples_out)
XhidLayerFOH = self.hidLayerFOH.expr()+inputTensor(samples_FOH)
XhidLayerFOM = self.hidLayerFOM.expr()+inputTensor(samples_FOM)
XhidBias = self.hidBias.expr()+inputTensor(samples_Bias)
if sentence[i].headfov is None:
sentence[i].headfov = XhidLayerFOH * concatenate([sentence[i].lstms[0], sentence[i].lstms[1]])
if sentence[j].modfov is None:
sentence[j].modfov = XhidLayerFOM * concatenate([sentence[j].lstms[0], sentence[j].lstms[1]])
output = XoutLayer * self.activation(sentence[i].headfov + sentence[j].modfov + XhidBias)
return output
Essentially what is happening in the above block is to first generate normally distributed noise, then add it to the trained values. But it seems somewhere along the way, all the generated values stay in the memory and it just runs out of memory. Any one knows why?

Dynet expressions stay in memory until the next call to renew_cg().
So the fix would be to call it after each iteration of your loop, provided that you have retrieved all information you needed from the computation graph.
Side note: when you do a simple addition, such as:
XoutLayer = self.outLayer.expr()+inputTensor(samples_out)
no addition is actually performed. You just create a new expression and specify how to evaluate it from other expressions. The actual computation is performed when there is a call to .forward() (or .value(), etc) on XoutLayer or on an expression whose computation depends on XoutLayer.
So, dynet needs to allocate memory for all expressions in the current computation graph.

Related

gtsummary::tbl_regression() - Obtain Random Effects from GLMM Zero-Inflated Model

When trying to create a table with the conditional random effects in r using the gtsummary function tbl_regression from a glmmTMB mixed effects negative-binomial zero-inflated model, I get duplicate random effects rows.
Example (using Mollie Brooks' Zero-Inflated GLMMs on Salamanders Dataset):
data(Salamanders)
head(Salamanders)
library(glmmTMB)
zinbm2 = glmmTMB(count~spp + mined +(1|site), zi=~spp + mined + (1|site), Salamanders, family=nbinom2)
zinbm2_table_cond <- tbl_regression(
zinbm2,
tidy_fun = function(...) broom.mixed::tidy(..., component = "cond"),
exponentiate = TRUE,
estimate_fun = purrr::partial(style_ratio, digits = 3),
pvalue_fun = purrr::partial(style_sigfig, digits = 3))
zinbm2_table_cond
Output:
Random Effects Output (cond)
When extracting the random effects from de zero-inflated part of the model I get the same problem.
Example:
zinbm2_table_zi <- tbl_regression(
zinbm2,
tidy_fun = function(...) broom.mixed::tidy(..., component = "zi"),
exponentiate = TRUE,
estimate_fun = purrr::partial(style_ratio, digits = 3),
pvalue_fun = purrr::partial(style_sigfig, digits = 3))
zinbm2_table_zi
Output:
Random Effects Output (zi)
The problem persists if I specify the effects argument in broom.mixed.
tidy_fun = function(...) broom.mixed::tidy(..., effects = "ran_pars", component = "cond"),
Looking at confidence intervals in both outputs it seems that somehow it is extracting random effects from both parts of the model and changing the estimate of the zero-inflated random effects (in 1st image; opposite in the 2nd image) to match the conditional part estimate while keeping the CI.
I am not knowledgeable enough to understand why this is happening. Since both rows have the same label I am having difficulty removing the wrong one.
Any tips on how to avoid this problem or a workaround to remove the undesired rows?
If you need more info, let me know.
Thank you in advance.
PS: Output images were changed to link due to insufficient reputation.

assigning a destination frame to a prediction in h2o.ai

I like h2o.ai for machine learning using R.
https://cran.r-project.org/web/packages/h2o/h2o.pdf
I like random forests, but I'm making a few thousand predictions in a loop.
It is spamming up my memory with things like this:
I can't afford to keep them all in memory. I'm making my very nice computer work very hard. That means it doesn't have the capacity to hold all the balls in the air at once.
If I could assign a destination frame name to the prediction then each new one would overwrite the old ones.
How do I assign a destination frame name when I am performing "h2o.predict" on an object?
Things that I have tried that did not work:
h2o.predict(object = rf.hex, newdata = test.hex, predictions_frame = "predict.hex")
h2o.predict(object = rf.hex, newdata = test.hex, destination_frame = "predict.hex")
h2o.predict(object = rf.hex, newdata = test.hex, model_id = "predict.hex")
There is no way that I am aware of.
But as an alternative, inside your loop, you could call h2o.rm() on the return value from h2o.predict(). It is worth calling h2o.gc() as well. Something like:
for(data in alldata){
# ... prepare newdata
p = h2o.predict(model, newdata)
# ... do something with p here
h2o.rm(p)
h2o.rm(newdata) # If also not needed any more
h2o.gc()
}
Aside: you said "I'm making a few thousand predictions in a loop". Assuming they were all against the same model, remember you can batch them up, and give all thousand predictions in a single newdata dataframe. One call to h2o.predict() with 1000 entries is much more efficient than making 1000 h2o.predict() calls, for one newdata entry at a time.

Pytorch minibatching keeps model from training

I am trying to classify sequences by a binary feature. I have a dataset of sequence/label pairs and am using a simple one-layer LSTM to classify each sequence. Before I implemented minibatching, I was getting reasonable accuracy on a test set (80%), and the training loss would go from 0.6 to 0.3 (averaged).
I implemented minibatching, using parts of this tutorial: https://pytorch.org/tutorials/beginner/chatbot_tutorial.html
However, now my model won’t do better than 70-72% (70% of the data has one label) with batch size set to 1 and all other parameters exactly the same. Additionally, the loss starts out at 0.0106 and quickly gets really really small, with no significant change in results. I feel like the results between no batching and batching with size 1 should be the same, so I probably have a bug, but for the life of me I can’t find it. My code is below.
Training code (one epoch):
for i in t:
model.zero_grad()
# prep inputs
last = i+self.params['batch_size']
last = last if last < len(train_data) else len(train_data)
batch_in, lengths, batch_targets = self.batch2TrainData(train_data[shuffled][i:last], word_to_ix, label_to_ix)
iters += 1
# forward pass.
tag_scores = model(batch_in, lengths)
# compute loss, then do backward pass, then update gradients
loss = loss_function(tag_scores, batch_targets)
loss.backward()
# Clip gradients: gradients are modified in place
nn.utils.clip_grad_norm_(model.parameters(), 50.0)
optimizer.step()
Functions:
def prep_sequence(self, seq, to_ix):
idxs = [to_ix[w] for w in seq]
return torch.tensor(idxs, dtype=torch.long)
# transposes batch_in
def zeroPadding(self, l, fillvalue=0):
return list(itertools.zip_longest(*l, fillvalue=fillvalue))
# Returns padded input sequence tensor and lengths
def inputVar(self, batch_in, word_to_ix):
idx_batch = [self.prep_sequence(seq, word_to_ix) for seq in batch_in]
lengths = torch.tensor([len(idxs) for idxs in idx_batch])
padList = self.zeroPadding(idx_batch)
padVar = torch.LongTensor(padList)
return padVar, lengths
# Returns all items for a given batch of pairs
def batch2TrainData(self, batch, word_to_ix, label_to_ix):
# sort by dec length
batch = batch[np.argsort([len(x['turn']) for x in batch])[::-1]]
input_batch, output_batch = [], []
for pair in batch:
input_batch.append(pair['turn'])
output_batch.append(pair['label'])
inp, lengths = self.inputVar(input_batch, word_to_ix)
output = self.prep_sequence(output_batch, label_to_ix)
return inp, lengths, output
Model:
class LSTMClassifier(nn.Module):
def __init__(self, params, vocab_size, tagset_size, weights_matrix=None):
super(LSTMClassifier, self).__init__()
self.hidden_dim = params['hidden_dim']
if weights_matrix is not None:
self.word_embeddings = nn.Embedding.from_pretrained(weights_matrix)
else:
self.word_embeddings = nn.Embedding(vocab_size, params['embedding_dim'])
self.lstm = nn.LSTM(params['embedding_dim'], self.hidden_dim, bidirectional=False)
# The linear layer that maps from hidden state space to tag space
self.hidden2tag = nn.Linear(self.hidden_dim, tagset_size)
def forward(self, batch_in, lengths):
embeds = self.word_embeddings(batch_in)
packed = nn.utils.rnn.pack_padded_sequence(embeds, lengths)
lstm_out, _ = self.lstm(packed)
outputs, _ = nn.utils.rnn.pad_packed_sequence(lstm_out)
tag_space = self.hidden2tag(outputs)
tag_scores = F.log_softmax(tag_space, dim=0)
return tag_scores[-1]
For anyone else with a similar issue, I got it to work. I removed the log_softmax calculation, so this:
tag_space = self.hidden2tag(outputs)
tag_scores = F.log_softmax(tag_space, dim=0)
return tag_scores[-1]
becomes this:
tag_space = self.hidden2tag(outputs)
return tag_space[-1]
I also changed NLLLoss to CrossEntropyLoss, (not shown above), and initialized CrossEntropyLoss with no parameters (aka no ignore_index).
I am not certain why these changes were necessary (the docs even say that NLLLoss should be run after a log_softmax layer), but they got my model working and brought my loss back to a reasonable range (~0.5).

Write from compute shader to persistently mapped SSBO fails

I'm trying to write to a SSBO with a compute shader and read the data back on the cpu.
The compute shader is just a 1x1x1 toy example that writes 24 floats:
#version 450 core
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
layout (std430, binding = 0) buffer particles {
float Particle[];
};
void main() {
for (int i = 0; i < 24; ++i) {
Particle[i] = i + 1;
}
}
This is how I run the shader and read the data:
val bufferFlags = GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT
val bufferSize = 24 * 4
val bufferId = glCreateBuffers()
glNamedBufferStorage(bufferId, bufferSize, bufferFlags)
val mappedBuffer = glMapNamedBufferRange(bufferId, 0, bufferSize, bufferFlags)
mappedBuffer.rewind()
val mappedFloatBuffer = mappedBuffer.asFloatBuffer()
mappedFloatBuffer.rewind()
val ssboIndex = glGetProgramResourceIndex(progId, GL_SHADER_STORAGE_BLOCK, "particles")
val props = Array(GL_BUFFER_BINDING)
val params = Array(-1)
glGetProgramResourceiv(progId, GL_SHADER_STORAGE_BLOCK, ssboIndex, props, null, params)
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, params(0), bufferId)
glUseProgram(progId)
val sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0)
glDispatchCompute(1, 1, 1)
glClientWaitSync(sync, 0, 1000000000) match {
case GL_TIMEOUT_EXPIRED =>
println("Timeout expired")
case GL_WAIT_FAILED =>
println("Wait failed. " + glGetError())
case _ =>
println("Result:")
while(mappedFloatBuffer.hasRemaining) {
println(mappedFloatBuffer.get())
}
}
I expect it to print the numbers 1 to 24 but instead it prints 24 zeros. Using the cpu I can read and write (if the GL_MAP_WRITE_BIT is set) to the buffer just fine. The same happens if I don't use DSA (glBindBuffer / glBufferStorage / glMapBufferRange instead). However, if the buffer isn't mapped while the shader runs and I only map it just before printing the contents, everything works correctly. Isn't this exactly what persistently mapped buffers are for? So I can keep it mapped while the gpu is using it?
I checked for any errors, with glGetError as well as with the newer debug output but I don't get any.
Here (pastebin) is a fully working example. You need LWJGL to run it.
There are a number of problems in your code.
First, you put the fence sync before the command you want to sync with. Syncing with a fence syncs with all commands executed before the fence, not after. If you want to sync with the compute shader execution, then you have to insert the fence after the dispatch call, not before.
Second, synchronization is not enough. Writes to an SSBO are incoherent, so you must follow the rules of incoherent memory accesses in order to make them visible to you. In this case, you need to insert an appropriate memory barrier between the compute operation and when you try to read from the buffer with glMemoryBarrier. Since you're reading the data via mapping, the proper barrier to use is GL_CLIENT_MAPPED_BUFFER_BARRIER_BIT.
Your code appears to work when you use non-persistent mapping, but this is mere appearance. It's still undefined behavior due to the improper incoherent memory access (ie: the lack of a memory barrier). It just so happens that the UB does what you want... in that case.

Different result returned using Scala Collection par in a series of runs

I have tasks that I want to execute concurrently and each task takes substantial amount of memory so I have to execute them in batches of 2 to conserve memory.
def runme(n: Int = 120) = (1 to n).grouped(2).toList.flatMap{tuple =>
tuple.par.map{x => {
println(s"Running $x")
val s = (1 to 100000).toList // intentionally to make the JVM allocate a sizeable chunk of memory
s.sum.toLong
}}
}
val result = runme()
println(result.size + " => " + result.sum)
The result I expected from the output was 120 => 84609924480 but the output was rather random. The returned collection size differed from execution to execution. Most of the time there was missing count even though all the futures were executed looking at the console. I thought flatMap waits the parallel executions in map to complete before returning the complete. What should I do to always get the right result using par? Thanks
Just for the record: changing the underlying collection in this case shouldn't change the output of your program. The problem is related to this known bug. It's fixed from 2.11.6, so if you use that (or higher) Scala version, you should not see the strange behavior.
And about the overflow, I still think that your expected value is wrong. You can check that the sum is overflowing because the list is of integers (which are 32 bit) while the total sum exceeds the integer limits. You can check it with the following snippet:
val n = 100000
val s = (1 to n).toList // your original code
val yourValue = s.sum.toLong // your original code
val correctValue = 1l * n * (n + 1) / 2 // use math formula
var bruteForceValue = 0l // in case you don't trust math :) It's Long because of 0l
for (i ← 1 to n) bruteForceValue += i // iterate through range
println(s"yourValue = $yourValue")
println(s"correctvalue = $correctValue")
println(s"bruteForceValue = $bruteForceValue")
which produces the output
yourValue = 705082704
correctvalue = 5000050000
bruteForceValue = 5000050000
Cheers!
Thanks #kaktusito.
It worked after I changed the grouped list to Vector or Seq i.e. (1 to n).grouped(2).toList.flatMap{... to (1 to n).grouped(2).toVector.flatMap{...