Caffe out of memory, where is it used? - neural-network

I'm trying to train a network in Caffe, a slightly modified SegNet-basic model.
I understand that the Check failed: error == cudaSuccess (2 vs. 0) out of memory error I am getting is due to me running out of GPU memory. However, what puzzles me is this:
My "old" training attempts worked fine. The network initialized and ran, with the following:
batch size 4
Memory required for data: 1800929300 (this calculates in the batch size, so it is 4x sample size here)
Total number of parameters: 1418176
the network is made out of 4x(convolution, ReLU, pooling) followed by 4x(upsample, deconvolution); with 64 filters with kernel size 7x7 per layer.
What surprises me that my "new" network runs out of memory, and I don't understand what is reserving the additional memory, since I lowered the batch size:
batch size 1
Memory required for data: 1175184180 ( = sample size)
Total number of parameters: 1618944
The input size is doubled along each dimension (expected output size does not change), hence the reason for increased number of parameters is one additional set of (convolution, ReLU, pooling) in the beginning of the network.
The number of parameters was counted by this script, by summing up the layer-wise parameters, obtained by multiplying the number of dimensions in each layer.
Assuming that each parameter needs 4 bytes of memory, that still gives data_memory+num_param*4 higher memory requirements for my old setup memory_old = 1806602004 = 1.68GB as compared to the new, memory_new = 1181659956 = 1.10GB.
I've accepted that the additional memory is probably needed somewhere, and that I'll have to re-think my new setup and downsample my input if I can't find a GPU with more memory, however I am really trying to understand where the additional memory is needed and why my new setup is running out of memory.
EDIT: Per request, here are the layer dimensions for each of the networks coupled with the size of the data that passes through it:
"Old" network:
Top shape: 4 4 384 512 (3145728)
('conv1', (64, 4, 7, 7)) --> 4 64 384 512 (50331648)
('conv1_bn', (1, 64, 1, 1)) --> 4 64 384 512 (50331648)
('conv2', (64, 64, 7, 7)) --> 4 64 192 256 (12582912)
('conv2_bn', (1, 64, 1, 1)) --> 4 64 192 256 (12582912)
('conv3', (64, 64, 7, 7)) --> 4 64 96 128 (3145728)
('conv3_bn', (1, 64, 1, 1)) --> 4 64 96 128 (3145728)
('conv4', (64, 64, 7, 7)) --> 4 64 48 64 (786432)
('conv4_bn', (1, 64, 1, 1)) --> 4 64 48 64 (786432)
('conv_decode4', (64, 64, 7, 7)) --> 4 64 48 64 (786432)
('conv_decode4_bn', (1, 64, 1, 1)) --> 4 64 48 64 (786432)
('conv_decode3', (64, 64, 7, 7)) --> 4 64 96 128 (3145728)
('conv_decode3_bn', (1, 64, 1, 1)) --> 4 64 96 128 (3145728)
('conv_decode2', (64, 64, 7, 7)) --> 4 64 192 256 (12582912)
('conv_decode2_bn', (1, 64, 1, 1)) --> 4 64 192 256 (12582912)
('conv_decode1', (64, 64, 7, 7)) --> 4 64 384 512 (50331648)
('conv_decode1_bn', (1, 64, 1, 1)) --> 4 64 384 512 (50331648)
('conv_classifier', (3, 64, 1, 1))
For the "New" network, the top few layers differ and the rest is exactly the same except that the batch size is 1 instead of 4:
Top shape: 1 4 769 1025 (3152900)
('conv0', (64, 4, 7, 7)) --> 1 4 769 1025 (3152900)
('conv0_bn', (1, 64, 1, 1)) --> 1 64 769 1025 (50446400)
('conv1', (64, 4, 7, 7)) --> 1 64 384 512 (12582912)
('conv1_bn', (1, 64, 1, 1)) --> 1 64 384 512 (12582912)
('conv2', (64, 64, 7, 7)) --> 1 64 192 256 (3145728)
('conv2_bn', (1, 64, 1, 1)) --> 1 64 192 256 (3145728)
('conv3', (64, 64, 7, 7)) --> 1 64 96 128 (786432)
('conv3_bn', (1, 64, 1, 1)) --> 1 64 96 128 (786432)
('conv4', (64, 64, 7, 7)) --> 1 64 48 64 (196608)
('conv4_bn', (1, 64, 1, 1)) --> 1 64 48 64 (196608)
('conv_decode4', (64, 64, 7, 7)) --> 1 64 48 64 (196608)
('conv_decode4_bn', (1, 64, 1, 1)) --> 1 64 48 64 (196608)
('conv_decode3', (64, 64, 7, 7)) --> 1 64 96 128 (786432)
('conv_decode3_bn', (1, 64, 1, 1)) --> 1 64 96 128 (786432)
('conv_decode2', (64, 64, 7, 7)) --> 1 64 192 256 (3145728)
('conv_decode2_bn', (1, 64, 1, 1)) --> 1 64 192 256 (3145728)
('conv_decode1', (64, 64, 7, 7)) --> 1 64 384 512 (12582912)
('conv_decode1_bn', (1, 64, 1, 1)) --> 1 64 384 512 (12582912)
('conv_classifier', (3, 64, 1, 1))
This skips the pooling and upsampling layers. Here is the train.prototxt for the "new" network. The old network does not have the layers conv0, conv0_bn and pool0, while the other layers are the same. The "old" network also has batch_size set to 4 instead of 1.
EDIT2: Per request, even more info:
All the input data has the same dimensions. It's a stack of 4 channels, each of the size 769x1025, so always 4x769x1025 input.
The caffe training log is here: as you can see, I get out of memory just after network initialization. Not a single iteration runs.
My GPU has 8GB of memory, while I've just found out (trying it on a different machine) that this new network requires 9.5GB of GPU memory.
Just to re-iterate, I am trying to understand how come my "old" setup fits into 8GB memory and the "new" one doesn't, as well as why the amount of memory needed for the additional data is ~8 times larger than the memory needed to hold the input. However, now that I have confirmed that the "new" setup takes only 9.5GB, it might not be as much bigger from the "old" one as I suspected (unfortunately the GPU is currently being used by somebody else so I can't check how much memory the old setup needed exactly)

Bear in mind that caffe actually allocates room for two copies of the net: the "train phase" net and the "test phase" net. So if the data takes 1.1GB you need to double this space.
Moreover, you need to allocate space for the parameters. Each parameter needs to store its gradient. In addition, the solver keeps track of the "momentum" for each parameter (sometimes even 2nd moment, e.g., in ADAM solver). Therefore, increasing the number of parameters even by a tiny amount can result with significant addition to memory footprint of the training system.

Related

Decoding a JPEG Huffman Table

I am looking for a way to retrieve the minCode, maxCode and valPtr from an arbitrary Huffman table.
For instance, the following is a Huffman DC table generated by JpegSnoop:
Destination ID = 0
Class = 0 (DC / Lossless Table)
Codes of length 01 bits (000 total):
Codes of length 02 bits (001 total): 00
Codes of length 03 bits (005 total): 01 02 03 04 05
Codes of length 04 bits (001 total): 06
Codes of length 05 bits (001 total): 07
Codes of length 06 bits (001 total): 08
Codes of length 07 bits (001 total): 09
Codes of length 08 bits (001 total): 0A
Codes of length 09 bits (001 total): 0B
Codes of length 10 bits (000 total):
Codes of length 11 bits (000 total):
Codes of length 12 bits (000 total):
Codes of length 13 bits (000 total):
Codes of length 14 bits (000 total):
Codes of length 15 bits (000 total):
Codes of length 16 bits (000 total):
Total number of codes: 012
And the following are its Mincode, MaxCode and valPtr respectively:
{ 0, 0, 2, 14, 30, 62, 126, 254, 510, 0, 0, 0, 0, 0, 0, 0 },//YDC
{ -1, 0, 6, 14, 30, 62, 126, 254, 510, -1, -1, -1, -1, -1, -1, -1 },//YDC
{ 0, 0, 1, 6, 7, 8, 9, 10, 11, 0, 0, 0, 0, 0, 0, 0 },//YDC
Now I'm really confused about how these values were derived.
I checked the itu-t81 file, but it was not very clear.
To generate the code bits, you start with all zero bits. Within each code length, increment the code like an integer for each symbol. When stepping up a code length, increment and then add a zero bit to the end.
So for your example code, we have each length, followed by the corresponding codes in binary:
2: 00
3: 010, 011, 100, 101, 110
4: 1110
5: 11110
6: 111110
7: 1111110
8: 11111110
9: 111111110
Converting those to the corresponding integer ranges for each bit length, we have:
2: 0..0
3: 2..6
4: 14..14
5: 30..30
6: 62..62
7: 126..126
8: 254..254
9: 510..510
You can see exactly those ranges in your MinCode and MaxCode vectors.
You also have a list of symbols that correspond to the codes. In this example, that list is simply:
00 01 02 03 04 05 06 07 08 09 0A 0B
(The particular values of the symbols are not relevant to the valPtr vector. Those could be anything.)
The codes are assigned to the symbols from shortest to longest, and within each length, in integer order. The valPtr vector is simply the index of the first symbol in that vector that corresponds to each bit length. To generate the vector, start at zero, and add the number of symbols of each code length to get the starting index for the next code length.
1: 0, 0 symbols
2: 0 + 0 = 0, 1 2-bit symbol
3: 0 + 1 = 1, 5 3-bit symbols
4: 1 + 5 = 6, 1 4-bit symbol
5: 6 + 1 = 7, 1 5-bit symbol
6: 7 + 1 = 8, 1 6-bit symbol
7: 8 + 1 = 9, 1 7-bit symbol
8: 9 + 1 = 10, 1 8-bit symbol
9: 10 + 1 = 11
The valPtr example vector are the numbers after the equal signs above.
Thanks, I have created a code that decodes the tables and returns the desired values. The code may be found on my GitHub Here.

PNG decompressed IDAT chunk. How to read?

I have read the PNG specifications too much times and still confused how I should interpret the IDAT chunk. I have it decompressed using zlib and got all of the bytes that my IDAT chunk got.
I made an example image using krita. It's an 3x2 PNG image containing a different color every pixel.
See the 3 by 2 PNG image here
According to the PNG specification about filters it says that when the first byte of the IDAT chunk is 1 the filter method that have been applied is
Filtered(byte) = Original(byte) - Original(previous_byte)
With that formula in mind I decompressed my IDAT chunk (which was 29 bytes in length to store only 6 pixels). The first byte (which is byte number 0) contains the value 1. That is where the formula comes from.
Byte# Vaue
0 1
1 224
2 215
3 200
4 227
5 241
6 48
7 2
8 36
9 225
10 1
11 253
12 255
13 195
14 245
15 182
16 244
17 232
18 245
19 57
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
The first pixel is supposed to be RGB(224, 215, 200) which I reconstructed with a RGB to color converter. This seems pretty much the same color as the original pixel in the image. Here are my thoughts about all the color pixels.
Pixel 1: RGB(224, 215, 200) [read from byte 1, byte2 and byte3]
Pixel 2: RGB(195, 200, 248) [because byte 4:227 byte5:241 byte6:48]
Pixel 3: RGB(197, 236, 217) [because byte 7:2 byte8:36 byte9:225]
Pixel 4: RGB(198, 233, 217) [because byte10:1 byte11:253 byte12:255]
Pixel 5: RGB(137, 222, 142) [because byte13:195 byte14:245 byte15:182]
Pixel 6: RGB(107, 198, 131) [because byte16:244 byte17:232 byte18:245]
I have used the formula to get all the values from the pixels.
Reconstructing pixel 1, 2 and 3 looks pretty much the same, but pixel 4, 5 and 6 are not what I have expected. I think I am not reading the IDAT chunk the correct way. That could explain why there are 29 bytes for only 6 pixels RGB. I expected 19 bytes because 3 times 6 is 18 and 1 byte for the filtering method.
The IHDR says that the bit depth is 8 and the color type is 2. From the table in the specifications it says that each pixel is an R, G and B triple. Could someone point me to the right direction to read the IDAT chunk and explain it's length?
Your decompressed result length of 29 is not correct, which may have lead to your confusion.
Your image is 3x2 RGB pixels. That would be 3*3 * 2 = 18 bytes of data, plus 1 extra byte per row; a total of 20 bytes. Somehow you got an extra 9 dummy bytes, not part of the compressed data.
(I reconstructed your tiny image from the larger one and happily got the exact same numbers, else the explanation would necessarily be purely theoretical. For ease, I determined the offset of the zipped data with a hex viewer.)
>>> with open ('3x2b.png','rb') as f:
... result = f.seek (0x6a)
... data = f.read()
...
>>> d = zlib.decompress(data)
>>> print ([x for x in d])
[1, 224, 215, 200, 227, 241, 48, 2, 36, 225, 1, 253, 255, 195, 245, 182, 244, 232, 245, 57]
This 'unpacks' to the following two rows, with 3 RGB pixel values each:
filter RGB RGB RGB
1 (224,215,200) (227,241,48) (2,36,225)
1 (253,255,195) (245,182,244, (232,245,57)
All these values may be relative to an earlier result: the last complete row read before it, or the pixel to its left. For the first row, you must assume a row of all zeroes; the value "left" of the first pixel must be assumed to be 0 as well.
You see the two bytes marked 'filter'? That is where you went wrong. Each row has a filter byte of its own. You used the filter byte itself for the calculation of the second row.
Adding (the inverse of the "Sub" filter as indicated by the filter 1) yields in
; start of row 0, filter is 1 and 'initial pixel' is (0,0,0)
(224,215,200) (224+227,215+241,200+48)
=(195,200,248)
(195+2,200+36,248+225)
=(197,236,217)
; restart for row 1, filter is 1 again and start value (0,0,0):
(253,255,195) (253+245,255+182,195+244)
=(242,181,183)
(242+232,181+245,183+57)
=(218,170,240)
... exactly the colors I started out with.
This is Filter 1 ("Sub") and so uses the values to its left; for Filter 2 ("Up"), you need to use the corresponding byte in the previously decoded row, and for Average and Paeth, you need both.

Embedding an array into another

I have two arrays. The first one is a consecutive sequential one, like:
seq1 =
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
...continues
The second one is like:
seq2 =
2 250
3 260
5 267
6 270
8 280
10 290
13 300
18 310
20 320
21 330
...continues
I need to embed seq2 into seq1 in such a way that I end up with the sequence:
seq3 =
1 0
2 250
3 260
4 260
5 267
6 270
7 270
8 280
9 280
10 290
11 290
... continues
I could do this with loops but the arrays are really big so I don't want to use two for loops, it is taking too long. How can I do this in a vectorised manner?
I think this does what you want:
[~, jj, vv] = find(sum(bsxfun(#le, seq2(:,1), seq1(:,1).'), 1));
seq3 = seq1;
seq3(jj,2) = seq2(vv,2);
How it works
The required index is obtained by computing how many values in the first column of seq2 are less than or equal to each value in the first column or seq1 (code sum(bsxfun(#le, ...), 1)). This will be used to select the appropriate entries from the second column of seq2 and write them into the result. But before that, the value 0 in this index needs to be discarded. This is done using the three-output version of find (code [~, jj, vv] = find(...)).
If your second column of data is always increasing, you can solve this easily with accumarray and cummax:
seq = [seq1; seq2];
seq3 = cummax(accumarray(seq(:, 1), seq(:, 2), [], #max));
seq3 = [(1:numel(seq3)).' seq3];
And here's what you get for your sample inputs:
seq3 =
1 0
2 250
3 260
4 260
5 267
6 270
7 270
8 280
9 280
10 290
11 290
12 290
13 300
14 300
15 300
16 300
17 300
18 310
19 310
20 320
21 330
How it works...
After concatenating seq1 and seq2, accumarray collects all the values in the second column that have the same value in the first column (i.e. [0 250] for the value 2), then gets the maximum value of each set. The function cummax is then used to fill any zero values with the previous non-zero value. Finally, an index column is added to the new sequence.

Select numbers from array which are much greater than the rest

Say there is an array of n elements, and out of n elements there be some numbers which are much bigger than the rest.
So, I might have:
16, 1, 1, 0, 5, 0, 32, 6, 54, 1, 2, 5, 3
In this case, I'd be interested in 32, 16 and 54.
Or, I might have:
32, 105, 26, 5, 1, 82, 906, 58, 22, 88, 967, 1024, 1055
In this case, I'd be interested in 1024, 906, 967 and 1055.
I'm trying to write a function to extract the numbers of interest. The problem is that I can't define a threshold to determine what's "much greater", and I can't just tell it to get the x biggest numbers because both of these will vary depending on what the function is called against.
I'm a little stuck. Does anyone have any ideas how to attack this?
Just taking all the numbers larger than the mean doesn't cut it all the time. For example if you only have one number which is much larger, but much more numbers wich are close to each other. The one large number won't shift the mean very much, which results in taking too many numbers:
data = [ones(1,10) 2*ones(1,10) 10];
data(data>mean(data))
ans =
2 2 2 2 2 2 2 2 2 2 10
If you look at the differences between numbers, this problem is solved:
>> data = [16, 1, 1, 0, 5, 0, 32, 6, 54, 1, 2, 5, 3];
sorted_data = sort(data);
dd = diff(sorted_data);
mean_dd = mean(dd);
ii = find(dd> 2*mean_dd,1,'first');
large_numbers = sorted_data(ii:end);
large_numbers =
6 16 32 54
the threshold value (2 in this case) lets you play with the meaning of "how much greater" a number has to be.
If it were me I'd use a little more statistical insight, that would give the most flexibility for the code in the future.
x = [1 2 3 2 2 1 4 6 15 83 2 4 22 81 0 8 7 7 7 3 1 2 3]
EpicNumbers = x( x>(mean(x) + std(x)) )
Then you can increase or decrease the number of standard deviations to broaden or tighten your threshold.
LessEpicNumbers = x( x>(mean(x) + 2*std(x)) )
MoreEpicNumbers = x( x>(mean(x) + 0.5*std(x)) )
A simple solution would be to use find and a treshold based on the mean value (or multiples thereof):
a = [16, 1, 1, 0, 5, 0, 32, 6, 54, 1, 2, 5, 3]
find(a>mean(a))

Padding in MD5 Hash Algorithm

I need to understand the Md5 hash algorithm. I was reading a documents and it states
"The message is "padded" (extended) so that its length (in bits) is
congruent to 448, modulo 512. That is, the message is extended so
that it is just 64 bits shy of being a multiple of 512 bits long.
Padding is always performed, even if the length of the message is
already congruent to 448, modulo 512."
I need to understand what this means in simple terms, especially the 448 modulo 512. The word MODULO is the issue. Please I will appreciate simple examples to this. Funny though, this is the first step to MD5 hash! :)
Thanks
Modulo or mod, is a function that results in telling you the remainder when two numbers are divided by each other.
For example:
5 modulo 3:
5/3 = 1, with 2 remainder. So 5 mod 3 is 2.
10 modulo 16 = 10, because 16 cannot be made.
15 modulo 5 = 0, because 15 goes into 5 exactly 3 times. 15 is a multiple of 5.
Back in school you would have learnt this as "Remainder" or "Left Over", modulo is just a fancy way to say that.
What this is saying here, is that when you use MD5, one of the first things that happens is that you pad your message so it's long enough. In MD5's case, your message must be n bits, where n= (512*z)+448 and z is any number.
As an example, if you had a file that was 1472 bits long, then you would be able to use it as an MD5 hash, because 1472 modulo 512 = 448. If the file was 1400 bits long, then you would need to pad in an extra 72 bits before you could run the rest of the MD5 algorithm.
Modulus is the remainder of division. In example
512 mod 448 = 64
448 mod 512 = 448
Another approach of 512 mod 448 would be to divide them 512/448 = 1.142..
Then you subtract 512 from result number before dot multiplied by 448:
512 - 448*1 == 64 That's your modulus result.
What you need to know that 448 is 64 bits shorter than multiple 512.
But what if it's between 448 and 512??
Normally we need to substract 448 by x(result of modulus).
447 mod 512 = 447; 448 - 447 = 1; (all good, 1 zero to pad)
449 mod 512 = 1; 448 - 449 = -1 ???
So this problem solution would be to take higher multiple of 512 but still shorter of 64;
512*2 - 64 = 960
449 mod 512 = 1; 960 - 449 = 511;
This happens because afterwards we need to add 64 bits original message and the full length have to be multiple of 512.
960 - 449 = 511;
511 + 449 + 64 = 1024;
1024 is multiple of 512;