Theano CPU Multicore lower utilisation on a larger dataset - neural-network

I have a Multilayer Perceptron code using Theano/Lasagne that when I run on a small dataset uses multiple cores correctly.
But, when I run it over a much larger dataset (with the same code) and I watch CPU utilisation in htop, I don't see it parallelize. It creates the number of processes as defined in .theanorc file which looks like:
[global]
OMP_NUM_THREADS=15
openmp=True
floatX = float32
[blas]
ldflags=-L/usr/lib/ -lblas -lgfortran
but most of the time (~ 90%) there is only one of the created processes is working (though some short amount of times utilisation goes up).
I imagine there is an operation that is not utilising multi-core while other operations are using it because when I run the code on a small dataset I see that most of the time all cores are utilised. All the operations are matrix multiplication (both sparse and dense) so I don't
know why some operations don't use multi-core.
This is the section of the code that is being run:
for n in xrange(self.n_epochs):
x_batch, y_batch = self.X, Y_train
l_train, acc_train = self.f_train(x_batch, y_batch, self.train_indices)
l_val, acc_val = self.f_val(self.X, Y_dev, self.dev_indices)
val_pred = self.f_predict(self.X, self.dev_indices)
if acc_val > best_val_acc:
best_val_loss = l_val
best_val_acc = acc_val
best_params = lasagne.layers.get_all_param_values(self.l_out)
n_validation_down = 0
else:
#early stopping
n_validation_down += 1
logging.info('epoch ' + str(n) + ' ,train_loss ' + str(l_train) + ' ,acc ' + str(acc_train) + ' ,val_loss ' + str(l_val) + ' ,acc ' + str(acc_val) + ',best_val_acc ' + str(best_val_acc))
if n_validation_down > self.early_stopping_max_down:
logging.info('validation results went down. early stopping ...')
break
My numpy/blas information:
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
blis_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/usr/local/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
Note that the input matrices are sparse and I'm doing some extra sparse multiplications within custom layers using S.dot.

Related

ORTOOLS - CPSAT - Objective to minimize a value by intervals

I my model in ORTools CPSAT, I am computing a variable called salary_var (among others). I need to minimize an objective. Let’s call it « taxes ».
to compute the taxes, the formula is not linear but organised this way:
if salary_var below 10084, taxes corresponds to 0%
between 10085 and 25710, taxes corresponds to 11%
between 25711 and 73516, taxes corresponds to 30%
and 41% for above
For example, if salary_var is 30000 then, taxes are:
(25710-10085) * 0.11 + (30000-25711) * 0.3 = 1718 + 1286 = 3005
My question: how can I efficiently code my « taxes » objective?
Thanks for your help
Seb
This task looks rather strange, there is not much context and some parts of the task might touch some not-so-nice areas of finite-domain based solvers (large domains or scaling / divisions during solving).
Therefore: consider this as an idea / template!
Code
from ortools.sat.python import cp_model
# Data
INPUT = 30000
INPUT_UB = 1000000
TAX_A = 11
TAX_B = 30
TAX_C = 41
# Helpers
# new variable which is constrained to be equal to: given input-var MINUS constant
# can get negative / wrap-around
def aux_var_offset(model, var, offset):
aux_var = model.NewIntVar(-INPUT_UB, INPUT_UB, "")
model.Add(aux_var == var - offset)
return aux_var
# new variable which is equal to the given input-var IFF >= 0; else 0
def aux_var_nonnegative(model, var):
aux_var = model.NewIntVar(0, INPUT_UB, "")
model.AddMaxEquality(aux_var, [var, model.NewConstant(0)])
return aux_var
# Model
model = cp_model.CpModel()
# vars
salary_var = model.NewIntVar(0, INPUT_UB, "salary")
tax_component_a = model.NewIntVar(0, INPUT_UB, "tax_11")
tax_component_b = model.NewIntVar(0, INPUT_UB, "tax_30")
tax_component_c = model.NewIntVar(0, INPUT_UB, "tax_41")
# constraints
model.AddMinEquality(tax_component_a, [
aux_var_nonnegative(model, aux_var_offset(model, salary_var, 10085)),
model.NewConstant(25710 - 10085)])
model.AddMinEquality(tax_component_b, [
aux_var_nonnegative(model, aux_var_offset(model, salary_var, 25711)),
model.NewConstant(73516 - 25711)])
model.Add(tax_component_c == aux_var_nonnegative(model,
aux_var_offset(model, salary_var, 73516)))
tax_full_scaled = tax_component_a * TAX_A + tax_component_b * TAX_B + tax_component_c * TAX_C
# Demo
model.Add(salary_var == INPUT)
solver = cp_model.CpSolver()
status = solver.Solve(model)
print(list(map(lambda x: solver.Value(x), [tax_component_a, tax_component_b, tax_component_c, tax_full_scaled])))
Output
[15625, 4289, 0, 300545]
Remarks
As implemented:
uses scaled solving
produces scaled solution (300545)
no fiddling with non-integral / ratio / rounding stuff BUT large domains
Alternative:
Maybe something around AddDivisionEquality
Edit in regards to Laurents comments
In some scenarios, solving the scaled problem but being able to reason about the real unscaled values easier might make sense.
If i interpret the comment correctly, the following would be a demo (which i was not aware of and it's cool!):
Updated Demo Code (partial)
# Demo -> Attempt of demonstrating the objective-scaling suggestion
model.Add(salary_var >= 30000)
model.Add(salary_var <= 40000)
model.Minimize(salary_var)
model.Proto().objective.scaling_factor = 0.001 # DEFINE INVERSE SCALING
solver = cp_model.CpSolver()
solver.parameters.log_search_progress = True # SCALED BACK OBJECTIVE PROGRESS
status = solver.Solve(model)
print(list(map(lambda x: solver.Value(x), [tax_component_a, tax_component_b, tax_component_c, tax_full_scaled])))
print(solver.ObjectiveValue()) # SCALED BACK OBJECTIVE
Output (excerpt)
...
...
#1 0.00s best:30 next:[30,29.999] fixed_bools:0/1
#Done 0.00s
CpSolverResponse summary:
status: OPTIMAL
objective: 30
best_bound: 30
booleans: 1
conflicts: 0
branches: 1
propagations: 0
integer_propagations: 2
restarts: 1
lp_iterations: 0
walltime: 0.0039022
usertime: 0.0039023
deterministic_time: 8e-08
primal_integral: 1.91832e-07
[15625, 4289, 0, 300545]
30.0

Clarification between Epoch and iteration

This answer points to the difference between an Epoch and an iteration while training a neural network. However, when I look at the source code for the solver API in the Stanford CS231n course (and I'm assuming this is the case for most libraries out there as well), during each iteration, batch_size number of examples are randomly selected with replacement. Thus, there is no guarantee that all examples would been seen during each epoch is there?
Does an epoch then mean that all examples would be seen in expectation? Or am I understanding this wrong?
Relevant Source Code:
def _step(self):
"""
Make a single gradient update. This is called by train() and should not
be called manually.
"""
# Make a minibatch of training data
num_train = self.X_train.shape[0]
batch_mask = np.random.choice(num_train, self.batch_size)
X_batch = self.X_train[batch_mask]
y_batch = self.y_train[batch_mask]
# Compute loss and gradient
loss, grads = self.model.loss(X_batch, y_batch)
self.loss_history.append(loss)
# Perform a parameter update
for p, w in self.model.params.iteritems():
dw = grads[p]
config = self.optim_configs[p]
next_w, next_config = self.update_rule(w, dw, config)
self.model.params[p] = next_w
self.optim_configs[p] = next_config
def train(self):
"""
Run optimization to train the model.
"""
num_train = self.X_train.shape[0]
iterations_per_epoch = max(num_train / self.batch_size, 1)
num_iterations = self.num_epochs * iterations_per_epoch
for t in xrange(num_iterations):
self._step()
# Maybe print training loss
if self.verbose and t % self.print_every == 0:
print '(Iteration %d / %d) loss: %f' % (
t + 1, num_iterations, self.loss_history[-1])
# At the end of every epoch, increment the epoch counter and decay the
# learning rate.
epoch_end = (t + 1) % iterations_per_epoch == 0
if epoch_end:
self.epoch += 1
for k in self.optim_configs:
self.optim_configs[k]['learning_rate'] *= self.lr_decay
# Check train and val accuracy on the first iteration, the last
# iteration, and at the end of each epoch.
first_it = (t == 0)
last_it = (t == num_iterations + 1)
if first_it or last_it or epoch_end:
train_acc = self.check_accuracy(self.X_train, self.y_train,
num_samples=1000)
val_acc = self.check_accuracy(self.X_val, self.y_val)
self.train_acc_history.append(train_acc)
self.val_acc_history.append(val_acc)
if self.verbose:
print '(Epoch %d / %d) train acc: %f; val_acc: %f' % (
self.epoch, self.num_epochs, train_acc, val_acc)
# Keep track of the best model
if val_acc > self.best_val_acc:
self.best_val_acc = val_acc
self.best_params = {}
for k, v in self.model.params.iteritems():
self.best_params[k] = v.copy()
# At the end of training swap the best params into the model
self.model.params = self.best_params
Thanks.
I believe, as you say, that in the Stanford course they are effectively using "epoch" with the less strict meaning of "expected number of times each example is seen during training". However, in my experience, most implementations consider an epoch as running through every example in the training set once, and I'd say they only chose the sampling with replacement for simplicity. If you have a good amount of data, chances are that you will not see a difference, but still, it is more correct to sample without replacement until there are no more examples.
You can check, for example, how Keras does the training in its source code; it's a bit complicated, but the important point is that make_batches is called to split the (possibly shuffled) examples into batches, which matches your initial idea of "epoch".

Julia Method Error converting Complex{Float64}

I'm novice to Julia and I have the following code with this error:
MethodError(convert,(Complex{Float64},[-1.0 - 1.0im])).
I would like to know the source of the error and how to optimize this piece of code for speed.
This is my code:
function OfdmSym()
N = 64
n = 1000
symbol = convert(Array{Complex{Float64},2},ones(n,64)) # I need Array{Complex{Float64},2}
data = convert(Array{Complex{Float64},2},ones(1,48)) # I need Array{Complex{Float64},2}
const unused = convert(Array{Complex{Float64},2},zeros(1,12))
const pilot = convert(Array{Complex{Float64},2},ones(1,4))
const s = convert(Array{Complex{Float64},2},[-1-im -1+im 1-im 1+im])# QPSK Complex Data
for i=1:n # generate 1000 symbols
for j = 1:48 # generate 48 complex data symbols whose basis is s
r = rand(1:4,1) # 1, 2, 3, or 4
data[j] = s[r]
end
symbol[i,:]=[data[1,1:10] pilot[1] data[1,11:20] pilot[2] data[1,21:30] pilot[3] data[1,31:40] pilot[4] data[1,41:48] unused]
end
end
As it's the first day programming in Julia, I tried very hard to reveal the source of the error without success. I also tried to optimize and initialize arrays as I could but when I time the code I realize that it is far from optimal. I appreciate your help.
Try this much simpler code
function OfdmSym()
N = 64
n = 1000
symbol = ones(Complex{Float64}, n, 64)
data = ones(Complex{Float64}, 1, 48)
unused = zeros(Complex{Float64}, 1, 12)
pilot = ones(Complex{Float64}, 1, 4)
s = [-1-im -1+im 1-im 1+im]
for i=1:n # generate 1000 symbols
for j = 1:48 # generate 48 complex data symbols whose basis is s
r = rand(1:4) # 1, 2, 3, or 4
data[j] = s[r]
end
symbol[i,:]=[data[1,1:10] pilot[1] data[1,11:20] pilot[2] data[1,21:30] pilot[3] data[1,31:40] pilot[4] data[1,41:48] unused]
end
end
OfdmSym()
I wouldn't worry too much about optimizing things until you have got it working correctly. The way you have it set up now seems like it'd be kinda inefficient due to all the slicing of arrays - it'd be better to try to build symbol directly.

MATLAB Execution Time Increasing

Here is my code. The intent is I have a Wireshark capture saved to a particularly formatted text file. The MATLAB code is supposed to go through the Packets, dissect them for different protocols, and then make tables based on those protocols. I currently have this programmed for ETHERNET/IP/UDP/MODBUS. In this case, it creates a column in MBTable each time it encounters a new register value, and each time it comes across a change to that register value, it updates the value in that line of the table. The first column of MBTable is time, the registers start with the second column.
MBTable is preallocated to over 100,000 Rows (nol is very large), 10 columns before this code is executed. The actual data from a file I'm pulling into the table gets to about 10,000 rows and 4 columns and the code execution is so slow I have to stop it. The tic/toc value is calculated every 1000 rows and continues to increase exponentially with every iteration. It is a large loop, but I can't see where anything is growing in such a way that it would cause it to run slower with each iteration.
All variables get initialized up top (left out to lessen amount of code.
The variables eth, eth.ip, eth.ip.udp, and eth.ip.udp.modbus are all of type struct as is eth.header and eth.ip.header. WSID is a file ID from a .txt file opened earlier.
MBTable = zeros(nol,10);
tval = tic;
while not(feof(WSID))
packline = packline + 1;
fl = fl + 1;
%Get the next line from the file
MBLine = fgetl(WSID);
%Make sure line is not blank or short
if length(MBLine) >= 3
%Split the line into 1. Line no, 2. Data, 3. ASCII
%MBAll = strsplit(MBLine,' ');
%First line of new packet, if headers included
if strcmp(MBLine(1:3),'No.')
newpack = true;
newtime = false;
newdata = false;
stoppack = false;
packline = 1;
end
%If packet has headers, 2nd line contains timestamp
if newpack
Ordered = false;
if packline == 2;
newtime = true;
%MBstrs = strsplit(MBAll{2},' ');
packno = int32(str2double(MBLine(1:8)));
t = str2double(MBLine(9:20));
et = t - lastt;
if lastt > 0 && et > 0
L = L + 1;
MBTable(L,1) = t;
end
%newpack = false;
end
if packline > 3
dataline = int16(str2double(MBLine(1:4)));
packdata = strcat(packdata,MBLine(7:53));
end
end
else
%if t >= st
if packline > 3
stoppack = true;
newpack = false;
end
if stoppack
invalid = false;
%eth = struct;
eth.pack = packdata(~isspace(packdata));
eth.length = length(eth.pack);
%Dissect the packet data
eth.stbyte = 1;
eth.ebyte = eth.length;
eth.header.stbyte = 1;
eth.header.ebyte = 28;
%Ethernet Packet Data
eth.header.pack = eth.pack(eth.stbyte:eth.stbyte+27);
eth.header.dest = eth.header.pack(eth.header.stbyte:eth.header.stbyte + 11);
eth.header.src = eth.header.pack(eth.header.stbyte + 12:eth.header.stbyte + 23);
eth.typecode = eth.header.pack(eth.header.stbyte + 24:eth.header.ebyte);
if strcmp(eth.typecode,'0800')
eth.type = 'IP';
%eth.ip = struct;
%IP Packet Data
eth.ip.stbyte = eth.header.ebyte + 1;
eth.ip.ver = eth.pack(eth.ip.stbyte);
%IP Header length
eth.ip.header.length = 4*int8(str2double(eth.pack(eth.ip.stbyte+1)));
eth.ip.header.ebyte = eth.ip.stbyte + eth.ip.header.length - 1;
%Differentiated Services Field
eth.ip.DSF = eth.pack(eth.ip.stbyte + 2:eth.ip.stbyte + 3);
%Total IP Packet Length
eth.ip.length = hex2dec(eth.pack(eth.ip.stbyte+4:eth.ip.stbyte+7));
eth.ip.ebyte = eth.ip.stbyte + max(eth.ip.length,46) - 1;
eth.ip.pack = eth.pack(eth.ip.stbyte:eth.ip.ebyte);
eth.ip.ID = eth.pack(eth.ip.stbyte+8:eth.ip.stbyte+11);
eth.ip.flags = eth.pack(eth.ip.stbyte+12:eth.ip.stbyte+13);
eth.ip.fragoff = eth.pack(eth.ip.stbyte+14:eth.ip.stbyte+15);
%Time to Live
eth.ip.ttl = hex2dec(eth.pack(eth.ip.stbyte+16:eth.ip.stbyte+17));
eth.ip.typecode = eth.pack(eth.ip.stbyte+18:eth.ip.stbyte+19);
eth.ip.checksum = eth.pack(eth.ip.stbyte+20:eth.ip.stbyte+23);
%eth.ip.src = eth.pack(eth.ip.stbyte+24:eth.ip.stbyte+31);
eth.ip.src = ...
[num2str(hex2dec(eth.pack(eth.ip.stbyte+24:eth.ip.stbyte+25))),'.', ...
num2str(hex2dec(eth.pack(eth.ip.stbyte+26:eth.ip.stbyte+27))),'.', ...
num2str(hex2dec(eth.pack(eth.ip.stbyte+28:eth.ip.stbyte+29))),'.', ...
num2str(hex2dec(eth.pack(eth.ip.stbyte+30:eth.ip.stbyte+31)))];
eth.ip.dest = ...
[num2str(hex2dec(eth.pack(eth.ip.stbyte+32:eth.ip.stbyte+33))),'.', ...
num2str(hex2dec(eth.pack(eth.ip.stbyte+34:eth.ip.stbyte+35))),'.', ...
num2str(hex2dec(eth.pack(eth.ip.stbyte+36:eth.ip.stbyte+37))),'.', ...
num2str(hex2dec(eth.pack(eth.ip.stbyte+38:eth.ip.stbyte+39)))];
if strcmp(eth.ip.typecode,'11')
eth.ip.type = 'UDP';
eth.ip.udp.stbyte = eth.ip.stbyte + 40;
eth.ip.udp.src = hex2dec(eth.pack(eth.ip.udp.stbyte:eth.ip.udp.stbyte + 3));
eth.ip.udp.dest = hex2dec(eth.pack(eth.ip.udp.stbyte+4:eth.ip.udp.stbyte+7));
eth.ip.udp.length = hex2dec(eth.pack(eth.ip.udp.stbyte+8:eth.ip.udp.stbyte+11));
eth.ip.udp.checksum = eth.pack(eth.ip.udp.stbyte+12:eth.ip.udp.stbyte+15);
eth.ip.udp.protoID = eth.pack(eth.ip.udp.stbyte+20:eth.ip.udp.stbyte+23);
if strcmp(eth.ip.udp.protoID,'0000')
eth.ip.udp.proto = 'MODBUS';
%eth.ip.udp.modbus = struct;
eth.ip.udp.modbus.stbyte = eth.ip.udp.stbyte+16;
eth.ip.udp.modbus.transID = eth.pack(eth.ip.udp.modbus.stbyte:eth.ip.udp.modbus.stbyte+3);
eth.ip.udp.modbus.protoID = eth.ip.udp.protoID;
eth.ip.udp.modbus.length = int16(str2double(eth.pack(eth.ip.udp.modbus.stbyte + 8:eth.ip.udp.modbus.stbyte + 11)));
eth.ip.udp.modbus.UID = eth.pack(eth.ip.udp.modbus.stbyte + 12:eth.ip.udp.modbus.stbyte + 13);
eth.ip.udp.modbus.func = hex2dec(eth.pack(eth.ip.udp.modbus.stbyte + 14:eth.ip.udp.modbus.stbyte+15));
eth.ip.udp.modbus.register = eth.pack(eth.ip.udp.modbus.stbyte + 16: eth.ip.udp.modbus.stbyte+19);
%Number of words to a register, or the number of registers
eth.ip.udp.modbus.words = hex2dec(eth.pack(eth.ip.udp.modbus.stbyte+20:eth.ip.udp.modbus.stbyte+23));
eth.ip.udp.modbus.bytes = hex2dec(eth.pack(eth.ip.udp.modbus.stbyte+24:eth.ip.udp.modbus.stbyte+25));
eth.ip.udp.modbus.data = eth.pack(eth.ip.udp.modbus.stbyte + 26:eth.ip.udp.modbus.stbyte + 26 + 2*eth.ip.udp.modbus.bytes - 1);
%If func 16 or 23, loop through data/registers and add to table
if eth.ip.udp.modbus.func == 16 || eth.ip.udp.modbus.func == 23
stp = eth.ip.udp.modbus.bytes*2/eth.ip.udp.modbus.words;
for n = 1:stp:eth.ip.udp.modbus.bytes*2;
%Check for existence of register as a key?
if ~isKey(MBMap,eth.ip.udp.modbus.register)
MBCol = MBCol + 1;
MBMap(eth.ip.udp.modbus.register) = MBCol;
end
MBTable(L,MBCol) = hex2dec(eth.ip.udp.modbus.data(n:n+stp-1));
eth.ip.udp.modbus.register = dec2hex(hex2dec(eth.ip.udp.modbus.register)+1);
end
lastt = t;
end
%If func 4, make sure it is the response, then put
%data into table for register column
elseif false
%need code to handle serial to UDP conversion box
else
invalid = true;
end
else
invalid = true;
end
else
invalid = true;
end
if ~invalid
end
end
%end
end
%Display Progress
if int64(fl/1000)*1000 == fl
for x = 1:length(mess);
fprintf('\b');
end
%fprintf('Lines parsed: %i',fl);
mess = sprintf('Lines parsed: %i / %i',fl,nol);
fprintf('%s',mess);
%Check execution time - getting slower:
%%{
ext = toc(tval);
mess = sprintf('\nExecution Time: %f\n',ext);
fprintf('%s',mess);
%%}
end
end
ext = toc - exst;
Update: I updated my code above to remove the overloaded operators (disp and lt were replaced with mess and lastt)
Was asked to use the profiler, so I limited to 2000 lines in the table (added && L >=2000 to the while loop) to limit the execution time, and here are the top results from the profiler:
SGAS_Wireshark_Parser_v0p7_fulleth 1 57.110 s 9.714 s
Strcat 9187 29.271 s 13.598 s
Blanks 9187 15.673 s 15.673 s
Uigetfile 1 12.226 s 0.009 s
uitools\private\uigetputfile_helper 1 12.212 s 0.031 s
FileChooser.FileChooser>FileChooser.show 1 12.085 s 0.006s
...er>FileChooser.showPeerAndBlockMATLAB 1 12.056 s 0.001s
...nChooser>FileOpenChooser.doShowDialog 1 12.049 s 12.049 s
hex2dec 44924 2.944 s 2.702 s
num2str 16336 1.139 s 0.550 s
str2double 17356 1.025 s 1.025 s
int2str 16336 0.589 s 0.589 s
fgetl 17356 0.488 s 0.488 s
dec2hex 6126 0.304 s 0.304 s
fliplr 44924 0.242 s 0.242 s
It appears to be strcat calls that are doing it. I only explicitly call strcat on one line. Are some of the other string manipulations I'm doing calling strcat indirectly?
Each loop should be calling strcat the same number of times though, so I still don't understand why it takes longer and longer the more it runs...
also, hex2dec is called a lot, but is not really affecting the time.
But anyway, are there any other methods I can use the combine the strings?
Here is the issue:
The string (an char array in MATLAB) packdata was being resized and reallocated over and over again. That's what was slowing down this code. I did the following steps:
I eliminated the redundant variable packdata and now only use eth.pack.
I preallocated eth.pack and a couple "helper variables" of known lengths by running blanks ONCE for each before the loop ever starts
eth.pack = blanks(604);
thisline = blanks(47);
smline = blanks(32);
(Note: 604 is the maximum possible size of packdata based on headers + MODBUS protocol)
Then I created a pointer variable to point to the location of the last char written to packdata.
pptr = 1;
...
dataline = int16(str2double(MBLine(1:4)));
thisline = MBLine(7:53); %Always 47 characters
smline = [thisline(~isspace(thisline)),blanks(32-sum(~isspace(thisline)))]; %Always 32 Characters
eth.pack(pptr:pptr+31) = smline;
pptr = pptr + 32;
The above was inside the 'if packline > 3' block in place of the 'packdata =' statement, then at the end of the 'if stoppack' block was the reset statement:
pptr = 1; %Reset Pointer
FYI, not surprisingly this brought out other flaws in my code which I've mostly fixed but still need to finish. Not a big issue now as this loop executes lightning fast with these changes. Thanks to Yvon for helping point me in the right direction.
I kept thinking my huge table, MBTable was the issue... but it had nothing to do with it.

Excessively large overhead in MATLAB .mat file

I am parsing a large text file full of data and then saving it to disk as a *.mat file so that I can easily load in only parts of it (see here for more information on reading in the files, and here for the data). To do so, I read in one line at a time, parse the line, and then append it to the file. The problem is that the file itself is >3 orders of magnitude larger than the data contained therein!
Here is a stripped down version of my code:
database = which('01_hit12.par');
[directory,filename,~] = fileparts(database);
matObj = matfile(fullfile(directory,[filename '.mat']),'Writable',true);
fidr = fopen(database);
hitranTemp = fgetl(fidr);
k = 1;
while ischar(hitranTemp)
if abs(hitranTemp(1)) == 32;
hitranTemp(1) = '0';
end
hitran = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2u%1c%7f%7f','delimiter','','whitespace','');
matObj.moleculeNumber(1,k) = uint8(hitran{1});
matObj.isotopeologueNumber(1,k) = uint8(hitran{2});
matObj.vacuumWavenumber(1,k) = hitran{3};
matObj.lineIntensity(1,k) = hitran{4};
matObj.airWidth(1,k) = single(hitran{6});
matObj.selfWidth(1,k) = single(hitran{7});
matObj.lowStateE(1,k) = single(hitran{8});
matObj.tempDependWidth(1,k) = single(hitran{9});
matObj.pressureShift(1,k) = single(hitran{10});
if rem(k,1e4) == 0;
display(sprintf('line %u (%2.2f)',k,100*k/K));
end
hitranTemp = fgetl(fidr);
k = k + 1;
end
fclose(fidr);
I stopped the code after 13,813 of the 224,515 lines had been parsed because it had been taking a very long time and the file size was getting huge, but the last printout indicated that I had only just cleared 10k lines. I cleared the memory, and then ran:
S = whos('-file','01_hit12.mat');
fileBytes = sum([S.bytes]);
T = dir(which('01_hit12.mat'));
diskBytes = T.bytes;
disp([fileBytes diskBytes diskBytes/fileBytes])
and get the output:
524894 896189009 1707.37141022759
What is taking up the extra 895,664,115 bytes? I know the help page says there should be a little extra overhead, but I feel that nearly a Gb of descriptive header is a bit excessive!
New information:
I tried pre-allocating the file, thinking that perhaps MATLAB was doing the same thing it does when a matrix is embiggened in a loop and reallocating a chunk of disk space for the entire matrix on each write, and that isn't it. Filling the file with zeros of the appropriate data types results in a file that my short check script returns:
8531570 71467 0.00837677004349727
This makes more sense to me. Matlab is saving the file sparsely, so the disk file size is much smaller than the size of the full matrix in memory. Once it starts replacing values with real data, however, I get the same behavior as before and the file size starts skyrocketing beyond all reasonable bounds.
New new information:
Tried this on a subset of the data, 100 lines long. To stream to disk, the data has to be in v7.3 format, so I ran the subset through my script, loaded it into memory, and then resaved as v7.0 format. Here are the results:
v7.3: 3800 8752 2.30
v7.0: 3800 2561 0.67
No wonder the v7.3 format isn't the default. Does anyone know a way around this? Is this a bug or a feature?
This seems like a bug to me. A workaround is to write in chunks to pre-allocated arrays.
Start off by pre-allocating:
fid = fopen('01_hit12.par', 'r');
data = fread(fid, inf, 'uint8');
nlines = nnz(data == 10) + 1;
fclose(fid);
matObj.moleculeNumber = zeros(1,nlines,'uint8');
matObj.isotopeologueNumber = zeros(1,nlines,'uint8');
matObj.vacuumWavenumber = zeros(1,nlines,'double');
matObj.lineIntensity = zeros(1,nlines,'double');
matObj.airWidth = zeros(1,nlines,'single');
matObj.selfWidth = zeros(1,nlines,'single');
matObj.lowStateE = zeros(1,nlines,'single');
matObj.tempDependWidth = zeros(1,nlines,'single');
matObj.pressureShift = zeros(1,nlines,'single');
Then to write in chunks of 10000, I modified your code as follows:
... % your code plus pre-alloc first
bs = 10000;
while ischar(hitranTemp)
if abs(hitranTemp(1)) == 32;
hitranTemp(1) = '0';
end
for ii = 1:bs,
hitran{ii} = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2 u%1c%7f%7f','delimiter','','whitespace','');
hitranTemp = fgetl(fidr);
if hitranTemp==-1, bs=ii; break; end
end
% this part really ugly, sorry! trying to keep it compact...
matObj.moleculeNumber(1,k:k+bs-1) = uint8(builtin('_paren',cellfun(#(c)c{1},hitran),1:bs));
matObj.isotopeologueNumber(1,k:k+bs-1) = uint8(builtin('_paren',cellfun(#(c)c{2},hitran),1:bs));
matObj.vacuumWavenumber(1,k:k+bs-1) = builtin('_paren',cellfun(#(c)c{3},hitran),1:bs);
matObj.lineIntensity(1,k:k+bs-1) = builtin('_paren',cellfun(#(c)c{4},hitran),1:bs);
matObj.airWidth(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{5},hitran),1:bs));
matObj.selfWidth(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{6},hitran),1:bs));
matObj.lowStateE(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{7},hitran),1:bs));
matObj.tempDependWidth(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{8},hitran),1:bs));
matObj.pressureShift(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{9},hitran),1:bs));
k = k + bs;
fprintf('.');
end
fclose(fidr);
The final size on disk is 21,393,408 bytes. The usage breaks down as,
>> S = whos('-file','01_hit12.mat');
>> fileBytes = sum([S.bytes]);
>> T = dir(which('01_hit12.mat'));
>> diskBytes = T.bytes; ratio = diskBytes/fileBytes;
>> fprintf('%10d whos\n%10d disk\n%10.6f\n',fileBytes,diskBytes,ratio)
8531608 whos
21389582 disk
2.507099
Still fairly inefficient, but not out of control.