I am writing a solution that manages data from an eye tracker. I currently hold the data in a N x 5 matrix, with the following columns:
X Position, Y Position, timestamp, Velocity, Acceleration
Each row represents a single sample from the eye tracker (which runs at 1000Hz).
At present, I access the data in the form of a matrix - e.g. if I want to access the velocity for sample #600, I use 'dataStream(600,4)'.
This is fine, but I'd prefer my code to be more readable. The '4' could be confusing; something like dataStream.velocity(600) would be ideal. I understand that this would be a simple use of STRUCT. However, there are situations in which I need to copy an entire sample (i.e. all columns from one row of my matrix). As I understand it, this would not easily be achieved in a STRUCT object, as the various arrays in each STRUCT sub-heading are not intrinsically linked. I would have to (I think) copy each element separately, for example if I wanted to copy sample #100, I believe I would need to copy dataStream.xPos(100), dataStream.yPos(100), dataStream.timestamp(100) and so on separately.
Is there something I'm missing with regards to management of STRUCTs, or would I be better off saving the hassle and sticking with the matrix approach?
If it is just for an increased readability, I would not use structs, but rather use an quite simple approach by defining variables for the different columns of your data matrix. See for instance:
xPosition = 1;
yPosition = 2;
timestamp = 3;
Velocity = 4;
Acceleration = 5;
With this variables you can write quite meaningful queries, for instance, instead of dataStream(600,1) you would write:
dataStream(600, xPosition)
Note that you also could define more complex queries, for instance
position = [1 2];
wholeSample = 1:5;
to query the multiple columns at once.
You can copy struct easily
s = struct(another_struct);
In terms of performance, struct will be slower than matrix. Use readable constant to replace your numerical indices as suggested by #H.Muster.
Related
I am developing a certain feature for a high-order finite element simulation algorithm in Matlab and I am wondering what is a good way of implementing a certain task. I believe I am facing a somewhat common problem, but after doing some digging, I'm not really finding a good solution.
Basically, I have a long list of ID's (corresponding to certain nodes on my mesh), where each ID is associated with small data set. Then when I am running my solver, I need to access the data associated with these nodes and update the data (multiple times).
So, for example, let's say that this is my list of these specific nodes:
nodelist = [3 27 38] %(so these are my node ID's)
Then for each node I have the following dataset associated
a (scalar)
b (5x5 double matrix)
c (10x1 double vector)
(a total of 36 double values associated with each node ID)
In reality, I will of course have a much, much longer list of node ID's and a somewhat larger data set associated with each node (but still only double scalars, matrices and vectors (no characters, strings etc)).
Approach 1
So one approach I cooked up is just to store everything in a 2D double matrix, and then do some relatively complex indexing to access my data when needed. For the example above, the size of my 2D matrix would be
size(2Dmat) = [length(nodelist), 36]
Say I wanted to access b(3,3) for node ID 27, I would access 2Dmat(2,14).
In principle, this works, but the code is just not very clean and readable because of this complex indexing (not to mention, when I change something in the way the data set is set up, I need to re-adjust the whole indexing code).
Approach 2
Another approach would be to use some sort of struct for each node in the node list:
a = 4.4;
b = rand(5,5);
c = rand(10,1);
s = struct('a',a,'b',b,'c',c)
And then I can access the data via, e.g., s.b(3,3) etc. But I just don't know how to associate a struct with the node ID?
Approach 3
The last thing I could think of would be to set up some sort of SQL database, but this seems like an overkill. And besides, I need my code to be as fast as possible, since I need to access these fields in the datasets associated with these chosen nodes many, many times and I imagine doing some queries into a database will slow things down.
Note that ultimately I will convert the code from Matlab to C/C++, so I would prefer to implement something that doesn't rely to heavily on some Matlab specific features.
So, any thoughts on how to implement this functionality in a clean way? I hope my question makes sense and thanks in advance!
Approach 2 is the cleanest, and readily translates to C++. For each node you have a struct s, then:
data(nodeID) = s;
is what is called a struct array. You index as
data(id).b(3,3) = 0.0;
This assumes that the IDs are contiguous, or that there are no huge gaps in their values. But this can always be ensured, it is easy to renumber node IDs if necessary.
In C++, you’d have a vector of structs:
struct Bla{
double a;
double b[3][3];
double c[10];
};
std::vector<Bla> data(N);
Or in C:
Bla* data = malloc(sizeof(Bla)*N);
(and don’t forget free(data) when you’re done with it).
Then, in either C or C++, you access an element this way:
data[id].b[2][2] = 0.0;
The translation is obvious, except that indexing starts at 0 in C++ and at 1 in MATLAB.
Note that this method has a larger memory overhead than Approach 1 in MATLAB, but not in C or C++.
Approach 3 is a bad idea, it will just slow down your code without any benefits.
I think the cleanest solution, given a non-contiguous set of node IDs, would be approach 2 making use of a map container where your node ID is the key (i.e. index) into the map. This can be implemented in MATLAB using a containers.Map object, and in C++ using the std::map container. For example, here's how you can create and add values to a node map in MATLAB:
>> nodeMap = containers.Map('KeyType', 'double', 'ValueType', 'any');
>> nodelist = [3 27 38];
>> nodeMap(nodelist(1)) = struct('a', 4.4, 'b', rand(5, 5), 'c', rand(10, 1));
>> nodeMap(3)
ans =
struct with fields:
a: 4.400000000000000
b: [5×5 double]
c: [10×1 double]
>> nodeMap(3).b(3,3)
ans =
0.646313010111265
In C++, you would need to define a structure or class (e.g. Node) for the data type to be stored in the map. Here's an example (... denotes arguments passed to the Node constructor):
#include <map>
class Node {...}; // Define Node class
typedef std::map<int, Node> NodeMap; // Using int for key type
int main()
{
NodeMap map1;
map1[3] = Node(...); // Initialize and assign Node object
map1.emplace(27, std::forward_as_tuple<...>); // Create Node object in-place
}
I have a dataset of n nifti (.nii) images. Ideally, I'd like to be able to get the value of the same voxel/element from each image, and apply a function to the n data points. I'd like to do this for each voxel/element across the whole image, so that I can reconvert the result back into .nii format.
I've used the Tools for NIfTI and ANALYZE image toolbox to load my images:
data(1)=load_nii('C:\file1.nii');
data(2)=load_nii('C:\file2.nii');
...
data(n)=load_nii('C:\filen.nii');
From which I obtain a struct object with each sub-field containing one loaded nifti. Each of these has a subfield 'img' corresponding to the image data I want to work on. The problem comes from trying to select a given xyz within each img field of data(1) to data(n). As I discovered, it isn't possible to select in this way:
data(:).img(x,y,z)
or
data(1:n).img(x,y,z)
because matlab doesn't support it. The contents of the first brackets have to be scalar for the call to work. The solution from googling around seems to be a loop that creates a temporary variable:
for z = 1:nz
for x = 1:nx
for y = 1:ny
for i=1:n;
points(i)=data(i).img(x,y,z);
end
[p1(x,y,z,:),~,p2(x,y,z)] = fit_data(a,points,b);
end
end
end
which works, but takes too long (several days) for a single set of images given the size of nx, ny, nz (several hundred each).
I've been looking for a solution to speed up the code, which I believe depends on removing those loops by vectorisation, preselecting the img fields (via getfield ?)and concatenating them, and applying something like arrayfun/cellfun/structfun, but i'm frankly a bit lost on how to do it. I can only think of ways to pre-select which themselves require loops, which seems to defeat the purpose of the exercise (though a solution with fewer loops, or fewer nested loops at least, might do it), or fun into the same problem that calls like data(:).img(x,y,z) dont work. googling around again is throwing up ways to select and concatenate fields within a struct, or a given field across multiple structs. But I can't find anything for my problem: select an element from a non-scalar sub-field in a sub-struct of a struct object (with the minimum of loops). Finally I need the output to be in the form of a matrix that the toolbox above can turn back into a nifti.
Any and all suggestions, clues, hints and help greatly appreciated!
You can concatenate images as a 4D array and use linear indexes to speed up calculations:
img = cat(4,data.img);
p1 = zeros(nx,ny,nz,n);
p2 = zeros(nx,ny,nz);
sz = ny*nx*nz;
for k = 1 : sz
points = img(k:sz:end);
[p1(k:sz:end),~,p2(k)] = fit_data(a,points,b);
end
If you must incrementally append data to arrays, it seems that using individual vectors of basic data types is orders of magnitude faster than an array of structs (with one vector element per record). Even trying to collect the individual vectors into a struct seems to double the time. The tests are:
N=5e4;
fprintf('\nstruct array (array of structs):\n')
clear x y;
y=struct( 'a',[], 'b',[], 'c',[], 'd',[] );
tic
for iIns = 1 : N
x.a=rand; x.b=rand; x.c=rand; x.d=rand;
y(end+1)=x;
end % for iIns
toc
fprintf('\nSeparate arrays of scalars:\n')
clear a b c d;
a=[]; b=[]; c=[]; d=[];
tic
for iIns = 1 : N
a(end+1) = rand;
b(end+1) = rand;
c(end+1) = rand;
d(end+1) = rand;
end % for iIns
toc
fprintf('\nA struct with arrays of scalars for fields:\n')
clear a b c d x y
x.a=[]; x.b=[]; x.c=[]; x.d=[];
tic
for iIns = 1:N
x.a(end+1)=rand;
x.b(end+1)=rand;
x.c(end+1)=rand;
x.d(end+1)=rand;
end % for iIns
toc
The results:
struct array (array of structs):
Elapsed time is 24.127274 seconds.
Separate arrays of scalars:
Elapsed time is 0.048190 seconds.
A struct with arrays of scalars for fields:
Elapsed time is 0.084624 seconds.
Even though collecting individual vectors of basic data types into a struct (3rd scenario above) imposes such a penalty, it may be preferrable to simply using individual vectors (second scenario above) because the variables are more organized. Your variable name space isn't filled up with so many variables which are in fact conceptually grouped.
That's quite a significant penalty, however, to pay for such organization. I don't suppose there is way to avoid this?
There are two ways to avoid this performance penalty: (1) pre-allocate, and (2) rethink your stance on "organizing" variables. I suggest both. Oh, and if you can, don't use arrays of structs where each field only uses scalars - if your application suddenly has to handle a couple of orders of magnitude more data, the memory overhead will force you to rewrite everything.
Pre-allocation
You often know how many elements your array will end up having. Thus, initialize your arrays as s = struct('a',NaN(1:N),'b',NaN(1:N)); If you don't know ahead of time how many entries there will be, but you can estimate an upper limit, initialize with the upper limit, and either remove the elements, or use functions (e.g. nanmean) that do not care if the array has a few extra NaNs in the end. If you truly know nothing about the final size (except that N will be large enough to matter), pre-allocate with a nice number (e.g. N=1337), and extend the array in chunks. MathWorks have sped up dynamic growing of numeric arrays in a recent release, but as you demonstrate in your answer, the optimization has not been applied to structs yet. Don't count MathWorks' optimization team to fix your code.
Nice variables
Why worry about your variable space? As long as you use explicitVariableNames, your code remains readable and you will have an easy time picking out the right variable. But ok, let's say you want to clean up: The first way to keeping the number of active variables low is to use clear or keep at strategic points in your code to make sure you only keep around what's needed. The second (assuming you want to optimize for performance), is to put contextually linked vectors into the same array: objectDimensions = [lengthOfObject, widthOfObject, heightOfObject]. This keeps everything as numeric arrays (which are fastest), and allows easy vectorization such as objectVolume = prod(objectDimensions,2);.
/aside: I should disclose that I used to use structures frequently for assembling results (so that I could return a lot of information a single variable and have the field names be part of the documentation). I have since switched to use object-oriented-programming (usually handle-objects), which no only collect related variables, but also the associated functionality, and which facilitate code re-use. I do take a performance hit, but the time it saves me coding makes more than up for it. Note that I do pre-allocate if at all possible (and if it's not just growing an array three times).
Example
Assume you have a function getDimensions that reads dimensions (length, height, width) of objects. However, sometimes, the object is 2D, sometimes it is 3D. Thus, you want to fill the following variables: twoD.length, twoD.width, threeD.length, threeD.width, threeD.height, ideally as arrays of structs, so that each element of a struct corresponds to an object. You do not know ahead of time how many objects there are, all you can do is poll the function thereAreMoreObjects, which returns true or false, until there are no more objects.
Here's how you can do this with reasonable efficiency and growing arrays by chunks:
%// preassign the temporary variable, and some others
chunkSize = 1000;
numObjects = 0;
idAndDimensions = zeros(chunkSize,4);
while thereAreMoreObjects()
objectId = getCurrentObjectId();
%// hi==-1 if it's flat
[len,wid,hi] = getObjectDimensions(objectId);
%// allocate more, if needed
numObjects = numObjects + 1;
if numObjects > size(idAndDimensions,1)
%// grow array
idAndDimensions(end+chunkSize,1) = 0;
end
idAndDimensions(numObjects,:) = [objectId, len, wid, hi];
end
%// throw away excess
idAndDimensions = idAndDimensions(1:numObjects,:);
%// split into 2D and 3D objects
isTwoD = numObjects(:,end) == -1;
%// assign twoD struct
twoD = struct('id',num2cell(idAndDimensions(isTwoD,1),...
'length',num2cell(idAndDimensions(isTwoD,2),...
'width',num2cell(idAndDimensions(isTwoD,3));
%// assign threeD struct
%// clean up - we need only the two structs
%// I use keep from the File Exchange instead of clearvars
clearvars -except twoD threeD
I am using structures in Matlab to organize my results in an intuitive way. My analysis is quite complex and hierarchical, so this works well---logically. For example:
resultObj.multivariate.individual.distributed.raw.alpha10(1).classification(1). Each level of the structure has several fields. Each alpha field is a structured array, indexed for each dataset, and classification is also a structured array, one for each cross validation run on the data.
To simplify, consider the the classification field:
>> classification
ans =
1x8 struct array with fields:
bestLambda
bestBetas
scores
statObj
fitObj
In which statObj has fields (for example):
dprime: 6.5811
hit: 20
miss: 0
falseAlarms: 0
correctRejections: 30
Of course, the fields have different values for each subject and cross validation run. Given this structure, is there a good way to find the mean of dprime over cross validation runs (i.e. the elements of classification) without needing to construct a for loop to extract, store, and finally compute on?
I was hoping that reshape(struct2array(classification.statObj),5,8) would work, so I could construct a matrix with stats as rows and cross validations runs as columns, but this won't work. I put these items in their own structure specifically because the fields of classification hold elements of various types (matrices, structures, integers).
I am not opposed to restructuring my output entirely, but I'd like it to be done in such a way that the organization is fairly self-commenting, and I could say return to this structure a year from now and remember what and where everything is.
I came up with the following, although I'm not sure if it is what you are looking for:
%# create a structure hierarchy similar to yours
%# (I ignore everything before alpha10, and only create a part of it)
alpha10 = struct();
for a=1:5
alpha10(a).classification = struct();
for c=1:8
alpha10(a).classification(c).statObj = struct('dprime',rand());
end
end
%# matrix of 'dprime' for each alpha across each cross-validation run
st = [alpha10.classification];
st = [st.statObj];
dp = reshape([st.dprime], 8, 5)' %# result is 5-by-8 matrix
Next you can compute mean across the second dimension of this matrix dp
For anyone who happens across this post, and is wrestling with something similar, it is worth asking yourself if such a nested structure-of-structures is really your best option. It may be easier to flatten the hierarchy and include descriptive fields as labels. For instance
resultObj.multivariate.individual.distributed.raw.alpha10(1).classification(1)
might instead be
resultObj(1).
AnlaysisType = 'multivariate'
GroupSolution = false
SignalType = 'distributed'
Processing = 'raw'
alpha = 10
crossvalidation = 1
dprime = 6.5811
bestLambda = []
bestBetas = []
scores = []
fitObj = []
That's not valid Matlab syntax there, but it get's the point across. Rather than building a hierarchy out of nested structures, create a 1xN structure with labels and data. It is a more general solution that is easier to query and work with.
I wrote a small quicksort implementation in matlab to sort some custom data. Because I am sorting a cell-array and I need the indexes of the sort-order and do not want to restructure the cell-array itself I need my own implementation (maybe there is one available that works, but I did not find it).
My current implementation works by partitioning into a left and right array and then passing these arrays to the recursive call. Because I do not know the size of left and and right I just grow them inside a loop which I know is horribly slow in matlab.
I know you can do an in place quicksort, but I was warned about never modifying the content of variables passed into a function, because call by reference is not implemented the way one would expect in matlab (or so I was told). Is this correct? Would an in-place quicksort work as expected in matlab or is there something I need to take care of? What other hints would you have for implementing this kind of thing?
Implementing a sort on complex data in user M-code is probably going to be a loss in terms of performance due to the overhead of M-level operations compared to Matlab's builtins. Try to reframe the operation in terms of Matlab's existing vectorized functions.
Based on your comment, it sounds like you're sorting on a single-value key that's inside the structs in the cells. You can probably get a good speedup by extracting the sort key to a primitive numeric array and calling the builtin sort on that.
%// An example cell array of structs that I think looks like your input
c = num2cell(struct('foo',{'a','b','c','d'}, 'bar',{6 1 3 2}))
%// Let's say the "bar" field is what you want to sort on.
key = cellfun(#(s)s.bar, c) %// Extract the sort key using cellfun
[sortedKey,ix] = sort(key) %// Sort on just the key using fast numeric sort() builtin
sortedC = c(ix); %// ix is a reordering index in to c; apply the sort using a single indexing operation
reordering = cellfun(#(s)s.foo, sortedC) %// for human readability of results
If you're sorting on multiple field values, extract all the m key values from the n cells to an n-by-m array, with columns in descending order of precedence, and use sortrows on it.
%// Multi-key sort
keyCols = {'bar','baz'};
key = NaN(numel(c), numel(keyCols));
for i = 1:numel(keyCols)
keyCol = keyCols{i};
key(:,i) = cellfun(#(s)s.(keyCol), c);
end
[sortedKey,ix] = sortrows(key);
sortedC = c(ix);
reordering = cellfun(#(s)s.foo, sortedC)
One of the keys to performance in Matlab is to get your data in primitive arrays, and use vectorized operations on those primitive arrays. Matlab code that looks like C++ STL code with algorithms and references to comparison functions and the like will often be slow; even if your code is good in O(n) complexity terms, the fixed cost of user-level M-code operations, especially on non-primitives, can be a killer.
Also, if your structs are homogeneous (that is, they all have the same set of fields), you can store them directly in a struct array instead of a cell array of structs, and it will be more compact. If you can do more extensive redesign, rearranging your data structures to be "planar-organized" - where you have a struct of arrays, reading across the ith elemnt of all the fields as a record, instead of an array of structs of scalar fields - could be a good efficiency win. Either of these reorganizations would make constructing the sort key array cheaper.
In this post, I only explain MATLAB function-calling convention, and am not discussing the quick-sort algorithm implementation.
When calling functions, MATLAB passes built-in data types by-value, and any changes made to such arguments are not visible outside the function.
function y = myFunc(x)
x = x .* 2; %# pass-by-value, changes only visible inside function
y = x;
end
This could be inefficient for large data especially if they are not modified inside the functions. Therefore MATLAB internally implements a copy-on-write mechanism: for example when a vector is copied, only some meta-data is copied, while the data itself is shared between the two copies of the vector. And it is only when one of them is modified, that the data is actually duplicated.
function y = myFunc(x)
%# x was never changed, thus passed-by-reference avoiding making a copy
y = x .* 2;
end
Note that for cell-arrays and structures, only the cells/fields modified are passed-by-value (this is because cells/fields are internally stored separately), which makes copying more efficient for such data structures. For more information, read this blog post.
In addition, versions R2007 and upward (I think) detects in-place operations on data and optimizes such cases.
function x = myFunc(x)
x = x.*2;
end
Obviously when calling such function, the LHS must be the same as the RHS (x = myFunc(x);). Also in order to take advantage of this optimization, in-place functions must be called from inside another function.
In MEX-functions, although it is possible to change input variables without making copies, it is not officially supported and might yield unexpected results...
For user-defined types (OOP), MATLAB introduced the concept of value object vs. handle object supporting reference semantics.