Hash tables in MATLAB - matlab

Does MATLAB have any support for hash tables?
Some background
I am working on a problem in Matlab that requires a scale-space representation of an image. To do this I create a 2-D Gaussian filter with variance sigma*s^k for k in some range., and then I use each one in turn to filter the image. Now, I want some sort of mapping from k to the filtered image.
If k were always an integer, I'd simply create a 3D array such that:
arr[k] = <image filtered with k-th guassian>
However, k is not necessarily an integer, so I can't do this. What I thought of doing was keeping an array of ks such that:
arr[find(array_of_ks_ = k)] = <image filtered with k-th guassian>
Which seems pretty good at first thought, except I will be doing this lookup potentially a few thousand times with about 20 or 30 values of k, and I fear that this will hurt performance.
I wonder if I wouldn't be better served doing this with a hash table of some sort so that I would have a lookup time that is O(1) instead of O(n).
Now, I know that I shouldn't optimize prematurely, and I may not have this problem at all, but remember, this is just the background, and there may be cases where this is really the best solution, regardless of whether it is the best solution for my problem.

Consider using MATLAB's map class: containers.Map. Here is a brief overview:
Creation:
>> keys = {'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', ...
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Annual'};
>> values = {327.2, 368.2, 197.6, 178.4, 100.0, 69.9, ...
32.3, 37.3, 19.0, 37.0, 73.2, 110.9, 1551.0};
>> rainfallMap = containers.Map(keys, values)
rainfallMap =
containers.Map handle
Package: containers
Properties:
Count: 13
KeyType: 'char'
ValueType: 'double'
Methods, Events, Superclasses
Lookup:
x = rainfallMap('Jan');
Assign:
rainfallMap('Jan') = 0;
Add:
rainfallMap('Total') = 999;
Remove:
rainfallMap.remove('Total')
Inspect:
values = rainfallMap.values;
keys = rainfallMap.keys;
sz = rainfallMap.size;
Check key:
if rainfallMap.isKey('Today')
...
end

Matlab R2008b (7.7)’s new containers.Map class is a scaled-down Matlab version of the java.util.Map interface. It has the added benefit of seamless integration with all Matlab types (Java Maps cannot handle Matlab structs for example) as well as the ability since Matlab 7.10 (R2010a) to specify data types.
Serious Matlab implementations requiring key-value maps/dictionaries should still use Java’s Map classes (java.util.EnumMap, HashMap, TreeMap, LinkedHashMap or Hashtable) to gain access to their larger functionality if not performance. Matlab versions earlier than R2008b have no real alternative in any case and must use the Java classes.
A potential limitation of using Java Collections is their inability to contain non-primitive Matlab types such as structs. To overcome this, either down-convert the types (e.g., using struct2cell or programmatically), or create a separate Java object that will hold your information and store this object in the Java Collection.
You may also be interested to examine a pure-Matlab object-oriented (class-based) Hashtable implementation, which is available on the File Exchange.

You could use java for it.
In matlab:
dict = java.util.Hashtable;
dict.put('a', 1);
dict.put('b', 2);
dict.put('c', 3);
dict.get('b')
But you would have to do some profiling to see if it gives you a speed gain I guess...

Matlab does not have support for hashtables. EDIT Until r2010a, that is; see #Amro's answer.
To speed up your look-ups, you can drop the find, and use LOGICAL INDEXING.
arr{array_of_ks==k} = <image filtered with k-th Gaussian>
or
arr(:,:,array_of_ks==k) = <image filtered with k-th Gaussian>
However, in all my experience with Matlab, I've never had a lookup be a bottleneck.
To speed up your specific problem, I suggest to either use incremental filtering
arr{i} = GaussFilter(arr{i-1},sigma*s^(array_of_ks(i)) - sigma*s^(array_of_ks(i-1)))
assuming array_of_ks is sorted in ascending order, and GaussFilter calculates the filter mask size based on the variance (and uses, 2 1D filters, of course), or you can filter in Fourier Space, which is especially useful for large images and if the variances are spaced evenly (which they most likely aren't unfortunately).

It's a little clugey, but I'm surprised nobody has suggested using structs. You can access any struct field by variable name as struct.(var) where var can be any variable and will resolve appropriately.
dict.a = 1;
dict.b = 2;
var = 'a';
display( dict.(var) ); % prints 1

You can also take advantage of the new type "Table". You can store different types of data and get statistics out of it really easy.
See http://www.mathworks.com/help/matlab/tables.html for more info.

Related

Opposite indices when using find-function

i searched a lot in google but didnt find an answer that did help me without reducing my performance.
I have a Matrice A and B of the same size with different values. Then i want to filter:
indices=find(A<5 & B>3)
A(indices)=
B(indices)=
Now I want to apply a function on the indices -> indices_2=find(A>=5 | b<=3) without using the find function on the whole matrices A and B again. Logic operations are not possible in this case because I need the indices and not 0 and 1.
Something like:
A(~indices)=
B(~indices)=
instead of:
indices_2=find(A>=5 | B<=3)
A(indices_2)=
B(indices_2)=
And after that I want to split these sets once again.... Just Filtering.
I used indices_2=setdiff(indices, size(A)) but it did screw my computation performance. Is there any other method to split the matrices into subsets without using find twice?
Hope you understand my problem and it fits the regulations.
I don't understand why you can't just use find again, nor why you can't use logical indexing in this case but I suppose if you are going to restrict yourself like this then you could accomplish this using setdiff:
indices_2 = setdiff(1:numel(A), indices)
however, if you are worried about performance, you should be sticking to logical indexing:
indices = A<5 & B>3
A(indices)=...
B(indices)=...
A(~indices)=...
B(~indices)=...
I think you may be looking for something like this:
%Split your data in two and keep track of which numbers you have
ranks = 1:numel(A);
indices= find(A<5 & B>3);
% Update the numbers list to contain the set of numbers you are interested in
ranks_2 = ranks(~indices)
% Operate on the set you are interested in and find the relevant ranks
indices_2= A(~indices)>=5 | B(~indices)<=3
ranks_2 = ranks_2(indices_2)

Matlab: Query complicated structures

I am using structures in Matlab to organize my results in an intuitive way. My analysis is quite complex and hierarchical, so this works well---logically. For example:
resultObj.multivariate.individual.distributed.raw.alpha10(1).classification(1). Each level of the structure has several fields. Each alpha field is a structured array, indexed for each dataset, and classification is also a structured array, one for each cross validation run on the data.
To simplify, consider the the classification field:
>> classification
ans =
1x8 struct array with fields:
bestLambda
bestBetas
scores
statObj
fitObj
In which statObj has fields (for example):
dprime: 6.5811
hit: 20
miss: 0
falseAlarms: 0
correctRejections: 30
Of course, the fields have different values for each subject and cross validation run. Given this structure, is there a good way to find the mean of dprime over cross validation runs (i.e. the elements of classification) without needing to construct a for loop to extract, store, and finally compute on?
I was hoping that reshape(struct2array(classification.statObj),5,8) would work, so I could construct a matrix with stats as rows and cross validations runs as columns, but this won't work. I put these items in their own structure specifically because the fields of classification hold elements of various types (matrices, structures, integers).
I am not opposed to restructuring my output entirely, but I'd like it to be done in such a way that the organization is fairly self-commenting, and I could say return to this structure a year from now and remember what and where everything is.
I came up with the following, although I'm not sure if it is what you are looking for:
%# create a structure hierarchy similar to yours
%# (I ignore everything before alpha10, and only create a part of it)
alpha10 = struct();
for a=1:5
alpha10(a).classification = struct();
for c=1:8
alpha10(a).classification(c).statObj = struct('dprime',rand());
end
end
%# matrix of 'dprime' for each alpha across each cross-validation run
st = [alpha10.classification];
st = [st.statObj];
dp = reshape([st.dprime], 8, 5)' %# result is 5-by-8 matrix
Next you can compute mean across the second dimension of this matrix dp
For anyone who happens across this post, and is wrestling with something similar, it is worth asking yourself if such a nested structure-of-structures is really your best option. It may be easier to flatten the hierarchy and include descriptive fields as labels. For instance
resultObj.multivariate.individual.distributed.raw.alpha10(1).classification(1)
might instead be
resultObj(1).
AnlaysisType = 'multivariate'
GroupSolution = false
SignalType = 'distributed'
Processing = 'raw'
alpha = 10
crossvalidation = 1
dprime = 6.5811
bestLambda = []
bestBetas = []
scores = []
fitObj = []
That's not valid Matlab syntax there, but it get's the point across. Rather than building a hierarchy out of nested structures, create a 1xN structure with labels and data. It is a more general solution that is easier to query and work with.

Caching Matlab function results to file

I'm writing a simulation in Matlab.
I will eventually run this simulation hundreds of times.
In each simulation run, there are millions of simulation cycles.
In each of these cycles, I calculate a very complex function, which takes ~0.5 sec to finish.
The function input is a long bit array (>1000 bits) - which is an array of 0 and 1.
I hold the bit arrays in a matrix of 0 and 1, and for each one of them I only run the function once - as I save the result in a different array (res) and check if the bit array is in the matrix before running the functions:
for i=1:1000000000
%pick a bit array somehow
[~,indx] = ismember(bit_array,bit_matrix,'rows');
if indx == 0
indx = length(results) + 1;
bit_matrix(indx,:) = bit_array;
res(indx) = complex_function(bit_array);
end
result = res(indx)
%do something with result
end
I have two quesitons, really:
Is there a more efficient way to find the index of a row in a matrix then 'ismember'?
Since I run the simulation many times, and there is a big overlap in the bit-arrays I'm getting, I want to cache the matrix between runs so that I don't recalculate the function over the same bit-arrays over and over again. How do I do that?
The answer to both questions is to use a map. There are a few steps to do this.
First you will need a function to turn your bit_array into either a number or a string. For example, turn [0 1 1 0 1 0] into '011010'. (Matlab only supports scalar or string keys, which is why this step is required.)
Defined a map object
cachedRunMap = containers.Map; %See edit below for more on this
To check if a particular case has been run, use iskey.
cachedRunMap.isKey('011010');
To add the results of a run use the appending syntax
cachedRunMap('011010') = [0 1 1 0 1]; %Or whatever your result is.
To retrieve cached results, use the getting syntax
tmpResult = cachedRunMap.values({'011010'});
This should efficiently store and retrieve values until you run out of system memory.
Putting this together, now your code would look like this:
%Hacky magic function to convert an array into a string of '0' and '1'
strFromBits = #(x) char((x(:)'~=0)+48); %'
%Initialize the map
cachedRunMap = containers.Map;
%Loop, computing and storing results as needed
for i=1:1000000000
%pick a bit array somehow
strKey = strFromBits(bit_array);
if cachedRunMap.isKey(strKey)
result = cachedRunMap(strKey);
else
result = complex_function(bit_array);
cachedRunMap(strKey) = reult;
end
%do something with result
end
If you want a key which is not a string, that needs to be declared at step 2. Some examples are:
cachedRunMap = containers.Map('KeyType', 'char', 'ValueType', 'any');
cachedRunMap = containers.Map('KeyType', 'double', 'ValueType', 'any');
cachedRunMap = containers.Map('KeyType', 'uint64', 'ValueType', 'any');
cachedRunMap = containers.Map('KeyType', 'uint64', 'ValueType', 'double');
Setting a KeyType of 'char' sets the map to use strings as keys. All other types must be scalars.
Regarding issues as you scale this up (per your recent comments)
Saving data between sessions: There should be no issues saving this map to a *.mat file, up to the limits of your systems memory
Purging old data: I am not aware of a straightforward way to add LRU features to this map. If you can find a Java implementation you can use it within Matlab pretty easily. Otherwise it would take some thought to determine the most efficient method of keeping track of the last time a key was used.
Sharing data between concurrent sessions: As you indicated, this probably requires a database to perform efficiently. The DB table would be two columns (3 if you want to implement LRU features), the key, value, (and last used time if desired). If your "result" is not a type which easily fits into SQL (e.g. a non-uniform size array, or complex structure) then you will need to put additional thought into how to store it. You will also need a method to access the database (e.g. the database toolbox, or various tools on the Mathworks file exchange). Finally you will need to actually setup a database on a server (e.g. MySql if you are cheap, like me, or whatever you have the most experience with, or can find the most help with.) This is not actually that hard, but it takes a bit of time and effort the first time through.
Another approach to consider (much less efficient, but not requiring a database) would be to break up the data store into a large (e.g. 1000's or millions) number of maps. Save each into a separate *.mat file, with a filename based on the keys contained in that map (e.g. the first N characters of your string key), and then load/save these files between sessions as needed. This will be pretty slow ... depending on your usage it may be faster to recalculate from the source function each time ... but it's the best way I can think of without setting up the DB (clearly a better answer).
For a large list, a hand-coded binary search can beat ismember, if maintaining it in sorted order isn't too expensive. If that's really your bottleneck. Use the profiler to see how much the ismember is really costing you. If there aren't too many distinct values, you could also store them in a containers.Map by packing the bit_matrix in to a char array and using it as the key.
If it's small enough to fit in memory, you could store it in a MAT file using save and load. They can store any basic Matlab datatype. Have the simulation save the accumulated res and bit_matrix at the end of its run, and re-load them the next time it's called.
I think that you should use containers.Map() for the purpose of speedup.
The general idea is to hold a map that contains all hash values. If your bit arrays have uniform distribution under the hash function, most of the time you won't need the call to ismember.
Since the key type cannot be an array in Matlab, you can calculate some hash function on your array of bits.
For example:
function s = GetHash(bitArray)
s = mod( sum(bitArray), intmax('uint32'));
end
This is a lousy hash function, but enough to understand the principle.
Then the code would look like:
map = containers.Map('KeyType','uint32','ValueType','any');
for i=1:1000000000
%pick a bit array somehow
s = GetHash(bit_array);
if isKey %Do the slow check.
[~,indx] = ismember(bit_array,bit_matrix,'rows');
else
map(s) = 1;
continue;
end
if indx == 0
indx = length(results) + 1;
bit_matrix(indx,:) = bit_array;
res(indx) = complex_function(bit_array);
end
result = res(indx)
%do something with result
end

In-Place Quicksort in matlab

I wrote a small quicksort implementation in matlab to sort some custom data. Because I am sorting a cell-array and I need the indexes of the sort-order and do not want to restructure the cell-array itself I need my own implementation (maybe there is one available that works, but I did not find it).
My current implementation works by partitioning into a left and right array and then passing these arrays to the recursive call. Because I do not know the size of left and and right I just grow them inside a loop which I know is horribly slow in matlab.
I know you can do an in place quicksort, but I was warned about never modifying the content of variables passed into a function, because call by reference is not implemented the way one would expect in matlab (or so I was told). Is this correct? Would an in-place quicksort work as expected in matlab or is there something I need to take care of? What other hints would you have for implementing this kind of thing?
Implementing a sort on complex data in user M-code is probably going to be a loss in terms of performance due to the overhead of M-level operations compared to Matlab's builtins. Try to reframe the operation in terms of Matlab's existing vectorized functions.
Based on your comment, it sounds like you're sorting on a single-value key that's inside the structs in the cells. You can probably get a good speedup by extracting the sort key to a primitive numeric array and calling the builtin sort on that.
%// An example cell array of structs that I think looks like your input
c = num2cell(struct('foo',{'a','b','c','d'}, 'bar',{6 1 3 2}))
%// Let's say the "bar" field is what you want to sort on.
key = cellfun(#(s)s.bar, c) %// Extract the sort key using cellfun
[sortedKey,ix] = sort(key) %// Sort on just the key using fast numeric sort() builtin
sortedC = c(ix); %// ix is a reordering index in to c; apply the sort using a single indexing operation
reordering = cellfun(#(s)s.foo, sortedC) %// for human readability of results
If you're sorting on multiple field values, extract all the m key values from the n cells to an n-by-m array, with columns in descending order of precedence, and use sortrows on it.
%// Multi-key sort
keyCols = {'bar','baz'};
key = NaN(numel(c), numel(keyCols));
for i = 1:numel(keyCols)
keyCol = keyCols{i};
key(:,i) = cellfun(#(s)s.(keyCol), c);
end
[sortedKey,ix] = sortrows(key);
sortedC = c(ix);
reordering = cellfun(#(s)s.foo, sortedC)
One of the keys to performance in Matlab is to get your data in primitive arrays, and use vectorized operations on those primitive arrays. Matlab code that looks like C++ STL code with algorithms and references to comparison functions and the like will often be slow; even if your code is good in O(n) complexity terms, the fixed cost of user-level M-code operations, especially on non-primitives, can be a killer.
Also, if your structs are homogeneous (that is, they all have the same set of fields), you can store them directly in a struct array instead of a cell array of structs, and it will be more compact. If you can do more extensive redesign, rearranging your data structures to be "planar-organized" - where you have a struct of arrays, reading across the ith elemnt of all the fields as a record, instead of an array of structs of scalar fields - could be a good efficiency win. Either of these reorganizations would make constructing the sort key array cheaper.
In this post, I only explain MATLAB function-calling convention, and am not discussing the quick-sort algorithm implementation.
When calling functions, MATLAB passes built-in data types by-value, and any changes made to such arguments are not visible outside the function.
function y = myFunc(x)
x = x .* 2; %# pass-by-value, changes only visible inside function
y = x;
end
This could be inefficient for large data especially if they are not modified inside the functions. Therefore MATLAB internally implements a copy-on-write mechanism: for example when a vector is copied, only some meta-data is copied, while the data itself is shared between the two copies of the vector. And it is only when one of them is modified, that the data is actually duplicated.
function y = myFunc(x)
%# x was never changed, thus passed-by-reference avoiding making a copy
y = x .* 2;
end
Note that for cell-arrays and structures, only the cells/fields modified are passed-by-value (this is because cells/fields are internally stored separately), which makes copying more efficient for such data structures. For more information, read this blog post.
In addition, versions R2007 and upward (I think) detects in-place operations on data and optimizes such cases.
function x = myFunc(x)
x = x.*2;
end
Obviously when calling such function, the LHS must be the same as the RHS (x = myFunc(x);). Also in order to take advantage of this optimization, in-place functions must be called from inside another function.
In MEX-functions, although it is possible to change input variables without making copies, it is not officially supported and might yield unexpected results...
For user-defined types (OOP), MATLAB introduced the concept of value object vs. handle object supporting reference semantics.

4 dimensional matrix

I need to use 4 dimensional matrix as an accumulator for voting 4 parameters. every parameters vary in the range of 1~300. for that, I define Acc = zeros(300,300,300,300) in MATLAB. and somewhere for example, I used:
Acc(4,10,120,78)=Acc(4,10,120,78)+1
however, MATLAB says some error happened because of memory limitation.
??? Error using ==> zeros
Out of memory. Type HELP MEMORY for your options.
in the below, you can see a part of my code:
I = imread('image.bmp'); %I is logical 300x300 image.
Acc = zeros(100,100,100,100);
for i = 1:300
for j = 1:300
if I(i,j)==1
for x0 = 3:3:300
for y0 = 3:3:300
for a = 3:3:300
b = abs(j-y0)/sqrt(1-((i-x0)^2) / (a^2));
b1=floor(b/3);
if b1==0
b1=1;
end
a1=ceil(a/3);
Acc(x0/3,y0/3,a1,b1) = Acc(x0/3,y0/3,a1,b1)+1;
end
end
end
end
end
end
As #Rasman mentioned, you probably want to use a sparse representation of the matrix Acc.
Unfortunately, the sparse function is geared toward 2D matrices, not arbitrary n-D.
But that's ok, because we can take advantage of sub2ind and linear indexing to go back and forth to 4D.
Dims = [300, 300, 300, 300]; % it will be a 300 by 300 by 300 by 300 matrix
Acc = sparse([], [], [], prod(Dims), 1, ExpectedNumElts);
Here ExpectedNumElts should be some number like 30 or 9000 or however many non-zero elements you expect for the matrix Acc to have. We notionally think of Acc as a matrix, but actually it will be a vector. But that's okay, we can use sub2ind to convert 4D coordinates into linear indices into the vector:
ind = sub2ind(Dims, 4, 10, 120, 78);
Acc(ind) = Acc(ind) + 1;
You may also find the functions find, nnz, spy, and spfun helpful.
edit: see lambdageek for the exact same answer with a bit more elegance.
The other answers are helping to guide you to use a sparse mat instead of your current dense solution. This is made a little more difficult since current matlab doesn't support N-dimensional sparse arrays. One implementation to do this is
replace
zeros(100,100,100,100)
with
sparse(100*100*100*100,1)
this will store all your counts in a sparse array, as long as most remain zero, you will be ok for memory.
then to access this data, instead of:
Acc(h,i,j,k)=Acc(h,i,j,k)+1
use:
index = h+100*i+100*100*j+100*100*100*k
Acc(index,1)=Acc(index,1)+1
See Avoiding 'Out of Memory' Errors
Your statement would require more than 4 GB of RAM (Around 16 Gigs, to be specific).
Solutions to 'Out of Memory' problems
fall into two main categories:
Maximizing the memory available to
MATLAB (i.e., removing or increasing
limits) on your system via operating
system selection and system
configuration. These usually have the
greatest overall applicability but are
potentially the most disruptive (e.g.
using a different operating system).
These techniques are covered in the
first two sections of this document.
Minimizing the memory used by MATLAB
by making your code more memory
efficient. These are all algorithm
and application specific and therefore
are less broadly applicable. These
techniques are covered in later
sections of this document.
In your case later seems to be the solution - try reducing the amount of memory used / required.