Matlab: Randomly select from "slowly varying" index set - matlab

I would like to find or implement a Matlab data structure that allows me to efficiently do the following three things:
Retrieve an element uniformly at random.
Add a new element.
Delete an element. (If it helps, this element was just "retrieved" out of the structure, so I can use both its location and its value to delete it).
Since I don't need duplicates, this structure is mathematically equivalent to a set. Also, my elements are always integers in the range 1 to 2500; it is not unusual for the set to be this entire range.
What is such a data structure? I've thought of using something like containers.Map or java.util.HashSet, but I don't know how to satisfy the first requirement in this case, because I don't know how to efficiently retrieve the nth key of such a structure. An ordinary array can achieve the first requirement of course, but it is a bad choice for the second and third requirements because of inefficient resizing.
For some context for why I'm looking to do this, in some current code I spent about 1/4 of the runtime doing:
find(x>0,Inf)
and then randomly retrieving an element from this vector. Yet this vector changes very little, and in a very predictable manner, in each iteration of my program. So I would prefer to carry around a data structure and update it as I go rather than recomputing it every time.
If you're familiar with Haskell, one way to implement the operations I'm looking to support would be
randomSelect set = fmap (\n -> elemAt n set) $ randomRIO (0,size set-1)
along with insert and delete, from Data.Set. But I have other reasons not to use Haskell in this project, and I don't know how to implement the backend of Data.Set myself.

Frequently, the best way to decrease time complexity is to increase space complexity. Given that your sets are going to be rather small, we can probably afford to use a little extra space.
To contain the set itself, you can use a preallocated array:
maxSize = 2500;
theSet = zeros(1, maxSize); % set elements
setCount = 0; % number of set elements
You can then have an auxiliary array to check for set membership:
isMember = zeros(1, maxSize);
To insert a new element newval into the set, add it to the end of theSet and increment the count (assuming there's room):
if ~isMember(newval)
assert(setCount < maxSize, 'Too many elements in set.');
theSet(++setCount) = newval;
isMember(newval) = 1;
else
% tried to add duplicate element... do something here
end
To delete an element by index delidx, swap the element to be deleted and the last element and decrement the count:
assert(delidx <= setCount, 'Tried to remove element beyond end of set.');
isMember(theSet(delidx)) = 0;
theSet(delidx) = theSet(setCount--);
Getting a random element of the set is then simple, just:
randidx = randi(setCount);
randelem = theSet(randidx);
All operations are O(1) and the only real disadvantage is that we have to carry along two arrays of size maxCount. Because of that you probably don't want to put these operations in functions as you'd end up creating new arrays on every function call. You'd be better off putting them inline or, better yet, wrapping them in a nice class.

Related

Vectorize matlab code to map nearest values in two arrays

I have two lists of timestamps and I'm trying to create a map between them that uses the imu_ts as the true time and tries to find the nearest vicon_ts value to it. The output is a 3xd matrix where the first row is the imu_ts index, the third row is the unix time at that index, and the second row is the index of the closest vicon_ts value above the timestamp in the same column.
Here's my code so far and it works, but it's really slow. I'm not sure how to vectorize it.
function tmap = sync_times(imu_ts, vicon_ts)
tstart = max(vicon_ts(1), imu_ts(1));
tstop = min(vicon_ts(end), imu_ts(end));
%trim imu data to
tmap(1,:) = find(imu_ts >= tstart & imu_ts <= tstop);
tmap(3,:) = imu_ts(tmap(1,:));%Use imu_ts as ground truth
%Find nearest indecies in vicon data and map
vic_t = 1;
for i = 1:size(tmap,2)
%
while(vicon_ts(vic_t) < tmap(3,i))
vic_t = vic_t + 1;
end
tmap(2,i) = vic_t;
end
The timestamps are already sorted in ascending order, so this is essentially an O(n) operation but because it's looped it runs slowly. Any vectorized ways to do the same thing?
Edit
It appears to be running faster than I expected or first measured, so this is no longer a critical issue. But I would be interested to see if there are any good solutions to this problem.
Have a look at knnsearch in MATLAB. Use cityblock distance and also put an additional constraint that the data point in vicon_ts should be less than its neighbour in imu_ts. If it is not then take the next index. This is required because cityblock takes absolute distance. Another option (and preferred) is to write your custom distance function.
I believe that your current method is sound, and I would not try and vectorize any further. Vectorization can actually be harmful when you are trying to optimize some inner loops, especially when you know more about the context of your data (e.g. it is sorted) than the Mathworks engineers can know.
Things that I typically look for when I need to optimize some piece of code liek this are:
All arrays are pre-allocated (this is the biggest driver of performance)
Fast inner loops use simple code (Matlab does pretty effective JIT on basic commands, but must interpret others.)
Take advantage of any special data features that you have, e.g. use sort appropriate algorithms and early exit conditions from some loops.
You're already doing all this. I recommend no change.
A good start might be to get rid of the while, try something like:
for i = 1:size(tmap,2)
C = max(0,tmap(3,:)-vicon_ts(i));
tmap(2,i) = find(C==min(C));
end

Caching Matlab function results to file

I'm writing a simulation in Matlab.
I will eventually run this simulation hundreds of times.
In each simulation run, there are millions of simulation cycles.
In each of these cycles, I calculate a very complex function, which takes ~0.5 sec to finish.
The function input is a long bit array (>1000 bits) - which is an array of 0 and 1.
I hold the bit arrays in a matrix of 0 and 1, and for each one of them I only run the function once - as I save the result in a different array (res) and check if the bit array is in the matrix before running the functions:
for i=1:1000000000
%pick a bit array somehow
[~,indx] = ismember(bit_array,bit_matrix,'rows');
if indx == 0
indx = length(results) + 1;
bit_matrix(indx,:) = bit_array;
res(indx) = complex_function(bit_array);
end
result = res(indx)
%do something with result
end
I have two quesitons, really:
Is there a more efficient way to find the index of a row in a matrix then 'ismember'?
Since I run the simulation many times, and there is a big overlap in the bit-arrays I'm getting, I want to cache the matrix between runs so that I don't recalculate the function over the same bit-arrays over and over again. How do I do that?
The answer to both questions is to use a map. There are a few steps to do this.
First you will need a function to turn your bit_array into either a number or a string. For example, turn [0 1 1 0 1 0] into '011010'. (Matlab only supports scalar or string keys, which is why this step is required.)
Defined a map object
cachedRunMap = containers.Map; %See edit below for more on this
To check if a particular case has been run, use iskey.
cachedRunMap.isKey('011010');
To add the results of a run use the appending syntax
cachedRunMap('011010') = [0 1 1 0 1]; %Or whatever your result is.
To retrieve cached results, use the getting syntax
tmpResult = cachedRunMap.values({'011010'});
This should efficiently store and retrieve values until you run out of system memory.
Putting this together, now your code would look like this:
%Hacky magic function to convert an array into a string of '0' and '1'
strFromBits = #(x) char((x(:)'~=0)+48); %'
%Initialize the map
cachedRunMap = containers.Map;
%Loop, computing and storing results as needed
for i=1:1000000000
%pick a bit array somehow
strKey = strFromBits(bit_array);
if cachedRunMap.isKey(strKey)
result = cachedRunMap(strKey);
else
result = complex_function(bit_array);
cachedRunMap(strKey) = reult;
end
%do something with result
end
If you want a key which is not a string, that needs to be declared at step 2. Some examples are:
cachedRunMap = containers.Map('KeyType', 'char', 'ValueType', 'any');
cachedRunMap = containers.Map('KeyType', 'double', 'ValueType', 'any');
cachedRunMap = containers.Map('KeyType', 'uint64', 'ValueType', 'any');
cachedRunMap = containers.Map('KeyType', 'uint64', 'ValueType', 'double');
Setting a KeyType of 'char' sets the map to use strings as keys. All other types must be scalars.
Regarding issues as you scale this up (per your recent comments)
Saving data between sessions: There should be no issues saving this map to a *.mat file, up to the limits of your systems memory
Purging old data: I am not aware of a straightforward way to add LRU features to this map. If you can find a Java implementation you can use it within Matlab pretty easily. Otherwise it would take some thought to determine the most efficient method of keeping track of the last time a key was used.
Sharing data between concurrent sessions: As you indicated, this probably requires a database to perform efficiently. The DB table would be two columns (3 if you want to implement LRU features), the key, value, (and last used time if desired). If your "result" is not a type which easily fits into SQL (e.g. a non-uniform size array, or complex structure) then you will need to put additional thought into how to store it. You will also need a method to access the database (e.g. the database toolbox, or various tools on the Mathworks file exchange). Finally you will need to actually setup a database on a server (e.g. MySql if you are cheap, like me, or whatever you have the most experience with, or can find the most help with.) This is not actually that hard, but it takes a bit of time and effort the first time through.
Another approach to consider (much less efficient, but not requiring a database) would be to break up the data store into a large (e.g. 1000's or millions) number of maps. Save each into a separate *.mat file, with a filename based on the keys contained in that map (e.g. the first N characters of your string key), and then load/save these files between sessions as needed. This will be pretty slow ... depending on your usage it may be faster to recalculate from the source function each time ... but it's the best way I can think of without setting up the DB (clearly a better answer).
For a large list, a hand-coded binary search can beat ismember, if maintaining it in sorted order isn't too expensive. If that's really your bottleneck. Use the profiler to see how much the ismember is really costing you. If there aren't too many distinct values, you could also store them in a containers.Map by packing the bit_matrix in to a char array and using it as the key.
If it's small enough to fit in memory, you could store it in a MAT file using save and load. They can store any basic Matlab datatype. Have the simulation save the accumulated res and bit_matrix at the end of its run, and re-load them the next time it's called.
I think that you should use containers.Map() for the purpose of speedup.
The general idea is to hold a map that contains all hash values. If your bit arrays have uniform distribution under the hash function, most of the time you won't need the call to ismember.
Since the key type cannot be an array in Matlab, you can calculate some hash function on your array of bits.
For example:
function s = GetHash(bitArray)
s = mod( sum(bitArray), intmax('uint32'));
end
This is a lousy hash function, but enough to understand the principle.
Then the code would look like:
map = containers.Map('KeyType','uint32','ValueType','any');
for i=1:1000000000
%pick a bit array somehow
s = GetHash(bit_array);
if isKey %Do the slow check.
[~,indx] = ismember(bit_array,bit_matrix,'rows');
else
map(s) = 1;
continue;
end
if indx == 0
indx = length(results) + 1;
bit_matrix(indx,:) = bit_array;
res(indx) = complex_function(bit_array);
end
result = res(indx)
%do something with result
end

Concatenate equivalent in MATLAB for a single value

I am trying to use MATLAB in order to generate a variable whose elements are either 0 or 1. I want to define this variable using some kind of concatenation (equivalent of Java string append) so that I can add as many 0's and 1's according to some upper limit.
I can only think of using a for loop to append values to an existing variable. Something like
variable=1;
for i=1:N
if ( i%2==0)
variable = variable.append('0')
else
variable = variable.append('1')
i=i+1;
end
Is there a better way to do this?
In MATLAB, you can almost always avoid a loop by treating arrays in a vectorized way.
The result of pseudo-code you provided can be obtained in a single line as:
variable = mod((1:N),2);
The above line generates a row vector [1,2,...,N] (with the code (1:N), use (1:N)' if you need a column vector) and the mod function (as most MATLAB functions) is applied to each element when it receives an array.
That's not valid Matlab code:
The % indicates the start of a comment, hence introducing a syntax error.
There is no append method (at least not for arrays).
Theres no need to increment the index in a for loop.
Aside of that it's a bad idea to have Matlab "grow" variables, as memory needs to be reallocated at each time, slowing it down considerably. The correct approach is:
variable=zeros(N,1);
for i=1:N
variable(i)=mod(i,2);
end
If you really do want to grow variables (some times it is inevitable) you can use this:
variable=[variable;1];
Use ; for appending rows, use , for appending columns (does the same as vertcat and horzcat). Use cat if you have more than 2 dimensions in your array.

In-Place Quicksort in matlab

I wrote a small quicksort implementation in matlab to sort some custom data. Because I am sorting a cell-array and I need the indexes of the sort-order and do not want to restructure the cell-array itself I need my own implementation (maybe there is one available that works, but I did not find it).
My current implementation works by partitioning into a left and right array and then passing these arrays to the recursive call. Because I do not know the size of left and and right I just grow them inside a loop which I know is horribly slow in matlab.
I know you can do an in place quicksort, but I was warned about never modifying the content of variables passed into a function, because call by reference is not implemented the way one would expect in matlab (or so I was told). Is this correct? Would an in-place quicksort work as expected in matlab or is there something I need to take care of? What other hints would you have for implementing this kind of thing?
Implementing a sort on complex data in user M-code is probably going to be a loss in terms of performance due to the overhead of M-level operations compared to Matlab's builtins. Try to reframe the operation in terms of Matlab's existing vectorized functions.
Based on your comment, it sounds like you're sorting on a single-value key that's inside the structs in the cells. You can probably get a good speedup by extracting the sort key to a primitive numeric array and calling the builtin sort on that.
%// An example cell array of structs that I think looks like your input
c = num2cell(struct('foo',{'a','b','c','d'}, 'bar',{6 1 3 2}))
%// Let's say the "bar" field is what you want to sort on.
key = cellfun(#(s)s.bar, c) %// Extract the sort key using cellfun
[sortedKey,ix] = sort(key) %// Sort on just the key using fast numeric sort() builtin
sortedC = c(ix); %// ix is a reordering index in to c; apply the sort using a single indexing operation
reordering = cellfun(#(s)s.foo, sortedC) %// for human readability of results
If you're sorting on multiple field values, extract all the m key values from the n cells to an n-by-m array, with columns in descending order of precedence, and use sortrows on it.
%// Multi-key sort
keyCols = {'bar','baz'};
key = NaN(numel(c), numel(keyCols));
for i = 1:numel(keyCols)
keyCol = keyCols{i};
key(:,i) = cellfun(#(s)s.(keyCol), c);
end
[sortedKey,ix] = sortrows(key);
sortedC = c(ix);
reordering = cellfun(#(s)s.foo, sortedC)
One of the keys to performance in Matlab is to get your data in primitive arrays, and use vectorized operations on those primitive arrays. Matlab code that looks like C++ STL code with algorithms and references to comparison functions and the like will often be slow; even if your code is good in O(n) complexity terms, the fixed cost of user-level M-code operations, especially on non-primitives, can be a killer.
Also, if your structs are homogeneous (that is, they all have the same set of fields), you can store them directly in a struct array instead of a cell array of structs, and it will be more compact. If you can do more extensive redesign, rearranging your data structures to be "planar-organized" - where you have a struct of arrays, reading across the ith elemnt of all the fields as a record, instead of an array of structs of scalar fields - could be a good efficiency win. Either of these reorganizations would make constructing the sort key array cheaper.
In this post, I only explain MATLAB function-calling convention, and am not discussing the quick-sort algorithm implementation.
When calling functions, MATLAB passes built-in data types by-value, and any changes made to such arguments are not visible outside the function.
function y = myFunc(x)
x = x .* 2; %# pass-by-value, changes only visible inside function
y = x;
end
This could be inefficient for large data especially if they are not modified inside the functions. Therefore MATLAB internally implements a copy-on-write mechanism: for example when a vector is copied, only some meta-data is copied, while the data itself is shared between the two copies of the vector. And it is only when one of them is modified, that the data is actually duplicated.
function y = myFunc(x)
%# x was never changed, thus passed-by-reference avoiding making a copy
y = x .* 2;
end
Note that for cell-arrays and structures, only the cells/fields modified are passed-by-value (this is because cells/fields are internally stored separately), which makes copying more efficient for such data structures. For more information, read this blog post.
In addition, versions R2007 and upward (I think) detects in-place operations on data and optimizes such cases.
function x = myFunc(x)
x = x.*2;
end
Obviously when calling such function, the LHS must be the same as the RHS (x = myFunc(x);). Also in order to take advantage of this optimization, in-place functions must be called from inside another function.
In MEX-functions, although it is possible to change input variables without making copies, it is not officially supported and might yield unexpected results...
For user-defined types (OOP), MATLAB introduced the concept of value object vs. handle object supporting reference semantics.

For loops in Matlab

I run through a for loop, each time extracting certain elements of an array, say element1, element2, etc. How do I then pool all of the elements I've extracted together so that I have a list of them?
John covered the basics of for loops, so...
Note that matlab code is often more efficient if you vectorize it instead of using loops (this is less true than it used to be). For example, if in your loop you're just grabbing the first value in every row of a matrix, instead of looping you can do:
yourValues = theMatrix(:,1)
Where the solo : operator indicates "every possible value for this index". If you're just starting out in matlab it is definitely worthwhile to read up on matrix indexing in matlab (among other topics).
Build the list as you go:
for i = 1:whatever
' pick out theValue
yourList(i) = theValue
end
I'm assuming that you pick out one element per loop iteration. If not, just maintain a counter and use that instead of i.
Also, I'm not assuming you're pulling out your elements from the same position in your array each time through the loop. If you're doing that, then look into Donnie's suggestion.
In MATLAB, you can always perform a loop operation. But the recommended "MATLAB" way is to avoid looping:
Suppose you want to get the subset of array items
destArray = [];
for k=1:numel(sourceArray)
if isGoodMatch(sourceArray(k))
destArray = [destArray, sourceArray(k)]; % This will create a warning about resizing
end
end
You perform the same task without looping:
matches = arrayfun(#(a) isGoodMatch(a), sourceArray); % returns a vector of bools
destArray = sourceArray(matches);