Fast associative arrays or maps in Matlab - matlab

I need to build a fast one-to-one mapping between two large arrays of integers in Matlab. The mapping should take as input an element from a pre-defined array, e.g.:
in_range = [-200 2 56 45 ... ];
and map it, by its index in the previous array, to the corresponding element from another pre-defined array, e.g.:
out_range = [-10000 0 97 600 ... ];
For example, in the case above, my_map(-200) should output -10000, and my_map(45) should output 600.
I need a solution that
Can map very large arrays (~100K elements) relatively efficiently.
Scales well with the bounds of in_range and out_range (i.e. their min and max values)
So far, I have solved this problem using Matlab's external interface to Java with Java's HashMaps, but I was wondering if there was a Matlab-native alternative.
Thanks!

The latest versions of Matlab have hashes. I'm using 2007b and they aren't available, so I use structs whenever I need a hash. Just convert the integers to valid field names with genvarname.

Related

Minizinc, counting occurrences in array of pairs

I'm new to constraint programming and toying around with some basic operations. I want to count the number of occurrences of an arbitrary element x in an array of pairs.
For instance, the following array has 2 eights, and 1 of every other element.
sampleArray = [{8,13}, {21,34}, {8,55}]
I wonder how I am to extract this information, possibly using built-in functions.
I'm not sure I understand exactly what you want to do here. Do you want to count only the first element in the pair?
Note that the example you show is an array of sets, not a 2 dimensional matrix. Extracting and count the first(?) element in each pair is probably easier if you have a two dimensional matrix (constructed with array2d).
In general there are at least two global constraints that you can use for this: "count" and perhaps also "global_cardinality". See http://www.minizinc.org/2.0/doc-lib/doc-globals-counting.html

Average across matrices within a structures

I have a structure P with 20 matrices. Each matrix is 53x63x46 double. The names of the matrices are fairly random, for instance S154, S324, S412, etc. Is there any way I can do an average across these matrices without having to type out like this?
M=(P.S154 + P.S324 + P.S412 + ...)/20
Also, does it make sense to use structure for computation like this. According to this post, perhaps it should be converted to cell array.
struct2cell(P)
is a cell array each of whose elements is one of your structure fields (the field names are discarded). Then
cell2mat(struct2cell(P))
is the result of concatenating these matrices along the first axis. You might reasonably ask why it does that rather than, say, making a new axis and giving you a 4-dimensional array, but expecting sensible answers to such questions is asking for frustration. Anyway, unless I'm getting the dimensions muddled,
reshape(cell2mat(struct2cell(P)),[53 20 63 46])))
will then give you roughly the 4-dimensional array you're after, with the "new" axis being (of course!) number 2. So now
mean(reshape(cell2mat(struct2cell(P)),[53 20 63 46]),2)
will compute the mean along that axis. The result will have shape [53 1 63 46], so now you will need to fix up the axes again:
reshape(mean(reshape(cell2mat(struct2cell(P)),[53 20 63 46]),2),[53 63 46])
If you are using structures, and by your question, you have fieldnames for each matrix.
Therefore, you need to:
1 - use function fieldnames to extract all the matrix names inside your structure. - http://www.mathworks.com/help/matlab/ref/fieldnames.html
2- then you can access it by doing like:
names = fieldnames(P);
matrix1 = P.names{1}
Using a for loop you can then make your calculations pretty fast!

When to use a cell, matrix, or table in Matlab

I am fairly new to matlab and I am trying to figure out when it is best to use cells, tables, or matrixes to store sets of data and then work with the data.
What I want is to store data that has multiple lines that include strings and numbers and then want to work with the numbers.
For example a line would look like
'string 1' , time, number1, number 2
. I know a matrix works best if al elements are numbers, but when I use a cell I keep having to convert the numbers or strings to a matrix in order to work with them. I am running matlab 2012 so maybe that is a part of the problem. Any help is appreciated. Thanks!
Use a matrix when :
the tabular data has a uniform type (all are floating points like double, or integers like int32);
& either the amount of data is small, or is big and has static (predefined) size;
& you care about the speed of accessing data, or you need matrix operations performed on data, or some function requires the data organized as such.
Use a cell array when:
the tabular data has heterogeneous type (mixed element types, "jagged" arrays etc.);
| there's a lot of data and has dynamic size;
| you need only indexing the data numerically (no algebraic operations);
| a function requires the data as such.
Same argument for structs, only the indexing is by name, not by number.
Not sure about tables, I don't think is offered by the language itself; might be an UDT that I don't know of...
Later edit
These three types may be combined, in the sense that cell arrays and structs may have matrices and cell arrays and structs as elements (because thy're heterogeneous containers). In your case, you might have 2 approaches, depending on how you need to access the data:
if you access the data mostly by row, then an array of N structs (one struct per row) with 4 fields (one field per column) would be the most effective in terms of performance;
if you access the data mostly by column, then a single struct with 4 fields (one field per column) would do; first field would be a cell array of strings for the first column, second field would be a cell array of strings or a 1D matrix of doubles depending on how you want to store you dates, the rest of the fields are 1D matrices of doubles.
Concerning tables: I always used matrices or cell arrays until I
had to do database related things such as joining datasets by a unique key; the only way I found to do this in was by using tables. It takes a while to get used to them and it's a bit annoying that some functions that work on cell arrays don't work on tables vice versa. MATLAB could have done a better job explaining when to use one or the other because it's not super clear from the documentation.
The situation that you describe, seems to be as follows:
You have several columns. Entire columns consist of 1 datatype each, and all columns have an equal number of rows.
This seems to match exactly with the recommended situation for using a [table][1]
T = table(var1,...,varN) creates a table from the input variables,
var1,...,varN . Variables can be of different sizes and data types,
but all variables must have the same number of rows.
Actually I don't have much experience with tables, but if you can't figure it out you can always switch to using 1 cell array for the first column, and a matrix for all others (in your example).

How can I convert double values into integers for indices to create a sparse matrix in MATLAB?

I am using MATLAB to load a text file that I want to make a sparse matrix out of. The columns in the text file refer to the row indices and are double type. I need them to be integers to be able to use them as indices for rows and columns. I tried using uint8, int32 and int64 to convert them to integers to use them to build a sparse matrix as so:
??? Undefined function or method 'sparse' for input
arguments of type 'int64'.
Error in ==> make_network at 5
graph =sparse(int64(listedges(:,1)),int64(listedges(:,2)),ones(size(listedges,1),1));
How can I convert the text file entries loaded as double so as to be used by the sparse function?
There is no need for any conversion, keep the indices double:
r = round(listedges);
graph = sparse(r(:, 1), r(:, 2), ones(size(listedges, 1), 1));
There are two reasons why one might want to convert to int:
The first, because you have data type restrictions.
The second, your inputs may contain fractions and are un-fit to be used as integers.
If you want to convert because of the first reason - then there's no need to: Matlab works with double type by default and often treats doubles as ints (for example, when used as indices).
However, if you want to convert to integers becuase of the second reason (numbers may be fractionals), then you should use round(), ceil() or floor() - whatever suits your purpose best.
There is another very good reason ( and really the primary one..) why one may want to convert indices of any structure (array, matrix, etc.) to int.
If you ever program in any language other than Matlab, you would be familiar with wanting to save memory space, especially with large structures. Being able to address elements in such structures with indices other than double is key.
One major issue with Matlab is the inability to more finely control the size of multidimensional structures in this way. There are sparse matrix solutions, but those are not adequate for many cases. Cell arrays will preserve the data types upon access, however the storage for every element in the cell array is extremely wasteful in terms of storage (113 bytes for a single uint8 encapsulated in a cell).

optimal way of storing multidimensional array/tensor

I am trying to create a tensor (can be conceived as a multidimensional array) package in scala. So far I was storing the data in a 1D Vector and doing index arithmetic.
But slicing and subarrays are not so easy to get. One needs to do a lot of arithmetic to convert multidimensional indices to 1D indices.
Is there any optimal way of storing a multidimensional array? If not, i.e. 1D array is the best solution, how one can optimally slice arrays (some concrete code would really help me)?
The key to answering this question is: when is pointer indirection faster than arithmetic? The answer is pretty much never. In-order traversals can be about equally fast for 2D, and things get worse from there:
2D random access
Array of Arrays - 600 M / second
Multiplication - 1.1 G / second
3D in-order
Array of Array of Arrays - 2.4G / second
Multiplication - 2.8 G / second
(etc.)
So you're better off just doing the math.
Now the question is how to do slicing. Initially, if you have dimensions of n1, n2, n3, ... and indices of i1, i2, i3, ..., you compute the offset into the array by
i = i1 + n1*(i2 + n2*(i3 + ... ))
where typically i1 is chosen to be the last (innermost) dimension (but in general it should be the dimension most often in the innermost loop). That is, if it were an array of arrays of (...), you would index into it as a(...)(i3)(i2)(i1).
Now suppose you want to slice this. First, you might give an offset o1, o2, o3 to every index:
i = (i1 + o1) + n1*((i2 + o2) + n2*((i3 + o3) + ...))
and then you will have a shorter range on each (let's call these m1, m2, m3, ...).
Finally, if you eliminate a dimension entirely--let's say, for example, that m2 == 1, meaning that i2 == 0, you just simplify the formula:
i = (i1 + o1 + n1*o2) + (n1+n2)*((i3 + o3) + ... ))
I will leave it as an exercise to the reader to figure out how to do this in general, but note that we can store new constants o1 + n1*o21 and n1+n2 so we don't need to keep doing that math on the slice.
Finally, if you are allowing arbitrary dimensions, you just put that math into a while loop. This does, admittedly, slow it down a little bit, but you're still at least as well off as if you'd used a pointer dereference (in almost every case).
From my own general experience: If you have to write a multidimensional (rectangular) array class yourself, do not aim to store the data as Array[Array[Double]] but use a one-dimensional storage and add helper methods for converting the multidimensional access tuples to a simple index and vice versa.
When using lists of lists, you need to do much to much bookkeeping that all lists are of the same size and you need to be careful when assigning a sublist to another sublist (because this makes the assigned to sublist identical to the first and you wonder why changing the item at (0,5) also changes (3,5)).
Of course, if you expect a certain dimension to be sliced much more often than another and you want to have reference semantics for that dimension as well, a list of lists will be the better solution, as you may pass around those inner lists as a slice to the consumer without making any copy. But if you don’t expect that, it is a better solution to add a proxy class for the slices which maps to the multidimensional array (which in turn maps to the one-dimensional storage array).
Just an idea: how about a map with Int-tuples as keys?
Example:
val twoDimMatrix = Map((1,1) -> -1, (1,2) -> 5, (2,1) -> 7.7, (2,2) -> 9)
and then you could
scala> twoDimMatrix.filterKeys{_._2 == 1}.values
res1: Iterable[AnyVal] = MapLike(-1, 7.7)
or
twoDimMatrix.filterKeys{tuple => { val (dim1, dim2) = tuple; dim1 == dim2}} //diagonal
this way the index arithmetics would be done by the map. I don't know how practical and fast this is though.
As soon as the number of dimension is known before the design, you can use a collection of collection ...(n times) of collection. If you must be able to build a verctor for any number of dimension, then, there's nothing convenient in the scala API to do it (as far as I know).
You can simply store information in a mulitdimensional array (eg. `Array[Array[Double]]).
If the tensors are small and can fit in cache, you can have a performance improvement with 1D arrays because of data memory locality. It should also be faster to copy the whole tensor.
For slicing arithmetic. It depends what kind of slicing you require. I suppose you already have a function for extracting an element based on indices. So write a basic splicing loop based on indices iteration, insert manually the expression for extracting element, and then try to simplify the whole loop. It is often simpler than to write a correct expression from scratch.