I am trying to create a tensor (can be conceived as a multidimensional array) package in scala. So far I was storing the data in a 1D Vector and doing index arithmetic.
But slicing and subarrays are not so easy to get. One needs to do a lot of arithmetic to convert multidimensional indices to 1D indices.
Is there any optimal way of storing a multidimensional array? If not, i.e. 1D array is the best solution, how one can optimally slice arrays (some concrete code would really help me)?
The key to answering this question is: when is pointer indirection faster than arithmetic? The answer is pretty much never. In-order traversals can be about equally fast for 2D, and things get worse from there:
2D random access
Array of Arrays - 600 M / second
Multiplication - 1.1 G / second
3D in-order
Array of Array of Arrays - 2.4G / second
Multiplication - 2.8 G / second
(etc.)
So you're better off just doing the math.
Now the question is how to do slicing. Initially, if you have dimensions of n1, n2, n3, ... and indices of i1, i2, i3, ..., you compute the offset into the array by
i = i1 + n1*(i2 + n2*(i3 + ... ))
where typically i1 is chosen to be the last (innermost) dimension (but in general it should be the dimension most often in the innermost loop). That is, if it were an array of arrays of (...), you would index into it as a(...)(i3)(i2)(i1).
Now suppose you want to slice this. First, you might give an offset o1, o2, o3 to every index:
i = (i1 + o1) + n1*((i2 + o2) + n2*((i3 + o3) + ...))
and then you will have a shorter range on each (let's call these m1, m2, m3, ...).
Finally, if you eliminate a dimension entirely--let's say, for example, that m2 == 1, meaning that i2 == 0, you just simplify the formula:
i = (i1 + o1 + n1*o2) + (n1+n2)*((i3 + o3) + ... ))
I will leave it as an exercise to the reader to figure out how to do this in general, but note that we can store new constants o1 + n1*o21 and n1+n2 so we don't need to keep doing that math on the slice.
Finally, if you are allowing arbitrary dimensions, you just put that math into a while loop. This does, admittedly, slow it down a little bit, but you're still at least as well off as if you'd used a pointer dereference (in almost every case).
From my own general experience: If you have to write a multidimensional (rectangular) array class yourself, do not aim to store the data as Array[Array[Double]] but use a one-dimensional storage and add helper methods for converting the multidimensional access tuples to a simple index and vice versa.
When using lists of lists, you need to do much to much bookkeeping that all lists are of the same size and you need to be careful when assigning a sublist to another sublist (because this makes the assigned to sublist identical to the first and you wonder why changing the item at (0,5) also changes (3,5)).
Of course, if you expect a certain dimension to be sliced much more often than another and you want to have reference semantics for that dimension as well, a list of lists will be the better solution, as you may pass around those inner lists as a slice to the consumer without making any copy. But if you don’t expect that, it is a better solution to add a proxy class for the slices which maps to the multidimensional array (which in turn maps to the one-dimensional storage array).
Just an idea: how about a map with Int-tuples as keys?
Example:
val twoDimMatrix = Map((1,1) -> -1, (1,2) -> 5, (2,1) -> 7.7, (2,2) -> 9)
and then you could
scala> twoDimMatrix.filterKeys{_._2 == 1}.values
res1: Iterable[AnyVal] = MapLike(-1, 7.7)
or
twoDimMatrix.filterKeys{tuple => { val (dim1, dim2) = tuple; dim1 == dim2}} //diagonal
this way the index arithmetics would be done by the map. I don't know how practical and fast this is though.
As soon as the number of dimension is known before the design, you can use a collection of collection ...(n times) of collection. If you must be able to build a verctor for any number of dimension, then, there's nothing convenient in the scala API to do it (as far as I know).
You can simply store information in a mulitdimensional array (eg. `Array[Array[Double]]).
If the tensors are small and can fit in cache, you can have a performance improvement with 1D arrays because of data memory locality. It should also be faster to copy the whole tensor.
For slicing arithmetic. It depends what kind of slicing you require. I suppose you already have a function for extracting an element based on indices. So write a basic splicing loop based on indices iteration, insert manually the expression for extracting element, and then try to simplify the whole loop. It is often simpler than to write a correct expression from scratch.
Related
I was messing around with matrix operations (it was the first programming I ever did nearly 20 years ago) and wanted to recreate what I had done all that time ago but with more modern practices.
Anyway...
One of the constraints with matrix operations is that the size of the matrices involved in the operations matter.
i.e. for addition the two matrices must be of the same size. i.e. M(i, j) + N(i, j). And multiplication only works if the number of columns of the left matrix is the same as the number of rows of the right matrix, etc...
I was looking for ways that I could apply these constraints at compile time but I'm not sure if that's possible.
I know I could create different subtypes for each size of matrix (Matrix1x1, Matrix1x2, Matrix2x3, ...) but there are "quite a lot" of those so that's a non-starter.
I could also use a precondition on the function which checks the input matrices for the correct sizes before doing anything (a bit like index out of bounds check on an array).
But I was wondering if there was a way of applying the size of the matrix to the type of the matrix at all. I don't think I've heard of this before but wanted to check before throwing out the idea entirely.
Something akin to when I create the matrix it applies the fact that it knows at that point what size the matrix has.
The function definition might then look something like...
func add(m: Matrix<i, j>, n: Matrix<i, j>) -> Matrix<i, j>
and
func multiply(m: Matrix<i, j>, n: Matrix<k, i>) -> Matrix<j, k> // or something
Where i and j are not generic type constraints but size constraints. But that isn't valid syntax just give the general idea of what I'm thinking.
Assuming I have an RDD containing (Int, Int) tuples.
I wish to turn it into a Vector where first Int in tuple is the index and second is the value.
Any Idea how can I do that?
I update my question and add my solution to clarify:
My RDD is already reduced by key, and the number of keys is known.
I want a vector in order to update a single accumulator instead of multiple accumulators.
There for my final solution was:
reducedStream.foreachRDD(rdd => rdd.collect({case (x: Int,y: Int) => {
val v = Array(0,0,0,0)
v(x) = y
accumulator += new Vector(v)
}}))
Using Vector from accumulator example in documentation.
rdd.collectAsMap.foldLeft(Vector[Int]()){case (acc, (k,v)) => acc updated (k, v)}
Turn the RDD into a Map. Then iterate over that, building a Vector as we go.
You could use justt collect(), but if there are many repetitions of the tuples with the same key that might not fit in memory.
One key thing: do you really need Vector? Map could be much more suitable.
If you really need local Vector, you first need to use .collect() and then do local transformations into Vector. Of course you shall have enough memory for this. But here the real problem is where to find Vector which can be built efficiently from pairs of (index, value). As far as I know Spark MLLib has itself class org.apache.spark.mllib.linalg.Vectors which can create Vector from array of indices and values and even from tuples. Under the hood it uses breeze.linalg. So probably it would be best start for you.
If you just need order, you just can use .orderByKey() as you already have RDD[(K,V)]. This way you have ordered stream. Which does not strictly follow your intention but maybe it could suit even better. Now you can drop elements with the same key by .reduceByKey() producing only resulting elements.
Finally if you really need large vector, do .orderByKey and then you can produce real vector by doing .flatmap() which maintain counter and drops more than one element for the same index / inserts needed amount of 'default' elements for missing indices.
Hope this is clear enough.
I have 2 cell arrays as below:
A = {'S' 'M' 'N' 'E'};
B = {'E' 'M' 'Q' 'S'};
In this case, the number of different elements is 3.
In a number array, I can use length(find(A ~= B)); to easily count number of different elements in one step easily.
Is there something similar for cell array of characters?
EDIT: I think I've misunderstood your question, and you probably meant finding different elements in corresponding positions in the arrays. I still kept my old answer
Counting different elements at the same position
yuk's approach with strcmp is correct. However, it works only if the two arrays are of the same size. The generalized solution would be:
N = min(numel(A), numel(B));
sum(~strcmp(A(1:N), B(1:N))) + numel(A) + numel(B) - 2 * N
If the arrays are of different length, the "extra" elements in the larger array will be counted as different here.
Counting different elements in any position
The most general approach would be using ismember, which does not care about lengths of strings or their position in the array. To count the total number of elements in A and B that are different, just do:
sum(ismember(A, B)) + sum(ismember(B, A))
The same effect can also be obtained with setdiff (instead of ismember):
numel(setdiff(A, B)) + numel(setdiff(B, A))
Both ways are valid for any two arrays, not necessarily of equal size.
Try
cell2mat(A)==cell2mat(B)
to start with, the rest should be straightforward. This simple approach will fail if the cell arrays don't have the same dimensions.
If your cell array is a cell array of strings you can use STRCMP:
sum(~strcmp(A,B))
Of course make sure A and B have the same length.
By the way for numeric array it's more efficient to use sum(A~=B). In general find is slow.
U can also try unique([A B]) if A and B are given in the exemple you gave.
It A and B do not have the same dimension you can try this.
unique(reshape(cell2mat(A,1,[])),reshape(cell2mat(B,1,[])))
Suppose that f is a function of one parameter whose output is an n-dimensional (m1 × m2… × mn) array, and that B is a vector of length k whose elements are all valid arguments for f.
I am looking for a convenient, and more importantly, "shape-agnostic", MATLAB expression (or recipe) for producing the (n+1)-dimensional (m1 × m2 ×…× mn × k) array obtained by "stacking" the k n-dimensional arrays f(b), where the parameter b ranges over B.
To do this in numpy, I would use an expression like this one:
C = concatenate([f(b)[..., None] for b in B], -1)
In case it's of any use, I'll unpack this numpy expression below (see APPENDIX), but the feature of it that I want to emphasize now is that it is entirely agnostic about the shapes/sizes of f(b) and B. For the types of applications I have in mind, the ability to write such "shape-agnostic" code is of utmost importance. (I stress this point because much MATLAB code I come across for doing this sort of manipulation is decidedly not "shape-agnostic", and I don't know how to make it so.)
APPENDIX
In general, if A is a numpy array, then the expression A[..., None] can be thought as "reshaping" A so that it gets one extra, trivial, dimension. Thus, if f(b) is an n-dimensional (m1 × m2… × mn) array, then, f(b)[..., None] is the corresponding (n+1)-dimensional (m1 × m2 ×…× mn × 1) array. (The reason for adding this trivial dimension will become clear below.)
With this clarification out of the way, the meaning of the first argument to concatenate, namely:
[f(b)[..., None] for b in B]
is not too hard to decipher. It is a standard Python "list comprehension", and it evaluates to the sequence of the k (n+1)-dimensional (m1 × m2 ×…× mn × 1) arrays f(b)[..., None], as the parameter b ranges over the vector B.
The second argument to concatenate is the "axis" along which the concatenation is to be performed, expressed as the index of the corresponding dimension of the arrays to be concatenated. In this context, the index -1 plays the same role as the end keyword does in MATLAB. Therefore, the expression
concatenate([f(b)[..., None] for b in B], -1)
says "concatenate the arrays f(b)[..., None] along their last dimension". It is in order to provide this "last dimension" to concatenate over that it becomes necessary to reshape the f(b) arrays (with, e.g., f(b)[..., None]).
One way of doing that is:
% input:
f=#(x) x*ones(2,2)
b=1:3;
%%%%
X=arrayfun(f,b,'UniformOutput',0);
X=cat(ndims(X{1})+1,X{:});
Maybe there are more elegant solutions?
Shape agnosticity is an important difference between the philosophies underlying NumPy and Matlab; it's a lot harder to accomplish in Matlab than it is in NumPy. And in my view, shape agnosticity is a bad thing, too -- the shape of matrices has mathematical meaning. If some function or class were to completely ignore the shape of the inputs, or change them in a way that is not in accordance with mathematical notations, then that function destroys part of the language's functionality and intent.
In programmer terms, it's an actually useful feature designed to prevent shape-related bugs. Granted, it's often a "programmatic inconvenience", but that's no reason to adjust the language. It's really all in the mindset.
Now, having said that, I doubt an elegant solution for your problem exists in Matlab :) My suggestion would be to stuff all of the requirements into the function, so that you don't have to do any post-processing:
f = #(x) bsxfun(#times, permute(x(:), [2:numel(x) 1]), ones(2,2, numel(x)) )
Now obviously this is not quite right, since f(1) doesn't work and f(1:2) does something other than f(1:4), so obviously some tinkering has to be done. But as the ugliness of this oneliner already suggests, a dedicated function might be a better idea. The one suggested by Oli is pretty decent, provided you lock it up in a function of its own:
function y = f(b)
g = #(x)x*ones(2,2); %# or whatever else you want
y = arrayfun(g,b, 'uni',false);
y = cat(ndims(y{1})+1,y{:});
end
so that f(b) for any b produces the right output.
I'm trying to implement a Count-Min Sketch algorithm in Scala, and so I need to generate k pairwise independent hash functions.
This is a lower-level than anything I've ever programmed before, and I don't know much about hash functions except from Algorithms classes, so my question is: how do I generate these k pairwise independent hash functions?
Am I supposed to use a hash function like MD5 or MurmurHash? Do I just generate k hash functions of the form f(x) = ax + b (mod p), where p is a prime and a and b are random integers? (i.e., the universal hashing family everyone learns in algorithms 101)
I'm looking more for simplicity than raw speed (e.g., I'll take something 5x slower if it's simpler to implement).
Scala already has MurmurHash implemented (it's scala.util.MurmurHash). It's very fast and very good at distributing values. A cryptographic hash is overkill--you'll just take tens or hundreds of times longer than you need to. Just pick k different seeds to start with and, since it's nearly cryptographic in quality, you'll get k largely independent hash codes. (In 2.10, you should probably switch to using scala.util.hashing.MurmurHash3; the usage is rather different but you can still do the same thing with mixing.)
If you only need near values to be mapped to randomly far values this will work; if you want to avoid collisions (i.e. if A and B collide using hash 1 they will probably not also collide using hash 2), then you'll need to go at least one more step and hash not the whole object but subcomponents of it so there's an opportunity for the hashes to start out different.
Probably the simplest approach is to take some cryptographic hash function and "seed" it with different sequences of bytes. For most practical purposes, the results should be independent, as this is one of the key properties a cryptographic hash function should have (if you replace any part of a message, the hash should be completely different).
I'd do something like:
// for each 0 <= i < k generate a sequence of random numbers
val randomSeeds: Array[Array[Byte]] = ... ; // initialize by random sequences
def hash(i: Int, value: Array[Byte]): Array[Byte] = {
val dg = java.security.MessageDigest.getInstance("SHA-1");
// "seed" the digest by a random value based on the index
dg.update(randomSeeds(i));
return dg.digest(value);
// if you need integer hash values, just take 4 bytes
// of the result and convert them to an int
}
Edit:
I don't know the precise requirements of the Count-Min Sketch, maybe a simple has function would suffice, but it doesn't seem to be the simplest solution.
I suggested a cryptographic hash function, because there you have quite strong guarantees that the resulting hash functions will be very different, and it's easy to implement, just use the standard libraries.
On the other hand, if you have two hash functions of the form f1(x) = ax + b (mod p) and f2(x) = cx + d (mod p), then you can compute one using another (without knowing x) using a simple linear formula f2(x) = c / a * (f1(x) - b) + d (mod p), which suggests that they aren't very independent. So you could run into unexpected problems here.