`accumarray` makes anomalous calls to its function argument - matlab

Short version:
The function passed as the fourth argument to accumarray sometimes gets called with arguments that are not consistent with specifications encoded the first argument to accumarray.
As a result, functions used as arguments to accumarray must test for what are, in effect, anomalous conditions.
The question is: how can an a 1-expression anonymous function test for such anomalous conditions? And more generally: how can write anonymous functions that are robust to accumarray's undocumented behavior?
Full version:
The code below is a drastically distilled version of a problem that ate up most of my workday today.
First some definitions:
idxs = [1:3 1:3 1:3]';
vals0 = [1 4 6 3 5 7 6 Inf 2]';
vals1 = [1 Inf 6 3 5 7 6 4 2]';
anon = #(x) max(x(~isinf(x)));
Note vals1 is obtained from vals0 by swapping elements 2 and 8. The "anonymous" function anon computes the maximum among the non-infinite elements of its input.
Given these definitions, the two calls below
accumarray(idxs, vals0, [], anon)
accumarray(idxs, vals1, [], anon)
which differ only in their second argument (vals0 vs vals1), should produce identical results, since the difference between vals0 and vals1 affects only the ordering of the values in the argument to one of the calls to anon, and the result of this function is insensitive to the ordering of elements in its argument.
As it turns out the first of these two expressions evaluates normally and produces the right result1:
>> accumarray(idxs, vals0, [], anon)
ans =
6
5
7
The second one, however, fails with:
>> accumarray(idxs, vals1, [], anon)
Error using accumarray
The function '#(x)max(x(~isinf(x)))' returned a non-scalar value.
To troubleshoot this problem, all I could come up with2 was to write a separate function (in its own file, of course, "the MATLAB way")
function out = kluge(x)
global ncalls;
ncalls = ncalls + 1;
y = ~isinf(x);
if any(y)
out = max(x(y));
else
{ncalls x}
out = NaN;
end
end
...and ran the following:
>> global ncalls;
>> ncalls = int8(0); accumarray(idxs, vals0, [], #kluge)
ans =
6
5
7
>> ncalls = int8(0); accumarray(idxs, vals1, [], #kluge)
ans =
[2] [Inf]
ans =
6
5
7
As one can see from the output of the last call to accumarray above, the argument to the second call to the kluge callback was the array [Int]. This tells me beyond any doubt that accumarray is not behaving as documented3 (since idxs specifies no arrays of length 1 to be passed to accumarray's function argument).
In fact, from this and other tests I determined that, contrary to what I expected, the function passed to accumarray is called more than max(idxs) (= 3) times; in the expressions involving kluge above it's called 5 times.
The problem here is that if one cannot rely on how accumarray's function argument will actually be called, then the only way to make this function argument robust is to include in it a lot of extra code to perform the necessary checks. This almost certainly will require that the function have multiple statements, which rules out anonymous functions. (E.g. the function kluge above is robust more robust than anon, but I don't know how to fit into an anonymous function.) Not being able to use anonymous functions with accumarray greatly reduces its utility.
So my question is:
how to specify anonymous functions that can be robust arguments to accumarray?
1 I have removed blank lines from MATLAB's typical over-padding in all the MATLAB output shown in this post.
2 I welcome comments with any other troubleshooting suggestions you may have; troubleshooting this problem was a lot harder than it should be.
3
In particular, see items number 1 through 5 right after the line "The function processes the input as follows:".

Short answer
The fourth input argument of accumarray, anon in this case, must return a scalar for any input.
Long answer (and discussion about index sorting)
Consider the output when the indexes are sorted:
>> [idxsSorted,sortInds] = sort(idxs)
>> accumarray(idxsSorted, vals0(sortInds), [], anon)
ans =
6
5
7
>> accumarray(idxsSorted, vals1(sortInds), [], anon)
ans =
6
5
7
Now, all the documentation has to say about this is the following:
If the subscripts in subs are not sorted, fun should not depend on the order of the values in its input data.
How does this relate the trouble with anon? It is a clue, as this forces anon to be called for the complete set of values for a given idx rather than a subset/subarray, as Luis Mendo suggested.
Consider how accumarray would work for a non-sorted list of indexes and values:
>> [idxs vals0 vals1]
ans =
1 1 1
2 4 Inf
3 6 6
1 3 3
2 5 5
3 7 7
1 6 6
2 Inf 4
3 2 2
For both vals0 and vals1, the Inf belongs to the set where idxs equals 2. Since idxs is not sorted, it does not process all values for idxs=2 in one shot, at first. The actual algorithm (implementation) is opaque, but it seems to start by assuming that idxs is sorted, processing each single-valued block of the first argument. This is verifiable by putting a breakpoint in fun, the function reference by fourth input argument. When it encounters a 1 in idxs for the second time, it seems to start over, but with subsequent calls to fun containing all the values for a given index. Presumably accumarray calls some implementation of unique to fully-segment idxs (incidentally, order is not preserved). As kjo suggests, this is the point where accumarray actually processes the inputs as described in the documentation, following steps 1-5 here ("Find out how many unique indices there are..."). As a result, it crashes for vals1, when anon(Inf) is called, but not for vals0, which instead calls anon(4) on the first try.
However, even if it followed those steps exactly on the first go, it would not necessarily be robust if a complete subarray of values contained just Infs (consider that anon([Inf Inf Inf]) returns an empty matrix too). It is a requirement, although an understated one, that fun must return a scalar. What is not clear from the documentation is that it must return a scalar, for any inputs, not just what is expected based on the high-level description of the algorithm.
Workaround:
anon = #(x) max([x(~isinf(x));-Inf]);

The documentation does not say that anon is called only with the whole set1 of vals corresponding to each value of idx as its input. As seen in your example, it does get called with subsets thereof.
So the way to make anon robust seems to be: make sure it gives a scalar output when its input is any subset of vals (or maybe just any subset of each set with same-idx value). In your case, anon(inf) does not return a scalar.
1 It's actually an array, of course, but I think it's easier to describe this in terms of sets (and subsets).

Related

Output of unique function in Matlab

I am using the unique function in Matlab and I am confused about the output of such a function.
Consider the following simple code
rng default
T=randn(232,50); %232*50
equalorder=randsample(232,80802,true); %80802*1
T_extended=T(equalorder,:); %80802*50
By construction, I expect the size of T_extended to be 232. In fact,
S=size(unique(T_extended,'rows'),1); %232
Now, consider the specific T and equalorder function that are produced by some codes of mine (T and equalorder are upload here
https://filebin.net/603zn7mt2efzq91c
unfortunately my code is too long to be reproduced and I think that the issue may be numerical). Let's apply the code above to these arrays:
clear
load matrices %T, equalorder
T_extended=T(equalorder,:);
However, if I do
S=size(unique(T_extended,'rows'),1);
I get S=4694 and not S=232. Why?
The code or data necessary to reproduce the problem should be included in the question itself, as external links may stop working in the future. In this case, however, it was easy to identify the pattern that causes the problem (see below), so the question together with this answer should be self-contained.
In your linked example, T contains NaN at entry (216,37):
>> T(216,37)
ans =
NaN
(and this is the only such entry):
>> nnz(isnan(T))
ans =
1
By design, NaN values are not equal to each other. So when computing unique(T_extended, 'rows'), all rows of T_extended that correspond to the original 216-th row of T are counted as being different. This is what causes the count of unique rows to increase. If you don't consider the 37-the column (which is the only one that contains NaN) you get the expected result:
>> S=size(unique(T_extended(:,[1:36 38:end]),'rows'),1)
S =
232
Let's count how many times a NaN entry appears in T_extended:
>> nnz(isnan(T_extended))
ans =
4465
(Of course, this happens because):
>> sum(equalorder==216)
ans =
4465
This means that the count of unique rows is increased by 4465 - 1 when each repetition of the row containing NaN is counted as a different row. And 4465 - 1 + 232 is 4696, which is the result you get.

Indexing dictionary in depth, two cases

When indexing dictionary in depth I've found different results in the same (as I think) constructions:
q)d:`a`b!(1 2 3;4 5 6)
q)d[`a`b;0]
1 4
q)d[`a`b]0
1 2 3
Why is this happening? How q understands and distinguishes two different cases? Before this I was confident that, for example, calling dyadic function f[a;b] and f[a]b are the same. And now I am not sure even about this.
To index at depth you either need semi colons separating your arguments, or use the dot. Your second example,
d[`a`b] 0
Is taking the 2 lists from the dictionary values and then indexing to return the first.
While
d[`a`b;0]
or
d .(`a`b;0)
Is taking the 2 lists, and then indexing at depth, taking the first element of each, due to the semi colon/dot
When you call a dyadic function it is expecting two parameters, passing one inside the square brackets creates a projection, which is basically using an implicit semi colon, so
f[a]b
is the same as
f[a;]b
which is the same as
f[a;b]
The result of
f[a]
is a projection which is expecting another argument, so
f[a] b
evaluates f[a], then passes argument b to this function, with usual function application via juxtaposition
Your dictionary indexing example does not create a projection, and hence the indexing is not expecting any more arguments, so the first indexing
d[`a`b]
is evaluated immediately to give a result, and then the second index is applied to this result.
It would work the same for a monadic function
q){5+til x}[5] 2
7
Like the top level dictionary index, the application is carried out and then the result is indexed, as only one argument was expected, with no projection involved
EDIT - Adam beat me to it!
I don't think you can consider a function invocation f[a;b] or f[a]b as equivalent to indexing. f[a]b for a function is a projection but you can't project indexing in the same way. A function has a fixed valence, aka fixed number of inputs, but indexing can be done at any depth.
If you take your dictionary and fabricate it to have more depth, you can see that you can keeping indexing deeper and deeper:
q)d:{`a`b!2#enlist value x}/[1;d]
q)d[`a`b;1;1]
5 5
q)d:{`a`b!2#enlist value x}/[2;d]
q)d[`a`b;1;1;1;1]
5 5
q)d:{`a`b!2#enlist value x}/[2;d]
q)d[`a`b;1;1;1;1;1;1]
5 5
Yet you can still index just at the top level d[`a`b]. So the interpreter has to decide if its indexing at the top level, aka d # `a`b or indexing at depth d . (`a`b;0).
To avoid confusion it indexes at top level if you supply one level of indexing and indexes at depth if you supply more than one level of indexing. Thus no projections (at least not in the same manner).
And as mentioned above, functions don't have this ambiguity because they have a fixed number of parameters and so they can be projected.
What's happening here is that d[`a`b] has the depth/valence as d. So when you apply d[`a`b]0 the zero is not indexing at depth. You get expected results if you don't index multiple values of your dictionary:
q)d[`a`b;0]~d[`a`b][0]
0b
q)d[`a;0]~d[`a][0]
1b
q)d[`b;0]~d[`b][0]
1b
This is more clear if you instead consider a 2x3 matrix which has identical behavior to your original example
q)M:(1 2 3;4 5 6)
q)M[0 1;0]
1 4
q)M[0 1][0]
1 2 3
Indexing any one row results in a simple vector
q)type M[0]
7h
q)type M[1]
7h
But indexing more than one row results in a matrix:
q)type M[0 1]
0h
In fact, indexing both rows results in the same exact matrix
q)M~M[0 1]
1b
So we should expect
q)M[0]~M[0 1][0]
1b
as we see above.
None of this should have an impact on calling dyadic functions, since supplying one parameter explicitly results in a function projection and therefore the valence is always reduced.
q)type {2+x*y}
100h
q)type {2+x*y}[10]
104h

Function which no longer exists

I am interested in using the function here:
http://uk.mathworks.com/help/nnet/ref/removerows.html
However, when I try to use it in Matlab it says: "Undefined function or variable 'removerows'"
I typed: exist removerows and returned a value of 0, suggesting that it's been removed. Has this function just been renamed? or is it part of a toolbox I may not have, the information does not detail this.
Much appreciated
According to the link that you posted, this function is part of the Neural Network Toolbox. So my guess is that you don't have this toolbox installed.
You can remove rows in a matrix by assigning an empty array to them.
This way you don't have to use functions belonging to toolboxes that require extra licences.
Example
A = [1 2; 3 4; 5 6]
A =
1 2
3 4
5 6
A(2,:) = [] %remove row 2
A =
1 2
5 6
Similarly you can provide an index array with the rows to be deleted in case you want to remove several ones.

Binary operation with singleton expansion -- scalar output

There is such function as bsxfun: http://www.mathworks.com/help/techdoc/ref/bsxfun.html however it work in element-by-element mode. I want similar function which works in vector-by-vector mode (and with scalar output).
As illustration I would try to use here bsxfun in such way. As inner function I will use (this is just an example) dot product of vectors.
function f = foo(a,b), f=a'*b; printf("called\n");, end
The above dummy function foo expects 2 vector, the result is scalar. Each time it is called we will see a message.
bsxfun(#foo,[2;3],[1 5;4 3])
The result is:
called
called
ans =
14 19
0 0
So, two calls (nice), however instead of a vectors (pair of 2 scalars) we got a matrix. One can say, it will suffice to get just first row in such case, because the matrix is the created in advance by bsxfun, and the rest will be always zeros.
But it is not always a case -- sometimes I got some real values, not only zeros -- and I am afraid some side-effects are involved (the above dot product is the simplest example which came to head).
Question
So, is there a function similar to bsxfun, but which gets vectors and expects a scalar, per each operation of those vectors?
I don't think there is a built in function, but using arrayfun or cellfun you might be able to do something. Generally arrayfun is also element-wise, but if you first split your larger array into a cell then you can do it:
foo = #(a,b) b*a
y = [2;3];
X = [1 5; 4 3];
% split X into cell array of rows
% apply foo to each row
cellfun(#(x) foo(y,x), num2cell(X,2))
ans =
17
17
I am not sure it would give any speed advantage (I would imagine an explicit loop would be quicker) but sometimes it can be easier to read.

Why does crossvalind fail?

I am using cross valind function on a very small data... However I observe that it gives me incorrect results for the same. Is this supposed to happen ?
I have Matlab R2012a and here is my output
crossvalind('KFold',1:1:11,5)
ans =
2
5
1
3
2
1
5
3
5
1
5
Notice the absence of set 4.. Is this a bug ? I expected atleast 2 elements per set but it gives me 0 in one... and it happens a lot that is the values are not uniformly distributed in the sets.
The help for crossvalind says that the form you are using is: crossvalind(METHOD, GROUP, ...). In this case, GROUP is the e.g. the class labels of your data. So 1:11 as the second argument is confusing here, because it suggests no two examples have the same label. I think this is sufficiently unusual that you shouldn't be surprised if the function does something strange.
I tried doing:
numel(unique(crossvalind('KFold', rand(11, 1) > 0.5, 5)))
and it reliably gave 5 as a result, which is what I would expect; my example would correspond to a two-class problem (I would guess that, as a general rule, you'd want something like numel(unique(group)) <= numel(group) / folds) - my hypothesis would be that it tries to have one example of each class in the Kth fold, and at least 2 examples in every other, with a difference between fold sizes of no more than 1 - but I haven't looked in the code to verify this.
It is possible that you mean to do:
crossvalind('KFold', 11, 5);
which would compute 5 folds for 11 data points - this doesn't attempt to do anything clever with labels, so you would be sure that there will be K folds.
However, in your problem, if you really have very few data points, then it is probably better to do leave-one-out cross validation, which you could do with:
crossvalind('LeaveMOut', 11, 1);
although a better method would be:
for leave_out=1:11
fold_number = (1:11) ~= leave_out;
<code here; where fold_number is 0, this is the leave-one-out example. fold_number = 1 means that the example is in the main fold.>
end