Adapting the mode function to favor central values (Matlab) - matlab

The mode-function in Matlab returns the value that occurs most frequently in a dataset. But "when there are multiple values occurring equally frequently, mode returns the smallest of those values."
This is not very useful for what i am using it for, i would rather have it return a median, or arithmetic mean in the absence of a modal value (as they are at least somewhat in the middle of the distibution). Otherwise the results of using mode are far too much on the low side of the scale (i have a lot of unique values in my distribution).
Is there an elegant way to make mode favor more central values in a dataset (in the absence of a true modal value)?
btw.: i know i could use [M,F] = mode(X, ...), to manually check for the most frequent value (and calculate a median or mean when necessary). But that seems like a bit of an awkward solution, since i would be almost entirely rewriting everything that mode is supposed to be doing. I'm hoping that there's a more elegant solution.

Looks like you want the third output argument from mode. EG:
x = [1 1 1 2 2 2 3 3 3 4 4 4 5 6 7 8];
[m,f,c] = mode(x);
valueYouWant = median(c{1});
Or (since median takes the average of values when there are an even number of entries), in the cases where an even number of values may have the same max number of occurrences, maybe do something like this:
valueYouWant = c{1}(ceil(length(c{1})/2))

Related

Why do I get different result in different versions of MATLAB (2016 vs 2021)?

Why do I get different results when using the same code running in different version of MATLAB (2016 vs 2021) for sum(b.*x1) where b is single and x1 is double. How to avoid such error between MATLAB version?
MATLAB v.2021:
sum(b.*x1)
ans =
single
-0.0013286
MATLAB 2016
sum(b.*x1)
ans =
single
-0.0013283
In R2017b, they changed the behavior of sum for single-precision floats, and in R2020b they made the same changes for other data types too.
The change speeds up the computation, and improves accuracy by reducing the rounding errors. Simply put, previously the algorithm would just run through the array in sequence, adding up the values. The new behavior computes the sum over smaller portions of the array, and then adds up those results. This is more precise because the running total can become a very large number, and adding smaller numbers to it causes more rounding in those smaller numbers. The speed improvement comes from loop unrolling: the loop now steps over, say, 8 values at the time, and in the loop body, 8 running totals are computed (they don’t specify the number they use, the 8 here is an example).
Thus, your newer result is a better approximation to the sum of your array than the old one.
For more details (a better explanation of the new algorithm and the reason for the change), see this blog post.
Regarding how to avoid the difference: you could implement your own sum function, and use that instead of the builtin one. I would suggest writing it as a MEX-file for efficiency. However, do make sure you match the newer behavior of the builtin sum, as that is the better approximation.
Here is an example of the problem. Let's create an array with N+1 elements, where the first one has a value of N and the rest have a value of 1.
N = 1e8;
a = ones(N+1,1,'single');
a(1) = N;
The sum over this array is expected to be 2*N. If we set N large enough w.r.t. the data type, I see this in R2017a (before the change):
>> sum(a)
ans =
single
150331648
And I see this in R2018b (after the change for single-precision sum):
>> sum(a)
ans =
single
199998976
Both implementations make rounding errors here, but one is obviously much, much closer to the expected result (2e8, or 200000000).

Predictive coding without using loops in Matlab?

I have a set of numbers and I want to use predictive coding to get smaller values for this set of data as each value should not differ to much from the last. I was just starting very simply with a expected value that each value would be the same as the last, and then just store the error.
For some simple data:
1 2 -3 1
The values I should get are
1 1 -5 4
The way I was doing this compression was one line, but to decompress I need the last value, so I put this in a loop. Is there a way to do this, and possibly more complex(looking at more the just the last value) predictive coding without needing to use Matlab loops.
As suggested by #Divakar, constructing the new values can be done by
B = [A(1) diff(A)];
Obtaining the original from this result, the inverse of the above procedure, is done through
A = cumsum(B);

Generate matrix of random number with constraints in matlab

I want to generate a matrix of random numbers (normrnd with mean == 0) that satisfy the following constraints using MATLAB (or any other language)
The sum of the absolute values in the matrix must equal X
The largest abs(single number) must equal Y
The difference between the number and its 8 neighbors (3 if in corner, 5 if on edge) must be less than Z
It would be relatively easy to satisfy one of the constraints, but I can't think of an algorithm that satisfies all of them...
Any ideas?
I am not sure whether to edit my post or to reply here, so I am editing... #MZimmerman6, you have a point. Though these constraints won't produce a unique solution, how would I obtain multiple solutions without using rand?
A very simply 3 x 3 where 5 is the max element value, 30 is the sum, and 2 is the difference
5 4 3
4 4 2
3 2 3
Rody, that actually may help...I need to think more :)
Luis ...Hmmm...why not? I can add up the absolute value of a normally distributed sample...right?
Here is an algorithm to get the 'random' numbers that you need.
Generate a valid number (for example in the middle)
Determine the feasible range for one of the numbers next to it
If there is no range, you go to step 1, otherwise generate a number and continue
Depending on your constraints it may take a while of course. You could add an other step to see if changing the existing numbers would help before going back to step 1.

Why does crossvalind fail?

I am using cross valind function on a very small data... However I observe that it gives me incorrect results for the same. Is this supposed to happen ?
I have Matlab R2012a and here is my output
crossvalind('KFold',1:1:11,5)
ans =
2
5
1
3
2
1
5
3
5
1
5
Notice the absence of set 4.. Is this a bug ? I expected atleast 2 elements per set but it gives me 0 in one... and it happens a lot that is the values are not uniformly distributed in the sets.
The help for crossvalind says that the form you are using is: crossvalind(METHOD, GROUP, ...). In this case, GROUP is the e.g. the class labels of your data. So 1:11 as the second argument is confusing here, because it suggests no two examples have the same label. I think this is sufficiently unusual that you shouldn't be surprised if the function does something strange.
I tried doing:
numel(unique(crossvalind('KFold', rand(11, 1) > 0.5, 5)))
and it reliably gave 5 as a result, which is what I would expect; my example would correspond to a two-class problem (I would guess that, as a general rule, you'd want something like numel(unique(group)) <= numel(group) / folds) - my hypothesis would be that it tries to have one example of each class in the Kth fold, and at least 2 examples in every other, with a difference between fold sizes of no more than 1 - but I haven't looked in the code to verify this.
It is possible that you mean to do:
crossvalind('KFold', 11, 5);
which would compute 5 folds for 11 data points - this doesn't attempt to do anything clever with labels, so you would be sure that there will be K folds.
However, in your problem, if you really have very few data points, then it is probably better to do leave-one-out cross validation, which you could do with:
crossvalind('LeaveMOut', 11, 1);
although a better method would be:
for leave_out=1:11
fold_number = (1:11) ~= leave_out;
<code here; where fold_number is 0, this is the leave-one-out example. fold_number = 1 means that the example is in the main fold.>
end

Problem using the find function in MATLAB

I have two arrays of data that I'm trying to amalgamate. One contains actual latencies from an experiment in the first column (e.g. 0.345, 0.455... never more than 3 decimal places), along with other data from that experiment. The other contains what is effectively a 'look up' list of latencies ranging from 0.001 to 0.500 in 0.001 increments, along with other pieces of data. Both data sets are X-by-Y doubles.
What I'm trying to do is something like...
for i = 1:length(actual_latency)
row = find(predicted_data(:,1) == actual_latency(i))
full_set(i,1:4) = [actual_latency(i) other_info(i) predicted_info(row,2) ...
predicted_info(row,3)];
end
...in order to find the relevant row in predicted_data where the look up latency corresponds to the actual latency. I then use this to created an amalgamated data set, full_set.
I figured this would be really simple, but the find function keeps failing by throwing up an empty matrix when looking for an actual latency that I know is in predicted_data(:,1) (as I've double-checked during debugging).
Moreover, if I replace find with a for loop to do the same job, I get a similar error. It doesn't appear to be systematic - using different participant data sets throws it up in different places.
Furthermore, during debugging mode, if I use find to try and find a hard-coded value of actual_latency, it doesn't always work. Sometimes yes, sometimes no.
I'm really scratching my head over this, so if anyone has any ideas about what might be going on, I'd be really grateful.
You are likely running into a problem with floating point comparisons when you do the following:
predicted_data(:,1) == actual_latency(i)
Even though your numbers appear to only have three decimal places of precision, they may still differ by very small amounts that are not being displayed, thus giving you an empty matrix since FIND can't get an exact match.
One feature of floating point numbers is that certain numbers can't be exactly represented, since they aren't an integer power of 2. This occurs with the numbers 0.1 and 0.001. If you repeatedly add or multiply one of these numbers you can see some unexpected behavior. Amro pointed out one example in his comment: 0.3 is not exactly equal to 3*0.1. This can also be illustrated by creating your look-up list of latencies in two different ways. You can use the normal colon syntax:
vec1 = 0.001:0.001:0.5;
Or you can use LINSPACE:
vec2 = linspace(0.001,0.5,500);
You'd think these two vectors would be equal to one another, but think again!:
>> isequal(vec1,vec2)
ans =
0 %# FALSE!
This is because the two methods create the vectors by performing successive additions or multiplications of 0.001 in different ways, giving ever so slightly different values for some entries in the vector. You can take a look at this technical solution for more details.
When comparing floating point numbers, you should therefore do your comparisons using some tolerance. For example, this finds the indices of entries in the look-up list that are within 0.0001 of your actual latency:
tolerance = 0.0001;
for i = 1:length(actual_latency)
row = find(abs(predicted_data(:,1) - actual_latency(i)) < tolerance);
...
The topic of floating point comparison is also covered in this related question.
You may try to do the following:
row = find(abs(predicted_data(:,1) - actual_latency(i))) < eps)
EPS is accuracy of floating-point operation.
Have you tried using a tolerance rather than == ?