How to count matches in several matrices? - matlab

Making a dichotomous study, I have to count how many times a condition takes place?
The study is based on two kinds of matrices, ones with forecasts and others with analyzed data.
Both in the forecast and analysis matrices, in case a condition is satisfied we add 1 to a counter. This process is repeated for a points distributed in a grid.
Are there any functions in MATLAB that help me with counting or any script that supports this procedure?
Thanks guys!
EDIT:
The case goes about precipitation registered and forecasted. When both exceed a threshold I consider it as a hit. I have Europe divided in several grid points, and I have to count how many times the forecast is correct. I also have 50 forecasts for each year, so the result (hit/no hit) must be a cumulative action.
I've trying with count and sum functions, but they reduce the spatial dimension of the matrices.

It's difficult to tell exactly what you are trying to do but the following may help.
forecasted = [ 40 10 50 0 15];
registered = [ 0 15 30 0 10];
mismatch = abs( forecasted - registered );
maxDelta = 10;
forecastCorrect = mismatch <= maxDelta
totalCorrectForecasts = sum(forecastCorrect)
Results:
forecastCorrect =
0 1 0 1 1
totalCorrectForecasts =
3

Related

How to identify recurring patterns in time-series data in Matlab

I am calculating ENSO indices using Matlab and one condition is that I have to find anomalous sea surface temperatures. The condition is that an El NiƱo event is characterised by sea surface temperatures that are 0.5 degrees above the normalised "0-value" for 5 months. I have gotten as far as to make my monthly time series data logical (i.e. "1" is a monthly data value above 0.5 and "0" is a monthly data value below 0.5), but I wanted to know if there was a command in Matlab that allows me to identify when this value repeats 5 times or more.
As an example code:
Monthly_data=[0 0 1 1 1 1 1 0 0 0 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 0]
I would ideally need a command that finds when a minimum of five "1"s occur after each other. Does this exist?
If more info is needed please let me know, I am new to matlab so I am not yet sure of the structure and syntax that is valued for asking questions on here.
Thank you!
not sure this is what you need but perhaps gives you some direction.
> x = diff(Monthly_data);
> find(x==-1)-find(x==1)
ans =
5 2 1 7
these are the lengths of the 1 sequences. You may need to pad front and end of the array with 0 to eliminate sequences missing one boundary.
To find the start index of the sequence longer than 5:
> s=find(x==1);
> s(find(x==-1)-s>5)
ans = 18
or
> s(find(x==-1)-s>=5)
ans =
2 18
note that because of the diff lag, these are one more than the array index, or consider it as position for zero based indexing.

Sort Matlab data into groups

I have a column of numerical data (imported from excel) and I would like to sort each of the column entries into 4 different groups based on custom size ranges, then calculate how many column entries are in each group, as a fraction of the total number of entries in the column.
For example, if my column was 1,3,13,11,5,9. I want to calculate how many entries fit into group 1-3, how many fit into group 4-7, and so on. Then calculate the amount of entries in each group as a fraction of the total number of column entries. ie, 6 in this example.
Does anyone know how to do this best?
Thanks
Hannah :)
Sry I misread your question:
here is the updated code
ranges = [1 3
4 7
8 11
12 13];
groups = size(ranges,1);
a = [ 1,3,13,11,5,9];
counter = zeros(groups,1);
for i=1:groups
counter(i) = sum(a>=ranges(i,1) & a<=ranges(i,2));
end
relative_counter = counter / numel(a);
Old answer:
I do not understand how you get your group bounds (in your question the first group has 3 elements and the 2nd group has 4?)
have a look at the following code. (be careful and test how it should behave at group boarders)
groups =4;
a = [ 1,3,13,11,5,9];
range = max(a)-min(a);
rangePerGroup = range/groups;
a_noOffset = a-min(a);
counter = zeros(groups,1);
for i=1:groups
counter(i) = sum(a_noOffset>=rangePerGroup*(i-1) & a_noOffset<=rangePerGroup*i);
end
relative_counter = counter / numel(a);

Using bin counts as weights for random number selection

I have a set of data that I wish to approximate via random sampling in a non-parametric manner, e.g.:
eventl=
4
5
6
8
10
11
12
24
32
In order to accomplish this, I initially bin the data up to a certain value:
binsize = 5;
nbins = 20;
[bincounts,ind] = histc(eventl,1:binsize:binsize*nbins);
Then populate a matrix with all possible numbers covered by the bins which the approximation can choose:
sizes = transpose(1:binsize*nbins);
To use the bin counts as weights for selection i.e. bincount (1-5) = 2, thus the weight for choosing 1,2,3,4 or 5 = 2 whilst (16-20) = 0 so 16,17,18, 19 or 20 can never be chosen, I simply take the bincounts and replicate them across the bin size:
w = repelem(bincounts,binsize);
To then perform weighted number selection, I use:
[~,R] = histc(rand(1,1),cumsum([0;w(:)./sum(w)]));
R = sizes(R);
For some reason this approach is unable to approximate the data. It was my understanding that was sufficient sampling depth, the binned version of R would be identical to the binned version of eventl however there is significant variation and often data found in bins whose weights were 0.
Could anybody suggest a better method to do this or point out the error?
For a better method, I suggest randsample:
values = [1 2 3 4 5 6 7 8]; %# values from which you want to pick
numberOfElements = 1000; %# how many values you want to pick
weights = [2 2 2 2 2 1 1 1]; %# weights given to the values (1-5 are twice as likely as 6-8)
sample = randsample(values, numberOfElements, true, weights);
Note that even with 1000 samples, the distribution does not exactly correspond to the weights, so if you only pick 20 samples, the histogram may look rather different.

Find median value of the largest clump of similar values in an array in the most computationally efficient manner

Sorry for the long title, but that about sums it up.
I am looking to find the median value of the largest clump of similar values in an array in the most computationally efficient manner.
for example:
H = [99,100,101,102,103,180,181,182,5,250,17]
I would be looking for the 101.
The array is not sorted, I just typed it in the above order for easier understanding.
The array is of a constant length and you can always assume there will be at least one clump of similar values.
What I have been doing so far is basically computing the standard deviation with one of the values removed and finding the value which corresponds to the largest reduction in STD and repeating that for the number of elements in the array, which is terribly inefficient.
for j = 1:7
G = double(H);
for i = 1:7
G(i) = NaN;
T(i) = nanstd(G);
end
best = find(T==min(T));
H(best) = NaN;
end
x = find(H==max(H));
Any thoughts?
This possibility bins your data and looks for the bin with most elements. If your distribution consists of well separated clusters this should work reasonably well.
H = [99,100,101,102,103,180,181,182,5,250,17];
nbins = length(H); % <-- set # of bins here
[v bins]=hist(H,nbins);
[vm im]=max(v); % find max in histogram
bl = bins(2)-bins(1); % bin size
bm = bins(im); % position of bin with max #
ifb =find(abs(H-bm)<bl/2) % elements within bin
median(H(ifb)) % average over those elements in bin
Output:
ifb = 1 2 3 4 5
H(ifb) = 99 100 101 102 103
median = 101
The more challenging parameters to set are the number of bins and the size of the region to look around the most populated bin. In the example you provided neither of these is so critical, you could set the number of bins to 3 (instead of length(H)) and it still would work. Using length(H) as the number of bins is in fact a little extreme and probably not a good general choice. A better choice is somewhere between that number and the expected number of clusters.
It may help for certain distributions to change bl within the find expression to a value you judge better in advance.
I should also note that there are clustering methods (kmeans) that may work better, but perhaps less efficiently. For instance this is the output of [H' kmeans(H',4) ]:
99 2
100 2
101 2
102 2
103 2
180 3
181 3
182 3
5 4
250 3
17 1
In this case I decided in advance to attempt grouping into 4 clusters.
Using kmeans you can get an answer as follows:
nbin = 4;
km = kmeans(H',nbin);
[mv iv]=max(histc(km,[1:nbin]));
H(km==km(iv))
median(H(km==km(iv)))
Notice however that kmeans does not necessarily return the same value every time it is run, so you might need to average over a few iterations.
I timed the two methods and found that kmeans takes ~10 X longer. However, it is more robust since the bin sizes adapt to your problem and do not need to be set beforehand (only the number of bins does).

Genetic algorithm: Minimum Number of Generations?

I have a Matlab script (actually a function, funModel), which I'm trying to solve with 7 integer variables via a genetic algorithm:
nvars = 7; %number of variables
Aineq = [1 1 1 1 1 1 1]; Aeq = [];
bineq = [VesMaxCrew]; beq = [];
LowBound = [1 1 1 1 1 4 0];
UpBound = [1 1 VesMaxCrew 1 VesMaxCrew VesMaxCrew VesMaxCrew];
Nonlcon = [];
IntCon = [1:7]; % all 7 variables to be treated as integers
Options = gaoptimset('Display','iter',... %display every iteration
'Generations',70,... %maximum number of generations is 70
'TolFun',1,... %tolerance for optimisation is 1
'TolCon',1,...
'PlotFcns',#gaplotbestf);
OptimisedValue = ga(#funModel,nvars,Aineq,bineq,Aeq,beq,,LowBound,UpBound,NonlCon,IntCon,Options);
The genetic algorithm works fine and finds a good solution, easily within 70 generations (as can be seen with the plot function #gaplotbestf). With the current input, the optimal solution is chosen for every individual after 25 to 30 generations. The algorithm, however, continues to run until 51 generations have been made. This would seem like at least 20 generations too many.
Even if I change the input parameters of funModel, the genetic algorithm still runs at least 51 generations, like there is some constraint or setting saying the algorithm has to run 51 generations minimum. (As can be seen, a maximum number of generations has been entered)
Why doesn't the algorithm stop between 25 or 30 generations? (or just after 30 generations)
And more importantly, does anyone know how to alter this?
(I haven't been able to find anything about a setting (gaoptimset) of minimum generations in the Matlab documentation. Neither have I been able to find somebody with the same problem/question.)
"Stall generations" option has default value of 50. This is actually the point where it stops in your case. This can be considered as a minimum number of generations. For more details please check here.