Manual Wilcoxon Rank-Sum Test

Manual Wilcoxon Rank-Sum Test - matlab

My statistics professor wants us to perform a manual Wilcoxon Rank-Sum Test using Matlab. Unfortunately, I have no experience with Matlab whatsoever, and I have been discovering as I go along. In short, we are given a list of 24 paired observations:
33 53 54 84 69 34 60 34 50 56 64 50 76 47 58 63 55 66 58 43 28 80 45
55
66 62 54 58 60 74 54 68 64 60 53 59 61 49 63 55 61 64 54 59 64 46 70
82
I've gotten to the point where I have a matrix with the absolute differences in the first column, the sign of the difference (indicated by a 1 for positive and -1 for negative) in the second column and the rank of the difference (1 through 24) in the third column.
I am struggling with finding a quick and efficient way to "break the ties" between the differences of equal size and allocating the average rank to each of these differences. I expect that some loops and logical statements may be required, but I am having a hard time with them as I have no prior experience.
Any suggestions on how to do this would be much appreciated.

One way to average over the ranks for entries with matching differences is as follows:
irankavg=zeros(length(dp),1);
[dpu,ix,iclass]=unique(dp);
for ii=1:length(dpu)
irankavg(iclass(ii)==iclass) = mean(irank(iclass(ii)==iclass));
end
where dp is a column array that contains the differences

Related

Saving the oucomes of a model run to use as starting point for future runs

OBJECTIVE: My plan is to run my model for ~60 ticks, and use the outcome of that run (i.e. the changes to the patches) as the starting point for all future runs. The idea behind this is that the first 60 ticks simulate a landscape from the past up until today (without any policy interventions). Then, from today on, I want to explore a range of policy scenarios, all starting with the same base conditions.
QUESTION: Do you know if there is a smart way to take stock of / save the outcomes of a run so that I can use them as a starting point for future runs, or do I need to assess the conditions after 60 ticks manually and then build an alternative setup-button that replicates those conditions?

I agree with Charles that export-world and import-world should work.
An alternative (see code below ) would be to use a fixed random seed for your alternative setup for the first 60 ticks, then change to a run-specific random seed, which would also work on a web-based run. ( I suspect export-world doesn't work over the web. )
Here's an example of changing the random seed mid-flight. Be sure to save the random seed before you define the new random seed or everything will always be the same!
Load this code and hit setup and go buttons multiple times and you can confirm it's working.
globals [ variable-random-seed fixed-ticks]
to setup
clear-all
set variable-random-seed random 999999999 ;; nine nines works
random-seed 123456789 ;; any fixed number to use for low ticks
set fixed-ticks 10 ;; or 60 in your case
print " ----------- fixed ------------- ===== -------- varies by run ----------- "
reset-ticks
end
to go
if ticks > 20 [ print "\n" stop ]
write random 100
if ticks = fixed-ticks [ write "=====" random-seed variable-random-seed ]
tick
end
Sample output of three runs
----------- fixed ------------- ===== -------- varies by run
66 68 42 59 14 1 34 20 3 15 86 "=====" 1 80 87 54 85 51 37 53 94 69
----------- fixed ------------- ===== -------- varies by run
66 68 42 59 14 1 34 20 3 15 86 "=====" 94 72 60 26 18 90 65 50 65 18
----------- fixed ------------- ===== -------- varies by run
66 68 42 59 14 1 34 20 3 15 86 "=====" 23 93 75 68 17 44 17 30 99 94

Stopping criteria for fminsearch in Matlab

I am using fminsearch to fit parameters for a system of DEs to observed data. I am not expecting to get a great fit.
fminsearch pretty quickly finds what appears to be an acceptable min for the objective function, but then does not stop. It's running for a really long time, and I cannot figure out why.
I am using the options
options = optimset('Display','iter','TolFun',1e-4,'TolX',1e-4,'MaxFunEvals',1000);
which I understood to mean that when the value of the objective function drops to below 1e-4 that would be considered sufficient. Alternatively when they could no longer change the parameters whatever is the best would be returned.
The output is
Iteration Func-count min f(x) Procedure
0 1 8.13911e+10
1 8 7.2565e+10 initial simplex
2 9 7.2565e+10 reflect
3 10 7.2565e+10 reflect
4 11 7.2565e+10 reflect
5 12 7.2565e+10 reflect
6 13 7.2565e+10 reflect
7 15 6.85149e+10 expand
8 16 6.85149e+10 reflect
9 17 6.85149e+10 reflect
10 19 6.20681e+10 expand
11 20 6.20681e+10 reflect
12 22 5.55199e+10 expand
13 23 5.55199e+10 reflect
14 25 4.86494e+10 expand
15 26 4.86494e+10 reflect
16 27 4.86494e+10 reflect
17 29 3.65616e+10 expand
18 30 3.65616e+10 reflect
19 31 3.65616e+10 reflect
20 33 2.82946e+10 expand
21 34 2.82946e+10 reflect
22 36 2.02985e+10 expand
23 37 2.02985e+10 reflect
24 39 1.20011e+10 expand
25 40 1.20011e+10 reflect
26 41 1.20011e+10 reflect
27 43 5.61651e+09 expand
28 44 5.61651e+09 reflect
29 45 5.61651e+09 reflect
30 47 2.1041e+09 expand
31 48 2.1041e+09 reflect
32 49 2.1041e+09 reflect
33 51 5.15751e+08 expand
34 52 5.15751e+08 reflect
35 53 5.15751e+08 reflect
36 55 7.99868e-05 expand
37 56 7.99868e-05 reflect
38 58 7.99835e-05 reflect
39 59 7.99835e-05 reflect
I have previously let this run for a lot longer and it's stuck with the same min f(x) for at least the next 30 print outs.
How do I set the options correctly so that when it finds a solution within an acceptable value for the objective function it stops?

Matlab requires that both TolX AND TolFun be satisfied before terminating ("Unlike other solvers, fminsearch stops when it satisfies both TolFun and TolX." See: https://www.mathworks.com/help/matlab/ref/fminsearch.html). You should check what the "x" value (your solution) is doing. I suspect that is changing more than your tolerance specification for each step. (i.e. the value of x is changing more than TolX between iterations but f(x) is not changing by more than TolFun).

CPU and Memory Friendly Solution to Merge Large Matrix

For the following typical case:
n = 1000000;
r = randi(n,n,2);
(assume there are 0.05% common numbers between all rows; n could be even tens of millions)
I am looking for a CPU and Memory efficient solution to merge rows based on any common items (here integer numbers). A list of sample codes in Python is available here and a quick try to translate one into Matlab can be found here.
In my attempt they take ages (minutes to hours), so I am in favor of finding faster solution.
For the above example, the typical output should look like (cell):
{
[1 90 34 67 ... 9]
[35 89]
[45000 23 828 130 8999 45326 ... 11]
...
}
Note also that, I have tried to compile as mex but failed due to no-support for cell in Matlab-Coder.
Edit: A tiny demonstration example
%---------------------------------------
clc
n = 100;
r = randi(n,n,2); % random integers in [1,n], size(n,2)
%---------------------------------------
>> r
r =
82 17 % (1) 82 17
91 13 % (2) 91 13
13 32 % (3) 91 13 32 merged with (2), common 13
82 53 % (4) 82 17 53 merged with (1), common 82
64 17 % (5) 82 17 53 64 merged with (4), common 17
...
94 45
13 31 % (77) 91 13 32 31 merged with (3), common 13
57 51
47 52
2 13 % (80) 91 13 32 31 2 merged with (77), common 13
34 80
%---------------------------------------
c = merge(r); % cpu and memory friendly solution is searched for.
%---------------------------------------
c =
[82 17 53 64]
[91 13 32 31 2]
...

You need an index.
In Python, use a dict. In MATLAB - I'd not use MATLAB, because open-source is the future, and MATLAB is dying out.
But Python is quite slow. You can likely get a 10x speedup by using e.g. Cython to translate and optimize the code in C. Avoid using Python data types such as a list of int, because they are very memory intensive. numpy has memory-efficient arrays of integer.
If you get a new pair (a,b) you can use this dictionary to find existing items to merge. Then update the dict after the merge.
Actually for integers, you should use an array instead of a dict.
The trickiest part is handling the case when both a and b exist, but are large different groups. There are some neat optimizations possible here, if that isn't fast enough yet.
It's not clustering, but connected components.

Why do you wrap around in 16 bit checksum (hex used)?

I have the question:
Compute the 16-bit checksum for the data block E3 4F 23 96 44 27 99
F3. Then perform the verification calculation.
I can perform the addition and I get the overflow like:
E3 4F
23 96
44 27
99 F3
``````````
1 E4 FF (overflow)
The solution then takes the overflow and adds it causing E4 FF to become E5 00. Can someone explain to me why this occurs?

Alternative to dec2hex in MATLAB?

I am using dec2hex up to 100 times in MATLAB. Because of this, the speed of code decreases. for one point I am using dec2hex 100 times. It will take 1 minute or more than it. I have do the same for 5000 points. But because of dec2hex it will take hours of time to run. So how can I do hexadecimal to decimal conversion optimally? Is there any other alternative that can be used instead of dec2hex?
As example:
%%Data[1..256]: can be any data from
for i=1:1:256
Table=dec2hex(Data);
%%Some permutation applied on Data
end;
Here I am using dec2hex more than 100 times for one point. And I have to use it for 5000 points.
Data =
Columns 1 through 16
105 232 98 250 234 216 98 199 172 226 250 215 188 11 52 174
Columns 17 through 32
111 181 71 254 133 171 94 91 194 136 249 168 177 202 109 187
Columns 33 through 48
232 249 191 60 230 67 183 122 164 163 91 24 145 124 200 142
This kind of data My code will use.

Function calls are (still) expensive in MATLAB. This is one of the reasons why vectorization and pseudo-vectorization is strongly recommended: processing an entire array of N values in one function call is way better than calling the processing function N times for each element, thus saving the N-1 supplemental calls overhead.
So, what you can do? Here are some non-mutually-exclusive choices:
Profile your code first. Just because something looks like the main culprit for execution time disasters, it isn't necessarily it. Type profview in your command window, chose the script that you want to run, and see where are the hotspots of your code. Choose to optimize those hotspots rather than your initial guesses.
Try faster functions. sprintf is usually fast and flexible:
Table = sprintf('%04X\n', Data);
(and — if you dive into the function code with edit dec2hex — you'll see that in some cases dec2hex actually calls sprintf).
Reduce the number of function calls. Suppose you have to build the table for the 100 datasets of different lengths, that are stored in a cell array:
DataSet = cell(1,100);
for k = 1:100
DataSet{k} = fix(1000*rand(k,1));
end;
The idea is to assemble all the numbers in a single array that you convert at once:
Table = dec2hex(vertcat(DataSet{:}));
Mind you, this is done at the expense of using supplemental memory for assembling the partial inputs in a single one — it's not always convenient to do that.
All the variants above. Okay, this point is not actually a point. :-)