What is the correct way to implement this loop that will average values with a changing counter? - matlab

I have looked thoroughly on the internet for an answer to this question, but it seems to be too specific for an answer anywhere else. This is my last stop.
To preface, this is not a homework problem, but it is adapted from an online Coursera course, whose quiz has already passed. I got the correct answer, but it was mostly luck. Also, it is a more of a general programming question than anything related to the course, so I know for a fact that it is within my right to ask it on a public forum.
The last thing is that I'm trying to do this in MatLab; however if you have an answer that is in C++ or Python or any other high level language, that would be wonderful, as I could easily adapt those solutions to MatLab syntax.
Here it is:
I have two vectors, T and M, each with 600,000 elements/entries/integers.
T is entered as milliseconds from 1 to 600,000 in ascending order, and each element in M represents 'on' or 'off' (entered as 1 or 0 respectively) for each corresponding millisecond entry in T. So there are random 1's and 0's in M that correspond to a particular millisecond from 1 to 600,000 in T.
I need to take, starting with the 150th millisecond of T, and in 150 element/millisecond increments from there on (inclusive), the average millisecond value of those groups of 150 but ONLY of those milliseconds whose entries are 1 in M ('on'). For example, I need to look at the first 150 milliseconds in T, see which ones have a value of 1 in M, and then average them. Then I need to do it again with entries 151 to 300 in T, then 301 to 450, etc. etc. These new averages should also be stored in a new vector. The problem is, the number of corresponding 1's in M isn't going to be the same for every group of 150 milliseconds in T. (And yes, we are trying to average the actual value of the milliseconds, so the values we are using to average and the order of the entries in T will be the same).
My attempt:
It turns out there are only 53,583 random 1's in M (out of the 600,000 entries, the rest are 0). I used a 'find' operator to extract those entries from M that are a 1 into a new vector K that has the millisecond value corresponding from T. So K looks like a bunch of random numbers in ascending order, which is just a list of all the milliseconds in T who are 'on' (assigned a 1 in M).
So K looks something like [2 5 11 27 39 40 79 ...... 599,698 599,727 etc.] (all of the millisecond values who are a 1 in M).
So I have the vector K which is all of the values that I need to average in groups of 150, but the problem is that I need to go in groups of 150 based on the vector T (1 to 600,000), which means there won't always be the same number of 1's (or values in K) in every group of 150 milliseconds in T, which in turn means the number I need to divide by to get the average of each group is going to change for each group of 150. I know I need to use a loop to do the average millisecond value for every 150 entries, but how do I get the dividing number (the number of entries for each group of 150 who is assigned a 1 or 'on') to change on each iteration of the loop? Is there a way to bind T and M together so that they only use the requisite values from K whenever there is a 1 in M, and then just use a simple counter to average?
It's not a complicated problem, but it is very hard to explain. Sorry about that! I hope I explained as clearly as I could. Any help would be appreciated, although I'm sure you'll have questions first.
Thank you very much!

I think this should work OK.
sz = length(T);
n = sz / 150;
K = T.*M';
t = 1;
aver = zeros(n-1,1); % Your result vector
for i = 1:150:sz-150
aver(t) = mean(K(i:(i+150)-1));
t = t + 1;
end
-Rob

Related

Matlab Randi lab Vectors

Im doing a lab using matlab and have hit a bit of a snag. The prompt is:
a. Generate a vector to manipulate in the following exercises by using a random
number generator to create "pull-ups" counts for 50 people. The counts should be
from 1 to 10. Use this vector of counts for the next two problems.
b. How many people did more than 5 pull-ups? Do your results make sense for a
uniformly distributed random number generator?
c. Generate another vector for "pull-ups" counts 50 athletes, so this time use the
range from 11 to 20. Append this new vector to the previous vector (now you have
100 "pull-ups" counts).
d. Find the average number of "pull-ups" for the 100 total people. Do your results
make sense?
e. Use the 100 person vector in c and create a new vector that contains only the
counts from the odd-numbered índices (not the odd value counts, instead the
counts for every other person starting with person 1).
f. Use the 100 person vector in c and make a new vector of the "even-valued
counts".
Now, I can do parts a. and b. with no problem, but i do not have an idea on how to do part c. Ive been trying to do this
x=randi(20,11,50)
now i know that i get 110 values that range from 1 to 20 doing what i put above. But im trying to get 50 values from 11 to 20 and add those values to the vector in part a so that i have 100 values, with 50 ranging from 1-10 and the other 50 ranging from 11-20. Any idea what im doing wrong?
You need to provide an array as the first input to randi to specify the lower and upper limits of the random integers. If you specify just a scalar, then values between 1 and your provided values will be returned. The second and third inputs are the size of the output so we want the output to be 50 x 1
x = randi([11 20], 50, 1)

Changing numbers for given indices between matrices

I'm struggling with one of my matlab assignments. I want to create 10 different models. Each of them is based on the same original array of dimensions 1x100 m_est. Then with for loop I am choosing 5 random values from the original model and want to add the same random value to each of them. The cycle repeats 10 times chosing different values each time and adding different random number. Here is a part of my code:
steps=10;
for s=1:steps
for i=1:1:5
rl(s,i)=m_est(randi(numel(m_est)));
rl_nr(s,i)=find(rl(s,i)==m_est);
a=-1;
b=1;
r(s)=(b-a)*rand(1,1)+a;
end
pert_layers(s,:)=rl(s,:)+r(s);
M=repmat(m_est',s,1);
end
for k=steps
for m=1:1:5
M_pert=M;
M_pert(1:k,rl_nr(k,1:m))=pert_layers(1:k,1:m);
end
end
In matrix M I am storing 10 initial models and want to replace the random numbers with indices from rl_nr matrix into those stored in pert_layers matrix. However, the last loop responsible for assigning values from pert_layers to rl_nr indices does not work properly.
Does anyone know how to solve this?
Best regards
Your code uses a lot of loops and in this particular circumstance, it's quite inefficient. It's better if you actually vectorize your code. As such, let me go through your problem description one point at a time and let's code up each part (if applicable):
I want to create 10 different models. Each of them is based on the same original array of dimensions 1x100 m_est.
I'm interpreting this as you having an array m_est of 100 elements, and with this array, you wish to create 10 different "models", where each model is 5 elements sampled from m_est. rl will store these values from m_est while rl_nr will store the indices / locations of where these values originated from. Also, for each model, you wish to add a random value to every element that is part of this model.
Then with for loop I am choosing 5 random values from the original model and want to add the same random value to each of them.
Instead of doing this with a for loop, generate all of your random indices in one go. Since you have 10 steps, and we wish to sample 5 points per step, you have 10*5 = 50 points in total. As such, why don't you use randperm instead? randperm is exactly what you're looking for, and we can use this to generate unique random indices so that we can ultimately use this to sample from m_est. randperm generates a vector from 1 to N but returns a random permutation of these elements. This way, you only get numbers enumerated from 1 to N exactly once and we will ensure no repeats. As such, simply use randperm to generate 50 elements, then reshape this array into a matrix of size 10 x 5, where the number of rows tells you the number of steps you want, while the number of columns is the total number of points per model. Therefore, do something like this:
num_steps = 10;
num_points_model = 5;
ind = randperm(numel(m_est));
ind = ind(1:num_steps*num_points_model);
rl_nr = reshape(ind, num_steps, num_points_model);
rl = m_est(rl_nr);
The first two lines are pretty straight forward. We are just declaring the total number of steps you want to take, as well as the total number of points per model. Next, what we will do is generate a random permutation of length 100, where elements are enumerated from 1 to 100, but they are in random order. You'll notice that this random vector uses only a value within the range of 1 to 100 exactly once. Because you only want to get 50 points in total, simply subset this vector so that we only get the first 50 random indices generated from randperm. These random indices get stored in ind.
Next, we simply reshape ind into a 10 x 5 matrix to get rl_nr. rl_nr will contain those indices that will be used to select those entries from m_est which is of size 10 x 5. Finally, rl will be a matrix of the same size as rl_nr, but it will contain the actual random values sampled from m_est. These random values correspond to those indices generated from rl_nr.
Now, the final step would be to add the same random number to each model. You can certainly use repmat to replicate a random column vector of 10 elements long, and duplicate them 5 times so that we have 5 columns then add this matrix together with rl.... so something like:
a = -1;
b = 1;
r = (b-a)*rand(num_steps, 1) + a;
r = repmat(r, 1, num_points_model);
M_pert = rl + r;
Now M_pert is the final result you want, where we take each model that is stored in rl and add the same random value to each corresponding model in the matrix. However, if I can suggest something more efficient, I would suggest you use bsxfun instead, which does this replication under the hood. Essentially, the above code would be replaced with:
a = -1;
b = 1;
r = (b-a)*rand(num_steps, 1) + a;
M_pert = bsxfun(#plus, rl, r);
Much easier to read, and less code. M_pert will contain your models in each row, with the same random value added to each particular model.
The cycle repeats 10 times chosing different values each time and adding different random number.
Already done in the above steps.
I hope you didn't find it an imposition to completely rewrite your code so that it's more vectorized, but I think this was a great opportunity to show you some of the more advanced functions that MATLAB has to offer, as well as more efficient ways to generate your random values, rather than looping and generating the values one at a time.
Hopefully this will get you started. Good luck!

Random numbers that add to 1 with a minimum increment: Matlab

Having read carefully the previous question
Random numbers that add to 100: Matlab
I am struggling to solve a similar but slightly more complex problem.
I would like to create an array of n elements that sums to 1, however I want an added constraint that the minimum increment (or if you like number of significant figures) for each element is fixed.
For example if I want 10 numbers that sum to 1 without any constraint the following works perfectly:
num_stocks=10;
num_simulations=100000;
temp = [zeros(num_simulations,1),sort(rand(num_simulations,num_stocks-1),2),ones(num_simulations,1)];
weights = diff(temp,[],2);
I foolishly thought that by scaling this I could add the constraint as follows
num_stocks=10;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp2 = [zeros(num_simulations,1),sort(round(rand(num_simulations,num_stocks-1)*scaling)/scaling,2),ones(num_simulations,1)];
weights2 = diff(temp2,[],2);
However though this works for small values of n & small values of increment, if for example n=1,000 & the increment is 0.1% then over a large number of trials the first and last numbers have a mean which is consistently below 0.1%.
I am sure there is a logical explanation/solution to this but I have been tearing my hair out to try & find it & wondered anybody would be so kind as to point me in the right direction. To put the problem into context create random stock portfolios (hence the sum to 1).
Thanks in advance
Thank you for the responses so far, just to clarify (as I think my initial question was perhaps badly phrased), it is the weights that have a fixed increment of 0.1% so 0%, 0.1%, 0.2% etc.
I did try using integers initially
num_stocks=1000;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp = [zeros(num_simulations,1),sort(randi([0 scaling],num_simulations,num_stocks-1),2),ones(num_simulations,1)*scaling];
weights = (diff(temp,[],2)/scaling);
test=mean(weights);
but this was worse, the mean for the 1st & last weights is well below 0.1%.....
Edit to reflect excellent answer by Floris & clarify
The original code I was using to solve this problem (before finding this forum) was
function x = monkey_weights_original(simulations,stocks)
stockmatrix=1:stocks;
base_weight=1/stocks;
r=randi(stocks,stocks,simulations);
x=histc(r,stockmatrix)*base_weight;
end
This runs very fast, which was important considering I want to run a total of 10,000,000 simulations, 10,000 simulations on 1,000 stocks takes just over 2 seconds with a single core & I am running the whole code on an 8 core machine using the parallel toolbox.
It also gives exactly the distribution I was looking for in terms of means, and I think that it is just as likely to get a portfolio that is 100% in 1 stock as it is to geta portfolio that is 0.1% in every stock (though I'm happy to be corrected).
My issue issue is that although it works for 1,000 stocks & an increment of 0.1% and I guess it works for 100 stocks & an increment of 1%, as the number of stocks decreases then each pick becomes a very large percentage (in the extreme with 2 stocks you will always get a 50/50 portfolio).
In effect I think this solution is like the binomial solution Floris suggests (but more limited)
However my question has arrisen because I would like to make my approach more flexible & have the possibility of say 3 stocks & an increment of 1% which my current code will not handle correctly, hence how I stumbled accross the original question on stackoverflow
Floris's recursive approach will get to the right answer, but the speed will be a major issue considering the scale of the problem.
An example of the original research is here
http://www.huffingtonpost.com/2013/04/05/monkeys-stocks-study_n_3021285.html
I am currently working on extending it with more flexibility on portfolio weights & numbers of stock in the index, but it appears my programming & probability theory ability are a limiting factor.......
One problem I can see is that your formula allows for numbers to be zero - when the rounding operation results in two consecutive numbers to be the same after sorting. Not sure if you consider that a problem - but I suggest you think about it (it would mean your model portfolio has fewer than N stocks in it since the contribution of one of the stocks would be zero).
The other thing to note is that the probability of getting the extreme values in your distribution is half of what you want them to be: If you have uniformly distributed numbers from 0 to 1000, and you round them, the numbers that round to 0 were in the interval [0 0.5>; the ones that round to 1 came from [0.5 1.5> - twice as big. The last number (rounding to 1000) is again from a smaller interval: [999.5 1000]. Thus you will not get the first and last number as often as you think. If instead of round you use floor I think you will get the answer you expect.
EDIT
I thought about this some more, and came up with a slow but (I think) accurate method for doing this. The basic idea is this:
Think in terms of integers; rather than dividing the interval 0 - 1 in steps of 0.001, divide the interval 0 - 1000 in integer steps
If we try to divide N into m intervals, the mean size of a step should be N / m; but being integer, we would expect the intervals to be binomially distributed
This suggests an algorithm in which we choose the first interval as a binomially distributed variate with mean (N/m) - call the first value v1; then divide the remaining interval N - v1 into m-1 steps; we can do so recursively.
The following code implements this:
% random integers adding up to a definite sum
function r = randomInt(n, limit)
% returns an array of n random integers
% whose sum is limit
% calls itself recursively; slow but accurate
if n>1
v = binomialRandom(limit, 1 / n);
r = [v randomInt(n-1, limit - v)];
else
r = limit;
end
function b = binomialRandom(N, p)
b = sum(rand(1,N)<p); % slow but direct
To get 10000 instances, you run this as follows:
tic
portfolio = zeros(10000, 10);
for ii = 1:10000
portfolio(ii,:) = randomInt(10, 1000);
end
toc
This ran in 3.8 seconds on a modest machine (single thread) - of course the method for obtaining a binomially distributed random variate is the thing slowing it down; there are statistical toolboxes with more efficient functions but I don't have one. If you increase the granularity (for example, by setting limit=10000) it will slow down more since you increase the number of random number samples that are generated; with limit = 10000 the above loop took 13.3 seconds to complete.
As a test, I found mean(portfolio)' and std(portfolio)' as follows (with limit=1000):
100.20 9.446
99.90 9.547
100.09 9.456
100.00 9.548
100.01 9.356
100.00 9.484
99.69 9.639
100.06 9.493
99.94 9.599
100.11 9.453
This looks like a pretty convincing "flat" distribution to me. We would expect the numbers to be binomially distributed with a mean of 100, and standard deviation of sqrt(p*(1-p)*n). In this case, p=0.1 so we expect s = 9.4868. The values I actually got were again quite close.
I realize that this is inefficient for large values of limit, and I made no attempt at efficiency. I find that clarity trumps speed when you develop something new. But for instance you could pre-compute the cumulative binomial distributions for p=1./(1:10), then do a random lookup; but if you are just going to do this once, for 100,000 instances, it will run in under a minute; unless you intend to do it many times, I wouldn't bother. But if anyone wants to improve this code I'd be happy to hear from them.
Eventually I have solved this problem!
I found a paper by 2 academics at John Hopkins University "Sampling Uniformly From The Unit Simplex"
http://www.cs.cmu.edu/~nasmith/papers/smith+tromble.tr04.pdf
In the paper they outline how naive algorthms don't work, in a way very similar to woodchips answer to the Random numbers that add to 100 question. They then go on to show that the method suggested by David Schwartz can also be slightly biased and propose a modified algorithm which appear to work.
If you want x numbers that sum to y
Sample uniformly x-1 random numbers from the range 1 to x+y-1 without replacement
Sort them
Add a zero at the beginning & x+y at the end
difference them & subtract 1 from each value
If you want to scale them as I do, then divide by y
It took me a while to realise why this works when the original approach didn't and it come down to the probability of getting a zero weight (as highlighted by Floris in his answer). To get a zero weight in the original version for all but the 1st or last weights your random numbers had to have 2 values the same but for the 1st & last ones then a random number of zero or the maximum number would result in a zero weight which is more likely.
In the revised algorithm, zero & the maximum number are not in the set of random choices & a zero weight occurs only if you select two consecutive numbers which is equally likely for every position.
I coded it up in Matlab as follows
function weights = unbiased_monkey_weights(num_simulations,num_stocks,min_increment)
scaling=1/min_increment;
sample=NaN(num_simulations,num_stocks-1);
for i=1:num_simulations
allcomb=randperm(scaling+num_stocks-1);
sample(i,:)=allcomb(1:num_stocks-1);
end
temp = [zeros(num_simulations,1),sort(sample,2),ones(num_simulations,1)*(scaling+num_stocks)];
weights = (diff(temp,[],2)-1)/scaling;
end
Obviously the loop is a bit clunky and as I'm using the 2009 version the randperm function only allows you to generate permutations of the whole set, however despite this I can run 10,000 simulations for 1,000 numbers in 5 seconds on my clunky laptop which is fast enough.
The mean weights are now correct & as a quick test I replicated woodchips generating 3 numbers that sum to 1 with the minimum increment being 0.01% & it also look right
Thank you all for your help and I hope this solution is useful to somebody else in the future
The simple answer is to use the schemes that work well with NO minimum increment, then transform the problem. As always, be careful. Some methods do NOT yield uniform sets of numbers.
Thus, suppose I want 11 numbers that sum to 100, with a constraint of a minimum increment of 5. I would first find 11 numbers that sum to 45, with no lower bound on the samples (other than zero.) I could use a tool from the file exchange for this. Simplest is to simply sample 10 numbers in the interval [0,45]. Sort them, then find the differences.
X = diff([0,sort(rand(1,10)),1]*45);
The vector X is a sample of numbers that sums to 45. But the vector Y sums to 100, with a minimum value of 5.
Y = X + 5;
Of course, this is trivially vectorized if you wish to find multiple sets of numbers with the given constraint.

How to generate all possible combinations n-bit strings?

Given a positive integer n, I want to generate all possible n bit combinations in matlab.
For ex : If n=3, then answer should be
000
001
010
011
100
101
110
111
How do I do it ?
I want to actually store them in matrix. I tried
for n=1:2^4
r(n)=dec2bin(n,5);
end;
but that gave error "In an assignment A(:) = B, the number of elements in A and B must be the same.
Just loop over all integers in [0,2^n), and print the number as binary. If you always want to have n digits (e.g. insert leading zeros), this would look like:
for ii=0:2^n-1,
fprintf('%0*s\n', n, dec2bin(ii));
end
Edit: there are a number of ways to put the results in a matrix. The easiest is to use
x = dec2bin(0:2^n-1);
which will produce an n-by-2^n matrix of type char. Each row is one of the bit strings.
If you really want to store strings in each row, you can do this:
x = cell(1, 2^n);
for ii=0:2^n-1,
x{ii} = dec2bin(ii);
end
However, if you're looking for efficient processing, you should remember that integers are already stored in memory in binary! So, the vector:
x = 0 : 2^n-1;
Contains the binary patterns in the most memory efficient and CPU efficient way possible. The only trade-off is that you will not be able to represent patterns with more than 32 of 64 bits using this compact representation.
This is a one-line answer to the question which gives you a double array of all 2^n bit combinations:
bitCombs = dec2bin(0:2^n-1) - '0'
So many ways to do this permutation. If you are looking to implement with an array counter: set an array of counters going from 0 to 1 for each of the three positions (2^0,2^1,2^2). Let the starting number be 000 (stored in an array). Use the counter and increment its 1st place (2^0). The number will be 001. Reset the counter at position (2^0) and increase counter at 2^1 and go on a loop till you complete all the counters.

How to compare different distribution means with reference truth value in Matlab?

I have production (q) values from 4 different methods stored in the 4 matrices. Each of the 4 matrices contains q values from a different method as:
Matrix_1 = 1 row x 20 column
Matrix_2 = 100 rows x 20 columns
Matrix_3 = 100 rows x 20 columns
Matrix_4 = 100 rows x 20 columns
The number of columns indicate the number of years. 1 row would contain the production values corresponding to the 20 years. Other 99 rows for matrix 2, 3 and 4 are just the different realizations (or simulation runs). So basically the other 99 rows for matrix 2,3 and 4 are repeat cases (but not with exact values because of random numbers).
Consider Matrix_1 as the reference truth (or base case ). Now I want to compare the other 3 matrices with Matrix_1 to see which one among those three matrices (each with 100 repeats) compares best, or closely imitates, with Matrix_1.
How can this be done in Matlab?
I know, manually, that we use confidence interval (CI) by plotting the mean of Matrix_1, and drawing each distribution of mean of Matrix_2, mean of Matrix_3 and mean of Matrix_4. The largest CI among matrix 2, 3 and 4 which contains the reference truth (or mean of Matrix_1) will be the answer.
mean of Matrix_1 = (1 row x 1 column)
mean of Matrix_2 = (100 rows x 1 column)
mean of Matrix_3 = (100 rows x 1 column)
mean of Matrix_4 = (100 rows x 1 column)
I hope the question is clear and relevant to SO. Otherwise please feel free to edit/suggest anything in question. Thanks!
EDIT: My three methods I talked about are a1, a2 and a3 respectively. Here's my result:
ci_a1 =
1.0e+008 *
4.084733001497999
4.097677503988565
ci_a2 =
1.0e+008 *
5.424396063219890
5.586301025525149
ci_a3 =
1.0e+008 *
2.429145282593182
2.838897116739112
p_a1 =
8.094614835195452e-130
p_a2 =
2.824626709966993e-072
p_a3 =
3.054667629953656e-012
h_a1 = 1; h_a2 = 1; h_a3 = 1
None of my CI, from the three methods, includes the mean ( = 3.454992884900722e+008) inside it. So do we still consider p-value to choose the best result?
If I understand correctly the calculation in MATLAB is pretty strait-forward.
Steps 1-2 (mean calculation):
k1_mean = mean(k1);
k2_mean = mean(k2);
k3_mean = mean(k3);
k4_mean = mean(k4);
Step 3, use HIST to plot distribution histograms:
hist([k2_mean; k3_mean; k4_mean]')
Step 4. You can do t-test comparing your vectors 2, 3 and 4 against normal distribution with mean k1_mean and unknown variance. See TTEST for details.
[h,p,ci] = ttest(k2_mean,k1_mean);
EDIT : I misinterpreted your question. See the answer of Yuk and following comments. My answer is what you need if you want to compare distributions of two vectors instead of a vector against a single value. Apparently, the latter is the case here.
Regarding your t-tests, you should keep in mind that they test against a "true" mean. Given the number of values for each matrix and the confidence intervals it's not too difficult to guess the standard deviation on your results. This is a measure of the "spread" of your results. Now the error on your mean is calculated as the standard deviation of your results divided by the number of observations. And the confidence interval is calculated by multiplying that standard error with appx. 2.
This confidence interval contains the true mean in 95% of the cases. So if the true mean is exactly at the border of that interval, the p-value is 0.05 the further away the mean, the lower the p-value. This can be interpreted as the chance that the values you have in matrix 2, 3 or 4 come from a population with a mean as in matrix 1. If you see your p-values, these chances can be said to be non-existent.
So you see that when the number of values get high, the confidence interval becomes smaller and the t-test becomes very sensitive. What this tells you, is nothing more that the three matrices differ significantly from the mean. If you have to choose one, I'd take a look at the distributions anyway. Otherwise the one with the closest mean seems a good guess. If you want to get deeper into this, you could also ask on stats.stackexchange.com
Your question and your method aren't really clear :
Is the distribution equal in all columns? This is important, as two distributions can have the same mean, but differ significantly :
is there a reason why you don't use the Central Limit Theorem? This seems to me like a very complex way of obtaining a result that can easily be found using the fact that the distribution of a mean approaches a normal distribution where sd(mean) = sd(observations)/number of observations. Saves you quite some work -if the distributions are alike! -
Now if the question is really the comparison of distributions, you should consider looking at a qqplot for a general idea, and at a 2-sample kolmogorov-smirnov test for formal testing. But please read in on this test, as you have to understand what it does in order to interprete the results correctly.
On a sidenote : if you do this test on multiple cases, make sure you understand the problem of multiple comparisons and use the appropriate correction, eg. Bonferroni or Dunn-Sidak.