Matlab - Error bars for (large) noisy data sets - matlab

I have ten large linear arrays (n elements) such as
A = [ A1 A2....An ]
B = [ B1 B2....Bn ]
....
J = [ J1 J2....Jn ]
I can make an arithmentic mean of these arrays by adding them and dividing by ten and this reduces the noise substantially and shows the trend I am looking for. (note, that often I have more or less than ten data sets, but this is representative. Also, n varies, but is generally 10,000s of data points)
What I would like to do is plot this average with error bars that represent the noise in the original ten arrays. The arrays are large, so maybe error bars at sensible increments (say ten error bars across the entire range where the deviation from the average is greatest).
The image shows 10 noisy data sets plotted as grey lines with the mean as a black line.
thanks

I have come to a rather laborious solution to this problem by writing a (what seems) lengthy piece of code.
The code takes all the input arrays and creates two new arrays. One which is the maximum y value for each x and one which is the miniumum. This is done with the max and min functions in matlab.
The minimum is substracted from the maximum to create an array of the magnitudes of the "error" at each value of x.
Then every nth value of the error array is plotted as an error bar on top of the arithmetic mean value of all the original input arrays.
It's a fix to the problem - and the screenshot shows the result - but I was wondering if there is a more elegant "built in" solution that does this in one shot.

Related

Why number of peaks of my signal stay same when I increase n in n-point moving average filter when data is big?

I am using MATLAB to find the number of peaks of a signal.
I'm trying to plot the number of peaks of a signal filtered with N-point moving average filter, N goes from 2 to 30.(I also consider the number of peaks when no filter has applied at the beginning of the resulting array) My data array(imported from csv and has double values between 0 and 1) has around 50k points. When I give part of the data i.e 100, 500 or 1000 points, using array slicing, # of peaks decrease as expected. However, when I give the whole data or even 2000 points, the number of peaks stays same at 127.
I changed the number of data given to the filter to find out why this happens. I changed the commented lines like showed in the comment and tried. When less than 1000 data points given plot was fine.
Here is the signal
https://www.dropbox.com/s/e1bkcjn5ta5q610/exampleSignal.csv?dl=0
Please import it from 4th element to end, it has some strange data at the beginning, I have not taken them, VarName1 is the imported column vector's name
numberOfPeaks = zeros(30,1,'int8');
pks = findpeaks(VarName1); % VarName1(1:1000,:) (when no filter applied)
numberOfPeaks(1) = size(pks,1);
for i=2:30
h = 1/i*ones(1,i,'double');
y = filter(h,1,VarName1); % VarName1(1:1000,:)
numberOfPeaks(i) = size(findpeaks(y),1);
end
plot(1:30,numberOfPeaks);
I expect a plot like this when whole the data is given:
but I get:
I realised that the problem is int8 I use. It can only take up to 127 and this caused my big results to be as 127.
Turning it into double solves the problem.

MATLAB spending an incredible amount of time writing a relatively small matrix

I have a small MATLAB script (included below) for handling data read from a CSV file with two columns and hundreds of thousands of rows. Each entry is a natural number, with zeros only occurring in the second column. This code is taking a truly incredible amount of time (hours) to run what should be achievable in at most some seconds. The profiler identifies that approximately 100% of the run time is spent writing a matrix of zeros, whose size varies depending on input, but in all usage is smaller than 1000x1000.
The code is as follows
function [data] = DataHandler(D)
n = size(D,1);
s = max(D,1);
data = zeros(s,s);
for i = 1:n
data(D(i,1),D(i,2)+1) = data(D(i,1),D(i,2)+1) + 1;
end
It's the data = zeros(s,s); line that takes around 100% of the runtime. I can make the code run quickly by just changing out the s's in this line for 1000, which is a sufficient upper bound to ensure it won't run into errors for any of the data I'm looking at.
Obviously there're better ways to do this, but being that I just bashed the code together to quickly format some data I wasn't too concerned. As I said, I fixed it by just replacing s with 1000 for my purposes, but I'm perplexed as to why writing that matrix would bog MATLAB down for several hours. New code runs instantaneously.
I'd be very interested if anyone has seen this kind of behaviour before, or knows why this would be happening. Its a little disconcerting, and it would be good to be able to be confident that I can initialize matrices freely without killing MATLAB.
Your call to zeros is incorrect. Looking at your code, D looks like a D x 2 array. However, your call of s = max(D,1) would actually generate another D x 2 array. By consulting the documentation for max, this is what happens when you call max in the way you used:
C = max(A,B) returns an array the same size as A and B with the largest elements taken from A or B. Either the dimensions of A and B are the same, or one can be a scalar.
Therefore, because you used max(D,1), you are essentially comparing every value in D with the value of 1, so what you're actually getting is just a copy of D in the end. Using this as input into zeros has rather undefined behaviour. What will actually happen is that for each row of s, it will allocate a temporary zeros matrix of that size and toss the temporary result. Only the dimensions of the last row of s is what is recorded. Because you have a very large matrix D, this is probably why the profiler hangs here at 100% utilization. Therefore, each parameter to zeros must be scalar, yet your call to produce s would produce a matrix.
What I believe you intended should have been:
s = max(D(:));
This finds the overall maximum of the matrix D by unrolling D into a single vector and finding the overall maximum. If you do this, your code should run faster.
As a side note, this post may interest you:
Faster way to initialize arrays via empty matrix multiplication? (Matlab)
It was shown in this post that doing zeros(n,n) is in fact slow and there are several neat tricks to initializing an array of zeros. One way is to accomplish this by empty matrix multiplication:
data = zeros(n,0)*zeros(0,n);
One of my personal favourites is that if you assume that data was not declared / initialized, you can do:
data(n,n) = 0;
If I can also comment, that for loop is quite inefficient. What you are doing is calculating a 2D histogram / accumulation of data. You can replace that for loop with a more efficient accumarray call. This also avoids allocating an array of zeros and accumarray will do that under the hood for you.
As such, your code would basically become this:
function [data] = DataHandler(D)
data = accumarray([D(:,1) D(:,2)+1], 1);
accumarray in this case will take all pairs of row and column coordinates, stored in D(i,1) and D(i,2) + 1 for i = 1, 2, ..., size(D,1) and place all that match the same row and column coordinates into a separate 2D bin, we then add up all of the occurrences and the output at this 2D bin gives you the total tally of how many values at this 2D bin which corresponds to the row and column coordinate of interest mapped to this location.

Random numbers that add to 1 with a minimum increment: Matlab

Having read carefully the previous question
Random numbers that add to 100: Matlab
I am struggling to solve a similar but slightly more complex problem.
I would like to create an array of n elements that sums to 1, however I want an added constraint that the minimum increment (or if you like number of significant figures) for each element is fixed.
For example if I want 10 numbers that sum to 1 without any constraint the following works perfectly:
num_stocks=10;
num_simulations=100000;
temp = [zeros(num_simulations,1),sort(rand(num_simulations,num_stocks-1),2),ones(num_simulations,1)];
weights = diff(temp,[],2);
I foolishly thought that by scaling this I could add the constraint as follows
num_stocks=10;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp2 = [zeros(num_simulations,1),sort(round(rand(num_simulations,num_stocks-1)*scaling)/scaling,2),ones(num_simulations,1)];
weights2 = diff(temp2,[],2);
However though this works for small values of n & small values of increment, if for example n=1,000 & the increment is 0.1% then over a large number of trials the first and last numbers have a mean which is consistently below 0.1%.
I am sure there is a logical explanation/solution to this but I have been tearing my hair out to try & find it & wondered anybody would be so kind as to point me in the right direction. To put the problem into context create random stock portfolios (hence the sum to 1).
Thanks in advance
Thank you for the responses so far, just to clarify (as I think my initial question was perhaps badly phrased), it is the weights that have a fixed increment of 0.1% so 0%, 0.1%, 0.2% etc.
I did try using integers initially
num_stocks=1000;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp = [zeros(num_simulations,1),sort(randi([0 scaling],num_simulations,num_stocks-1),2),ones(num_simulations,1)*scaling];
weights = (diff(temp,[],2)/scaling);
test=mean(weights);
but this was worse, the mean for the 1st & last weights is well below 0.1%.....
Edit to reflect excellent answer by Floris & clarify
The original code I was using to solve this problem (before finding this forum) was
function x = monkey_weights_original(simulations,stocks)
stockmatrix=1:stocks;
base_weight=1/stocks;
r=randi(stocks,stocks,simulations);
x=histc(r,stockmatrix)*base_weight;
end
This runs very fast, which was important considering I want to run a total of 10,000,000 simulations, 10,000 simulations on 1,000 stocks takes just over 2 seconds with a single core & I am running the whole code on an 8 core machine using the parallel toolbox.
It also gives exactly the distribution I was looking for in terms of means, and I think that it is just as likely to get a portfolio that is 100% in 1 stock as it is to geta portfolio that is 0.1% in every stock (though I'm happy to be corrected).
My issue issue is that although it works for 1,000 stocks & an increment of 0.1% and I guess it works for 100 stocks & an increment of 1%, as the number of stocks decreases then each pick becomes a very large percentage (in the extreme with 2 stocks you will always get a 50/50 portfolio).
In effect I think this solution is like the binomial solution Floris suggests (but more limited)
However my question has arrisen because I would like to make my approach more flexible & have the possibility of say 3 stocks & an increment of 1% which my current code will not handle correctly, hence how I stumbled accross the original question on stackoverflow
Floris's recursive approach will get to the right answer, but the speed will be a major issue considering the scale of the problem.
An example of the original research is here
http://www.huffingtonpost.com/2013/04/05/monkeys-stocks-study_n_3021285.html
I am currently working on extending it with more flexibility on portfolio weights & numbers of stock in the index, but it appears my programming & probability theory ability are a limiting factor.......
One problem I can see is that your formula allows for numbers to be zero - when the rounding operation results in two consecutive numbers to be the same after sorting. Not sure if you consider that a problem - but I suggest you think about it (it would mean your model portfolio has fewer than N stocks in it since the contribution of one of the stocks would be zero).
The other thing to note is that the probability of getting the extreme values in your distribution is half of what you want them to be: If you have uniformly distributed numbers from 0 to 1000, and you round them, the numbers that round to 0 were in the interval [0 0.5>; the ones that round to 1 came from [0.5 1.5> - twice as big. The last number (rounding to 1000) is again from a smaller interval: [999.5 1000]. Thus you will not get the first and last number as often as you think. If instead of round you use floor I think you will get the answer you expect.
EDIT
I thought about this some more, and came up with a slow but (I think) accurate method for doing this. The basic idea is this:
Think in terms of integers; rather than dividing the interval 0 - 1 in steps of 0.001, divide the interval 0 - 1000 in integer steps
If we try to divide N into m intervals, the mean size of a step should be N / m; but being integer, we would expect the intervals to be binomially distributed
This suggests an algorithm in which we choose the first interval as a binomially distributed variate with mean (N/m) - call the first value v1; then divide the remaining interval N - v1 into m-1 steps; we can do so recursively.
The following code implements this:
% random integers adding up to a definite sum
function r = randomInt(n, limit)
% returns an array of n random integers
% whose sum is limit
% calls itself recursively; slow but accurate
if n>1
v = binomialRandom(limit, 1 / n);
r = [v randomInt(n-1, limit - v)];
else
r = limit;
end
function b = binomialRandom(N, p)
b = sum(rand(1,N)<p); % slow but direct
To get 10000 instances, you run this as follows:
tic
portfolio = zeros(10000, 10);
for ii = 1:10000
portfolio(ii,:) = randomInt(10, 1000);
end
toc
This ran in 3.8 seconds on a modest machine (single thread) - of course the method for obtaining a binomially distributed random variate is the thing slowing it down; there are statistical toolboxes with more efficient functions but I don't have one. If you increase the granularity (for example, by setting limit=10000) it will slow down more since you increase the number of random number samples that are generated; with limit = 10000 the above loop took 13.3 seconds to complete.
As a test, I found mean(portfolio)' and std(portfolio)' as follows (with limit=1000):
100.20 9.446
99.90 9.547
100.09 9.456
100.00 9.548
100.01 9.356
100.00 9.484
99.69 9.639
100.06 9.493
99.94 9.599
100.11 9.453
This looks like a pretty convincing "flat" distribution to me. We would expect the numbers to be binomially distributed with a mean of 100, and standard deviation of sqrt(p*(1-p)*n). In this case, p=0.1 so we expect s = 9.4868. The values I actually got were again quite close.
I realize that this is inefficient for large values of limit, and I made no attempt at efficiency. I find that clarity trumps speed when you develop something new. But for instance you could pre-compute the cumulative binomial distributions for p=1./(1:10), then do a random lookup; but if you are just going to do this once, for 100,000 instances, it will run in under a minute; unless you intend to do it many times, I wouldn't bother. But if anyone wants to improve this code I'd be happy to hear from them.
Eventually I have solved this problem!
I found a paper by 2 academics at John Hopkins University "Sampling Uniformly From The Unit Simplex"
http://www.cs.cmu.edu/~nasmith/papers/smith+tromble.tr04.pdf
In the paper they outline how naive algorthms don't work, in a way very similar to woodchips answer to the Random numbers that add to 100 question. They then go on to show that the method suggested by David Schwartz can also be slightly biased and propose a modified algorithm which appear to work.
If you want x numbers that sum to y
Sample uniformly x-1 random numbers from the range 1 to x+y-1 without replacement
Sort them
Add a zero at the beginning & x+y at the end
difference them & subtract 1 from each value
If you want to scale them as I do, then divide by y
It took me a while to realise why this works when the original approach didn't and it come down to the probability of getting a zero weight (as highlighted by Floris in his answer). To get a zero weight in the original version for all but the 1st or last weights your random numbers had to have 2 values the same but for the 1st & last ones then a random number of zero or the maximum number would result in a zero weight which is more likely.
In the revised algorithm, zero & the maximum number are not in the set of random choices & a zero weight occurs only if you select two consecutive numbers which is equally likely for every position.
I coded it up in Matlab as follows
function weights = unbiased_monkey_weights(num_simulations,num_stocks,min_increment)
scaling=1/min_increment;
sample=NaN(num_simulations,num_stocks-1);
for i=1:num_simulations
allcomb=randperm(scaling+num_stocks-1);
sample(i,:)=allcomb(1:num_stocks-1);
end
temp = [zeros(num_simulations,1),sort(sample,2),ones(num_simulations,1)*(scaling+num_stocks)];
weights = (diff(temp,[],2)-1)/scaling;
end
Obviously the loop is a bit clunky and as I'm using the 2009 version the randperm function only allows you to generate permutations of the whole set, however despite this I can run 10,000 simulations for 1,000 numbers in 5 seconds on my clunky laptop which is fast enough.
The mean weights are now correct & as a quick test I replicated woodchips generating 3 numbers that sum to 1 with the minimum increment being 0.01% & it also look right
Thank you all for your help and I hope this solution is useful to somebody else in the future
The simple answer is to use the schemes that work well with NO minimum increment, then transform the problem. As always, be careful. Some methods do NOT yield uniform sets of numbers.
Thus, suppose I want 11 numbers that sum to 100, with a constraint of a minimum increment of 5. I would first find 11 numbers that sum to 45, with no lower bound on the samples (other than zero.) I could use a tool from the file exchange for this. Simplest is to simply sample 10 numbers in the interval [0,45]. Sort them, then find the differences.
X = diff([0,sort(rand(1,10)),1]*45);
The vector X is a sample of numbers that sums to 45. But the vector Y sums to 100, with a minimum value of 5.
Y = X + 5;
Of course, this is trivially vectorized if you wish to find multiple sets of numbers with the given constraint.

Removing extreme values in a vector in Matlab?

So say, I have a = [2 7 4 9 2 4 999]
And I'd like to remove 999 from the matrix (which is an obvious outlier).
Is there a general way to remove values like this? I have a set of vectors and not all of them have extreme values like that. prctile(a,99.5) is going to output the largest number in the vector no matter how extreme (or non-extreme) it is.
There are several way to do that, but first you must define what is "extreme'? Is it above some threshold? above some number of standard deviations?
Or, if you know you have exactly n of these extreme events and that their values are larger than the rest, you can use sort and the delete the last n elements. etc...
For example a(a>threshold)=[] will take care of a threshold like definition, while a(a>mean(a)+n*std(a))=[] will take care of discarding values that are n standard deviation above the mean of a.
A completely different approach is to use the median of a, if the vector is as short as you mention, you want to look on a median value and then you can either threshold anything above some factor of that value a(a>n*median(a))=[] .
Last, a way to assess an approach to treat these spikes would be to take a histogram of the data, and work from there...
I can think of two:
Sort your matrix and remove n-elements from top and bottom.
Compute the mean and the standard deviation and discard all values that fall outside:
mean +/- (n * standard deviation)
In both cases n must be chosen by the user.
Filter your signal.
%choose the value
N = 10;
filtered = filter(ones(1,N)/N, 1, signal);
Find the noise
noise = signal - filtered;
Remove noisy elements
THRESH = 50;
signal = signal(abs(noise) < THRESH);
It is better than mean+-n*stddev approach because it looks for local changes so it won't fail on a slowly changing signal like [1 2 3 ... 998 998].

Arbitrary distribution -> Uniform distribution (Probability Integral Transform?)

I have 500,000 values for a variable derived from financial markets. Specifically, this variable represents distance from the mean (in standard deviations). This variable has a arbitrary distribution. I need a formula that will allow me to select a range around any value of this variable such that an equal (or close to it) amount of data points fall within that range.
This will allow me to then analyze all of the data points within a specific range and to treat them as "similar situations to the input."
From what I understand, this means that I need to convert it from arbitrary distribution to uniform distribution. I have read (but barely understood) that what I am looking for is called "probability integral transform."
Can anyone assist me with some code (Matlab preferred, but it doesn't really matter) to help me accomplish this?
Here's something I put together quickly. It's not polished and not perfect, but it does what you want to do.
clear
randList=[randn(1e4,1);2*randn(1e4,1)+5];
[xCdf,xList]=ksdensity(randList,'npoints',5e3,'function','cdf');
xRange=getInterval(5,xList,xCdf,0.1);
and the function getInterval is
function out=getInterval(yPoint,xList,xCdf,areaFraction)
yCdf=interp1(xList,xCdf,yPoint);
yCdfRange=[-areaFraction/2, areaFraction/2]+yCdf;
out=interp1(xCdf,xList,yCdfRange);
Explanation:
The CDF of the random distribution is shown below by the line in blue. You provide a point (here 5 in the input to getInterval) about which you want a range that gives you 10% of the area (input 0.1 to getInterval). The chosen point is marked by the red cross and the
interval is marked by the lines in green. You can get the corresponding points from the original list that lie within this interval as
newList=randList(randList>=xRange(1) & randList<=xRange(2));
You'll find that on an average, the number of points in this example is ~2000, which is 10% of numel(randList)
numel(newList)
ans =
2045
NOTE:
Please note that this was done quickly and I haven't made any checks to see if the chosen point is outside the range or if yCdfRange falls outside [0 1], in which case interp1 will return a NaN. This is fairly straightforward to implement, and I'll leave that to you.
Also, ksdensity is very CPU intensive. I wouldn't recommend increasing npoints to more than 1e4. I assume you're only working with a fixed list (i.e., you have a list of 5e5 points that you've obtained somehow and now you're just running tests/analyzing it). In that case, you can run ksdensity once and save the result.
I do not speak Matlab, but you need to find quantiles in your data. This is Mathematica code which would do this:
In[88]:= data = RandomVariate[SkewNormalDistribution[0, 1, 2], 10^4];
Compute quantile points:
In[91]:= q10 = Quantile[data, Range[0, 10]/10];
Now form pairs of consecutive quantiles:
In[92]:= intervals = Partition[q10, 2, 1];
In[93]:= intervals
Out[93]= {{-1.397, -0.136989}, {-0.136989, 0.123689}, {0.123689,
0.312232}, {0.312232, 0.478551}, {0.478551, 0.652482}, {0.652482,
0.829642}, {0.829642, 1.02801}, {1.02801, 1.27609}, {1.27609,
1.6237}, {1.6237, 4.04219}}
Verify that the splitting points separate data nearly evenly:
In[94]:= Table[Count[data, x_ /; i[[1]] <= x < i[[2]]], {i, intervals}]
Out[94]= {999, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000}