Index when mean is constant - matlab

I am relatively new to matlab. I found the consecutive mean of a set of 1E6 random numbers that has mean and standard deviation. Initially the calculated mean fluctuate and then converges to a certain value.
I will like to know the index (i.e 100th position) at which the mean converges. I have no idea how to do that.
I tried using the logical operator but i have to go through 1e6 data points. Even with that i still can't find the index.
Y_c= sigma_c * randn(n_r, 1) + mu_c; %Random number creation
Y_f=sigma_f * randn(n_r, 1) + mu_f;%Random number creation
P_u=gamma*(B*B)/2.*N_gamma+q*B.*N_q + Y_c*B.*N_c; %Calculation of Ultimate load
prog_mu=cumsum(P_u)./cumsum(ones(size(P_u))); %Progressive Cumulative Mean of system response
logical(diff(prog_mu==0)); %Find index

I suspect the issue is that the mean will never truly be constant, but will rather fluctuate around the "true mean". As such, you'll most likely never encounter a situation where the two consecutive values of the cumulative mean are identical. What you should do is determine some threshold value, below which you consider fluctuations in the mean to be approximately equal to zero, and compare the difference of the cumulative mean to that value. For instance:
epsilon = 0.01;
const_ind = find(abs(diff(prog_mu))<epsilon,1,'first');
where epsilon will be the threshold value you choose. The find command will return the index at which the variation in the cumulative mean first drops below this threshold value.
EDIT: As was pointed out, this method may potentially fail if the first few random numbers are generated such that the difference between them is less than the epsilon value, but have not yet converged. I would like to suggest a different approach, then.
We calculate the cumulative means, as before, like so:
prog_mu=cumsum(P_u)./cumsum(ones(size(P_u)));
We also calculate the difference in these cumulative means, as before:
df_prog_mu = diff(prog_mu);
Now, to ensure that conversion has been achieved, we find the first index where the cumulative mean is below the threshold value epsilon and all subsequent means are also below the threshold value. To phrase this another way, we want to find the index after the last position in the array where the cumulative mean is above the threshold:
conv_index = find(~df_prog_mu,1,'last')+1;
In doing so, we guarantee that the value at the index, and all subsequent values, have converged below your predetermined threshold value.

I wouldn't imagine that the mean would suddenly become constant at a single index. Wouldn't it asymptotically approach a constant value? I would reccommend a for loop to calculate the mean (it sounds like maybe you've already done this part?) like this:
avg = [];
for k=1:length(x)
avg(k) = mean(x(1:k));
end
Then plot the consecutive mean:
plot(avg)
hold on % this will allow us to plot more data on the same figure later
If you're trying to find the point at which the consecutive mean comes within a certain range of the true mean, try this:
Tavg = 5; % or whatever your true mean is
err = 0.01; % the range you want the consecutive mean to reach before we say that it "became constant"
inRange = avg>(Tavg-err) & avg<(Tavg+err); % gives you a binary logical array telling you which values fell within the range
q = 1000; % set this as high as you can while still getting a value for consIndex
constIndex = [];
for k=1:length(inRange)
if(inRange(k) == sum(inRange(k:k+q))/(q-1);)
constIndex = k;
end
end
The below answer takes a similar approach but makes an unsafe assumption that the first value to fall within the range is the value where the function starts to converge. Any value could randomly fall within that range. We need to make sure that the following values also fall within that range. In the above code, you can edit "q" and "err" to optimize your result. I would recommend double checking it by plotting.
plot(avg(constIndex), '*')

Related

'Find' function working incorrectly, have tried floating point accuracy resolution

I have vertically concatenated files from my directory into a matrix that is about 60000 x 15 in size (verified).
d=dir('*.log');
n=length(d);
data=[];
for k=1:n
data{k}=importdata(d(k).name);
end
total=[];
for k=1:n
total=[total;data{n}];
end
I am using a the following 32-iteration loop and the 'Find" function to locate row numbers where the final column is an integer corresponding to the integer iteration of the loop:
for i=1:32
v=[];
vn=[];
[v,vn]=find(abs(fix(i)-fix(total))<eps);
g=length(v)
end
I have tried to account for the floating point accuracy by using 'fix' on values of 'i' and values from matrix 'total', in addition to taking their absolute difference and checking it to be less than a tolerance of 'eps' (floating-point relative accuracy function), up to a tolerance of .99.
The 'Find' function is not working correctly. It is only working for certain integers (although it should be locating all of them (1-32)), and for the integers it does find the values are incomplete.
What is the problem here? If 'Find' is inadequate for this purpose, what is a suitable alternative?
You are getting a lot of zeros because you are looking not just at the 15th column of data but the entire data matrix so you are going to have a lot of non-integers.
Also, you're using fix on both numbers and since floating point errors can cause the number to be slightly above and below the desired integer, this will cause the ones that are below to round down an integer lower than what you'd expect. You should use round to round to the nearest integer instead.
Rather than using find to do this, I would use simple boolean logic to determine the value of the last column
for k = 1:32
% Compare column 15 to the current index
matches = abs(total(:,end) - k) < eps;
% Do stuff with these matches
g = sum(matches); % Count the matches
end
Depending on what you want to actually do with the data, you may be able to use the last column as an input to accumarray to perform an operation on each group.
As a side note, you can replace the first chunk of code with
d = dir('*.log');
data = cellfun(#importdata, {d.name}, 'UniformOutput', false);
total = cat(1, data{:});

How to calculate the "rest value" of a plot?

Didn't know how to paraphrase the question well.
Function for example:
Data:https://www.dropbox.com/s/wr61qyhhf6ujvny/data.mat?dl=0
In this case how do I calculate that the rest point of this function is ~1? I have access to the vector that makes the plot.
I guess the mean is an approximation but in some cases it can be pretty bad.
Under the assumption that the "rest" point is the steady-state value in your data and the fact that the steady-state value happens the majority of the times in your data, you can simply bin all of the points and use each unique value as a separate bin. The bin with the highest count should correspond to the steady-state value.
You can do this by a combination of histc and unique. Assuming your data is stored in y, do this:
%// Find all unique values in your data
bins = unique(y);
%// Find the total number of occurrences per unique value
counts = histc(y, bins);
%// Figure out which bin has the largest count
[~,max_bin] = max(counts);
%// Figure out the corresponding y value
ss_value = bins(max_bin);
ss_value contains the steady-state value of your data, corresponding to the most occurring output point with the assumptions I laid out above.
A minor caveat with the above approach is that this is not friendly to floating point data whose unique values are generated by floating point values whose decimal values beyond the first few significant digits are different.
Here's an example of your data from point 2300 to 2320:
>> format long g;
>> y(2300:2320)
ans =
0.99995724232555
0.999957488454868
0.999957733165346
0.999957976465197
0.999958218362579
0.999958458865564
0.999958697982251
0.999958935720613
0.999959172088623
0.999959407094224
0.999959640745246
0.999959873049548
0.999960104014889
0.999960333649014
0.999960561959611
0.999960788954326
0.99996101464076
0.999961239026462
0.999961462118947
0.999961683925704
0.999961904454139
Therefore, what I'd recommend is to perhaps round so that the first 5 or so significant digits are maintained.
You can do this to your dataset before you continue:
num_digits = 5;
y_round = round(y*(10^num_digits))/(10^num_digits);
This will first multiply by 10^n where n is the number of digits you desire so that the decimal point is shifted over by n positions. We round this result, then divide by 10^n to bring it back to the scale that it was before. If you do this, for those points that were 0.9999... where there are n decimal places, these will get rounded to 1, and it may help in the above calculations.
However, more recent versions of MATLAB have this functionality already built-in to round, and you can just do this:
num_digits = 5;
y_round = round(y,num_digits);
Minor Note
More recent versions of MATLAB discourage the use of histc and recommend you use histcounts instead. Same function definition and expected inputs and outputs... so just replace histc with histcounts if your MATLAB version can handle it.
Using the above logic, you could also use the median too. If the majority of data is fluctuating around 1, then the median would have a high probability that the steady-state value is chosen... so try this too:
ss_value = median(y_round);

Finding the best threshold level without a for loop in MATLAB

Let A and B be two matrices of the same size. For a matrix M, let ht(M,t) threshold all the entries of M by t. That is All entries whose absolute value is less than t are set to 0. Suppose I want to find the optimal threshold t such that norm(ht(A,t)-B,'fro')^2 is minimized.
The only way that I can see to do this is deficient: do a for loop over the unique values of A and threshold A and setting C=ht(A,t)-B, compute sum(sum(C.*C)).
This is just too slow when A is large. I have considered sorting the elements of A and finding some efficient way to set a few entries to zero at a time, but I'm not sure this can all be done without a for loop.
Is there a way to do it?
Here's a very simple example (so simple a for loop works easily in this case):
B =
0.101508820368332 0
0 0.301996943246957
Set
A=B+.1*ones(2)
A =
0.201508820368332 0.1
0.1 0.401996943246957
Simple inspection shows that if we zero out the off-diagonal entries of A we minimize the difference between A and B. There are 3 possible threshold values, given by unique(A)=[.1,.2015,.402]. Given a potential threshold value t, we can hard threshold A by:
function [A_thresholded] = ht(A,t)
%
A_thresholded = A .* (abs(A)>t);
The form of the data in a matrix is irrelevant. You can convert them to vectors and simply compute the square-norm. In fact, you can sort the contents of A in increasing order (and permute B to preserve pairing). When you increase the threshold to include one more value in A, the norm only changes by that one increment. Therefore, you can find your solution in O(n log n). Hope this helps.

Partitioning a number into a number of almost equal partitions

I would like to partition a number into an almost equal number of values in each partition. The only criteria is that each partition must be in between 60 to 80.
For example, if I have a value = 300, this means that 75 * 4 = 300.
I would like to know a method to get this 4 and 75 in the above example. In some cases, all partitions don't need to be of equal value, but they should be in between 60 and 80. Any constraints can be used (addition, subtraction, etc..). However, the outputs must not be floating point.
Also it's not that the total must be exactly 300 as in this case, but they can be up to a maximum of +40 of the total, and so for the case of 300, the numbers can sum up to 340 if required.
Assuming only addition, you can formulate this problem into a linear programming problem. You would choose an objective function that would maximize the sum of all of the factors chosen to generate that number for you. Therefore, your objective function would be:
(source: codecogs.com)
.
In this case, n would be the number of factors you are using to try and decompose your number into. Each x_i is a particular factor in the overall sum of the value you want to decompose. I'm also going to assume that none of the factors can be floating point, and can only be integer. As such, you need to use a special case of linear programming called integer programming where the constraints and the actual solution to your problem are all in integers. In general, the integer programming problem is formulated thusly:
You are actually trying to minimize this objective function, such that you produce a parameter vector of x that are subject to all of these constraints. In our case, x would be a vector of numbers where each element forms part of the sum to the value you are trying to decompose (300 in your case).
You have inequalities, equalities and also boundaries of x that each parameter in your solution must respect. You also need to make sure that each parameter of x is an integer. As such, MATLAB has a function called intlinprog that will perform this for you. However, this function assumes that you are minimizing the objective function, and so if you want to maximize, simply minimize on the negative. f is a vector of weights to be applied to each value in your parameter vector, and with our objective function, you just need to set all of these to -1.
Therefore, to formulate your problem in an integer programming framework, you are actually doing:
(source: codecogs.com)
V would be the value you are trying to decompose (so 300 in your example).
The standard way to call intlinprog is in the following way:
x = intlinprog(f,intcon,A,b,Aeq,beq,lb,ub);
f is the vector that weights each parameter of the solution you want to solve, intcon denotes which of your parameters need to be integer. In this case, you want all of them to be integer so you would have to supply an increasing vector from 1 to n, where n is the number of factors you want to decompose the number V into (same as before). A and b are matrices and vectors that define your inequality constraints. Because you want equality, you'd set this to empty ([]). Aeq and beq are the same as A and b, but for equality. Because you only have one constraint here, you would simply create a matrix of 1 row, where each value is set to 1. beq would be a single value which denotes the number you are trying to factorize. lb and ub are the lower and upper bounds for each value in the parameter set that you are bounding with, so this would be 60 and 80 respectively, and you'd have to specify a vector to ensure that each value of the parameters are bounded between these two ranges.
Now, because you don't know how many factors will evenly decompose your value, you'll have to loop over a given set of factors (like between 1 to 10, or 1 to 20, etc.), place your results in a cell array, then you have to manually examine yourself whether or not an integer decomposition was successful.
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = intlinprog(-ones(n,1),1:n,[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
You can then go through results and see which value of n was successful in decomposing your number into that said number of factors.
One small problem here is that we also don't know how many factors we should check up to. That unfortunately I don't have an answer to, and so you'll have to play with this value until you get good results. This is also an unconstrained parameter, and I'll talk about this more later in this post.
However, intlinprog was only released in recent versions of MATLAB. If you want to do the same thing without it, you can use linprog, which is the floating point version of integer programming... actually, it's just the core linear programming framework itself. You would call linprog this way:
x = linprog(f,A,b,Aeq,beq,lb,ub);
All of the variables are the same, except that intcon is not used here... which makes sense as linprog may generate floating point numbers as part of its solution. Due to the fact that linprog can generate floating point solutions, what you can do is if you want to ensure that for a given value of n, you could loop over your results, take the floor of the result and subtract with the final result, and sum over the result. If you get a value of 0, this means that you had a completely integer result. Therefore, you'd have to do something like:
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = linprog(-ones(n,1),[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
%// Loop through and determine which decompositions were successful integer ones
out = cellfun(#(x) sum(abs(floor(x) - x)), results);
%// Determine which values of n were successful in the integer composition.
final_factors = find(~out);
final_factors will contain which number of factors you specified that was successful in an integer decomposition. Now, if final_factors is empty, this means that it wasn't successful in finding anything that would be able to decompose the value into integer factors. Noting your problem description, you said you can allow for tolerances, so perhaps scan through results and determine which overall sum best matches the value, then choose whatever number of factors that gave you that result as the final answer.
Now, noting from my comments, you'll see that this problem is very unconstrained. You don't know how many factors are required to get an integer decomposition of your value, which is why we had to semi-brute-force it. In fact, this is a more general case of the subset sum problem. This problem is NP-complete. Basically, what this means is that it is not known whether there is a polynomial-time algorithm that can be used to solve this kind of problem and that the only way to get a valid solution is to brute-force each possible solution and check if it works with the specified problem. Usually, brute-forcing solutions requires exponential time, which is very intractable for large problems. Another interesting fact is that modern cryptography algorithms use NP-Complete intractability as part of their ciphertext and encrypting. Basically, they're banking on the fact that the only way for you to determine the right key that was used to encrypt your plain text is to check all possible keys, which is an intractable problem... especially if you use 128-bit encryption! This means you would have to check 2^128 possibilities, and assuming a moderately fast computer, the worst-case time to find the right key will take more than the current age of the universe. Check out this cool Wikipedia post for more details in intractability with regards to key breaking in cryptography.
In fact, NP-complete problems are very popular and there have been many attempts to determine whether there is or there isn't a polynomial-time algorithm to solve such problems. An interesting property is that if you can find a polynomial-time algorithm that will solve one problem, you will have found an algorithm to solve them all.
The Clay Mathematics Institute has what are known as Millennium Problems where if you solve any problem listed on their website, you get a million dollars.
Also, that's for each problem, so one problem solved == 1 million dollars!
(source: quickmeme.com)
The NP problem is amongst one of the seven problems up for solving. If I recall correctly, only one problem has been solved so far, and these problems were first released to the public in the year 2000 (hence millennium...). So... it has been about 14 years and only one problem has been solved. Don't let that discourage you though! If you want to invest some time and try to solve one of the problems, please do!
Hopefully this will be enough to get you started. Good luck!

What is this code doing? Machine Learning

I'm just learning matlab and I have a snippet of code which I don't understand the syntax of. The x is an n x 1 vector.
Code is below
p = (min(x):(max(x)/300):max(x))';
The p vector is used a few lines later to plot the function
plot(p,pp*model,'r');
It generates an arithmetic progression.
An arithmetic progression is a sequence of numbers where the next number is equal to the previous number plus a constant. In an arithmetic progression, this constant must stay the same value.
In your code,
min(x) is the initial value of the sequence
max(x) / 300 is the increment amount
max(x) is the stopping criteria. When the result of incrementation exceeds this stopping criteria, no more items are generated for the sequence.
I cannot comment on this particular choice of initial value and increment amount, without seeing the surrounding code where it was used.
However, from a naive perspective, MATLAB has a linspace command which does something similar, but not exactly the same.
Certainly looks to me like an odd thing to be doing. Basically, it's creating a vector of values p that range from the smallest to the largest values of x, which is fine, but it's using steps between successive values of max(x)/300.
If min(x)=300 and max(x)=300.5 then this would only give 1 point for p.
On the other hand, if min(x)=-1000 and max(x)=0.3 then p would have thousands of elements.
In fact, it's even worse. If max(x) is negative, then you would get an error as p would start from min(x), some negative number below max(x), and then each element would be smaller than the last.
I think p must be used to create pp or model somehow as well so that the plot works, and without knowing how I can't suggest how to fix this, but I can't think of a good reason why it would be done like this. using linspace(min(x),max(x),300) or setting the step to (max(x)-min(x))/299 would make more sense to me.
This code examines an array named x, and finds its minimum value min(x) and its maximum value max(x). It takes the maximum value and divides it by the constant 300.
It doesn't explicitly name any variable, setting it equal to max(x)/300, but for the sake of explanation, I'm naming it "incr", short for increment.
And, it creates a vector named p. p looks something like this:
p = [min(x), min(x) + incr, min(x) + 2*incr, ..., min(x) + 299*incr, max(x)];