How to evaluate emprical cdf at given points in Matlab? - matlab

Suppose I have a sequence of scalar points subject to a unknown distribution.
From the sequence of points, we can get the empirical cdf.
I was wondering if there is some way in Matlab to evaluate this empirical cdf at any point? For example, evaluate it at the same sequence of points that are used to build the empirical cdf?
I have looked up the function ecdf at http://www.mathworks.com/help/stats/ecdf.html. Its usage is [f,x] = ecdf(y), where the empirical cdf from data yis evaluated atx, butx` doesn't seem to be specifiable.
Thanks and regards!

Assuming you have the output of the function, two vectors f and x and you want to find the emperical cdf at point x_of_interest, this is what you can do:
max(f(x<=x_of_interest))
Or maybe you want to use minand >=, but I think the above formula is correct.

It seems like x are unique points in y with their CDF.

I'm not sure that's what you meant, but I also needed to convert a vector of data values to the corresponding vector of empirical CDFs, with the same ordering.
Actually, instead of the usual definition cdf(x) = Prob(X <= x) I prefer the more symmetric definition cdf(x) = Prob(X < x) + 1/2 * Prob(X == x) , which is more suitable to cases with ties. Now the computation is a one-liner, but with the help of the tiedrank() function from the statistical toolbox:
cdf = (tiedrank(data) - 1/2) / length(data) ;
For example,
data = [3 2 4 2 1] ;
yields
cdf = [0.7 0.4 0.9 0.4 0.1] ;

A good approach might be to use interpolation to find the closest "x" for each point you want to evaluate, and the relative "f" value. You can find how to use interp1 for this purpose here:
determining-the-value-of-ecdf-at-a-point-using-matlab

Related

Plot log(n over k)

I've never used Matlab before and I really don't know how to fix the code. I need to plot log(1000 over k) with k going from 1 to 1000.
y = #(x) log(nchoosek(1000,x));
fplot(y,[1 1000]);
Error:
Warning: Function behaves unexpectedly on array inputs. To improve performance, properly
vectorize your function to return an output with the same size and shape as the input
arguments.
In matlab.graphics.function.FunctionLine>getFunction
In matlab.graphics.function.FunctionLine/updateFunction
In matlab.graphics.function.FunctionLine/set.Function_I
In matlab.graphics.function.FunctionLine/set.Function
In matlab.graphics.function.FunctionLine
In fplot>singleFplot (line 241)
In fplot>#(f)singleFplot(cax,{f},limits,extraOpts,args) (line 196)
In fplot>vectorizeFplot (line 196)
In fplot (line 166)
In P1 (line 5)
There are several problems with the code:
nchoosek does not vectorize on the second input, that is, it does not accept an array as input. fplot works faster for vectorized functions. Otherwise it can be used, but it issues a warning.
The result of nchoosek is close to overflowing for such large values of the first input. For example, nchoosek(1000,500) gives 2.702882409454366e+299, and issues a warning.
nchoosek expects integer inputs. fplot uses in general non-integer values within the specified limits, and so nchoosek issues an error.
You can solve these three issues exploiting the relationship between the factorial and the gamma function and the fact that Matlab has gammaln, which directly computes the logarithm of the gamma function:
n = 1000;
y = #(x) gammaln(n+1)-gammaln(x+1)-gammaln(n-x+1);
fplot(y,[1 1000]);
Note that you get a plot with y values for all x in the specified range, but actually the binomial coefficient is only defined for non-negative integers.
OK, since you've gotten spoilers for your homework exercise anyway now, I'll post an answer that I think is easier to understand.
The multiplicative formula for the binomial coefficient says that
n over k = producti=1 to k( (n+1-i)/i )
(sorry, no way to write proper formulas on SO, see the Wikipedia link if that was not clear).
To compute the logarithm of a product, we can compute the sum of the logarithms:
log(product(xi)) = sum(log(xi))
Thus, we can compute the values of (n+1-i)/i for all i, take the logarithm, and then sum up the first k values to get the result for a given k.
This code accomplishes that using cumsum, the cumulative sum. Its output at array element k is the sum over all input array elements from 1 to k.
n = 1000;
i = 1:1000;
f = (n+1-i)./i;
f = cumsum(log(f));
plot(i,f)
Note also ./, the element-wise division. / performs a matrix division in MATLAB, and is not what you need here.
syms function type reproduces exactly what you want
syms x
y = log(nchoosek(1000,x));
fplot(y,[1 1000]);
This solution uses arrayfun to deal with the fact that nchoosek(n,k) requires k to be a scalar. This approach requires no toolboxes.
Also, this uses plot instead of fplot since this clever answer already addresses how to do with fplot.
% MATLAB R2017a
n = 1000;
fh=#(k) log(nchoosek(n,k));
K = 1:1000;
V = arrayfun(fh,K); % calls fh on each element of K and return all results in vector V
plot(K,V)
Note that for some values of k greater than or equal to 500, you will receive the warning
Warning: Result may not be exact. Coefficient is greater than 9.007199e+15 and is only accurate to 15 digits
because nchoosek(1000,500) = 2.7029e+299. As pointed out by #Luis Mendo, this is due to realmax = 1.7977e+308 which is the largest real floating-point supported. See here for more info.

find the line which best fits to the data

I'm trying to find the line which best fits to the data. I use the following code below but now I want to have the data placed into an array sorted so it has the data which is closest to the line first how can I do this? Also is polyfit the correct function to use for this?
x=[1,2,2.5,4,5];
y=[1,-1,-.9,-2,1.5];
n=1;
p = polyfit(x,y,n)
f = polyval(p,x);
plot(x,y,'o',x,f,'-')
PS: I'm using Octave 4.0 which is similar to Matlab
You can first compute the error between the real value y and the predicted value f
err = abs(y-f);
Then sort the error vector
[val, idx] = sort(err);
And use the sorted indexes to have your y values sorted
y2 = y(idx);
Now y2 has the same values as y but the ones closer to the fitting value first.
Do the same for x to compute x2 so you have a correspondence between x2 and y2
x2 = x(idx);
Sembei Norimaki did a good job of explaining your primary question, so I will look at your secondary question = is polyfit the right function?
The best fit line is defined as the line that has a mean error of zero.
If it must be a "line" we could use polyfit, which will fit a polynomial. Of course, a "line" can be defined as first degree polynomial, but first degree polynomials have some properties that make it easy to deal with. The first order polynomial (or linear) equation you are looking for should come in this form:
y = mx + b
where y is your dependent variable and X is your independent variable. So the challenge is this: find the m and b such that the modeled y is as close to the actual y as possible. As it turns out, the error associated with a linear fit is convex, meaning it has one minimum value. In order to calculate this minimum value, it is simplest to combine the bias and the x vectors as follows:
Xcombined = [x.' ones(length(x),1)];
then utilized the normal equation, derived from the minimization of error
beta = inv(Xcombined.'*Xcombined)*(Xcombined.')*(y.')
great, now our line is defined as Y = Xcombined*beta. to draw a line, simply sample from some range of x and add the b term
Xplot = [[0:.1:5].' ones(length([0:.1:5].'),1)];
Yplot = Xplot*beta;
plot(Xplot, Yplot);
So why does polyfit work so poorly? well, I cant say for sure, but my hypothesis is that you need to transpose your x and y matrixies. I would guess that that would give you a much more reasonable line.
x = x.';
y = y.';
then try
p = polyfit(x,y,n)
I hope this helps. A wise man once told me (and as I learn every day), don't trust an algorithm you do not understand!
Here's some test code that may help someone else dealing with linear regression and least squares
%https://youtu.be/m8FDX1nALSE matlab code
%https://youtu.be/1C3olrs1CUw good video to work out by hand if you want to test
function [a0 a1] = rtlinreg(x,y)
x=x(:);
y=y(:);
n=length(x);
a1 = (n*sum(x.*y) - sum(x)*sum(y))/(n*sum(x.^2) - (sum(x))^2); %a1 this is the slope of linear model
a0 = mean(y) - a1*mean(x); %a0 is the y-intercept
end
x=[65,65,62,67,69,65,61,67]'
y=[105,125,110,120,140,135,95,130]'
[a0 a1] = rtlinreg(x,y); %a1 is the slope of linear model, a0 is the y-intercept
x_model =min(x):.001:max(x);
y_model = a0 + a1.*x_model; %y=-186.47 +4.70x
plot(x,y,'x',x_model,y_model)

How to perform indefinite integration of this function in MATLAB?

I need to perform the following operations as shown in the image. I need to calculate the value of function H for different inputs(x) using MATLAB.
I am giving the following command from Symbolic Math Toolbox
syms y t x;
f1=(1-exp(-y))/y;
f2=-t+3*int(f1,[0,t]);
f3=exp(f2);
H=int(f3,[0,x]);
but the value of 2nd integral i.e. integral in the function H can't be calculated and my output is of the form of
H =
int(exp(3*eulergamma - t - 3*ei(-t) + 3*log(t)), t, 0, x)
If any of you guys know how to evaluate this or have a different idea about this, please share it with me.
Directly Finding the Numerical Solution using integral:
Since you want to calculate H for different values of x, so instead of analytical solution, you can go for numerical solution.
Code:
syms y t;
f1=(1-exp(-y))/y; f2=-t+3*int(f1,[0,t]); f3=exp(f2);
H=integral(matlabFunction(f3),0,100) % Result of integration when x=100
Output:
H =
37.9044
Finding the Approximate Analytical Solution using Monte-Carlo Integration:
It probably is an "Elliptic Integral" and cannot be expressed in terms of elementary functions. However, you can find an approximate analytical solution using "Monte-Carlo Integration" according to which:
where f(c) = 1/n Σ f(xᵢ)
Code:
syms x y t;
f1=(1-exp(-y))/y; f2=-t+3*int(f1,[0,t]); f3=exp(f2);
f3a= matlabFunction(f3); % Converting to function handle
n = 1000;
t = x*rand(n,1); % Generating random numbers within the limits (0,x)
MCint(x) = x * mean(f3a(t)); % Integration
H= double(MCint(100)) % Result of integration when x=100
Output:
H =
35.2900
% Output will be different each time you execute it since it is based
% on generation of random numbers
Drawbacks of this approach:
Solution is not exact but approximated.
Greater the value of n, better the result and slower the code execution speed.
Read the documentation of matlabFunction, integral, Random Numbers Within a Specific Range, mean and double for further understanding of the code(s).

How to find if a value lies in between bounds of a histogram?

I have an empirical set of data (Hypothetically x=normrnd(10,3,1000,1);) which has a cumulative distribution function as follows:
I also have a set of data x1=[11,11.1,10.1]. I'd like to find the probability of finding the values of x1 if they came from the distribution x. If it was a continuous known function I could evaluate it exactly but I'd like to do it from the data I have. Any thoughts?
By hand I would find the value on the x axis and trace up to the line and across to the F(x) axis (see figure 1).
EDIT:
size(x1)
10,0000
So I now have found out how to get the data that plots F(x)
handles=cdfplot(X);
xdata=get(handles,'XData');
ydata=get(handles,'YData');
I think now it's a case of finding the location of x in an interval in xdata and subsequently the location in ydata.
e.g.
for i=1:length(x)
for j=1:length(xdata)
if x(i,1)<=xdata(jj,1)
X(i)=xdata(jj,1);
end
end
end
Y=ydata(X);
Is this the most elegant way?
There is a much more elegant way to do that using bsxfun. Also, you can just calculate empirical CDF using ecdf instead of cdfplot (unless you actually need the plot):
x = normrnd(10,3,1000,1);
[f_data, x_data] = ecdf(x);
x1 = [11, 11.1, 10.1];
idx = sum(bsxfun(#le, x_data(:)', x1(:)), 2);
y1 = f_data(idx);

Square wave function for Matlab

I'm new to programming in Matlab. I'm trying to figure out how to calculate the following function:
I know my code is off, I just wanted to start with some form of the function. I have attempted to write out the sum of the function in the program below.
function [g] = square_wave(n)
g = symsum(((sin((2k-1)*t))/(2k-1)), 1,n);
end
Any help would be much appreciated.
Update:
My code as of now:
function [yout] = square_wave(n)
syms n;
f = n^4;
df = diff(f);
syms t k;
f = 1; %//Define frequency here
funcSum = (sin(2*pi*(2*k - 1)*f*t) / (2*k - 1));
funcOut = symsum(func, v, start, finish);
xsquare = (4/pi) * symsum(funcSum, k, 1, Inf);
tVector = 0 : 0.01 : 4*pi; %// Choose a step size of 0.01
yout = subs(xsquare, t, tVector);
end
Note: This answer was partly inspired by a previous post I wrote here: How to have square wave in Matlab symbolic equation - However, it isn't quite the same, which is why I'm providing an answer here.
Alright, so it looks like you got the first bit of the question right. However, when you're multiplying things together, you need to use the * operator... and so 2k - 1 should be 2*k - 1. Ignoring this, you are symsuming correctly given that square wave equation. The input into this function is only one parameter only - n. What you see in the above equation is a Fourier Series representation of a square wave. A bastardized version of this theory is that you can represent a periodic function as an infinite summation of sinusoidal functions with each function weighted by a certain amount. What you see in the equation is in fact the Fourier Series of a square wave.
n controls the total number of sinusoids to add into the equation. The more sinusoids you have, the more the function is going to look like a square wave. In the question, they want you to play around with the value of n. If n becomes very large, it should start approaching what looks like to be a square wave.
The symsum will represent this Fourier Series as a function with respect to t. What you need to do now is you need to substitute values of t into this expression to get the output amplitude for each value t. They define that for you already where it's a vector from 0 to 4*pi with 1001 points in between.
Define this vector, then you'll need to use subs to substitute the time values into the symsum expression and when you're done, cast them back to double so that you actually get a numeric vector.
As such, your function should simply be this:
function [g] = square_wave(n)
syms t k; %// Define t and k
f = sin((2*k-1)*t)/(2*k-1); %// Define function
F = symsum(f, k, 1, n); %// Define Fourier Series
tVector = linspace(0, 4*pi, 1001); %// Define time points
g = double(subs(F, t, tVector)); %// Get numeric output
end
The first line defines t and k to be symbolic because t and k are symbolic in the expression. Next, I'll define f to be the term inside the summation with respect to t and k. The line after that defines the actual sum itself. We use f and sum with respect to k as that is what the summation calls for and we sum from 1 up to n. Last but not least, we define a time vector from 0 to 4*pi with 1001 points in between and we use subs to substitute the value of t in the Fourier Series with all values in this vector. The result should be a 1001 vector which I then cast to double to get a numerical result and we get your desired output.
To show you that this works, we can try this with n = 20. Do this in the command prompt now:
>> g = square_wave(20);
>> t = linspace(0, 4*pi, 1001);
>> plot(t, g);
We get:
Therefore, if you make n go higher... so 200 as they suggest, you'll see that the wave will eventually look like what you expect from a square wave.
If you don't have the Symbolic Math Toolbox, which symsum, syms and subs relies on, we can do it completely numerically. What you'll have to do is define a meshgrid of points for pairs of t and n, substitute each pair into the sequence equation for the Fourier Series and sum up all of the results.
As such, you'd do something like this:
function [g] = square_wave(n)
tVector = linspace(0, 4*pi, 1001); %// Define time points
[t,k] = meshgrid(tVector, 1:n); %// Define meshgrid
f = sin((2*k-1).*t)./(2*k-1); %// Define Fourier Series
g = sum(f, 1); %// Sum up for each time point
end
The first line of code defines our time points from 0 to 4*pi. The next line of code defines a meshgrid of points. How this works is that for t, each column defines a unique time point, so the first column is 200 zeroes, up to the last column which is a column of 200 4*pi values. Similarly for k, each row denotes a unique n value so the first row is 1001 1s, followed by 1001 2s, up to 1001 1s. The implications with this is now each column of t and k denotes the right (t,n) pairs to compute the output of the Fourier series for each time that is unique to that column.
As such, you'd simply use the sequence equation and do element-wise multiplication and division, then sum along each individual column to finally get the square wave output. With the above code, you will get the same result as above, and it'll be much faster than symsum because we're doing it numerically now and not doing it symbolically which has a lot more computational overhead.
Here's what we get when n = 200:
This code with n=200 ran in milliseconds whereas the symsum equivalent took almost 2 minutes on my machine - Mac OS X 10.10.3 Yosemite, 16 GB RAM, Intel Core i7 2.3 GHz.