Assessing parameter bias in simulation - simulation

set.seed(123456)
reps <- 500 # no. of repetitions
par.est <- matrix(NA, nrow= reps, ncol=2) # empty matrix to store the estimates
b0 <- .2 # true value for the intercept
b1 <- .5 # true value for the slope
n <- 1000 # sample size
X <- runif(n, -1, 1) # create a sample of n obs on the independent variable x
for (i in 1:reps){ # start of the loop
Y <- b0 +b1*X + rnorm(n,0,1) # the true DGP, with N(0,1) error
model <-lm(Y~X) # estimate the OLS model
par.est[i,1] <- model$coef[1] # put the estimate for the intercept in the 1st column
par.est[i, 2] <- model$coef[2] # put the estimates for the coefficient of X in the 2nd column
}
Can someone show me how to assess the bias in the estimates of the intercept and the slope?

Related

approximate low rank matrix with weighted sum of rows

I'd like to approximate a given n x m matrix A with n >> m as a weighted sum W of some k rows B (ideally selected from A, but could also be arbitrary). The weights must sum up to 1 and need to be positive.
import numpy as np
n = 1000 # rows
m = 3 # columns
k = 2 # hidden rank
# create random matrix with rank k
A = np.random.rand(n, k).dot(np.random.rand(k, m))
# estimate hidden rank
u, s, vt = np.linalg.svd(A, full_matrices=False, compute_uv=True)
k_est = np.count_nonzero(~np.isclose(s, 0))
# truncate to k_est
B = np.diag(s[:k_est]) # vt[..., :k_est, :]
W = u[..., :k_est]
# do some magic with B and W to come up with
assert np.all(W >= 0)
assert np.all(np.isclose(W.sum(1), 1))
assert np.all(np.isclose(A, W # B))
I tried with SVD which is able to reproduce A by W # B, but the weights are negative and don't sum up to 1.
From my gut feeling it seems like I'm searching for a convex hull of A, but with only k_est points.

Optimization under constraints

I have a question regarding optimization.
I have a matrix x with 3 columns and a certain number of rows (max 200). Each row represents a candidate. The column one contains a score (between 0 and 1) , the column 2 contains the kind of candidate (there are 10 kinds in total labeled from 1 to 10) and the column 3 contains the amount of each candidate. There is one thing to take into consideration: the amount can be NEGATIVE
What I would like to do is to select max 35 elements among these candidates which would maximize the function which sum over their respective score (column 1) under the constraints that there can be a maximum of 10% of each kind computed in the following way: percenteage of kind 1: sum amount of kind 1 divided by sum all amount.
At the end, I would like to have a set of max 35 candidates which satisfy the constraints and optimize the sum of their scores.
Here is a the code I have come up with so far but I am struggling on the 10% constraint as it seems not to be taken into account:
rng('default');
clc;
clear;
n = 100;
maxSize = 35;
%%%TOP BASKET
nbCandidates = 100;
score = rand(100,1)/10+0.9;
quantity = rand(100,1)*100000;
type = ceil(rand(100,1)*10)
typeMask = zeros(n,10);
for i=1:10
typeMask(:,i) = type(:,1) == i;
end
fTop = -score;
intconTop = [1:1:n];
%Write the linear INEQUALITY constraints:
A = [ones(1,n);bsxfun(#times,typeMask,quantity)'/sum(type.*quantity)];
b = [maxSize;0.1*ones(10,1)];
%Write the linear EQUALITY constraints:
Aeq = [];
beq = [];
%Write the BOUND constraints:
lb = zeros(n,1);
ub = ones(n,1); % Enforces i1,i2,...in binary
x = intlinprog(fTop,intconTop,A,b,Aeq,beq,lb,ub);
I would be grateful to some advice where I m doing it wrong!
A linear program for your model might look something like this:
n is the number of candidates.
S[x] is candidate x's score.
A[i][x] is the amount of candidate x for kind i (A[i][x] can be positive or negative, like you said).
T[i] is the total amount of all candidates for kind i.
I[x] is 1 if element x is to be included, and 0 if element x is to be excluded.
The function f which you want to optimize is a function of S[x] and I[x]. You can think of S and I as n-dimensional vectors, so the function you want to optimize is their dot-product.
f() = DotProduct(I, S)
This is equivalent to the linear function I1 * S1 + I2 * S2 + ... + In * Sn.
We can formulate all of the constraints in this way to get a set of linear functions whose coeffecients are the components in an n dimensional vector that we can dot with I, the parameters to optimize.
For the constraint that we can only take 35 elements at most, let C1() be a function which computes the total number of elements.
Then the first constraint can be formalized as C1() <= 35 and C1() is a linear function which can be computed thusly:
Let j be an n dimensional vector with each component equal to 1: j = <1,1,...,1>.
C1() = DotProduct(I, j)
So C1() <= 35 is a linear inequality equivalent to:
I1 * 1 + I2 * 1 + ... + In * 1 <= 35
I1 + I2 + ... + In <= 35
We need to add a slack variable x1 here to turn this into and equivalence relation:
I1 + I2 + ... + In + x1 = 35
For the constraint that we can only take 10% of each kind, we will have a function C2[i]() for each kind i (you said there are 10 in all). C2[i]() Computes the amount of students taken for kind i given the students we have selected:
C21() <= .1 * T1
C22() <= .1 * T2
...
C210() <= .1 * T10
We compute C2[i]() like this:
Let k be an n dimensional vector equal to <A[i]1, A[i]2, ..., A[i]n>, each component is the amount of each candidate for kind i.
Then DotProduct(I, k) = I1 * A[i]1 + I2 * A[i]2 + ... + In * A[i]n, is the total amount we are taking of kind i given I, the vector which captures what elements we are including.
So C2[i]() = DotProduct(I, k)
Now that we know how to compute C2[i](), we need to add a slack variable to turn this into an equality relation:
C2[i]() + x[i + 1] = .1 * T[i]
Here x's subscript is [i + 1] because x1 is already used as a slack variable for the previous constraint.
In summary, the linear program would look like this (adding 11 slack variables x1, x2, ..., x11 for each constraint that is an inequality):
Let:
V = <I1, I2, ..., In, x1, x2, ..., x11> (variables)
|S1|
|S2|
|. |
|. |
|. |
P = |Sn| (parameters of objective function)
|0 |
|0 |
|. |
|. |
|. |
|0 |
|35 |
|.1*T1 |
C = |.1*T2 | (right-hand sides of constraining equality relations)
|... |
|.1*T10|
|1 |1 |...|1 |1|0|...|0|
|A1,1 |A1,2 |...|A1,n |0|1|...|0|
CP = |A2,1 |A2,2 |...|A2,n |0|0|...|0| (parameters of constraint functions)
|... |... |...|... |0|0|...|0|
|A10,1|A10,2|...|A10,n|0|0|...|1|
Maximize:
V x P
Subject to:
CP x Transpose(V) = C
Hopefully this is clear, sorry for terrible formatting.
I believe the MIP model can look like:
Here i are the data points and j indicates the type. For simplicity I assumed here every type has the same number of data points (i.e. Amount(i,j), Score(i,j) are matrices). It is easy to handle the more irregular case by restricting the summations.
The 10% rule is simply applied on the sum of the amounts. I hope that is the correct interpretation. Not sure if this is true if we have negative sums.

Matlab plot for exponential decay function

I have empirical data of 9 sets of patients the data looks in this format
input = [10 -1 1
20 17956 1
30 61096 1
40 31098 1
50 18446 1
60 12969 1
95 7932 1
120 6213 1
188 4414 1
240 3310 1
300 3329 1
610 2623 1
1200 1953 1
1800 1617 1
2490 1559 1
3000 1561 1
3635 1574 1
4205 1438 1
4788 1448 1
];
calibrationfactor_wellcounter =1.841201569;
Here, the first column describes values of time and next one is concentration. As you can see, the concentration increases until a certain time and then decreases exponentially with increase in time.
If I plot the following characteristics, I obtain following curve
I would like to create a script which represents the same behavior cited above. following is the script which i have formulated where concentration linearly increases till certain time period and aftermath it decays exponentially, but when i plot this function i am obtaining linear characteristics , kindly let me know if my logic is appropriate
function c_o = Sample_function(td,t_max,a1,a2,a3,b1,b2,b3)
t =(0: 100 :5000); % time of the sample post injection in mins
c =(0 : 2275.3 :113765);
A_max= max(c);%Max value of Concentration (Peak of the curve)
c_o = zeros(size(t));
c_o(t>td & t<=t_max) = A_max*(t(t>td & t<=t_max)-td);
c_o(t>t_max)=(a1*exp(-b1*(t(t>t_max)-t_max)))+(a2*exp(-b2*(t(t>t_max)-t_max)))+(a3*exp(-b3*(t(t>t_max)-t_max)));
fprintf('plotting Data ...\n');
hold on;
%figure ;
plot(c_o,'erasemode','background');
xlabel('time of the sample in minutes ');
ylabel('Activity of the sample Ba/ml');
title (' Input function: Activity sample VS time ');
pause;
end
The figure i obtained is
In the above plot the decay is linear instead of exponential, let me know how to obtain 3rd order decay this is the line of code i have written to obtain 3rd order decay
c_o(t>t_max)=(a1*exp(-b1*(t(t>t_max)-t_max)))+(a2*exp(-b2*(t(t>t_max)-t_max)))+(a3*exp(-b3*(t(t>t_max)-t_max)));
I've come up with a solution using the functionality of Matlab's Curve Fitting Toolbox. The fitting result looks very good. However, I've found that it strongly depends on the right choice of starting values for the parameters, which therefore have to be carefully chosen manually.
Starting from you variable input, let's define the independent and dependent variables for the fit, time and concentration,
t = input(:, 1);
c = input(:, 2);
and plot them:
plot(t, c, 'x')
axis([-100 5000 -2000 80000])
xlabel time
ylabel concentration
These data are to be modeled with a function with three pieces: 1) constantly 0 up to a time td, 2) linearly increasing between td and tmax, 3) decreasing as a sum of three different exponentials after time tmax. In addition, the function is continuous, so that the three pieces have to fit together seamlessly. The implementation of this model as a Matlab function:
function c = model(t, a1, a2, a3, b1, b2, b3, td, tmax)
c = zeros(size(t));
ind = (t > td) & (t < tmax);
c(ind) = (t(ind) - td) ./ (tmax - td) * (a1 + a2 + a3);
ind = (t >= tmax);
c(ind) = a1 * exp(-b1 * (t(ind) - tmax)) ...
+ a2 * exp(-b2 * (t(ind) - tmax)) + a3 * exp(-b3 * (t(ind) - tmax));
Model parameters appear to be treated internally by the Curve Fitting Toolbox as a vector ordered alphabetically by the parameter names, so to avoid confusion I sorted the parameters alphabetically in the definition of this function, too. a1 to a3 and b1 to b3 are the amplitudes and inverse time constants of the three exponentials, respectively.
Let's fit the model to the data:
ft = fittype('model(t, a1, a2, a3, b1, b2, b3, td, tmax)', 'independent', 't');
fo = fit(t, c, ft, ...
'StartPoint', [20000, 20000, 20000, 0.01, 0.01, 0.01, 10, 30], ...
'Lower', [0, 0, 0, 0, 0, 0, 0, 0])
As mentioned before, the fitting works well only if the algorithm gets decent starting values. I here chose for the amplitudes a1 to a3 the number 20000, which is about one third of the maximum of the data, for b1 to b3 a value of 0.01 corresponding to a time constant of about 100, the time of the data maximum, 30, for tmax, and 10 as a rough estimate of the starting constant time td.
The output of fit:
fo =
General model:
fo(t) = model(t, a1, a2, a3, b1, b2, b3, td, tmax)
Coefficients (with 95% confidence bounds):
a1 = 2510 (-2.48e+07, 2.481e+07)
a2 = 1.044e+04 (-7.393e+09, 7.393e+09)
a3 = 6.506e+04 (-4.01e+11, 4.01e+11)
b1 = 0.0001465 (7.005e-05, 0.0002229)
b2 = 0.01049 (0.006933, 0.01405)
b3 = 0.09134 (0.08623, 0.09644)
td = 17.97 (-3.396e+07, 3.396e+07)
tmax = 26.78 (-6.748e+07, 6.748e+07)
I can't decide whether these values make sense physiologically. The estimates also don't appear to be too well defined, since many of the confidence intervals are huge and actually include 0. The documentation isn't clear on this, but I assume the confidence bounds are nonsimultaneous, which means it is possible that the huge intervals simply indicate strong correlations between the estimates of different parameters.
Plotting the data together with the fitted model
plot(t, c, 'x')
hold all
ts = 0 : 5000;
plot(ts, model(ts, fo.a1, fo.a2, fo.a3, fo.b1, fo.b2, fo.b3, fo.td, fo.tmax))
axis([-100 5000 -2000 80000])
xlabel time
ylabel concentration
shows that the fit is excellent:
A close-up of the more interesting initial part:
Note that the estimated time and value of the true maximal concentration (27, 78000) depends only on the fit to the following decreasing part of the data, since the linear increase is characterized only by one data point, which does not pose a constraint.
The results indicate that the data are not sufficient to obtain precise estimates of the model parameters. You should consider either to increase the sampling rate of the data, particularly up to time 500, or to decrease the complexity of the model, e.g. by using a sum of two exponentials only; or both.
Try this code from this question:
x = input(:,1);
c = input(:,2);
c_0 = piecewiseFunction(x, max(c), td,t_max,a1,a2,a3,b1,b2,b3)
with:
function y = piecewiseFunction(x,A_max,td,t_max,a1,a2,a3,b1,b2,b3)
y = zeros(size(x));
for i = 1:length(x)
if x(i) < td
y(i) = 0;
elseif(x(i) < t_max)
y(i) = A_max*(x(i)-td);
else
y(i) = (a1*exp(-b1*(x(i)-t_max)))+(a2*exp(-b2*(x(i)- t_max)))+(a3*exp(-b3*(x(i)-t_max)))
end
end
end

Iterating over all integer vectors summing up to a certain value in MATLAB?

I would like to find a clean way so that I can iterate over all the vectors of positive integers of length, say n (called x), such that sum(x) == 100 in MATLAB.
I know it is an exponentially complex task. If the length is sufficiently small, say 2-3 I can do it by a for loop (I know it is very inefficient) but how about longer vectors?
Thanks in advance,
Here is a quick and dirty method that uses recursion. The idea is that to generate all vectors of length k that sum to n, you first generate vectors of length k-1 that sum to n-i for each i=1..n, and then add an extra i to the end of each of these.
You could speed this up by pre-allocating x in each loop.
Note that the size of the output is (n + k - 1 choose n) rows and k columns.
function x = genperms(n, k)
if k == 1
x = n;
elseif n == 0
x = zeros(1,k);
else
x = zeros(0, k);
for i = 0:n
y = genperms(n-i,k-1);
y(:,end+1) = i;
x = [x; y];
end
end
Edit
As alluded to in the comments, this will run into memory issues for large n and k. A streaming solution is preferable, which generates the outputs one at a time. In a non-strict language like Haskell this is very simple -
genperms n k
| k == 1 = return [n]
| n == 0 = return (replicate k 0)
| otherwise = [i:y | i <- [0..n], y <- genperms (n-i) (k-1)]
viz.
>> mapM_ print $ take 10 $ genperms 100 30
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,99]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,98]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,97]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,96]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,95]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,94]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7,93]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,92]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,91]
which runs virtually instantaneously - no memory issues to worry about.
In Python you could achieve something nearly as simple using generators and the yield keyword. In Matlab it is certainly possible, but I leave the translation up to you!
This is one possible method to generate all vectors at once (will give memory problems for moderately large n):
s = 10; %// desired sum
n = 3; %// number of digits
vectors = cell(1,n);
[vectors{:}] = ndgrid(0:s); %// I assume by "integer" you mean non-negative int
vectors = cell2mat(cellfun(#(c) reshape(c,1,[]), vectors, 'uni', 0).');
vectors = vectors(:,sum(vectors)==s); %// each column is a vector
Now you can iterate over those vectors:
for vector = vectors %// take one column at each iteration
%// do stuff with the vector
end
To avoid memory problems it is better to generate each vector as needed, instead of generating all of them initially. The following approach iterates over all possible n-vectors in one for loop (regardless of n), rejecting those vectors whose sum is not the desired value:
s = 10; %// desired sum
n = 3;; %// number of digits
for number = 0: s^n-1
vector = dec2base(number,s).'-'0'; %// column vector of n rows
if sum(vector) ~= s
continue %// reject that vector
end
%// do stuff with the vector
end

MATLAB: Using fmincon inside a for loop

I am a novice in programming. I have about 100 data sets with 2 columns each. I want to fit col1 data as a power function of col2 data i.e.
variable(col1) = parameter1 x (variable(col2))^ parameter2
(say a (x^b)).
Now I would like to use the optimization function fmincon and get the value of the two parameters. I want to repeat this for all the 100 data sets and hence would like to include this fmincon within a for loop in which I call the data from each data set one at a time. I have tried my best and I am unable to get how to do this.
Any suggestions?
Lets call the first column data set k Xk and the second one Yk (of size m x 1). If I understand your question correctly, then for each dataset k {Xk, Yk} you are looking for two scalars ak and bk such that
Yk == ak * Xk.^bk for all elements 1..m
Since there are more equations/constraints than parameters (m equations with only two parameters) we seek a least-squares solution.
Taking log from both sides of the equation yields
log Yk == log ak + bk * log Xk
Defining new variables YYk <- log(Yk) and XXk <- log(Xk) we have a linear equations for log ak and bk -- this can be solved easily without fmincon or other optimization tools.
In fact, if we append the column vector XXk with another column of all ones (that is XXk(:,2)=1) we can write our system in matrix form
XXk * [ b ; log(ak)] == YYk
Now for some Matlab code:
N = 100; % number of data sets
a = zeros( 1, N ); % pre allocate room for all ak
b = zeros( 1, N ); % pre allocate room for all bk
for k = 1 : N
% get the data here: Xk = ???, Yk = ???
XXk = log( Xk );
XXk(:,2) = 1; % add all ones column
YYk = log( Yk );
tmp = XXk \ YYk
a(k) = exp( tmp(2) );
b(k) = tmp( 1 );
end