linear regression with feature normalization matlab code - matlab

I did it two ways, why is the first way(starting on line with mu=mean(X) not working? what's the difference?
function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X
% FEATURENORMALIZE(X) returns a normalized version of X where
% the mean value of each feature is 0 and the standard deviation
% is 1. This is often a good preprocessing step to do when
% working with learning algorithms.
% You need to set these values correctly
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));
% ====================== YOUR CODE HERE ======================
% Instructions: First, for each feature dimension, compute the mean
% of the feature and subtract it from the dataset,
% storing the mean value in mu. Next, compute the
% standard deviation of each feature and divide
% each feature by it's standard deviation, storing
% the standard deviation in sigma.
%
% Note that X is a matrix where each column is a
% feature and each row is an example. You need
% to perform the normalization separately for
% each feature.
%
% Hint: You might find the 'mean' and 'std' functions useful.
%
%mu=mean(X)
%X_norm=X-mu;
%sigma=std(X_norm)
%X_norm(1)=X_norm(1)/sigma(1)
%X_norm(2)=X_norm(2)/sigma(2)
% Calculates mean and std dev for each feature
for i=1:size(mu,2)
mu(1,i) = mean(X(:,i));
sigma(1,i) = std(X(:,i));
X_norm(:,i) = (X(:,i)-mu(1,i))/sigma(1,i);
end
% ============================================================
end

The reason is because you try to subtract a vector from a matrix. mean(X) gives you a vector with the mean in the columns of X, dimension [1xC], and the X is dimension [RxC]. A way to solve this in a oneliner is
X = (X-repmat(mean(X,1),size(X,1),1))./repmat(std(X,0,1),size(X,1),1)

You need to loop through X. You can further verify the output of above code with normalize(X)
for i = 1: size(X, 2)
mu = mean(X(:, i));
sigma = std(X(:, i));
X_norm(:, i) = (X(:, i) - mu) ./ sigma
end

Related

Obtaining theoretical semivariogram from parametrized covariance function (STK)

I have been using the STK toolbox for a few days, for kriging of environmental parameter fields, i.e. in a geostatistical context.
I find the toolbox very well implemented and useful (big thanks to the authors!), and the kriging predictions I am getting through STK actually seem fine; however, I am finding myself unable to visualize a semivariogram model based on the STK output (i.e. estimated parameters for gaussian process / covariance functions).
I am attaching an example figure, showing the empirical semivariogram for a simple 1D test case and a Gaussian semivariogram model (as typically used in geostatistics, see also figure) fitted directly to that data. The figure further shows a semivariogram model based on STK output, i.e. using previously estimated model parameters (model.param from stk_param_estim) to get covariance K on a target grid of lag distances and then converting K to semivariance (according to the well-known relation semivar = K0-K where K0 is the covariance at zero lag). I am attaching a simple script to reproduce the figure and detailing the attempted conversion.
As you can see in the figure, this doesn’t do the trick. I have tried several other simple examples and STK datasets, but models obtained through STK vs direct fitting never agree, and in fact usually look much more different than in the example (i.e. the range often seems very different, in addition to the sill/sigma2; uncomment line 12 in the script to see another example). I have also attempted to input the converted STK parameters into the geostatistical model (also in the script), however, the output is identical to the result based on converting K above.
I’d be very thankful for your help!
Figure illustrating the lack of agreement between semivariograms based on direct fit vs conversion of STK output
% Code to reproduce the figure illustrating my problem of getting
% variograms from STK output. The only external functions needed are those
% included with STK.
% TEST DATA - This is simply a monotonic part of the normal pdf
nugget = 0;
X = [0:20]'; % coordinates
% X = [0:50]'; % uncomment this line to see how strongly the models can deviate for different test cases
V = normpdf(X./10+nugget,0,1); % observed values
covmodel = 'stk_gausscov_iso'; % covar model, part of STK toolbox
variomodel = 'stk_gausscov_iso_vario'; % variogram model, nested function
% GET STRUCTURE FOR THE SELECTED KRIGING (GAUSSIAN PROCESS) MODEL
nDim = size(X,2);
model = stk_model (covmodel, nDim);
model.lognoisevariance = NaN; % This makes STK fit nugget
% ESTIMATE THE PARAMETERS OF THE COVARIANCE FUNCTION
[param0, model.lognoisevariance] = stk_param_init (model, X, V); % Compute an initial guess for the parameters of the covariance function (param0)
model.param = stk_param_estim (model, X, V, param0); % Now model the covariance function
% EMPIRICAL SEMIVARIOGRAM (raw, binning removed for simplicity)
D = pdist(X)';
semivar_emp = 0.5.*(pdist(V)').^2;
% THEORETICAL SEMIVARIOGRAM FROM STK
% Target grid of lag distances
DT = [0:1:100]';
DT_zero = zeros(size(DT));
% Get covariance matrix on target grid using STK estimated pars
pairwise = true;
K = feval(model.covariance_type, model.param, DT, DT_zero, -1, pairwise);
% convert covariance to semivariance, i.e. G = C(0) - C(h)
sill = exp(model.param(1));
nugget = exp(model.lognoisevariance);
semivar_stk = sill - K + nugget; % --> this variable is then plotted
% TEST: FIT A GAUSSIAN VARIOGRAM MODEL DIRECTLY TO THE EMPIRICAL SEMIVARIOGRAM
f = #(par)mseval(par,D,semivar_emp,variomodel);
par0 = [10 10 0.1]; % initial guess for pars
[par,mse] = fminsearch(f, par0); % optimize
semivar_directfit = feval(variomodel, par, DT); % evaluate
% TEST 2: USE PARS FROM STK AS INPUT TO GAUSSIAN VARIOGRAM MODEL
par(1) = exp(model.param(1)); % sill, PARAM(1) = log (SIGMA ^ 2), where SIGMA is the standard deviation,
par(2) = sqrt(3)./exp(model.param(2)); % range, PARAM(2) = - log (RHO), where RHO is the range parameter. --- > RHO = exp(-PARAM(2))
par(3) = exp(model.lognoisevariance); % nugget
semivar_stkparswithvariomodel = feval(variomodel, par, DT);
% PLOT SEMIVARIOGRAM
figure(); hold on;
plot(D(:), semivar_emp(:),'.k'); % Observed variogram, raw
plot(DT, semivar_stk,'-b','LineWidth',2); % Theoretical variogram, on a grid
plot(DT, semivar_directfit,'--r','LineWidth',2); % Test direct fit variogram
plot(DT,semivar_stkparswithvariomodel,'--g','LineWidth',2); % Test direct fit variogram using pars from stk
legend('raw empirical semivariance (no binned data here for simplicity) ',...
'Gaussian cov model from STK, i.e. exp(Sigma2) - K + exp(lognoisevar)',...
'Gaussian semivariogram model (fitted directly to semivariance)',...
'Gaussian semivariogram model (using transformed params from STK)');
xlabel('Lag distance','Fontweight','b');
ylabel('Semivariance','Fontweight','b');
% NESTED FUNCTIONS
% Objective function for direct fit
function [mse] = mseval(par,D,Graw,variomodel)
Gmod = feval(variomodel, par, D);
mse = mean((Gmod-Graw).^2);
end
% Gaussian semivariogram model.
function [semivar] = stk_gausscov_iso_vario(par, D) %#ok<DEFNU>
% D : lag distance, c : sill, a : range, n : nugget
c = par(1); % sill
a = par(2); % range
if length(par) > 2, n = par(3); % nugget optional
else, n = 0; end
semivar = n + c .* (1 - exp( -3.*D.^2./a.^2 )); % Model
end
There is nothing wrong with the way you compute the semivariogram.
To understand the figure that you obtain, consider that:
The parameters of the model are estimated in STK using the (restricted) maximum likelihood method, not by least-squares fitting on the semi-variogram.
For very smooth stationary random fields observed over short intervals, you should not expect that the theoretical semivariogram will agree with the empirical semivariogram, with or without binning. The reason for this is that the observations, and thus the squared differences, are very correlated in this case.
To convince yourself of the second point, you can run the following script repeatedly:
% a smooth GP
model = stk_model (#stk_gausscov_iso, 1);
model.param = log ([1.0, 0.2]); % unit variance
x_max = 20; x_obs = x_max * rand (50, 1);
% Simulate data
z_obs = stk_generate_samplepaths (model, x_obs);
% Empirical semivariogram (raw, no binning)
h = (pdist (double (x_obs)))';
semivar_emp = 0.5 * (pdist (z_obs)') .^ 2;
% Model-based semivariogram
x1 = (0:0.01:x_max)';
x0 = zeros (size (x1));
K = feval (model.covariance_type, model.param, x0, x1, -1, true);
semivar_th = 1 - K;
% Figure
figure; subplot (1, 2, 1); plot (x_obs, z_obs, '.');
subplot (1, 2, 2); plot (h(:), semivar_emp(:),'.k'); hold on;
plot (x1, semivar_th,'-b','LineWidth',2);
legend ('empirical', 'model'); xlabel ('lag'); ylabel ('semivar');
Further questions on parameter estimation for Gaussian process models should probably be asked on Cross-Validated rather than Stack Overflow.

random number with p(x)= x^(-a) distribution

How can I generate integer random number within [a,b] with below distribution in MATLAB:
p(x)= x^(-a)
I want the distribution to be normalized.
For continuous distributions: Generate random values given a PDF
For discrete distributions, as later it was specified in the OP:
The same rationale can be used as for continuous distributions: inverse transform sampling.
So from mathematical point of view there is no difference, the Matlab implementation however is different. Here is a simple solution with your distribution function:
% for reproducibility
rng(333)
% OPTIONS
% interval endpoints
a = 4;
b = 20;
% number of required random draws
n = 1e4;
% CALCULATION
x = a:b;
% normalization constant
nc = sum(x.^(-a));
% if a and b are finite it is more convinient to have the pdf and cdf as vectors
pmf = 1/nc*x.^(-a);
% create cdf
cdf = cumsum(pmf);
% generate uniformly distributed random numbers from [0,1]
r = rand(n,1);
% use the cdf to get the x value to rs
R = nan(n,1);
for ii = 1:n
rr = r(ii);
if rr == 1
R(ii) = b;
else
idx = sum(cdf < rr) + 1;
R(ii) = x(idx);
end
end
%PLOT
% verfication plot
f = hist(R,x);
bar(x,f/sum(f))
hold on
plot(x, pmf, 'xr', 'Linewidth', 1.2)
xlabel('x')
ylabel('Probability mass')
legend('histogram of random values', 'analytical pdf')
Notes:
the code is general, just replace the pmf with your function;
it is strange that the same parameter a appears in the distribution function and in the interval too.

Matlab: Latin Hypercube

Is there a way to create a Latin Hypercube from a particular set of data? I have d(1,:) = 3*t +0.00167*randn(1,1000);. Is there a way for me to create a Latin Hypercube from the elements in d(1,:)?
Thanks a lot
An edit of the lhsnorm function can probably answer your question.
In matlab : edit lhsnorm :
function [X,z] = lhsnorm(mu,sigma,n,dosmooth)
%LHSNORM Generate a latin hypercube sample with a normal distribution
% X=LHSNORM(MU,SIGMA,N) generates a latin hypercube sample X of size
% N from the multivariate normal distribution with mean vector MU
% and covariance matrix SIGMA. X is similar to a random sample from
% the multivariate normal distribution, but the marginal distribution
% of each column is adjusted so that its sample marginal distribution
% is close to its theoretical normal distribution.
%
% X=LHSNORM(MU,SIGMA,N,'ONOFF') controls the amount of smoothing in the
% sample. If 'ONOFF' is 'off', each column has points equally spaced
% on the probability scale. In other words, each column is a permutation
% of the values G(.5/N), G(1.5/N), ..., G(1-.5/N) where G is the inverse
% normal cumulative distribution for that column''s marginal distribution.
% If 'ONOFF' is 'on' (the default), each column has points uniformly
% distributed on the probability scale. For example, in place of
% 0.5/N we use a value having a uniform distribution on the
% interval (0/N,1/N).
%
% [X,Z]=LHSNORM(...) also returns Z, the original multivariate normal
% sample before the marginals are adjusted to obtain X.
%
% See also LHSDESIGN, MVNRND.
% Reference: Stein, M. L. (1987). Large sample properties of simulations
% using Latin hypercube sampling. Technometrics, 29, 143-151. Correction,
% 32, 367.
% Copyright 1993-2010 The MathWorks, Inc.
% $Revision: 1.1.8.1 $ $Date: 2010/03/16 00:15:10 $
% Generate a random sample with a specified distribution and
% correlation structure -- in this case multivariate normal
z = mvnrnd(mu,sigma,n);
% Find the ranks of each column
p = length(mu);
x = zeros(size(z),class(z));
for i=1:p
x(:,i) = rank(z(:,i));
end
% Get gridded or smoothed-out values on the unit interval
if (nargin<4) || isequal(dosmooth,'on')
x = x - rand(size(x));
else
x = x - 0.5;
end
x = x / n;
% Transform each column back to the desired marginal distribution,
% maintaining the ranks (and therefore rank correlations) from the
% original random sample
for i=1:p
x(:,i) = norminv(x(:,i),mu(i), sqrt(sigma(i,i)));
end
X = x;
% -----------------------
function r=rank(x)
% Similar to tiedrank, but no adjustment for ties here
[sx, rowidx] = sort(x);
r(rowidx) = 1:length(x);
r = r(:);
In your case you already have your distribution z in the code above and you also have mu, sigma and 'n' (the size of your distribution), just replace them and you should be able to create your Latin Hypercube.
There is a function in matlab for the creation of latin hypercube samples: lhsdesign(which lets you specifiy your hypercube) or lhsnorm(which uses a normal distributed one) . both are found in the statistics toolbox.

bootstrap data in confidence interval MATLAb

I tried this as well:
plot(x(bootsam(:,100)),y(bootsam(:,100)), 'r*') but it was exactly the same to my data! I want to resample my data in 95% confidence interval .
But it seems this command bootstrp doesn't work alone, it needs some function or other commands to combine. Would you help me to figure it out?
I would like to generate some data randomly but behave like my function around the original data, I attached a plot which original data which are red and resampled data are in blue and green colors.
Generally, I would like to use bootstrap to find error for my best-fit parameters. I read in this book:
http://books.google.de/books?id=ekyupqnDFzMC&lpg=PA131&vq=bootstrap&hl=de&pg=PA130#v=onepage&q&f=false
other methods for error analysis my fitted parameters are appreciated.
I suggest you start this way and then adapt it to your case.
% One step at a time.
% Step 1: Suppose you generate a simple linear deterministic trend with
% noise from the standardized Gaussian distribution:
N = 1000; % number of points
x = [(1:N)', ones(N, 1)]; % x values
b = [0.15, 157]'; % parameters
y = x * b + 10 * randn(N, 1); % linear trend with noise
% Step 2: Suppose you want to fit y with a linear equation:
[b_hat, bint1] = regress(y, x); % estimate parameters with linear regression
y_fit = x * b_hat; % calculate fitted values
resid = y - y_fit; % calculate residuals
plot(x(:, 1), y, '.') % plot
hold on
plot(x(:, 1), y_fit, 'r', 'LineWidth', 5) % fitted values
% Step 3: use bootstrap approach to estimate the confidence interval of
% regression parameters
N_boot = 10000; % size of bootstrap
b_boot = bootstrp(N_boot, #(bootr)regress(y_fit + bootr, x), resid); % bootstrap
bint2 = prctile(b_boot, [2.5, 97.5])'; % percentiles 2.5 and 97.5, a 95% confidence interval
% The confidence intervals obtained with regress and bootstrp are
% practically identical:
bint1
bint2

matlab -- use of fminsearch with matrix

If I have a function for example :
k=1:100
func=#(s) sum(c(k)-exp((-z(k).^2./s)))
where c and z are matrices with same size (for example 1x100) , is there any way to use fminsearch to find the "s" value?
fminsearch needs an initial condition in the second parameter, not boundary conditions (though some of the options may support boundaries).
Just call
fminsearch(func,-0.5)
Where you saw examples passing in a vector, was a multidimensional search across multiple coefficients, and the vector was the initial value of each coefficient. Not limits on the search space.
You can also use
fminbnd(func, -0.5, 1);
which performs constrained minimization.
But I think you should minimize the norm of the error, not the sum (minimizing the sum leads to a large error magnitude -- very very negative).
If you have the Optimization Toolbox, then lsqnonlin could be useful.
I guess you would like to find argmin of your symbolic function, use
Index of max and min value in an array
OR
ARGMAX/ARGMIN by Marco Cococcioni:
function I = argmax(X, DIM)
%ARGMAX Argument of the maximum
% For vectors, ARGMAX(X) is the indix of the smallest element in X. For matrices,
% MAX(X) is a row vector containing the indices of the smallest elements from each
% column. This function is not supported for N-D arrays with N > 2.
%
% It is an efficient replacement to the use of [Y,I] = MAX(X,[],DIM);
% See ARGMAX_DEMO for a speed comparison.
%
% I = ARGMAX(X,DIM) operates along the dimension DIM (DIM can be
% either 1 or 2).
%
% When complex, the magnitude ABS(X) is used, and the angle
% ANGLE(X) is ignored. This function cannot handle NaN's.
%
% Example:
% clc
% disp('If X = [2 8 4; 7 3 9]');
% disp('then argmax(X,1) should be [2 1 2]')
% disp('while argmax(X,2) should be [2 3]''. Now we check it:')
% disp(' ');
% X = [2 8 4;
% 7 3 9]
% argmax(X,1)
% argmax(X,2)
%
% See also ARGMIN, ARGMAXMIN_MEX, ARGMAX_DEMO, MIN, MAX, MEDIAN, MEAN, SORT.
% Copyright Marco Cococcioni, 2009.
% $Revision: 1.0 $ $Date: 2009/02/16 19:24:01$
if nargin < 2,
DIM = 1;
end
if length(size(X)) > 2,
error('Function not provided for N-D arrays when N > 2.');
end
if (DIM ~=1 && DIM ~= 2),
error('DIM has to be either 1 or 2');
end
if any(isnan(X(:))),
error('Cannot handle NaN''s.');
end
if not(isreal(X)),
X = abs(X);
end
max_NOT_MIN = 1; % computes argmax
I = argmaxmin_mex(X, DIM, max_NOT_MIN);