Understanding T-SQL stdev, stdevp, var, and varp - tsql

I'm having a difficult time understand what these statistics functions do and how they work. I'm having an even more difficult time understanding how stdev works vs stdevp and the var equivelant. Can someone please break these down into dumb for me?

In statistics Standard Deviation and Variance are measures of how much a metric in a population deviate from the mean (usually the average.)
The Standard Deviation is defined as the square root of the Variance and the Variance is defined as the average of the squared difference from the mean, i.e.:
For a population of size n: x1, x2, ..., xn
with mean: xmean
Stdevp = sqrt( ((x1-xmean)^2 + (x2-xmean)^2 + ... + (xn-xmean)^2)/n )
When values for the whole population are not available (most of the time) it is customary to apply Bessel's correction to get a better estimate of the actual standard deviation for the whole population. Bessel's correction is merely dividing by n-1 instead of by n when computing the variance, i.e:
Stdev = sqrt( ((x1-xmean)^2 + (x2-xmean)^2 + ... + (xn-xmean)^2)/(n-1) )
Note that for large enough data-sets it won't really matter which function is used.
You can verify my answer by running the following T-SQL script:
-- temporary data set with values 2, 3, 4
declare #t table([val] int);
insert into #t values
(2),(3),(4);
select avg(val) as [avg], -- equals to 3.0
-- Estimation of the population standard devisation using a sample and Bessel's Correction:
-- ((x1 - xmean)^2 + (x2 - xmean)^2 + ... + (xn-xmean)^2)/(n-1)
stdev(val) as [stdev],
sqrt( (square(2-3.0) + square(3-3) + square(4-3))/2) as [stdev calculated], -- calculated with value 2, 3, 4
-- Population standard deviation:
-- ((x1 - xmean)^2 + (x2 - xmean)^2 + ... + (xn-xmean)^2)/n
stdevp(val) as [stdevp],
sqrt( (square(2-3.0) + square(3-3) + square(4-3))/3) as [stdevp calculated] -- calculated with value 2, 3, 4
from #t;
Further reading wikipedia articles for: standard deviation and Bessel's Correction.

STDDEV is used for computing the standard deviation of a data set. STDDEVP is used to compute the standard deviation of a population from which your data is a sample.
If your input is the entire population, then the population standard deviation is computed with STDDEV. More typically, your data set is a sample of a much larger population. In this case the standard deviation of the data set would not represent the true standard deviation of the population since it will usually be biased too low. A better estimate for the standard deviation of the population based on a sample is obtained with STDDEVP.
The situation with VAR and VARP is the same.
For a more thorough discussion of the topic, please see this Wikipedia article.

Related

Small bug in MATLAB R2017B LogLikelihood after fitnlm?

Background: I am working on a problem similar to the nonlinear logistic regression described in the link [1] (my problem is more complicated, but link [1] is enough for the next sections of this post). Comparing my results with those obtained in parallel with a R package, I got similar results for the coefficients, but (very approximately) an opposite logLikelihood.
Hypothesis: The logLikelihood given by fitnlm in Matlab is in fact the negative LogLikelihood. (Note that this impairs consequently the BIC and AIC computation by Matlab)
Reasonning: in [1], the same problem is solved through two different approaches. ML-approach/ By defining the negative LogLikelihood and making an optimization with fminsearch. GLS-approach/ By using fitnlm.
The negative LogLikelihood after the ML-approach is:380
The negative LogLikelihood after the GLS-approach is:-406
I imagine the second one should be at least multiplied by (-1)?
Questions: Did I miss something? Is the (-1) coefficient enough, or would this simple correction not be enough?
Self-contained code:
%copy-pasting code from [1]
myf = #(beta,x) beta(1)*x./(beta(2) + x);
mymodelfun = #(beta,x) 1./(1 + exp(-myf(beta,x)));
rng(300,'twister');
x = linspace(-1,1,200)';
beta = [10;2];
beta0=[3;3];
mu = mymodelfun(beta,x);
n = 50;
z = binornd(n,mu);
y = z./n;
%ML Approach
mynegloglik = #(beta) -sum(log(binopdf(z,n,mymodelfun(beta,x))));
opts = optimset('fminsearch');
opts.MaxFunEvals = Inf;
opts.MaxIter = 10000;
betaHatML = fminsearch(mynegloglik,beta0,opts)
neglogLH_MLApproach = mynegloglik(betaHatML);
%GLS Approach
wfun = #(xx) n./(xx.*(1-xx));
nlm = fitnlm(x,y,mymodelfun,beta0,'Weights',wfun)
neglogLH_GLSApproach = - nlm.LogLikelihood;
Source:
[1] https://uk.mathworks.com/help/stats/examples/nonlinear-logistic-regression.html
This answer (now) only details which code is used. Please see Tom Lane's answer below for a substantive answer.
Basically, fitnlm.m is a call to NonLinearModel.fit.
When opening NonLinearModel.m, one gets in line 1209:
model.LogLikelihood = getlogLikelihood(model);
getlogLikelihood is itself described between lines 1234-1251.
For instance:
function L = getlogLikelihood(model)
(...)
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
(...)
Please also not that this notably impacts ModelCriterion.AIC and ModelCriterion.BIC, as they are computed using model.LogLikelihood ("thinking" it is the logLikelihood).
To get the corresponding formula for BIC/AIC/..., type:
edit classreg.regr.modelutils.modelcriterion
this is Tom from MathWorks. Take another look at the formula quoted:
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
Remember the normal distribution has a factor (1/sqrt(2*pi)), so taking logs of that gives us -log(2*pi)/2. So the minus sign comes from that and it is part of the log likelihood. The property value is not the negative log likelihood.
One reason for the difference in the two log likelihood values is that the "ML approach" value is computing something based on the discrete probabilities from the binomial distribution. Those are all between 0 and 1, and they add up to 1. The "GLS approach" is computing something based on the probability density of the continuous normal distribution. In this example, the standard deviation of the residuals is about 0.0462. That leads to density values that are much higher than 1 at the peak. So the two things are not really comparable. You would need to convert the normal values to probabilities on the same discrete intervals that correspond to individual outcomes from the binomial distribution.

How can I use skewnorm to produce a distribution with the specified skew?

I am trying to produce a random distribution where I control the mean, SD, skewness and kurtosis.
I can solve the mean and SD with some simple maths after the distribution is produced.
Kurtosis I am leaving on the shelf for the moment because it just seems too hard.
Skewness is today's problem.
import scipy.stats
def convert_to_alpha(s):
d=(np.pi/2*((abs(s)**(2/3))/(abs(s)**(2/3)+((4-np.pi)/2)**(2/3))))**0.5
a=((d)/((1-d**2)**.5))
return(a)
for skewness_expected in (.5, .9, 1.3):
alpha = convert_to_alpha(skewness_expected)
r = stats.skewnorm.rvs(alpha,size=10000)
print('Skewness expected:',skewness_expected)
print('Skewness obtained:',stats.skew(r))
print()
Skewness expected: 0.5
Skewness obtained: 0.47851348006629035
Skewness expected: 0.9
Skewness obtained: 0.8917020428586827
Skewness expected: 1.3
Skewness obtained: (1.2794406116842627+0.01780402125888404j)
I understand that the calculated skewness will generally not match the desired skewness - this is a random distribution, after all. But I am confused as to how I can get a distribution with a skewness > 1 without falling into complex number territory. The rvs method appears incapable of handling it, since the parameter alpha is an imaginary number whenever skewness > 1.
How can I fix it so that I can generate distributions with skewness > 1, but not have complex numbers creeping in?
[With credit to Warren Weckesser for pointing me at Wikipedia in order to write the convert_to_alpha function.]
Understand this thread is a year and a half old now, but I've run into this problem recently as well and it never seemed to get answered here. The further problem with converting between alpha from stats.skewnorm and the skewness statistic (excellent function to do that by the way) is that doing so will also alter the measures of central tendency for the distribution, which was problematic for my needs.
I've developed this based on the F-distribution (https://en.wikipedia.org/wiki/F-distribution). The end result of a lot of work is this function for which you specify the mean, SD and skewness required, and desired sample size. I can share the work behind it if anyone wishes. The output SD and skew become a little rough at extreme settings. Presumably because the F-distribution naturally sits around 1. It is also very problematic for skew values close to zero, in which case there would be no need for this function anyway.
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def createSkewDist(mean, sd, skew, size):
# calculate the degrees of freedom 1 required to obtain the specific skewness statistic, derived from simulations
loglog_slope=-2.211897875506251
loglog_intercept=1.002555437670879
df2=500
df1 = 10**(loglog_slope*np.log10(abs(skew)) + loglog_intercept)
# sample from F distribution
fsample = np.sort(stats.f(df1, df2).rvs(size=size))
# adjust the variance by scaling the distance from each point to the distribution mean by a constant, derived from simulations
k1_slope = 0.5670830069364579
k1_intercept = -0.09239985798819927
k2_slope = 0.5823114978219056
k2_intercept = -0.11748300123471256
scaling_slope = abs(skew)*k1_slope + k1_intercept
scaling_intercept = abs(skew)*k2_slope + k2_intercept
scale_factor = (sd - scaling_intercept)/scaling_slope
new_dist = (fsample - np.mean(fsample))*scale_factor + fsample
# flip the distribution if specified skew is negative
if skew < 0:
new_dist = np.mean(new_dist) - new_dist
# adjust the distribution mean to the specified value
final_dist = new_dist + (mean - np.mean(new_dist))
return final_dist
'''EXAMPLE'''
desired_mean = 497.68
desired_skew = -1.75
desired_sd = 77.24
final_dist = createSkewDist(mean=desired_mean, sd=desired_sd, skew=desired_skew, size=1000000)
# inspect the plots & moments, try random sample
fig, ax = plt.subplots(figsize=(12,7))
sns.distplot(final_dist, hist=True, ax=ax, color='green', label='generated distribution')
sns.distplot(np.random.choice(final_dist, size=100), hist=True, ax=ax, color='red', hist_kws={'alpha':.2}, label='sample n=100')
ax.legend()
print('Input mean: ', desired_mean)
print('Result mean: ', np.mean(final_dist),'\n')
print('Input SD: ', desired_sd)
print('Result SD: ', np.std(final_dist),'\n')
print('Input skew: ', desired_skew)
print('Result skew: ', stats.skew(final_dist))
Input mean: 497.68
Result mean: 497.6799999999999
Input SD: 77.24
Result SD: 71.69030764848961
Input skew: -1.75
Result skew: -1.6724486459469905
The shape parameter of the skew-normal distribution is not the skewness of the distribution. Check out the wikipedia page for the skew normal distribution. The formulas in the table on the right give the expressions for the mean, variance, skewness, etc., in terms of the parameters. You can get these values from the skewnorm object with the stats() method.
For example, here's the skewness of the distribution with shape parameter 2:
In [46]: from scipy.stats import skewnorm, skew
In [47]: skewnorm.stats(2, moments='s')
Out[47]: array(0.45382556395938217)
Generate a couple samples and find the sample skewness:
In [48]: r = skewnorm.rvs(2, size=10000000)
In [49]: skew(r)
Out[49]: 0.4533209955299838
In [50]: r = skewnorm.rvs(2, size=10000000)
In [51]: skew(r)
Out[51]: 0.4536583726840712

MATLAB: How can I create autocorrelated data?

I'm looking to create a vector of autocorrelated data points in MATLAB, with the lag 1 higher than lag 2, and so on.
If I look at the lag 1 data pairs (1, 2), (3, 4), (5, 6), ..., then the correlation is relatively higher, but then at lag 2 it's reduced.
I found a way to do this in R
x <- filter(rnorm(1000), filter=rep(1,3), circular=TRUE)
However, I'm not sure how to do the same thing in MATLAB. Ideally I'd like to be able to fine tune exactly how autocorrelated the data is.
Math:
A group of standard models for autocorrelation in stationary time series are so called "auto regressive model" eg. an autoregressive model with 1 term is known as an AR(1) and is:
y_t = a + b*y_{t-1} + e_t
AR(1) sounds simplistic, but it turns it's a quite powerful tooll. Eg. an AR(p) with p autoregressive terms is actually an AR(1) on a p dimensional vector. (Check Wikipedia page.) Note also b=1, gives a non-stationary random walk.
A more intuitive way to write what's going on (in stationary case with |b| < 1) is define u = a / (1 - b) (turns out u is unconditional mean of AR(1)), then with some algebra:
y_t - u = b * ( y_{t-1} - u) + e_t
That is, the difference from the unconditional mean u gets hit with some decay term b and then a shock term e_t gets added. (you want -1<b<1 for stationarity)
Code:
Since e_t denotes the shock term, this is super easy to simulate. Eg. to simulate an AR(1):
a = 0; b = .4; sigma = 1; T = 1000;
y0 = a / (1 - b); %eg initialize to unconditional mean of stationary time series
y = zeros(T,1);
y(1) = a + b * y0 + randn() * sigma;
for t = 2:T
y(t) = a + b * y(t-1) + randn() * sigma;
end
This code isn't mean to be fast, but illustrative. An AR(1) model implies a certain type of correlation structure, but adding AR or MA terms, you can fit some pretty funky stuff. (MA is moving average model)
Can test sample autocorrelation with autocorr(y). For reference, the bible on time series mathematics is Hamilton's book Time Series Analysis.

How to estimate goodness-of-fit using scipy.odr?

I am fitting data with weights using scipy.odr but I don't know how to obtain a measure of goodness-of-fit or an R squared. Does anyone have suggestions for how to obtain this measure using the output stored by the function?
The res_var attribute of the Output is the so-called reduced Chi-square value for the fit, a popular choice of goodness-of-fit statistic. It is somewhat problematic for non-linear fitting, though. You can look at the residuals directly (out.delta for the X residuals and out.eps for the Y residuals). Implementing a cross-validation or bootstrap method for determining goodness-of-fit, as suggested in the linked paper, is left as an exercise for the reader.
The output of ODR gives both the estimated parameters beta as well as the standard deviation of those parameters sd_beta. Following p. 76 of the ODRPACK documentation, you can convert these values into a t-statistic with (beta - beta_0) / sd_beta, where beta_0 is the number that you're testing significance with respect to (often zero). From there, you can use the t-distribution to get the p-value.
Here's a working example:
import numpy as np
from scipy import stats, odr
def linear_func(B, x):
"""
From https://docs.scipy.org/doc/scipy/reference/odr.html
Linear function y = m*x + b
"""
# B is a vector of the parameters.
# x is an array of the current x values.
# x is in the same format as the x passed to Data or RealData.
#
# Return an array in the same format as y passed to Data or RealData.
return B[0] * x + B[1]
np.random.seed(0)
sigma_x = .1
sigma_y = .15
N = 100
x_star = np.linspace(0, 10, N)
x = np.random.normal(x_star, sigma_x, N)
# the true underlying function is y = 2*x_star + 1
y = np.random.normal(2*x_star + 1, sigma_y, N)
linear = odr.Model(linear_func)
dat = odr.Data(x, y, wd=1./sigma_x**2, we=1./sigma_y**2)
this_odr = odr.ODR(dat, linear, beta0=[1., 0.])
odr_out = this_odr.run()
# degrees of freedom are n_samples - n_parameters
df = N - 2 # equivalently, df = odr_out.iwork[10]
beta_0 = 0 # test if slope is significantly different from zero
t_stat = (odr_out.beta[0] - beta_0) / odr_out.sd_beta[0] # t statistic for the slope parameter
p_val = stats.t.sf(np.abs(t_stat), df) * 2
print('Recovered equation: y={:3.2f}x + {:3.2f}, t={:3.2f}, p={:.2e}'.format(odr_out.beta[0], odr_out.beta[1], t_stat, p_val))
Recovered equation: y=2.00x + 1.01, t=239.63, p=1.76e-137
One note of caution in using this approach on nonlinear problems, from the same ODRPACK docs:
"Note that for nonlinear ordinary least squares, the linearized confidence regions and intervals are asymptotically correct as n → ∞ [Jennrich, 1969]. For the orthogonal distance regression problem, they have been shown to be asymptotically correct as σ∗ → 0 [Fuller, 1987]. The difference between the conditions of asymptotic correctness can be explained by the fact that, as the number of observations increases in the orthogonal distance regression problem one does not obtain additional information for ∆. Note also that Vˆ is dependent upon the weight matrix Ω, which must be assumed to be correct, and cannot be confirmed from the orthogonal distance regression results. Errors in the values of wǫi and wδi that form Ω will have an adverse affect on the accuracy of Vˆ and its component parts. The results of a Monte Carlo experiment examining the accuracy
of the linearized confidence intervals for four different measurement error models is presented in [Boggs and Rogers, 1990b]. Those results indicate that the confidence regions and intervals for ∆ are not as accurate as those for β.
Despite its potential inaccuracy, the covariance matrix is frequently used to construct confidence regions and intervals for both nonlinear ordinary least squares and measurement error models because the resulting regions and intervals are inexpensive to compute, often adequate, and familiar to practitioners. Caution must be exercised when using such regions and intervals, however, since the validity of the approximation will depend on the nonlinearity of the model, the variance and distribution of the errors, and the data itself. When more reliable intervals and regions are required, other more accurate methods should be used. (See, e.g., [Bates and Watts, 1988], [Donaldson and Schnabel, 1987], and [Efron, 1985].)"
As mentioned by R. Ken, chi-square or variance of the residuals is one of the more
commonly used tests of goodness of fit. ODR stores the sum of squared
residuals in out.sum_square and you can verify yourself
that out.res_var = out.sum_square/degrees_freedom corresponds to what is commonly called reduced chi-square: i.e. the chi-square test result divided by its expected value.
As for the other very popular estimator of goodness of fit in linear regression, R squared and its adjusted version, we can define the functions
import numpy as np
def R_squared(observed, predicted, uncertainty=1):
""" Returns R square measure of goodness of fit for predicted model. """
weight = 1./uncertainty
return 1. - (np.var((observed - predicted)*weight) / np.var(observed*weight))
def adjusted_R(x, y, model, popt, unc=1):
"""
Returns adjusted R squared test for optimal parameters popt calculated
according to W-MN formula, other forms have different coefficients:
Wherry/McNemar : (n - 1)/(n - p - 1)
Wherry : (n - 1)/(n - p)
Lord : (n + p - 1)/(n - p - 1)
Stein : (n - 1)/(n - p - 1) * (n - 2)/(n - p - 2) * (n + 1)/n
"""
# Assuming you have a model with ODR argument order f(beta, x)
# otherwise if model is of the form f(x, a, b, c..) you could use
# R = R_squared(y, model(x, *popt), uncertainty=unc)
R = R_squared(y, model(popt, x), uncertainty=unc)
n, p = len(y), len(popt)
coefficient = (n - 1)/(n - p - 1)
adj = 1 - (1 - R) * coefficient
return adj, R
From the output of your ODR run you can find the optimal values for your model's parameters in out.beta and at this point we have everything we need for computing R squared.
from scipy import odr
def lin_model(beta, x):
"""
Linear function y = m*x + q
slope m, constant term/y-intercept q
"""
return beta[0] * x + beta[1]
linear = odr.Model(lin_model)
data = odr.RealData(x, y, sx=sigma_x, sy=sigma_y)
init = odr.ODR(data, linear, beta0=[1, 1])
out = init.run()
adjusted_Rsq, Rsq = adjusted_R(x, y, lin_model, popt=out.beta)

Calculate posterior distribution of unknown mis-classification with PRTools in MATLAB

I'm using the PRTools MATLAB library to train some classifiers, generating test data and testing the classifiers.
I have the following details:
N: Total # of test examples
k: # of
mis-classification for each
classifier and class
I want to do:
Calculate and plot Bayesian posterior distributions of the unknown probabilities of mis-classification (denoted q), that is, as probability density functions over q itself (so, P(q) will be plotted over q, from 0 to 1).
I have that (math formulae, not matlab code!):
Posterior = Likelihood * Prior / Normalization constant =
P(q|k,N) = P(k|q,N) * P(q|N) / P(k|N)
The prior is set to 1, so I only need to calculate the likelihood and normalization constant.
I know that the likelihood can be expressed as (where B(N,k) is the binomial coefficient):
P(k|q,N) = B(N,k) * q^k * (1-q)^(N-k)
... so the Normalization constant is simply an integral of the posterior above, from 0 to 1:
P(k|N) = B(N,k) * integralFromZeroToOne( q^k * (1-q)^(N-k) )
(The Binomial coefficient ( B(N,k) ) can be omitted though as it appears in both the likelihood and normalization constant)
Now, I've heard that the integral for the normalization constant should be able to be calculated as a series ... something like:
k!(N-k)! / (N+1)!
Is that correct? (I have some lecture notes with this series, but can't figure out if it is for the normalization constant integral, or for the overall distribution of mis-classification (q))
Also, hints are welcome as how to practically calculate this? (factorials are easily creating truncation errors right?) ... AND, how to practically calculate the final plot (the posterior distribution over q, from 0 to 1).
I really haven't done much with Bayesian posterior distributions ( and not for a while), but I'll try to help with what you've given. First,
k!(N-k)! / (N+1)! = 1 / (B(N,k) * (N + 1))
and you can calculate the binomial coefficients in Matlab with nchoosek() though it does say in the docs that there can be accuracy problems for large coefficients. How big are N and k?
Second, according to Mathematica,
integralFromZeroToOne( q^k * (1-q)^(N-k) ) = pi * csc((k-N)*pi) * Gamma(1+k)/(Gamma(k-N) * Gamma(2+N))
where csc() is the cosecant function and Gamma() is the gamma function. However, Gamma(x) = (x-1)! which we'll use in a moment. The problem is that we have a function Gamma(k-N) on the bottom and k-N will be negative. However, the reflection formula will help us with that so that we end up with:
= (N-k)! * k! / (N+1)!
Apparently, your notes were correct.
Let q be the probability of mis-classification. Then the probability that you would observe k mis-classifications in N runs is given by:
P(k|N,q) = B(N,k) q^k (1-q)^(N-k)
You need to then assume a suitable prior for q which is bounded between 0 and 1. A conjugate prior for the above is the beta distribution. If q ~ Beta(a,b) then the posterior is also a Beta distribution. For your info the posterior is:
f(q|-) ~ Beta(a+k,b+N-k)
Hope that helps.