Can someone help me with scipy.stats.chisquare? I do not have a statistical / mathematical background, and I am learning scipy.stats.chisquare with this data set from https://en.wikipedia.org/wiki/Chi-squared_test
The Wikipedia article gives the table below as an example, stating the Chi-squared value based on it is approximately 24.6. I am to use scipy.stats to verify this value and calculate the associated p value.
I have found what looks like the most likely formula solutions to help me here
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
As I am new to statistics, and also the use of scipy.stats.chisquare I am just not sure of the best approach, and how best to enter the data from provided table into the arrays, and whether to supply expected values? from Wikipedia.
That data is a contingency table. SciPy has the function scipy.stats.chi2_contingency that applies the chi-square test to a contingency table. It is fundamentally just a reqular chi-square test, but when applied to a contingency table, the expected frequencies are calculated under the assumption of independence (chi2_contingency does this for you), and the degrees of freedom depends on the number of rows and columns (chi2_contingency calculates this for you, too).
Here's how you can apply the chi-square test to that table:
import numpy as np
from scipy.stats import chi2_contingency
table = np.array([[90, 60, 104, 95],
[30, 50, 51, 20],
[30, 40, 45, 35]])
chi2, p, dof, expected = chi2_contingency(table)
print(f"chi2 statistic: {chi2:.5g}")
print(f"p-value: {p:.5g}")
print(f"degrees of freedom: {dof}")
print("expected frequencies:")
print(expected)
Output:
chi2 statistic: 24.571
p-value: 0.00040984
degrees of freedom: 6
expected frequencies:
[[ 80.53846154 80.53846154 107.38461538 80.53846154]
[ 34.84615385 34.84615385 46.46153846 34.84615385]
[ 34.61538462 34.61538462 46.15384615 34.61538462]]
Related
So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.
loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.
I'd like to build a GP with marginalized hyperparameters.
I have seen that this is possible with the HMC sampler provided in gpflow from this notebook
However, when I tried to run the following code as a first step of this (NOTE this is on gpflow 0.5, an older version), the returned samples are negative, even though the lengthscale and variance need to be positive (negative values would be meaningless).
import numpy as np
from matplotlib import pyplot as plt
import gpflow
from gpflow import hmc
X = np.linspace(-3, 3, 20)
Y = np.random.exponential(np.sin(X) ** 2)
Y = (Y - np.mean(Y)) / np.std(Y)
k = gpflow.kernels.Matern32(1, lengthscales=.2, ARD=False)
m = gpflow.gpr.GPR(X[:, None], Y[:, None], k)
m.kern.lengthscales.prior = gpflow.priors.Gamma(1., 1.)
m.kern.variance.prior = gpflow.priors.Gamma(1., 1.)
# dont want likelihood be a hyperparam now so fixed
m.likelihood.variance = 1e-6
m.likelihood.variance.fixed = True
m.optimize(maxiter=1000)
samples = m.sample(500)
print(samples)
Output:
[[-0.43764571 -0.22753325]
[-0.50418501 -0.11070128]
[-0.5932655 0.00821438]
[-0.70217714 0.05077999]
[-0.77745654 0.09362291]
[-0.79404456 0.13649446]
[-0.83989415 0.27118385]
[-0.90355789 0.29589641]
...
I don't know too much in detail about HMC sampling but I would expect that the sampled posterior hyperparameters are positive, I've checked the code and it seems maybe related to the Log1pe transform, though I failed to figure it out myself.
Any hint on this?
It would be helpful if you specified which GPflow version you are using - especially given that from the output you posted it looks like you are using a really old version of GPflow (pre-1.0), and this is actually something that got improved since. What is happening here (in old GPflow) is that the sample() method returns a single array S x P, where S is the number of samples, and P is the number of free parameters [e.g. for a M x M matrix parameter with lower-triangular transform (such as the Cholesky of the covariance of the approximate posterior, q_sqrt), only M * (M - 1)/2 parameters are actually stored and optimised!]. These are the values in the unconstrained space, i.e. they can take any value whatsoever. Transforms (see gpflow.transforms module) provide the mapping between this value (between plus/minus infinity) and the constrained value (e.g. gpflow.transforms.positive for lengthscales and variances). In old GPflow, the model provides a get_samples_df() method that takes the S x P array returned by sample() and returns a pandas DataFrame with columns for all the trainable parameters which would be what you want. Or, ideally, you would just use a recent version of GPflow, in which the HMC sampler directly returns the DataFrame!
I am trying to produce a random distribution where I control the mean, SD, skewness and kurtosis.
I can solve the mean and SD with some simple maths after the distribution is produced.
Kurtosis I am leaving on the shelf for the moment because it just seems too hard.
Skewness is today's problem.
import scipy.stats
def convert_to_alpha(s):
d=(np.pi/2*((abs(s)**(2/3))/(abs(s)**(2/3)+((4-np.pi)/2)**(2/3))))**0.5
a=((d)/((1-d**2)**.5))
return(a)
for skewness_expected in (.5, .9, 1.3):
alpha = convert_to_alpha(skewness_expected)
r = stats.skewnorm.rvs(alpha,size=10000)
print('Skewness expected:',skewness_expected)
print('Skewness obtained:',stats.skew(r))
print()
Skewness expected: 0.5
Skewness obtained: 0.47851348006629035
Skewness expected: 0.9
Skewness obtained: 0.8917020428586827
Skewness expected: 1.3
Skewness obtained: (1.2794406116842627+0.01780402125888404j)
I understand that the calculated skewness will generally not match the desired skewness - this is a random distribution, after all. But I am confused as to how I can get a distribution with a skewness > 1 without falling into complex number territory. The rvs method appears incapable of handling it, since the parameter alpha is an imaginary number whenever skewness > 1.
How can I fix it so that I can generate distributions with skewness > 1, but not have complex numbers creeping in?
[With credit to Warren Weckesser for pointing me at Wikipedia in order to write the convert_to_alpha function.]
Understand this thread is a year and a half old now, but I've run into this problem recently as well and it never seemed to get answered here. The further problem with converting between alpha from stats.skewnorm and the skewness statistic (excellent function to do that by the way) is that doing so will also alter the measures of central tendency for the distribution, which was problematic for my needs.
I've developed this based on the F-distribution (https://en.wikipedia.org/wiki/F-distribution). The end result of a lot of work is this function for which you specify the mean, SD and skewness required, and desired sample size. I can share the work behind it if anyone wishes. The output SD and skew become a little rough at extreme settings. Presumably because the F-distribution naturally sits around 1. It is also very problematic for skew values close to zero, in which case there would be no need for this function anyway.
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def createSkewDist(mean, sd, skew, size):
# calculate the degrees of freedom 1 required to obtain the specific skewness statistic, derived from simulations
loglog_slope=-2.211897875506251
loglog_intercept=1.002555437670879
df2=500
df1 = 10**(loglog_slope*np.log10(abs(skew)) + loglog_intercept)
# sample from F distribution
fsample = np.sort(stats.f(df1, df2).rvs(size=size))
# adjust the variance by scaling the distance from each point to the distribution mean by a constant, derived from simulations
k1_slope = 0.5670830069364579
k1_intercept = -0.09239985798819927
k2_slope = 0.5823114978219056
k2_intercept = -0.11748300123471256
scaling_slope = abs(skew)*k1_slope + k1_intercept
scaling_intercept = abs(skew)*k2_slope + k2_intercept
scale_factor = (sd - scaling_intercept)/scaling_slope
new_dist = (fsample - np.mean(fsample))*scale_factor + fsample
# flip the distribution if specified skew is negative
if skew < 0:
new_dist = np.mean(new_dist) - new_dist
# adjust the distribution mean to the specified value
final_dist = new_dist + (mean - np.mean(new_dist))
return final_dist
'''EXAMPLE'''
desired_mean = 497.68
desired_skew = -1.75
desired_sd = 77.24
final_dist = createSkewDist(mean=desired_mean, sd=desired_sd, skew=desired_skew, size=1000000)
# inspect the plots & moments, try random sample
fig, ax = plt.subplots(figsize=(12,7))
sns.distplot(final_dist, hist=True, ax=ax, color='green', label='generated distribution')
sns.distplot(np.random.choice(final_dist, size=100), hist=True, ax=ax, color='red', hist_kws={'alpha':.2}, label='sample n=100')
ax.legend()
print('Input mean: ', desired_mean)
print('Result mean: ', np.mean(final_dist),'\n')
print('Input SD: ', desired_sd)
print('Result SD: ', np.std(final_dist),'\n')
print('Input skew: ', desired_skew)
print('Result skew: ', stats.skew(final_dist))
Input mean: 497.68
Result mean: 497.6799999999999
Input SD: 77.24
Result SD: 71.69030764848961
Input skew: -1.75
Result skew: -1.6724486459469905
The shape parameter of the skew-normal distribution is not the skewness of the distribution. Check out the wikipedia page for the skew normal distribution. The formulas in the table on the right give the expressions for the mean, variance, skewness, etc., in terms of the parameters. You can get these values from the skewnorm object with the stats() method.
For example, here's the skewness of the distribution with shape parameter 2:
In [46]: from scipy.stats import skewnorm, skew
In [47]: skewnorm.stats(2, moments='s')
Out[47]: array(0.45382556395938217)
Generate a couple samples and find the sample skewness:
In [48]: r = skewnorm.rvs(2, size=10000000)
In [49]: skew(r)
Out[49]: 0.4533209955299838
In [50]: r = skewnorm.rvs(2, size=10000000)
In [51]: skew(r)
Out[51]: 0.4536583726840712
I was wondering what function in numpy/scipy corresponded to pcacov() in MATLAB. If there isn't a corresponding one, what would be the best way to implement the function?
Thanks!
NumPy and SciPy don't have specific routines for PCA, but they do have the linear algebra primitives required to compute it. Any pca function in any language will basically be just a light wrapper around an eigenvalue or singular value decomposition, with different conventions regarding centering, normalization, meaning of matrix dimensions, and terms (eigenvectors, principal components, principal vectors, latent variables, etc. are all different names for the same thing, sometimes with slight variations).
So, for example, given a matrix X you can compute the PCA using the SVD:
import numpy as np
def pca(X):
X_centered = X - X.mean(0)
u, s, vt = np.linalg.svd(X_centered)
evals = s[::-1] ** 2 / (X.shape[0] - 1)
evecs = vt[::-1].T
return evals, evecs
np.random.seed(0)
X = np.random.rand(100, 3)
evals, evecs = pca(X)
print(evals)
# [ 0.06820946 0.08738236 0.09858988]
print(evecs)
# [[-0.49659797 0.4567562 -0.73808145]
# [ 0.34847559 0.88371847 0.31242029]
# [ 0.79495611 -0.10205609 -0.59802118]]
If you have a covariance matrix, you can compute the PCA using an eigenvalue decomposition:
def pcacov(C):
return np.linalg.eigh(C)
C = np.cov(X.T)
evals, evecs = pcacov(C)
print(evals)
# [ 0.06820946 0.08738236 0.09858988]
print(evecs)
# [[-0.49659797 -0.4567562 -0.73808145]
# [ 0.34847559 -0.88371847 0.31242029]
# [ 0.79495611 0.10205609 -0.59802118]]
The results are the same, up to a sign in the eigenvector columns.
Now, I've used a particular set of conventions here regarding whether datapoints are in rows or columns, how the covariance is normalized, etc. and those details vary from implementation to implementation of PCA. So the Matlab code might give different results because it's using different conventions internally. But under the hood, it's doing something very similar to the computations used above.
I have some data points with errors in both the x and y coordinates on these data points. I therefore want to use python's ODR tool to compute the best-fit slope and the error on this slope. I have tried doing it for my actual data but do not find good results. Therefore, I have first tried to use ODR with a simple example as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy.odr import *
def linear_func(B, x):
return B[0]*x+B[1]
x_data=np.array([0.0, 1.0, 2.0, 3.0])
y_data=np.array([0.0, 1.0, 2.0, 3.0])
x_err=np.array([1.0, 1.0, 1.0, 1.0])
y_err=np.array([5.0, 5.0, 5.0, 5.0])
linear=Model(linear_func)
data=RealData(x_data, y_data, sx=x_err, sy=y_err)
odr=ODR(data, linear, beta0=[1.0, 0.0])
out=odr.run()
out.pprint()
The pprint() line gives:
Beta: [ 1. 0.]
Beta Std Error: [ 0. 0.]
Beta Covariance: [[ 5.20000039 -7.80000026]
[ -7.80000026 18.1999991 ]]
Residual Variance: 0.0
Inverse Condition #: 0.0315397386692
Reason(s) for Halting:
Sum of squares convergence
The resutling Beta values are shown to be 1.0 and 0.0, which I would epect. But why are the standard errors, Beta Std Error, also both zero if my errors on the data points are quite large? Can anyone offer some insight?
I see no discrepancy here. Your example model fits your data perfectly, so the weights you pass to the data do not matter. Moreover, your initial guess beta0=[1.0, 0.0] is a parameter vector giving an optimal solution, so the ODR machinery can not find an iterative improvement of the parameters and quits after zero iterations. The associated errors are zero because for a given data the solution found is infinitely better than any other solution possible because your sum of squares at B=[1, 0] is zero.
To see the what actually happens inside ODR.run() function, add odr.set_iprint(init=2, iter=2, final=2) before you run the regression. In particular, the following output confirms that ODR reaches the stopping condition immediately:
--- STOPPING CONDITIONS:
INFO = 1 ==> SUM OF SQUARES CONVERGENCE.
NITER = 0 (NUMBER OF ITERATIONS)
Note how the errors will not be zero, and NITER will be an integer number if either your x_data is unequal to y_data or if beta0 does not match the optimal solution. In that case, the errors returned by ODR will be nonzero, although still incredibly small.