Minimal p-value for scipy.stats.pearsonr - scipy

I am running scipy.stats.pearsonr on my data, and I get
(0.9672434106763087, 0.0)
It is reasonable that the r-value is high and the p-value is very low.
However, p is obviously not 0, so I would like to know what p=0.0 means. Is it p<10^-10, p<10^-100 or what is the limit?

As pointed out by #MB-F in the comments it is calculated analytically.
In the code for the version 0.19.1, you could isolate that part of the code and plot the p-value in terms of r
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import betainc
r = np.linspace(-1, 1, 1000)*(1-1e-10);
for n in [10, 100, 1000]:
df = n - 2
t_squared = r**2 * (df / ((1.0 - r) * (1.0 + r)))
prob = betainc(0.5*df, 0.5, df/(df+t_squared))
plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()
The current stable version 1.9.3 uses a different formula
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import btdtr
r = np.linspace(-1, 1, 1000)*(1-1e-10);
for n in [10, 100, 1000]:
ab = 0.5*n
prob = btdtr(ab, ab, 0.5*(1-abs(r)))
plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()
But yield the same results.
You can see that if you have 1000 points and your correlation, the p value will be less than the minimum floating value.
The beta distribution
Scipy provides a collection of probability distributions, among them, the beta distribution.
The line
prob = btdtr(ab, ab, 0.5*(1-abs(r)))
could be replaced by
from scipy.stats import beta
prob = beta(ab, ab).cdf(0.5*(1-abs(r)))
There you can get much more information about it.

Related

Is the inplace operation with scipy gaussian_filter1d safe?

Here is the sample code I wrote to examine this issue.
It can be seen that in this case we get the same result, but I want to know if it is safe to compute inplace with other options (scipy version, augment, ...).
import numpy as np
from scipy.ndimage import gaussian_filter1d
X = np.random.normal(0, 1, size=[64, 1024, 2048])
OPX = X.copy()
for axis, sigma in zip([-2, -1], [3, 7]):
gaussian_filter1d(OPX, sigma, axis, output=OPX)
OPY, OPZ = X.copy(), X.copy()
for axis, sigma in zip([-2, -1], [3, 7]):
gaussian_filter1d(OPY, sigma, axis, output=OPZ)
OPY, OPZ = OPZ, OPY
(OPX == OPY).all() # True
python 3.7.15
scipy 1.7.3
numpy 1.21.6

To fit Linear regression Model with and without intercept in python

I need to fit Linear regression Model 1 : y = β1x1 + ε and Model 2: y = β0 + β1x1 + ε, to the data x1 = ([0,1,2,3,4])
y = ([1,2,3,2,1]). My objective is to find
coefficients, squared error loss, the absolute error loss, and the L1.5 loss for both model.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
import statsmodels.formula.api as smf
import numpy as np
x1 = ([0,1,2,3,4])
y = ([1,2,3,2,1])
would you please show me some way to get these?
This first method doesn't use the formula api.
import statsmodels.api as sm
import numpy as np
x1 = np.array([0,1,2,3,4])
y = np.array([1,2,3,2,1])
x1 = x1[:, None] # Transform into a (5,1) atrray
res = sm.OLS(y,x1).fit()
print(res.summary())
If you want to use the formula interface, you need to build a DataFrame, and then the regression is "y ~ x1" (if you want a constant you need to include +1 on the right-hand-side of the formula.
import statsmodels.formula.api as smf
import pandas as pd
x1 = [0,1,2,3,4]
y = [1,2,3,2,1]
data = pd.DataFrame({"y":y,"x1":x1})
res = smf.ols("y ~ x1", data).fit()
print(res.summary())
Either produce
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.000
Model: OLS Adj. R-squared: -0.333
Method: Least Squares F-statistic: 4.758e-16
Date: Wed, 17 Mar 2021 Prob (F-statistic): 1.00
Time: 22:11:40 Log-Likelihood: -5.6451
No. Observations: 5 AIC: 15.29
Df Residuals: 3 BIC: 14.51
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.8000 0.748 2.405 0.095 -0.582 4.182
x1 0 0.306 0 1.000 -0.972 0.972
==============================================================================
Omnibus: nan Durbin-Watson: 1.429
Prob(Omnibus): nan Jarque-Bera (JB): 0.375
Skew: 0.344 Prob(JB): 0.829
Kurtosis: 1.847 Cond. No. 4.74
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
to include an intercept in the non-formula API, you can simply use
res_constant = sm.OLS(y, sm.add_constant(x1).fit()
You can use sklearn's LinearRegression.
For the one without intercept (wanting to fit the model to intercept at origin), simply set the parameter fit_intercept = False

scipy.special yields fluctuating result for confluent hypergeometric function

The scipy implementation of the confluent hypergeometric function gives me wrong results. This is a minimal code:
import matplotlib.pyplot as plt
import numpy as np
from scipy import special
x=np.arange(0,1,.001)
f=special.hyp1f1(30,60,-1/x)
plt.scatter(x,f,s=.05)
When I run it, it produces the following plot:
output of scipy.special.hyp1f1
I wonder if there is a way to fix these fluctuations, which are definitely not correct. In fact, the function should be strictly positive in that range.
Starting from the explanation at scipy.special.hyp1f1, here is an attempt to approximate the function with a polynomial.
Apparently, hyp1f1(-1/x) works nice between x=0 and about x=0.2. Note that at x exactly 0, the function isn't properly defined. The approximation with a 5th degree polynomial is much too large for x<0.4. With a 80th degree polynomial, the approximation seems correct starting at x>0.025 but quickly gets out of bounds for smaller x. (With more than 90 terms the polynomial can't be calculated in this way anymore.)
Probably the best solution would be to use a high degree polynomial for x>=0.1 and the original hyp1f1 when x is smaller.
import matplotlib.pyplot as plt
import numpy as np
from scipy import special
x = np.linspace(0.001, 1, 1000)
f = special.hyp1f1(30, 60, -1 / x)
plt.scatter(x, f, s=1, color='r', label='hyp1f1')
for terms in range(80, 1, -10):
k10 = np.arange(terms)
c10 = special.poch(30, k10) / (special.poch(60, k10) * special.factorial(k10))
poly10 = np.poly1d(c10[::-1])
plt.scatter(x, poly10(-1 / x), s=1, label=f'{terms} terms', color=plt.cm.Set1(terms / 80))
plt.ylim(-3.5, 3.7)
plt.legend(scatterpoints=10, ncol=3)
plt.show()
Zoomed in:

gaussian process regression in multiple dimensions with GPflow

I would like to perform some multivariant regression using gaussian process regression as implemented in GPflow using version 2.
Installed with pip install gpflow==2.0.0rc1
Below is some example code that generates some 2D data and then attempts to fit it with using GPR and the finally computes the difference
between the true input data and the GPR prediction.
Eventually I would like to extend to higher dimensions
and do tests against a validation set to check for over-fitting
and experiment with other kernels and "Automatic Relevance Determination"
but understanding how to get this to work is the first step.
Thanks!
The following code snippet will work in a jupyter notebook.
import gpflow
import numpy as np
import matplotlib
from gpflow.utilities import print_summary
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12, 6)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def gen_data(X, Y):
"""
make some fake data.
X, Y are np.ndarrays with shape (N,) where
N is the number of samples.
"""
ys = []
for x0, x1 in zip(X,Y):
y = x0 * np.sin(x0*10)
y = x1 * np.sin(x0*10)
y += 1
ys.append(y)
return np.array(ys)
# generate some fake data
x = np.linspace(0, 1, 20)
X, Y = np.meshgrid(x, x)
X = X.ravel()
Y = Y.ravel()
z = gen_data(X, Y)
#note X.shape, Y.shape and z.shape
#are all (400,) for this case.
# if you would like to plot the data you can do the following
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X, Y, z, s=100, c='k')
# had to set this
# to avoid the following error
# tensorflow.python.framework.errors_impl.InvalidArgumentError: Cholesky decomposition was not successful. The input might not be valid. [Op:Cholesky]
gpflow.config.set_default_positive_minimum(1e-7)
# setup the kernel
k = gpflow.kernels.Matern52()
# set up GPR model
# I think the shape of the independent data
# should be (400, 2) for this case
XY = np.column_stack([[X, Y]]).T
print(XY.shape) # this will be (400, 2)
m = gpflow.models.GPR(data=(XY, z), kernel=k, mean_function=None)
# optimise hyper-parameters
opt = gpflow.optimizers.Scipy()
def objective_closure():
return - m.log_marginal_likelihood()
opt_logs = opt.minimize(objective_closure,
m.trainable_variables,
options=dict(maxiter=100)
)
# predict training set
mean, var = m.predict_f(XY)
print(mean.numpy().shape)
# (400, 400)
# I would expect this to be (400,)
# If it was then I could compute the difference
# between the true data and the GPR prediction
# `diff = mean - z`
# but because the shape is not as expected this of course
# won't work.
The shape of z must be (N, 1), whereas in your case it is (N,). However, this is a missing check in GPflow and not your fault.

keep the scaling while drawing a weighed networkx

when I draw a weighed networkx, it does not really represented the real weight in terms of distance. I was curious if there is any parameters that I am missing or some other problem.
so, I started by making a simulated dataset as following
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import euclidean
import networkx as nx
from scipy.spatial.distance import pdist, squareform, cdist
# data generation
data = vstack((rand(5,2) + array([12,12]),rand(5,2)))
a = pdist(data, 'euclidean')
def givexy(index1D, VectorLength):
return [index1D%VectorLength, index1D/VectorLength]
import matplotlib.pyplot as plt
plt.plot(data[:,0], data[:,1], 'o')
plt.show()
then, I calculate the euclidean distance among all pairs and use the distance as weight
G = nx.empty_graph(1)
for cnt, item in enumerate(a):
print cnt
G.add_edge(givexy(cnt, 10)[0], givexy(cnt, 10)[1], weight=item, length=0)
pos = nx.spring_layout(G)
nx.draw_networkx(G, pos)
edge_labels=dict([((u,v,),"%.2f" % d['weight'])
for u,v,d in G.edges(data=True)])
nx.draw_networkx_edge_labels(G,pos,edge_labels=edge_labels)
#~ nx.draw(G,pos,edge_labels=edge_labels)
plt.show()
exit()
you might a get a different plot - because of unknown reason it is random. my main problem is the distance of nodes. for example the distance between node 4 to 8 is 0.82 but it looks longer than the distance of node 7 and 0.
any hint ?
thank you,
The spring layout doesn't explicitly use the weights as distances. Higher weight edges produce shorter edges in general.
Though if you want to specify the positions explicitly you can do that:
from numpy import vstack,array
from numpy.random import rand
from scipy.spatial.distance import euclidean, pdist
import networkx as nx
import matplotlib.pyplot as plt
# data generation
data = vstack((rand(5,2) + array([12,12]),rand(5,2)))
a = pdist(data, 'euclidean')
def givexy(index1D, VectorLength):
return [index1D%VectorLength, index1D/VectorLength]
plt.plot(data[:,0], data[:,1], 'o')
G = nx.Graph()
for cnt, item in enumerate(a):
print cnt
G.add_edge(givexy(cnt, 10)[0], givexy(cnt, 10)[1], weight=item, length=0)
pos={}
for node,row in enumerate(data):
pos[node]=row
nx.draw_networkx(G, pos)
plt.savefig('drawing.png')