Apparently missing scipy function - scipy

When I run the script below, which is taken from scipy documentation, I get the
error ‘Attribute Error: module 'scipy.integrate' has no attribute 'cumulative_trapezoid'
When I run dir (integrate) there is no mention in the list of the functions 'cumulative_trapezoid'
from scipy import integrate
import matplotlib.pyplot as plt
x = np.linspace(-2, 2, num=20)
y = x
y_int = integrate.cumulative_trapezoid(y, x, initial=0)
plt.plot(x, y_int, 'ro', x, y[0] + 0.5 * x**2, 'b-')
plt.show()

cumulative_trapezoid was added to SciPy in version 1.6. You can check which version you have with import scipy; print(scipy.__version__). In older versions of SciPy, the function is called cumtrapz.

Related

scipy.special yields fluctuating result for confluent hypergeometric function

The scipy implementation of the confluent hypergeometric function gives me wrong results. This is a minimal code:
import matplotlib.pyplot as plt
import numpy as np
from scipy import special
x=np.arange(0,1,.001)
f=special.hyp1f1(30,60,-1/x)
plt.scatter(x,f,s=.05)
When I run it, it produces the following plot:
output of scipy.special.hyp1f1
I wonder if there is a way to fix these fluctuations, which are definitely not correct. In fact, the function should be strictly positive in that range.
Starting from the explanation at scipy.special.hyp1f1, here is an attempt to approximate the function with a polynomial.
Apparently, hyp1f1(-1/x) works nice between x=0 and about x=0.2. Note that at x exactly 0, the function isn't properly defined. The approximation with a 5th degree polynomial is much too large for x<0.4. With a 80th degree polynomial, the approximation seems correct starting at x>0.025 but quickly gets out of bounds for smaller x. (With more than 90 terms the polynomial can't be calculated in this way anymore.)
Probably the best solution would be to use a high degree polynomial for x>=0.1 and the original hyp1f1 when x is smaller.
import matplotlib.pyplot as plt
import numpy as np
from scipy import special
x = np.linspace(0.001, 1, 1000)
f = special.hyp1f1(30, 60, -1 / x)
plt.scatter(x, f, s=1, color='r', label='hyp1f1')
for terms in range(80, 1, -10):
k10 = np.arange(terms)
c10 = special.poch(30, k10) / (special.poch(60, k10) * special.factorial(k10))
poly10 = np.poly1d(c10[::-1])
plt.scatter(x, poly10(-1 / x), s=1, label=f'{terms} terms', color=plt.cm.Set1(terms / 80))
plt.ylim(-3.5, 3.7)
plt.legend(scatterpoints=10, ncol=3)
plt.show()
Zoomed in:

Curve fitting of sine function in python using scipy is not yielding desired output

I'm trying to fit sine function on my data. No errors are shown but it doesn't seem to work.
python
def sin_fun(x,a,b):
return (a*np.sin(b*x))
p_opt,p_cov=cf(sin_fun,xdata,ydata)
print(p_opt)
plt.plot(xdata,sin_fun(xdata,*p_opt))
plt.scatter(xdata,ydata)
plt.show()
This is the output I am getting:
I have simulated your data. There are 2 problems with your code as to why it isn't doing what you want. First is that your sin_fun needs a y-offset parameter, otherwise the function will always be symmetrical about y = 0. Secondly, the fit works better if you can provide curve_fit with a reasonable guess. This is done using the p0 argument. Have a look here:
from scipy.optimize import curve_fit as cf
import numpy as np
from matplotlib import pyplot as plt
# simulate your data
xdata = np.linspace(0, 25000, 256)
ydata = 15000 * np.sin(xdata/2000) + 22000
# add some noise
ydata += np.random.rand(xdata.size) * 2000
# sin function needs a y-offset -> c
def sin_fun(x,a,b,c):
return a*np.sin(b*x)+c
# need a reasonable guess -> note that the guess is not quite right but curve_fit still works
p_opt,p_cov=cf(sin_fun,xdata,ydata, p0=(10000, 1/2500, 15000))
print(p_opt)
plt.plot(xdata,sin_fun(xdata,*p_opt))
plt.plot(xdata,ydata, 'r.', ms=1)
plt.show()
With these fixes you can get a good fit. You could also add a phase parameter to your function to help fit other sinusoids.

gaussian process regression in multiple dimensions with GPflow

I would like to perform some multivariant regression using gaussian process regression as implemented in GPflow using version 2.
Installed with pip install gpflow==2.0.0rc1
Below is some example code that generates some 2D data and then attempts to fit it with using GPR and the finally computes the difference
between the true input data and the GPR prediction.
Eventually I would like to extend to higher dimensions
and do tests against a validation set to check for over-fitting
and experiment with other kernels and "Automatic Relevance Determination"
but understanding how to get this to work is the first step.
Thanks!
The following code snippet will work in a jupyter notebook.
import gpflow
import numpy as np
import matplotlib
from gpflow.utilities import print_summary
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12, 6)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def gen_data(X, Y):
"""
make some fake data.
X, Y are np.ndarrays with shape (N,) where
N is the number of samples.
"""
ys = []
for x0, x1 in zip(X,Y):
y = x0 * np.sin(x0*10)
y = x1 * np.sin(x0*10)
y += 1
ys.append(y)
return np.array(ys)
# generate some fake data
x = np.linspace(0, 1, 20)
X, Y = np.meshgrid(x, x)
X = X.ravel()
Y = Y.ravel()
z = gen_data(X, Y)
#note X.shape, Y.shape and z.shape
#are all (400,) for this case.
# if you would like to plot the data you can do the following
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X, Y, z, s=100, c='k')
# had to set this
# to avoid the following error
# tensorflow.python.framework.errors_impl.InvalidArgumentError: Cholesky decomposition was not successful. The input might not be valid. [Op:Cholesky]
gpflow.config.set_default_positive_minimum(1e-7)
# setup the kernel
k = gpflow.kernels.Matern52()
# set up GPR model
# I think the shape of the independent data
# should be (400, 2) for this case
XY = np.column_stack([[X, Y]]).T
print(XY.shape) # this will be (400, 2)
m = gpflow.models.GPR(data=(XY, z), kernel=k, mean_function=None)
# optimise hyper-parameters
opt = gpflow.optimizers.Scipy()
def objective_closure():
return - m.log_marginal_likelihood()
opt_logs = opt.minimize(objective_closure,
m.trainable_variables,
options=dict(maxiter=100)
)
# predict training set
mean, var = m.predict_f(XY)
print(mean.numpy().shape)
# (400, 400)
# I would expect this to be (400,)
# If it was then I could compute the difference
# between the true data and the GPR prediction
# `diff = mean - z`
# but because the shape is not as expected this of course
# won't work.
The shape of z must be (N, 1), whereas in your case it is (N,). However, this is a missing check in GPflow and not your fault.

Minimal p-value for scipy.stats.pearsonr

I am running scipy.stats.pearsonr on my data, and I get
(0.9672434106763087, 0.0)
It is reasonable that the r-value is high and the p-value is very low.
However, p is obviously not 0, so I would like to know what p=0.0 means. Is it p<10^-10, p<10^-100 or what is the limit?
As pointed out by #MB-F in the comments it is calculated analytically.
In the code for the version 0.19.1, you could isolate that part of the code and plot the p-value in terms of r
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import betainc
r = np.linspace(-1, 1, 1000)*(1-1e-10);
for n in [10, 100, 1000]:
df = n - 2
t_squared = r**2 * (df / ((1.0 - r) * (1.0 + r)))
prob = betainc(0.5*df, 0.5, df/(df+t_squared))
plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()
The current stable version 1.9.3 uses a different formula
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import btdtr
r = np.linspace(-1, 1, 1000)*(1-1e-10);
for n in [10, 100, 1000]:
ab = 0.5*n
prob = btdtr(ab, ab, 0.5*(1-abs(r)))
plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()
But yield the same results.
You can see that if you have 1000 points and your correlation, the p value will be less than the minimum floating value.
The beta distribution
Scipy provides a collection of probability distributions, among them, the beta distribution.
The line
prob = btdtr(ab, ab, 0.5*(1-abs(r)))
could be replaced by
from scipy.stats import beta
prob = beta(ab, ab).cdf(0.5*(1-abs(r)))
There you can get much more information about it.

keep the scaling while drawing a weighed networkx

when I draw a weighed networkx, it does not really represented the real weight in terms of distance. I was curious if there is any parameters that I am missing or some other problem.
so, I started by making a simulated dataset as following
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import euclidean
import networkx as nx
from scipy.spatial.distance import pdist, squareform, cdist
# data generation
data = vstack((rand(5,2) + array([12,12]),rand(5,2)))
a = pdist(data, 'euclidean')
def givexy(index1D, VectorLength):
return [index1D%VectorLength, index1D/VectorLength]
import matplotlib.pyplot as plt
plt.plot(data[:,0], data[:,1], 'o')
plt.show()
then, I calculate the euclidean distance among all pairs and use the distance as weight
G = nx.empty_graph(1)
for cnt, item in enumerate(a):
print cnt
G.add_edge(givexy(cnt, 10)[0], givexy(cnt, 10)[1], weight=item, length=0)
pos = nx.spring_layout(G)
nx.draw_networkx(G, pos)
edge_labels=dict([((u,v,),"%.2f" % d['weight'])
for u,v,d in G.edges(data=True)])
nx.draw_networkx_edge_labels(G,pos,edge_labels=edge_labels)
#~ nx.draw(G,pos,edge_labels=edge_labels)
plt.show()
exit()
you might a get a different plot - because of unknown reason it is random. my main problem is the distance of nodes. for example the distance between node 4 to 8 is 0.82 but it looks longer than the distance of node 7 and 0.
any hint ?
thank you,
The spring layout doesn't explicitly use the weights as distances. Higher weight edges produce shorter edges in general.
Though if you want to specify the positions explicitly you can do that:
from numpy import vstack,array
from numpy.random import rand
from scipy.spatial.distance import euclidean, pdist
import networkx as nx
import matplotlib.pyplot as plt
# data generation
data = vstack((rand(5,2) + array([12,12]),rand(5,2)))
a = pdist(data, 'euclidean')
def givexy(index1D, VectorLength):
return [index1D%VectorLength, index1D/VectorLength]
plt.plot(data[:,0], data[:,1], 'o')
G = nx.Graph()
for cnt, item in enumerate(a):
print cnt
G.add_edge(givexy(cnt, 10)[0], givexy(cnt, 10)[1], weight=item, length=0)
pos={}
for node,row in enumerate(data):
pos[node]=row
nx.draw_networkx(G, pos)
plt.savefig('drawing.png')