Is the inplace operation with scipy gaussian_filter1d safe? - scipy

Here is the sample code I wrote to examine this issue.
It can be seen that in this case we get the same result, but I want to know if it is safe to compute inplace with other options (scipy version, augment, ...).
import numpy as np
from scipy.ndimage import gaussian_filter1d
X = np.random.normal(0, 1, size=[64, 1024, 2048])
OPX = X.copy()
for axis, sigma in zip([-2, -1], [3, 7]):
gaussian_filter1d(OPX, sigma, axis, output=OPX)
OPY, OPZ = X.copy(), X.copy()
for axis, sigma in zip([-2, -1], [3, 7]):
gaussian_filter1d(OPY, sigma, axis, output=OPZ)
OPY, OPZ = OPZ, OPY
(OPX == OPY).all() # True
python 3.7.15
scipy 1.7.3
numpy 1.21.6

Related

Detecting Patterns in dataset using clustering algorithm

I have a dataset that contains a cross pattern. How can I filter off this cross-like pattern?
Tried DBScan but it didn't work effectively. Also, can't use any cluster which needs to specify number of clusters as the data cleaning needs to be automated.
Sorry, forget SVM, that's for classification. Honestly, I didn't ready your question carefully the first time. I just re-read what you posted originally. Try Mean Shift for automatically detecting the optimal number of clusters. Here's an example. Hopefully you can adapt it for your specific use.
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets import make_blobs
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1], [1, -1], [1, -1]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.2)
# #############################################################################
# Compute clustering with MeanShift
# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.6, n_samples=5000)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : %d" % n_clusters_)
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
my_members = labels == k
cluster_center = cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
Or, try this.
import numpy as np
import pandas as pd
from sklearn.cluster import MeanShift
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import make_blobs
# We will be using the make_blobs method
# in order to generate our own data.
clusters = [[2, 2, 2], [7, 7, 7], [5, 13, 13]]
X, _ = make_blobs(n_samples = 150, centers = clusters,
cluster_std = 0.60)
# After training the model, We store the
# coordinates for the cluster centers
ms = MeanShift()
ms.fit(X)
cluster_centers = ms.cluster_centers_
# Finally We plot the data points
# and centroids in a 3D graph.
fig = plt.figure()
ax = fig.add_subplot(111, projection ='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], marker ='o')
ax.scatter(cluster_centers[:, 0], cluster_centers[:, 1],
cluster_centers[:, 2], marker ='x', color ='red',
s = 300, linewidth = 5, zorder = 10)
plt.show()
There are a few clustering methodologies that help you choose the optimal number of clusters automatically. Check out the link below for some ideas of how to move forward with your project.
https://scikit-learn.org/stable/modules/clustering.html

GPflow - GP classification with 1-dim Linear kernel fits poorly for 2 dimension data

Following the issue #1435, I have an additional question to how to use GPflow.
I replicate the issue in an additional kernel: https://github.com/avalonhse/BayesNotebook/blob/master/Issue_2_GPFlow_Linear_Classification.ipynb
My purpose is fitting an additive kernel to a 2-dimensional data (squared exponential in dimension 1 and linear kernel in dimension 2). Following the instruction of #1435, I have been successfully fitting the model with kernel gpflow.kernels.Linear(variance= 0.1).
Linear kernel
However, when I use the kernel gpflow.kernels.Linear(active_dims=1,variance= 0.01) as I original planned, the model is not fitted. I used the GPy with same kernel as a reference then the result looks reasonable.
1-dim GPFlow kernel
import numpy as np
X = np.array([[ 9.96578428, 60.],[ 9.96578428, 40.],[ 9.96578428, 20.],
[10.96578428, 30.],[11.96578428, 40.],[12.96578428, 50.],
[12.96578428, 70.],[8.96578428, 30. ],[ 7.96578428, 40.],
[ 6.96578428, 50.],[ 6.96578428, 30.],[ 6.96578428, 10.],
[11.4655664 , 71.],[ 8.56605404, 63.],[12.41574177, 69.],
[10.61562964, 48.],[ 7.61470984, 51.],[ 9.31514956, 45.]])
Y = np.array([[1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 1., 0.]]).T
# plotting
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
def plot(X,Y):
mask = Y[:, 0] == 1
plt.figure(figsize=(6, 6))
plt.plot(X[mask, 0], X[mask, 1], "oC0", mew=0, alpha=0.5)
plt.ylim(-10, 100)
plt.xlim(5, 15)
_ = plt.plot(X[np.logical_not(mask), 0], X[np.logical_not(mask), 1], "oC1", mew=0, alpha=0.5)
plot(X,Y)
# Evaluate real function and the predicted probability
res = 500
xx, yy = np.meshgrid(np.linspace(5, 15, res),
np.linspace(- 10, 120, res))
Xplot = np.vstack((xx.flatten(), yy.flatten())).T
# Code followed the Notebook : https://gpflow.readthedocs.io/en/develop/notebooks/basics/classification.html
import tensorflow as tf
import tensorflow_probability as tfp
import gpflow
from gpflow.utilities import print_summary, set_trainable, to_default_float
gpflow.config.set_default_summary_fmt("notebook")
def testGPFlow(k):
m = gpflow.models.VGP(
(X, Y),
kernel= k,
likelihood=gpflow.likelihoods.Bernoulli()
)
print("\n ########### Model before optimzation ########### \n")
print_summary(m)
print("\n ########### Model after optimzation ########### \n")
opt = gpflow.optimizers.Scipy()
res = opt.minimize(
m.training_loss, variables=m.trainable_variables, options=dict(maxiter=2500), method="L-BFGS-B"
)
print(' Message: ' + str(res.message) + '\n Status = ' + str(res.status) + '\n Number of iterations = ' + str(res.nit))
print_summary(m)
means, _ = m.predict_y(Xplot) # here we only care about the mean
y_prob = means.numpy().reshape(*xx.shape)
print("Fitting model using GPFlow")
plot(X,Y)
_ = plt.contour(
xx,
yy,
y_prob,
[0.5], # plot the p=0.5 contour line only
colors="k",
linewidths=1.8,
zorder=100,
)
k = gpflow.kernels.Linear(active_dims=[1],variance= 0.01)
testGPFlow(k)
k = gpflow.kernels.Linear(variance= 1)
testGPFlow(k)
The GPy code is for reference only to suggest how a fitted model should be. I am aware that GPy and GPflow use different methods. My question is why GPflow model does not fit when I specify the Linear kernel in 1 dimension.
Thanks for posting this question, Hoang, and for using GPflow.
When you specify input_dim in Gpy, you are telling the algorithm to act on two dimensions. Active_dims in GPflow behaves differently. It specifies which dimensions you want the kernel to act on. 'active_dims = 1' is telling GPflow to apply your linear kernel to only the y dimension.
Since you want your kernel to act on both x and y dimensions, you should specify active_dims = [0,1] rather than just 'active_dims = 1.' When I run your code with this fix, I get a result identical to GPy's result:

gaussian process regression in multiple dimensions with GPflow

I would like to perform some multivariant regression using gaussian process regression as implemented in GPflow using version 2.
Installed with pip install gpflow==2.0.0rc1
Below is some example code that generates some 2D data and then attempts to fit it with using GPR and the finally computes the difference
between the true input data and the GPR prediction.
Eventually I would like to extend to higher dimensions
and do tests against a validation set to check for over-fitting
and experiment with other kernels and "Automatic Relevance Determination"
but understanding how to get this to work is the first step.
Thanks!
The following code snippet will work in a jupyter notebook.
import gpflow
import numpy as np
import matplotlib
from gpflow.utilities import print_summary
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12, 6)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def gen_data(X, Y):
"""
make some fake data.
X, Y are np.ndarrays with shape (N,) where
N is the number of samples.
"""
ys = []
for x0, x1 in zip(X,Y):
y = x0 * np.sin(x0*10)
y = x1 * np.sin(x0*10)
y += 1
ys.append(y)
return np.array(ys)
# generate some fake data
x = np.linspace(0, 1, 20)
X, Y = np.meshgrid(x, x)
X = X.ravel()
Y = Y.ravel()
z = gen_data(X, Y)
#note X.shape, Y.shape and z.shape
#are all (400,) for this case.
# if you would like to plot the data you can do the following
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X, Y, z, s=100, c='k')
# had to set this
# to avoid the following error
# tensorflow.python.framework.errors_impl.InvalidArgumentError: Cholesky decomposition was not successful. The input might not be valid. [Op:Cholesky]
gpflow.config.set_default_positive_minimum(1e-7)
# setup the kernel
k = gpflow.kernels.Matern52()
# set up GPR model
# I think the shape of the independent data
# should be (400, 2) for this case
XY = np.column_stack([[X, Y]]).T
print(XY.shape) # this will be (400, 2)
m = gpflow.models.GPR(data=(XY, z), kernel=k, mean_function=None)
# optimise hyper-parameters
opt = gpflow.optimizers.Scipy()
def objective_closure():
return - m.log_marginal_likelihood()
opt_logs = opt.minimize(objective_closure,
m.trainable_variables,
options=dict(maxiter=100)
)
# predict training set
mean, var = m.predict_f(XY)
print(mean.numpy().shape)
# (400, 400)
# I would expect this to be (400,)
# If it was then I could compute the difference
# between the true data and the GPR prediction
# `diff = mean - z`
# but because the shape is not as expected this of course
# won't work.
The shape of z must be (N, 1), whereas in your case it is (N,). However, this is a missing check in GPflow and not your fault.

How to use scipy signal for MIMO systems

I am looking for a way to simulate the output of a signal for various input signals. To be more precise, I have a system defined by its transfer function H that takes one input and has one output. I generated several signals (stored in a numpy array). What I would like to do, is get the response of the system, to each input signal whithout using a for loop. Is there a way to proceed? Below is the code I wrote so far.
from __future__ import division
import numpy as np
from scipy import signal
nbr_inputs = 5
t_in = np.arange(0,10,0.2)
dim = (nbr_inputs, len(t_in))
x = np.cumsum(np.random.normal(0,2e-3, dim), axis=1)
H = signal.TransferFunction([1, 3, 3], [1, 2, 1])
t_out, y, _ = signal.lsim(H, x[0], t_in) # here, I would just like to simply write x
thanks for your help
This is not a MIMO system, it is a SISO system but you have multiple inputs.
You can create a MIMO system and apply your inputs all at once which will be computed channel by channel but simultaneously. Moreover, you can't use scipy.signal.lsim for MIMO systems yet. You can use other options such as python-control (if you have slycot extension otherwise again no MIMO) or harold if you have Python 3.6 or greater (disclaimer: I'm the author).
import numpy as np
from harold import *
import matplotlib.pyplot
nbr_inputs = 5
t_in = np.arange(0,10,0.2)
dim = (nbr_inputs, len(t_in))
x = np.cumsum(np.random.normal(0,2e-3, dim), axis=1)
# Forming a 1x5 system, common denominator will be completed automatically
H = Transfer([[[1, 3, 3]]*nbr_inputs], [1, 2, 1])
The keyword per_channel=True applies first input to first channel, second input to second and so on. Otherwise combined response is returned. You can check the shapes by playing around with it to see what I mean.
# Notice it is x.T below -> input shape = <num samples>, <num inputs>
y, t = simulate_linear_system(H, x.T, t_in, per_channel=True)
plt.plot(t, y)
This gives

Minimal p-value for scipy.stats.pearsonr

I am running scipy.stats.pearsonr on my data, and I get
(0.9672434106763087, 0.0)
It is reasonable that the r-value is high and the p-value is very low.
However, p is obviously not 0, so I would like to know what p=0.0 means. Is it p<10^-10, p<10^-100 or what is the limit?
As pointed out by #MB-F in the comments it is calculated analytically.
In the code for the version 0.19.1, you could isolate that part of the code and plot the p-value in terms of r
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import betainc
r = np.linspace(-1, 1, 1000)*(1-1e-10);
for n in [10, 100, 1000]:
df = n - 2
t_squared = r**2 * (df / ((1.0 - r) * (1.0 + r)))
prob = betainc(0.5*df, 0.5, df/(df+t_squared))
plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()
The current stable version 1.9.3 uses a different formula
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import btdtr
r = np.linspace(-1, 1, 1000)*(1-1e-10);
for n in [10, 100, 1000]:
ab = 0.5*n
prob = btdtr(ab, ab, 0.5*(1-abs(r)))
plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()
But yield the same results.
You can see that if you have 1000 points and your correlation, the p value will be less than the minimum floating value.
The beta distribution
Scipy provides a collection of probability distributions, among them, the beta distribution.
The line
prob = btdtr(ab, ab, 0.5*(1-abs(r)))
could be replaced by
from scipy.stats import beta
prob = beta(ab, ab).cdf(0.5*(1-abs(r)))
There you can get much more information about it.