To fit Linear regression Model with and without intercept in python - linear-regression

I need to fit Linear regression Model 1 : y = β1x1 + ε and Model 2: y = β0 + β1x1 + ε, to the data x1 = ([0,1,2,3,4])
y = ([1,2,3,2,1]). My objective is to find
coefficients, squared error loss, the absolute error loss, and the L1.5 loss for both model.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
import statsmodels.formula.api as smf
import numpy as np
x1 = ([0,1,2,3,4])
y = ([1,2,3,2,1])
would you please show me some way to get these?

This first method doesn't use the formula api.
import statsmodels.api as sm
import numpy as np
x1 = np.array([0,1,2,3,4])
y = np.array([1,2,3,2,1])
x1 = x1[:, None] # Transform into a (5,1) atrray
res = sm.OLS(y,x1).fit()
print(res.summary())
If you want to use the formula interface, you need to build a DataFrame, and then the regression is "y ~ x1" (if you want a constant you need to include +1 on the right-hand-side of the formula.
import statsmodels.formula.api as smf
import pandas as pd
x1 = [0,1,2,3,4]
y = [1,2,3,2,1]
data = pd.DataFrame({"y":y,"x1":x1})
res = smf.ols("y ~ x1", data).fit()
print(res.summary())
Either produce
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.000
Model: OLS Adj. R-squared: -0.333
Method: Least Squares F-statistic: 4.758e-16
Date: Wed, 17 Mar 2021 Prob (F-statistic): 1.00
Time: 22:11:40 Log-Likelihood: -5.6451
No. Observations: 5 AIC: 15.29
Df Residuals: 3 BIC: 14.51
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.8000 0.748 2.405 0.095 -0.582 4.182
x1 0 0.306 0 1.000 -0.972 0.972
==============================================================================
Omnibus: nan Durbin-Watson: 1.429
Prob(Omnibus): nan Jarque-Bera (JB): 0.375
Skew: 0.344 Prob(JB): 0.829
Kurtosis: 1.847 Cond. No. 4.74
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
to include an intercept in the non-formula API, you can simply use
res_constant = sm.OLS(y, sm.add_constant(x1).fit()

You can use sklearn's LinearRegression.
For the one without intercept (wanting to fit the model to intercept at origin), simply set the parameter fit_intercept = False

Related

Solving for a nonlinear Hamiltonian using SciPy's fsolve

I am in the midst of solving for a nonlinear Hamiltonian of a dimer, which consists of 2 complex wavefunctions. I am using SciPy's root solver method by iterations. Thus, the complex input for my initial guess has to be encoded into real and imaginary parts, which will then make the initial 2x1 matrix equation into a 4x1 matrix equation.
What I've tried :
from scipy.optimize import fsolve
import math
import numpy as np
import cmath
from math import pi
k = np.linspace(-pi, pi, 100)
v = 1
w = 1
H = np.zeros((2, 2), dtype=complex)
H[0, 1] = v+w*cmath.exp(-1j*k)
H[1, 0] = v+w*cmath.exp(1j*k)
E=np.zeros((2,2), )
E[0,1]=-(v**2+w**2+2*v*w*math.cos(k))**0.5
E[1,0]=(v**2+w**2+2*v*w*math.cos(k))**0.5
def f(x):
psi = np.zeros((2,1),float)
psi[0]=np.isreal(x[0])+1j*np.iscomplex(x[1])
psi[1]=np.isreal(x[2])+1j*np.iscomplex(x[3])
f=(H-E)*psi
return f
m = fsolve(f,np.array([[1],[2],[3],[4]]))
print (m.x)
The error message that I'm getting is as follows
TypeError: only length-1 arrays can be converted to Python scalars #for line 12
I need help to rectify this mistake.

Python sklearn- gaussian.mixture how to get the samples/points in each clusters

I am using the GMM to cluster my dataset to K Groups, my model is running well, but there is no way to get raw data from each cluster, Can you guys suggest me some idea to solve this problem. Thank you so much.
You can do it like this (look at d0, d1, & d2).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn import datasets
from sklearn.mixture import GaussianMixture
# load the iris dataset
iris = datasets.load_iris()
# select first two columns
X = iris.data[:, 0:2]
# turn it into a dataframe
d = pd.DataFrame(X)
# plot the data
plt.scatter(d[0], d[1])
gmm = GaussianMixture(n_components = 3)
# Fit the GMM model for the dataset
# which expresses the dataset as a
# mixture of 3 Gaussian Distribution
gmm.fit(d)
# Assign a label to each sample
labels = gmm.predict(d)
d['labels']= labels
d0 = d[d['labels']== 0]
d1 = d[d['labels']== 1]
d2 = d[d['labels']== 2]
# here is a possible solution for you:
d0
d1
d2
# plot three clusters in same plot
plt.scatter(d0[0], d0[1], c ='r')
plt.scatter(d1[0], d1[1], c ='yellow')
plt.scatter(d2[0], d2[1], c ='g')
# print the converged log-likelihood value
print(gmm.lower_bound_)
# print the number of iterations needed
# for the log-likelihood value to converge
print(gmm.n_iter_)
# it needed 8 iterations for the log-likelihood to converge.

gaussian process regression in multiple dimensions with GPflow

I would like to perform some multivariant regression using gaussian process regression as implemented in GPflow using version 2.
Installed with pip install gpflow==2.0.0rc1
Below is some example code that generates some 2D data and then attempts to fit it with using GPR and the finally computes the difference
between the true input data and the GPR prediction.
Eventually I would like to extend to higher dimensions
and do tests against a validation set to check for over-fitting
and experiment with other kernels and "Automatic Relevance Determination"
but understanding how to get this to work is the first step.
Thanks!
The following code snippet will work in a jupyter notebook.
import gpflow
import numpy as np
import matplotlib
from gpflow.utilities import print_summary
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12, 6)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def gen_data(X, Y):
"""
make some fake data.
X, Y are np.ndarrays with shape (N,) where
N is the number of samples.
"""
ys = []
for x0, x1 in zip(X,Y):
y = x0 * np.sin(x0*10)
y = x1 * np.sin(x0*10)
y += 1
ys.append(y)
return np.array(ys)
# generate some fake data
x = np.linspace(0, 1, 20)
X, Y = np.meshgrid(x, x)
X = X.ravel()
Y = Y.ravel()
z = gen_data(X, Y)
#note X.shape, Y.shape and z.shape
#are all (400,) for this case.
# if you would like to plot the data you can do the following
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X, Y, z, s=100, c='k')
# had to set this
# to avoid the following error
# tensorflow.python.framework.errors_impl.InvalidArgumentError: Cholesky decomposition was not successful. The input might not be valid. [Op:Cholesky]
gpflow.config.set_default_positive_minimum(1e-7)
# setup the kernel
k = gpflow.kernels.Matern52()
# set up GPR model
# I think the shape of the independent data
# should be (400, 2) for this case
XY = np.column_stack([[X, Y]]).T
print(XY.shape) # this will be (400, 2)
m = gpflow.models.GPR(data=(XY, z), kernel=k, mean_function=None)
# optimise hyper-parameters
opt = gpflow.optimizers.Scipy()
def objective_closure():
return - m.log_marginal_likelihood()
opt_logs = opt.minimize(objective_closure,
m.trainable_variables,
options=dict(maxiter=100)
)
# predict training set
mean, var = m.predict_f(XY)
print(mean.numpy().shape)
# (400, 400)
# I would expect this to be (400,)
# If it was then I could compute the difference
# between the true data and the GPR prediction
# `diff = mean - z`
# but because the shape is not as expected this of course
# won't work.
The shape of z must be (N, 1), whereas in your case it is (N,). However, this is a missing check in GPflow and not your fault.

Minimal p-value for scipy.stats.pearsonr

I am running scipy.stats.pearsonr on my data, and I get
(0.9672434106763087, 0.0)
It is reasonable that the r-value is high and the p-value is very low.
However, p is obviously not 0, so I would like to know what p=0.0 means. Is it p<10^-10, p<10^-100 or what is the limit?
As pointed out by #MB-F in the comments it is calculated analytically.
In the code for the version 0.19.1, you could isolate that part of the code and plot the p-value in terms of r
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import betainc
r = np.linspace(-1, 1, 1000)*(1-1e-10);
for n in [10, 100, 1000]:
df = n - 2
t_squared = r**2 * (df / ((1.0 - r) * (1.0 + r)))
prob = betainc(0.5*df, 0.5, df/(df+t_squared))
plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()
The current stable version 1.9.3 uses a different formula
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import btdtr
r = np.linspace(-1, 1, 1000)*(1-1e-10);
for n in [10, 100, 1000]:
ab = 0.5*n
prob = btdtr(ab, ab, 0.5*(1-abs(r)))
plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()
But yield the same results.
You can see that if you have 1000 points and your correlation, the p value will be less than the minimum floating value.
The beta distribution
Scipy provides a collection of probability distributions, among them, the beta distribution.
The line
prob = btdtr(ab, ab, 0.5*(1-abs(r)))
could be replaced by
from scipy.stats import beta
prob = beta(ab, ab).cdf(0.5*(1-abs(r)))
There you can get much more information about it.

Chi squared test

I have written code in MATLAB for a Chi-Square test. I wish to obtain P-values as 0.897 or 0.287 and so on, but my results are too small. Below is my code:
pd = fitdist(sample, 'weibull');
[h,p,st] = chi2gof(sample,'CDF',pd)
I've also tried using the AD test with similar result:
dist = makedist('Weibull', 'a',A, 'b',B);
[h,p,ad,cv] = adtest(sample, 'Distribution',dist)
Below is a histogram of the data with a fitted Weibull density function (Weibull parameters are A=4.0420 and B=2.0853)
When the p-value is less than a predetermined significance level (default is 5% or 0.05), it means that the null hypotheses is rejected (which in your case means that the sample did not come from a Weibull distribution).
The chi2gof function first output variable h denotes the test result, where h=1 means that the test rejects the null hypothesis at the specified significance level.
Example:
sample = rand(1000,1); % sample from Uniform distribution
pd = fitdist(sample, 'weibull');
[h,p,st] = chi2gof(sample, 'CDF',pd, 'Alpha',0.05)
The test clearly rejects H0, and concludes that the data did not came from a Weibull distribution:
h =
1 % 1: H1 (alternate hypo), 0: H0 (null hypo)
p =
2.8597e-27 % note that p << 0.05
st =
chi2stat: 141.1922
df: 7
edges: [0.0041 0.1035 0.2029 0.3023 0.4017 0.5011 0.6005 0.6999 0.7993 0.8987 0.9981]
O: [95 92 92 97 107 110 102 95 116 94]
E: [53.4103 105.6778 130.7911 136.7777 129.1428 113.1017 93.1844 72.8444 54.3360 110.7338]
Next let's try that again with a conforming sample:
>> sample = wblrnd(0.5, 2, [1000,1]); % sample from a Weibull distribution
>> pd = fitdist(sample, 'weibull')
pd =
WeibullDistribution
Weibull distribution
A = 0.496413 [0.481027, 0.512292]
B = 2.07314 [1.97524, 2.17589]
>> [h,p] = chi2gof(sample, 'CDF',pd, 'Alpha',0.05)
h =
0
p =
0.7340
the test now clearly passes with a high p-value.
EDIT:
Looking at the histogram you've shown, it does look like the data follows a Weibull distribution, although there might be cases of outliers (look at the right side of the histogram), which might explain why you are getting bad p-values. Consider preprocessing your data to handle extreme outliers..
Here is an example where I simulate outlier values:
% 5000 samples from a Weibull distribution
pd = makedist('Weibull', 'a',4.0420, 'b',2.0853);
sample = random(pd, [5000 1]);
%sample = wblrnd(4.0420, 2.0853, [5000 1]);
% add 20 outlier instances
sample(1:20) = [rand(10,1)+15; rand(10,1)+25];
% hypothesis tests using original distribution
[h,p,st] = chi2gof(sample, 'CDF',pd, 'Alpha',0.05)
[h,p,ad,cv] = adtest(sample, 'Distribution',pd)
% hypothesis tests using empirical distribution
[h,p,st] = chi2gof(sample, 'CDF',fitdist(sample,'Weibull'))
[h,p,ad,cv] = adtest(sample, 'Distribution', 'Weibull')
% show histogram
histfit(sample, 20, 'Weibull')
% chi-squared test
h =
1
p =
0.0382
st =
chi2stat: 8.4162
df: 3
edges: [0.1010 2.6835 5.2659 7.8483 25.9252]
O: [1741 2376 764 119]
E: [1.7332e+03 2.3857e+03 788.6020 92.5274]
% AD test
h =
1
p =
1.2000e-07
ad =
Inf
cv =
2.4924
The outliers are causing the distribution tests to fail (null hypothesis rejected). Still I couldn't reproduce getting a NaN p-value (you might wanna check this related question on Stats.SE about getting NaN p-values)..