This is a follow-up to my previous question here. I'm trying to fit my data from this csv file with scipy.stats.skewnorm, but I can't get it working right:
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import skewnorm
df = pd.read_csv('astro_data.csv')
x = df['delta z']
number_bins = 50
fig, ax = plt.subplots()
h, edges, _ = ax.hist(x, alpha = 0.5,
density = False,
bins = number_bins)
a_est, loc_est, scale_est = skewnorm.fit(x)
ax.plot(x, skewnorm.pdf(x, a_est, loc_est, scale_est), 'r-', lw=5, alpha=0.6, label='skewnorm pdf')
Can anyone see how I can fix this?
EDIT: when I change to density=True, the result is this:
Related
I have converted my categorical data into columns using the dummy variables. And then performed the train test split. Finally, trained the model and tested with the test data. Since the test data is already in the same format the model understands, it predicts without any issues but when I want to make prediction for a totally new data, creating dummy variables for the new data is not working well. Can I know how its generally done?
Here is my code..
import pandas as pd
import numpy as np
df = pd.read_csv('salary_prediction_usa_finance_job_v2.csv')
df_columns = df.columns
degree = pd.get_dummies(df.degree, prefix='degree', drop_first=True)
masters = pd.get_dummies(df.masters, prefix='masters').iloc[:, 1:]
prof_member = pd.get_dummies(df.professional_membership, prefix='professional_membership', drop_first=True)
df = pd.concat([df, degree,masters,prof_member], axis=1)
df = df.drop(['degree','masters','professional_membership'], axis=1)
X = df.drop('salary_per_year', axis=1)
y = df['salary_per_year']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
new_data = {'degree':['yes'], 'masters':['no'], 'professional_membership':['no'], 'years_experience':[10],'audit_experience':[4], 'IT_skill_rate':[6], 'Size_of_the_company_worked':[3]}
single_df = pd.DataFrame(data=new_data)
This h5 file contains the information of an analytical function on a regular 3D gird. For interpolation purpose, I have got very poor result using the Regulargridinterpolator here. Now, I want to test scipy.interpolate.Rbf interpolator for my data set. Can anyone help me to do that? I had a look at the documentation of this interpolator but didn't understand properly.
I have created a h5 file like this:
import numpy as np
from numpy import gradient
import h5py
from scipy.interpolate import Rbf
def f(x,y,z):
return ( -1 / np.sqrt(x**2 + y**2 + z**2))
#grid
x = np.linspace(0, 100, 32) # since the boxsize is 320 Mpc/h
y = np.linspace(0, 100, 32)
z = np.linspace(0, 100, 32)
mesh_data = phi_an(*np.meshgrid(x, y, z, indexing='ij', sparse=True))
#create h5 file
h5file = h5py.File('analytic.h5', 'w')
h5file.create_dataset('/x', data=x)
h5file.create_dataset('/y', data=y)
h5file.create_dataset('/z', data=z)
h5file.create_dataset('/mesh_data', data=mesh_data)
h5file.close()
I have a h5 file containing regulargrid data. I have used a code by which I can easily get the interpolated value for three given value. I have used RegularGridInterpolator function for interpolation purpose here. Now I want to make a plot to check whether the interpolation is correct or not. But I don't understand how can I do that. Can anyone help me to do that please? Here is my code:
import numpy as np
import h5py
from scipy.interpolate import RegularGridInterpolator
f = h5py.File('file.h5', 'r')
list(f.keys())
dset = f[u'data']
dset.shape
dset.value.shape
dset[0:63,0:63,0:63]
x = np.linspace(-10, 320, 64)
y = np.linspace(-10, 320, 64)
z = np.linspace(-10, 320, 64)
my_interpolating_function = RegularGridInterpolator((x, y, z), dset.value)
pts = np.array([7.36970468e-09, -4.54271563e-09, 1.51802701e-09])
my_interpolating_function(pts)
The output of the interpolation is array([5.45534467e-10])
I am running scipy.stats.pearsonr on my data, and I get
(0.9672434106763087, 0.0)
It is reasonable that the r-value is high and the p-value is very low.
However, p is obviously not 0, so I would like to know what p=0.0 means. Is it p<10^-10, p<10^-100 or what is the limit?
As pointed out by #MB-F in the comments it is calculated analytically.
In the code for the version 0.19.1, you could isolate that part of the code and plot the p-value in terms of r
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import betainc
r = np.linspace(-1, 1, 1000)*(1-1e-10);
for n in [10, 100, 1000]:
df = n - 2
t_squared = r**2 * (df / ((1.0 - r) * (1.0 + r)))
prob = betainc(0.5*df, 0.5, df/(df+t_squared))
plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()
The current stable version 1.9.3 uses a different formula
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import btdtr
r = np.linspace(-1, 1, 1000)*(1-1e-10);
for n in [10, 100, 1000]:
ab = 0.5*n
prob = btdtr(ab, ab, 0.5*(1-abs(r)))
plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()
But yield the same results.
You can see that if you have 1000 points and your correlation, the p value will be less than the minimum floating value.
The beta distribution
Scipy provides a collection of probability distributions, among them, the beta distribution.
The line
prob = btdtr(ab, ab, 0.5*(1-abs(r)))
could be replaced by
from scipy.stats import beta
prob = beta(ab, ab).cdf(0.5*(1-abs(r)))
There you can get much more information about it.
The answer to this question appears relevant to my problem, however, it applies for ax.bar() instead of ax.vlines.
Matplotlib DateFormatter for axis label not working
The code below works with ax1.vlines(x, l, h, colors='k') and ax2.vlines(x, 0, v, colors='k') to plot vertical price and volume bars in a stock chart. But the horizontal axis is defined by a numpy array x = 0,1,2,3, ... etc. I have datetime objects in array d but if change to ax1.vlines(d, l, h, colors='k') and ax2.vlines(d,0,v,colors='k') then it throws an error. Thus d is defined but not used in the code below (it won't work using d but it works using x in the referenced code lines).
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(PATH+ticker+EXT, usecols=[0,2,3,4,5], header=None,
engine='python',skiprows=skr,skipfooter=skf)
d = pd.to_datetime(df[0]) # numpy array date
h = df[2].values # numpy array high
l = df[3].values # numpy array low
c = df[4].values # numpy array close
v = df[5].values # numpy array volume
x = np.arange(len(d))
# Draw Chart to White Background
ax1_y_label = ticker
fig1 = plt.figure()
fig1.set_size_inches(WIDE,TALL)
fig1.set_dpi(DTPI)
fig1.autofmt_xdate()
ax1 = plt.subplot2grid((5,4), (0,0), rowspan=4, colspan=4)
ax1.set_ylabel(ax1_y_label)
ax1.grid(True)
ax1.vlines(x, l, h, colors='k')
ax1.hlines(c, x, x+0.3, color='k')
ax2 = plt.subplot2grid((5,4), (4,0), sharex=ax1, rowspan=1, colspan=4)
ax2.set_ylabel(ax2_y_label)
ax2.grid(True)
ax2.vlines(x, 0, v, colors='k')
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
ax2.spines['right'].set_visible(False)
plt.setp(ax1.get_xticklabels(), visible=False)
plt.setp(ax1.get_yticklabels(), visible=False)
plt.setp(ax2.get_yticklabels(), visible=False)
plt.subplots_adjust(hspace=.01)