scipy stats skewness is not correctly provide skewness results - scipy

I noticed that the skewness returned from scipy stats is not correct. Pandas.skew() actually provide better results.
I am recently trying to duplicate a classic paper, Expected Stock Returns and Volatility by French&Schwert. I use S&P500 data from 1928 to 1984. I follow the formula in the paper for standard deviation of the return and I am able to get the same result for mean, std dev of std dev.
However, when I use scipy.stats.skew function, I can't not get any number of the std dev of the sp return. The function return "nan", where clearly it should return a value.
I switch to Pandas.skew(). it returned me the correct value as in the paper.
Clearly, something is wrong with the scipy.stats.skew() function.
scipy.stats.skew()
pandas.skew()
Results by Scipy.stats.skew()
['Adj Close_gspc', 'Adj Close_gspc_lag', 'SP_Return', 'SP_Return_square',
'SP_Return_lag', 'SP_varianceMon', 'SP_varianceMon_sqrRoot']
array([ 0.6922229 , 0.69186265, -0.11292165, 4.23571807, -1.9556035 ,
5.39873607, nan])
results by pandas:
Adj Close_gspc 0.693745
Adj Close_gspc_lag 0.693384
SP_Return -0.113170
SP_Return_square 4.245033
SP_Return_lag -1.959904
SP_varianceMon 5.410609
SP_varianceMon_sqrRoot 2.800919
dtype: float64

You haven't provided enough information or sample code to reproduce the nan that you get.
To make scipy.stats.skew compute the same value as the skew() method in Pandas, add the argument bias=False.
Here's an example.
First, the imports:
In [21]: import numpy as np
In [22]: import pandas as pd
In [23]: from scipy.stats import skew
Generate some data:
In [24]: np.random.seed(8675309)
In [25]: x = np.random.weibull(0.2, size=15)
Compute the skew with scipy and with Pandas:
In [26]: skew(x, bias=False)
Out[26]: 3.7582525674514544
In [27]: pd.Series(x).skew()
Out[27]: 3.7582525674514544

Related

Similar function to scipy.stats.zscore but base on another "sample"

I have 2 datasets which describe the same process and I expect the same general range of values. So I would like to do is use scipy.stats.zscore on the one dataset but instead of using the sample mean and standard deviation, I would like to use the mean and standard deviation from the other dataset. Is there such an equivalent function?
It sounds like you want scipy.stats.zmap.
In [141]: import numpy as np
In [142]: from scipy.stats import zmap
In [143]: olddata = np.array([3.67, 4.01, 3.60, 5.36, 3.65, 2.01, 2.75, 4.43, 2.74, 3.89, 3.60])
In [144]: newdata = np.array([1.0, 2.4, 2.5, 3.25, 5.6])
In [145]: zmap(newdata, olddata)
Out[145]: array([-3.05378533, -1.41573956, -1.29873629, -0.42121177, 2.32836506])

How can I get the numbers for the correlation matrix from Pandas Profiling

I really like the heatmap, but what I need are the numbers behind the heatmap (AKA correlation matrix).
Is there an easy way to extract the numbers?
It was a bit hard to track down but starting from the documentation; specifically
from the report structure then digging into the following function get_correlation_items(summary) and then going into the source and looking at the usage of it we get to this call that essentially loops over each of the correlation types in the summary, to obtain the summary object we can find the following, if we lookup the caller we find that it is get_report_structure(summary) and if we try to find how to get the summary arg we find that it is simply the description_set property as shown here.
Given the above, we can now do the following using version 2.9.0:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.DataFrame(
np.random.rand(100, 5),
columns=["a", "b", "c", "d", "e"]
)
profile = ProfileReport(df, title="StackOverflow", explorative=True)
correlations = profile.description_set["correlations"]
print(correlations.keys())
dict_keys(['pearson', 'spearman', 'kendall', 'phi_k'])
To see a specific correlation do:
correlations["phi_k"]["e"]
a 0.000000
b 0.112446
c 0.289983
d 0.000000
e 1.000000
Name: e, dtype: float64
Sample Notebook

Please help debug my call scipy library for Kolmogorov-Smirnov Test

I am completing an assignment but can not get the right results from a kolmogorov smirnov test for a small sample of observations against a 'norm' distribution.
I have setup a minimal sample in a jupyter notebook with expected kstest results and tried running this in several environment and reviewed the call for hours. Answer key says my ks_value and p_value are wildly wrong.
But, I cannot see my error.
the sample I have is from the test run in the answer key. it is a 1d array, a valid input option.
sample mean and standard deviation I compute look right
if I change ddof it makes a small difference (hint is to use ddof=0)
norm is a valid distribution for the kstest
library documentation is at
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html#scipy-stats-kstest
Any ideas or comments?
Would you expect a sample = [0.37, 0.27, 0.69, 0.56, 0.26] compared to a normal distribution to have
'KS test statistic' of 0.64 or 0.24
and
'p-value' of 0.02 or 0.94
TIA
import pandas as pd
import numpy as np
from scipy.stats import kstest
sample = [0.37, 0.27, 0.69, 0.56, 0.26]
normal_args = (np.mean(sample), np.std(sample, ddof=0))
print('mean', normal_args[0])
print('std', normal_args[1])
ks_value, p_value = kstest(sample, 'norm', normal_args )
print('ks_value', ks_value)
print('p_value', p_value)
print('')
print('#####posted solution')
print('expected ks_value = 0.63919407')
print('expected p_value = 0.01650327')
mean 0.43000000000000005
std 0.1688786546606764
ks_value 0.23881183701141995
p_value 0.9379686201081335
####posted solution
expected ks_value = 0.63919407
expected p_value = 0.01650327
My bad. A new guy mistake.
The function calls defines the 3rd argument as "args=()". I had put the 3rd argument in treating the input as a positional. Changing the call to
ks_value, p_value = kstest(sample, 'norm', args=(normal_args) )
yields the correct response.

PySpark PCA: avoiding NotConvergedException

I'm attempting to reduce a wide dataset (51 features, ~1300 individuals) using PCA through the ml.linalg method as follows:
1) Named my columns as one list:
features = indi_prep_df.select([c for c in indi_prep_df.columns if c not in{'indi_nbr','label'}]).columns
2) Imported the necessary libraries
from pyspark.ml.feature import PCA as PCAML
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import DenseVector
3) Collapsed the features to a DenseVector
indi_feat = indi_prep_df.rdd.map(lambda x: (x[0], x[-1], DenseVector(x[1:-2]))).toDF(['indi_nbr','label','features'])
4) Dropped everything but the features to retain index:
dftest = indi_feat.drop('indi_nbr','label')
5) Instantiated the PCA object
dfPCA = PCAML(k=3, inputCol="features", outputCol="pcafeats")
6) And attempted to fit the model
PCAout = dfPCA.fit(dftest)
But my model fails to converge (error below).
Things I've tried:
- Mean-filling or zero-filling NA and Null values (as appropriate)
- Reducing the number of features (to 25, then I switched to SKlearn's PCA)
Py4JJavaError: An error occurred while calling o2242.fit.
: breeze.linalg.NotConvergedException:
at breeze.linalg.svd$.breeze$linalg$svd$$doSVD_Double(svd.scala:110)
at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:40)
at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:39)
at breeze.generic.UFunc$class.apply(UFunc.scala:48)
at breeze.linalg.svd$.apply(svd.scala:23)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponentsAndExplainedVariance(RowMatrix.scala:389)
at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:48)
at org.apache.spark.ml.feature.PCA.fit(PCA.scala:99)
at org.apache.spark.ml.feature.PCA.fit(PCA.scala:70)
My configuration is for 50 executors with 6GB/executor, so I don't think it's a matter of not having enough resources (and I don't see anything about resources here).
My input factors are a mixture of percentages, integers and 2-decimal floats, all positive and all ordinal. Could that be causing difficulty with convergence?
I had no trouble with the SKLearn method converging, and quickly, once I converted the PySpark DF to a Pandas DF.

Low alpha for NLTK agreement using MASI distance

I'm getting a very low value for Krippendorff's alpha when I calculate agreement in NLTK using MASI as the distance function.
Three coders (Inky, Blinky, and Sue) are instructed to assign topic labels (love, gifts, slime, or gaming) to two texts (text01 and text02), based on what the texts are about. Each text can be about more than one topic, so coders may assign each text more than one label. The data and the code used to make the calculatons are shown below:
import nltk
from nltk.metrics import agreement
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance
#(coder, item, label)
data = [('inky','text01',frozenset(['love','gifts'])),
('blinky','text01',frozenset(['love','gifts'])),
('sue','text01',frozenset(['love','gifts'])),
('inky','text02',frozenset(['slime','gaming'])),
('blinky','text02',frozenset(['slime'])),
('sue','text02',frozenset(['slime','gaming']))]
jaccard_task = nltk.AnnotationTask(distance=jaccard_distance)
masi_task = nltk.AnnotationTask(distance=masi_distance)
tasks = [jaccard_task, masi_task]
for task in tasks:
task.load_array(data)
print("Statistics for dataset using {}".format(task.distance))
print("C: {}\nI: {}\nK: {}".format(task.C, task.I, task.K))
print("Pi: {}".format(task.pi()))
print("Kappa: {}".format(task.kappa()))
print("Multi-Kappa: {}".format(task.multi_kappa()))
print("Alpha: {}".format(task.alpha()))
print()
When I run the code, I get the following results:
Statistics for dataset using <function jaccard_distance at 0x09D26DB0>
C: {'inky', 'sue', 'blinky'}
I: {'text01', 'text02'}
K: {frozenset({'slime'}), frozenset({'love', 'gifts'}), frozenset ({'gaming', 'slime'})}
Pi: 0.7272727272727273
Kappa: 0.7777777777777777
Multi-Kappa: 0.7499999999999999
Alpha: 0.75
Statistics for dataset using <function masi_distance at 0x09D26DF8>
C: {'inky', 'sue', 'blinky'}
I: {'text01', 'text02'}
K: {frozenset({'slime'}), frozenset({'love', 'gifts'}), frozenset({'gaming', 'slime'})}
Pi: 0.8172727272727272
Kappa: 0.8511111111111113
Multi-Kappa: 0.8324999999999998
Alpha: -1.5
My question is, why is the alpha so low when using the MASI distance function compared to Jaccard?
I was unable to reproduce the error and got the correct value of Krippendorff's alpha with MASI distance when running the provided code. I used Python 3.5.2, NumPy 1.18.2, NLTK 3.4.5. Thus, the most probable answer would be that one need to update NLTK.