I am trying to compare categorical data from 2 groups.
Yes No
GrpA: [152, 220]
GrpB: [187, 350]
However, I am getting different P value results when using different methods:
count = [152, 220]
nobs = [187, 350]
import statsmodels
import scipy.stats
# USING STATSMODELS PACKAGE:
res = statsmodels.stats.proportion.proportions_chisquare(count, nobs)
print("P value =", res[1])
res = statsmodels.stats.proportion.proportions_ztest(count, nobs)
print("P value =", res[1])
# USING SCIPY.STATS PACKAGE:
res = scipy.stats.chi2_contingency([count, nobs], correction=True)
print("P value =", res[1])
res = scipy.stats.chi2_contingency([count, nobs], correction=False)
print("P value =", res[1])
Output is:
P value using proportions_chisquare = 1.037221289479458e-05
P value using proportions_ztest= 1.0372212894794536e-05
P value using chi2_contingency with correction= 0.0749218380702875
P value using chi2_contingency without correction= 0.06421435896354544
First 2 are identical (and highly significant), but they are different from last 2 (non-signficant).
Why are the results different? Which is the correct method to do this analysis?
Related
Hello StackOverflowers.
I have a pyspark dataframe that consists of a time_column and a column with values.
E.g.
+----------+--------------------+
| snapshot| values|
+----------+--------------------+
|2005-01-31| 0.19120256617637743|
|2005-01-31| 0.7972692479278891|
|2005-02-28|0.005236883665445502|
|2005-02-28| 0.5474099672222935|
|2005-02-28| 0.13077227571485905|
+----------+--------------------+
I would like to perform a KS test of each snapshot value with the previous one.
I tried to do it with a for loop.
import numpy as np
from scipy.stats import ks_2samp
import pyspark.sql.functions as F
def KS_for_one_snapshot(temp_df, snapshots_list, j, var = "values"):
sample1=temp_df.filter(F.col("snapshot")==snapshots_list[j])
sample2=temp_df.filter(F.col("snapshot")==snapshots_list[j-1]) # pick the last snapshot as the one to compare with
if (sample1.count() == 0 or sample2.count() == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
ks_value, p_value = ks_2samp( np.array(sample1.select(var).collect()).reshape(-1)
, np.array(sample2.select(var).collect()).reshape(-1)
, alternative="two-sided"
, mode="auto")
return ks_value
results = []
snapshots_list = df.select('snapshot').dropDuplicates().sort('snapshot').rdd.flatMap(lambda x: x).collect()
for j in range(len(snapshots_list) - 1 ):
results.append(KS_for_one_snapshot(df, snapshots_list, j+1))
results
But the data in reality is huge so it takes forever. I am using databricks and pyspark, so I wonder what would be a more efficient way to run it by avoiding the for loop and utilizing the available workers.
I tried to do it by using a udf but in vain.
Any ideas?
PS. you can generate the data with the following code.
from random import randint
df = (spark.createDataFrame( range(1,1000), T.IntegerType())
.withColumn('snapshot' ,F.array(F.lit("2005-01-31"), F.lit("2005-02-28"),F.lit("2005-03-30") ).getItem((F.rand()*3).cast("int")))
.withColumn('values', F.rand()).drop('value')
)
Update:
I tried the following by using an UDF.
var_used = 'values'
data_input_1 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias('value_list'))
data_input_2 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias("value_list_2"))
windowSpec = Window.orderBy("snapshot")
data_input_2 = data_input_2.withColumn('snapshot_2', F.lag("snapshot", 1).over(Window.orderBy('snapshot'))).filter('snapshot_2 is not NULL')
data_input_final = data_input_final = data_input_1.join(data_input_2, data_input_1.snapshot == data_input_2.snapshot_2)
def KS_one_snapshot_general(sample_in_list_1, sample_in_list_2):
if (len(sample_in_list_1) == 0 or len(sample_in_list_2) == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
print('something')
ks_value, p_value = ks_2samp( sample_in_list_1
, sample_in_list_2
, alternative="two-sided"
, mode="auto")
return ks_value
import pyspark.sql.types as T
KS_one_snapshot_general_udf = udf(KS_one_snapshot_general, T.FloatType())
data_input_final.select( KS_one_snapshot_general_udf('value_list', 'value_list_2')).display()
Which works fine if the dataset (per snapshot) is small. But If I increase the number of rows then I end up with an error.
PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
I want to obtain parameters of multiple integral ():
'''
from sympy import *
z = symbols('z', real=True)
nu = Function('nu', real=True, positive=True)(z)
xx1 = nu.integrate((z,0,z))
xx2 = xx1.integrate((z,0,1))
xx1
xx2
xx1.cancel
xx2.cancel
'''
Integral and its structure in jupyterlab shown
Then with "wild"-type variables "Wi" I try to obtain parameters of integrals
''''
mm1 = Integral(W1,(W2,W3,W4))
mm2 = Integral(W1,(W2,W3,W4),(W5,W6,W7))
mm1
mm2
rr1 = xx1.match(mm1)
rr2 = xx2.match(mm2)
rr1
rr2.type()
''''
matching result
It works for single integral but doesmt for multiple. Why?
SECOND question is: why integration variable "z" is not obtained to "W2"?
THIRD question: why variable "z" is changed to symbol "_0" in "W1:nu(z)? How to do it right?
Recommended in comment '.list' doesn't work:
gives error message
Every expression has args -- all you are (apparently) try to do is capture the arguments in variables. Wild are not needed for this:
>>> W1,(W2,W3,W4),(W5,W6,W7) = xx2.args
>>> W1,(W2,W3,W4),(W5,W6,W7)
(nu(z), (z, 0, z), (z, 0, 1))
I followed the tutorial about Reading data with TF and made some tries myself. Now, the problem is that my tests show duplicate data in the batches I created when reading data from a CSV.
My code looks like this:
# -*- coding: utf-8 -*-
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import collections
import numpy as np
from six.moves import xrange # pylint: disable=redefined-builtin
import tensorflow as tf
class XICSDataSet:
def __init__(self, height=20, width=195, batch_size=1000, noutput=15):
self.depth = 1
self.height = height
self.width = width
self.batch_size = batch_size
self.noutput = noutput
def trainingset_files_reader(self, data_dir, nfiles):
fnames = [os.path.join(data_dir, "test%d"%i) for i in range(nfiles)]
filename_queue = tf.train.string_input_producer(fnames, shuffle=False)
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = [[.0],[.0],[.0],[.0],[.0]]
data_tuple = tf.decode_csv(value, record_defaults=record_defaults, field_delim = ' ')
features = tf.pack(data_tuple[:-self.noutput])
label = tf.pack(data_tuple[-self.noutput:])
depth_major = tf.reshape(features, [self.height, self.width, self.depth])
min_after_dequeue = 100
capacity = min_after_dequeue + 30 * self.batch_size
example_batch, label_batch = tf.train.shuffle_batch([depth_major, label], batch_size=self.batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue)
return example_batch, label_batch
with tf.Graph().as_default():
ds = XICSDataSet(2, 2, 3, 1)
im, lb = ds.trainingset_files_reader(filename, 1)
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
tf.train.start_queue_runners(sess=sess)
for i in range(1000):
lbs = sess.run([im, lb])[1]
_, nu = np.unique(lbs, return_counts=True)
if np.array_equal(nu, np.array([1, 1, 1])) == False:
print('Not unique elements found in a batch!')
print(lbs)
I tried with different batch sizes, different number of files, different values of capacity and min_after_dequeue, but I always get the problem. In the end, I would like to be able to read data from only one file, creating batches and shuffling the examples.
My files, created ad hoc for this test, have 5 lines each representing samples, and 5 columns. The last column is meant to be the label for that sample. These are just random numbers. I'm using only 10 files just to test this out.
The default behavior for tf.train.string_input_producer(fnames) is to produce an infinite number of copies of the elements in fnames. Therefore, since your tf.train.shuffle_batch() capacity is larger than the total number of elements in your input files (5 elements per file * 10 files = 50 elements), and the min_after_dequeue is also larger than the number of elements, the queue will contain at least two full copies of the input data before the first batch is produced. As a result, it is likely that some batches will contain duplicate data.
If you only want to process each example once, you can set an explicit num_epochs=1 when creating the tf.train.string_input_producer(). For example:
def trainingset_files_reader(self, data_dir, nfiles):
fnames = [os.path.join(data_dir, "test%d" % i) for i in range(nfiles)]
filename_queue = tf.train.string_input_producer(
fnames, shuffle=False, num_epochs=1)
# ...
class Matrix:
def __init__(self, nr, nc):
self.NRows = nr
self.NCols = nc
self.data = [ [0]*self.NCols for r in range(self.NRows) ]
def max(self, other):
""" return: a matrix with as many rows as the shorter of self and other and as many columns as the narrower of self and other.
Each entry of the returned matrix should be the larger (the max) of the corresponding entries in self and other.
"""
minrows = min(other.NRows, self.NRows)
mincols = min(other.NCols, self.NCols)
M = Matrix(minrows, mincols)
for i in range(minrows):
for j in range(mincols):
M.data[i][j] = max(self.data[i][j], other[i][j])
return M
This code give Traceback in max when tested and output said:
...in max
M.data[i][j] = max(self.data[i][j], other[i][j])
AttributeError: Matrix instance has no attribute '__getitem__'
How to get rid of this error? Where I made mistake?. Please help someone.
You should use this :
M.data[i][j] = max(self.data[i][j], other.data[i][j])
instead of
M.data[i][j] = max(self.data[i][j], other[i][j])
I'm cythonizing a skript that containes scipy.stats.norm() function for calculation of implied vola.
Instead of scipy.stats.norm() I use scipy.special.ndtr() since this is somewhat faster. However, when profiling my script most time (50 from 125sec) is still spent within this function, in particular within _distn_infrastructure.py:1610(cdf).
That's the function:
def cdf(self, x, *args, **kwds):
"""
in class rv_continuous(rv_generic):
Cumulative distribution function of the given RV.
Parameters
----------
x : array_like
quantiles
arg1, arg2, arg3,... : array_like
The shape parameter(s) for the distribution (see docstring of the
instance object for more information)
loc : array_like, optional
location parameter (default=0)
scale : array_like, optional
scale parameter (default=1)
Returns
-------
cdf : ndarray
Cumulative distribution function evaluated at `x`
"""
args, loc, scale = self._parse_args(*args, **kwds)
x, loc, scale = map(asarray, (x, loc, scale))
args = tuple(map(asarray, args))
x = (x-loc)*1.0/scale
cond0 = self._argcheck(*args) & (scale > 0)
cond1 = (scale > 0) & (x > self.a) & (x < self.b)
cond2 = (x >= self.b) & cond0
cond = cond0 & cond1
output = zeros(shape(cond), 'd')
place(output, (1-cond0)+np.isnan(x), self.badvalue)
place(output, cond2, 1.0)
if any(cond): # call only if at least 1 entry
goodargs = argsreduce(cond, *((x,)+args))
place(output, cond, self._cdf(*goodargs))
if output.ndim == 0:
return output[()]
return output
However, I neighter see any function that does the actual cdf calculation nor a call of a another function that actually does it. I tried to print the output of this function via inserting
print output
before
return output
however, the print command is highlighted as wrong syntax and when running the skript there is no print of the output. How to go from here? I somehow need to speed up the norm-CDF calculation.