Similar function to scipy.stats.zscore but base on another "sample" - scipy

I have 2 datasets which describe the same process and I expect the same general range of values. So I would like to do is use scipy.stats.zscore on the one dataset but instead of using the sample mean and standard deviation, I would like to use the mean and standard deviation from the other dataset. Is there such an equivalent function?

It sounds like you want scipy.stats.zmap.
In [141]: import numpy as np
In [142]: from scipy.stats import zmap
In [143]: olddata = np.array([3.67, 4.01, 3.60, 5.36, 3.65, 2.01, 2.75, 4.43, 2.74, 3.89, 3.60])
In [144]: newdata = np.array([1.0, 2.4, 2.5, 3.25, 5.6])
In [145]: zmap(newdata, olddata)
Out[145]: array([-3.05378533, -1.41573956, -1.29873629, -0.42121177, 2.32836506])

Related

How to get support numbers in pyspark.ml like we get in sklearn classification report?

I am using Pyspark and able to get metrices like accuracy, f1, precison and recall from MulticlassClassificationEvaluator but I am not sure how to get the support numbers like we get in classification report for sklearn. rfc_pred in my case has the group of each class that I run in loop. So, will rfc_pred.count() do the trick?
Below is my current code:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")
accuracy = evaluator.evaluate(rfc_pred, {evaluator.metricName: "accuracy"})
f1 = evaluator.evaluate(rfc_pred, {evaluator.metricName: "f1"})
weightedPrecision = evaluator.evaluate(rfc_pred, {evaluator.metricName: "weightedPrecision"})
weightedRecall = evaluator.evaluate(rfc_pred, {evaluator.metricName: "weightedRecall"})

scipy stats skewness is not correctly provide skewness results

I noticed that the skewness returned from scipy stats is not correct. Pandas.skew() actually provide better results.
I am recently trying to duplicate a classic paper, Expected Stock Returns and Volatility by French&Schwert. I use S&P500 data from 1928 to 1984. I follow the formula in the paper for standard deviation of the return and I am able to get the same result for mean, std dev of std dev.
However, when I use scipy.stats.skew function, I can't not get any number of the std dev of the sp return. The function return "nan", where clearly it should return a value.
I switch to Pandas.skew(). it returned me the correct value as in the paper.
Clearly, something is wrong with the scipy.stats.skew() function.
scipy.stats.skew()
pandas.skew()
Results by Scipy.stats.skew()
['Adj Close_gspc', 'Adj Close_gspc_lag', 'SP_Return', 'SP_Return_square',
'SP_Return_lag', 'SP_varianceMon', 'SP_varianceMon_sqrRoot']
array([ 0.6922229 , 0.69186265, -0.11292165, 4.23571807, -1.9556035 ,
5.39873607, nan])
results by pandas:
Adj Close_gspc 0.693745
Adj Close_gspc_lag 0.693384
SP_Return -0.113170
SP_Return_square 4.245033
SP_Return_lag -1.959904
SP_varianceMon 5.410609
SP_varianceMon_sqrRoot 2.800919
dtype: float64
You haven't provided enough information or sample code to reproduce the nan that you get.
To make scipy.stats.skew compute the same value as the skew() method in Pandas, add the argument bias=False.
Here's an example.
First, the imports:
In [21]: import numpy as np
In [22]: import pandas as pd
In [23]: from scipy.stats import skew
Generate some data:
In [24]: np.random.seed(8675309)
In [25]: x = np.random.weibull(0.2, size=15)
Compute the skew with scipy and with Pandas:
In [26]: skew(x, bias=False)
Out[26]: 3.7582525674514544
In [27]: pd.Series(x).skew()
Out[27]: 3.7582525674514544

Please help debug my call scipy library for Kolmogorov-Smirnov Test

I am completing an assignment but can not get the right results from a kolmogorov smirnov test for a small sample of observations against a 'norm' distribution.
I have setup a minimal sample in a jupyter notebook with expected kstest results and tried running this in several environment and reviewed the call for hours. Answer key says my ks_value and p_value are wildly wrong.
But, I cannot see my error.
the sample I have is from the test run in the answer key. it is a 1d array, a valid input option.
sample mean and standard deviation I compute look right
if I change ddof it makes a small difference (hint is to use ddof=0)
norm is a valid distribution for the kstest
library documentation is at
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html#scipy-stats-kstest
Any ideas or comments?
Would you expect a sample = [0.37, 0.27, 0.69, 0.56, 0.26] compared to a normal distribution to have
'KS test statistic' of 0.64 or 0.24
and
'p-value' of 0.02 or 0.94
TIA
import pandas as pd
import numpy as np
from scipy.stats import kstest
sample = [0.37, 0.27, 0.69, 0.56, 0.26]
normal_args = (np.mean(sample), np.std(sample, ddof=0))
print('mean', normal_args[0])
print('std', normal_args[1])
ks_value, p_value = kstest(sample, 'norm', normal_args )
print('ks_value', ks_value)
print('p_value', p_value)
print('')
print('#####posted solution')
print('expected ks_value = 0.63919407')
print('expected p_value = 0.01650327')
mean 0.43000000000000005
std 0.1688786546606764
ks_value 0.23881183701141995
p_value 0.9379686201081335
####posted solution
expected ks_value = 0.63919407
expected p_value = 0.01650327
My bad. A new guy mistake.
The function calls defines the 3rd argument as "args=()". I had put the 3rd argument in treating the input as a positional. Changing the call to
ks_value, p_value = kstest(sample, 'norm', args=(normal_args) )
yields the correct response.

How to implement exponentially decay learning rate in Keras by following the global steps

Look at the following example
# encoding: utf-8
import numpy as np
import pandas as pd
import random
import math
from keras import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam, RMSprop
from keras.callbacks import LearningRateScheduler
X = [i*0.05 for i in range(100)]
def step_decay(epoch):
initial_lrate = 1.0
drop = 0.5
epochs_drop = 2.0
lrate = initial_lrate * math.pow(drop,
math.floor((1+epoch)/epochs_drop))
return lrate
def build_model():
model = Sequential()
model.add(Dense(32, input_shape=(1,), activation='relu'))
model.add(Dense(1, activation='linear'))
adam = Adam(lr=0.5)
model.compile(loss='mse', optimizer=adam)
return model
model = build_model()
lrate = LearningRateScheduler(step_decay)
callback_list = [lrate]
for ep in range(20):
X_train = np.array(random.sample(X, 10))
y_train = np.sin(X_train)
X_train = np.reshape(X_train, (-1,1))
y_train = np.reshape(y_train, (-1,1))
model.fit(X_train, y_train, batch_size=2, callbacks=callback_list,
epochs=1, verbose=2)
In this example, the LearningRateSchedule does not change the learning rate at all because in each iteration of ep, epoch=1. Thus the learning rate is just const (1.0, according to step_decay). In fact, instead of setting epoch>1 directly, I have to do outer loop as shown in the example, and insider each loop, I just run 1 epoch. (This is the case when I implement deep reinforcement learning, instead of supervised learning).
My question is how to set an exponentially decay learning rate in my example and how to get the learning rate in each iteration of ep.
You can actually pass two arguments to the LearningRateScheduler.
According to Keras documentation, the scheduler is
a function that takes an epoch index as input (integer, indexed from
0) and current learning rate and returns a new learning rate as output
(float).
So, basically, simply replace your initial_lr with a function parameter, like so:
def step_decay(epoch, lr):
# initial_lrate = 1.0 # no longer needed
drop = 0.5
epochs_drop = 2.0
lrate = lr * math.pow(drop,math.floor((1+epoch)/epochs_drop))
return lrate
The actual function you implement is not exponential decay (as you mention in your title) but a staircase function.
Also, you mention your learning rate does not change inside your loop. That's true because you set model.fit(..., epochs=1,...) and your epochs_drop = 2.0 at the same time. I am not sure this is your desired case or not. You are providing a toy example and it's not clear in that case.
I would like to add the more common case where you don't mix a for loop with fit() and just provide a different epochs parameter in your fit() function. In this case you have the following options:
First of all keras provides a decaying functionality itself with the predefined optimizers. For example in your case Adam() the actual code is:
lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))
which is not exactly exponential either and it's somehow different than tensorflow's one. Also, it's used only when decay > 0.0 as it's obvious.
To follow the tensorflow convention of exponential decay you should implement:
decayed_learning_rate = learning_rate * ^ (global_step / decay_steps)
Depending on your needs you could choose to implement a Callback subclass and define a function within it (see 3rd bullet below) or use LearningRateScheduler which is actually exactly this with some checking: a Callback subclass which updates the learning rate at each epoch end.
If you want a finer handling of your learning rate policy (per batch for example) you would have to implement your subclass since as far as I know there is no implemented subclass for this task. The good part is that it's super easy:
Create a subclass
class LearningRateExponentialDecay(Callback):
and add the __init__() function which will initialize your instance with all needed parameters and also create a global_step variables to keep track of the iterations (batches):
def __init__(self, init_learining_rate, decay_rate, decay_steps):
self.init_learining_rate = init_learining_rate
self.decay_rate = decay_rate
self.decay_steps = decay_steps
self.global_step = 0
Finally, add the actual function inside the class:
def on_batch_begin(self, batch, logs=None):
actual_lr = float(K.get_value(self.model.optimizer.lr))
decayed_learning_rate = actual_lr * self.decay_rate ^ (self.global_step / self.decay_steps)
K.set_value(self.model.optimizer.lr, decayed_learning_rate)
self.global_step += 1
The really cool part is the if you want the above subclass to update every epoch you could use on_epoch_begin(self, epoch, logs=None) which nicely has epoch as parameter to it's signature. This case is even easier as you could skip global step altogether (no need to keep track of it now unless you want a fancier way to apply your decay) and use epoch in it's place.

PySpark PCA: avoiding NotConvergedException

I'm attempting to reduce a wide dataset (51 features, ~1300 individuals) using PCA through the ml.linalg method as follows:
1) Named my columns as one list:
features = indi_prep_df.select([c for c in indi_prep_df.columns if c not in{'indi_nbr','label'}]).columns
2) Imported the necessary libraries
from pyspark.ml.feature import PCA as PCAML
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import DenseVector
3) Collapsed the features to a DenseVector
indi_feat = indi_prep_df.rdd.map(lambda x: (x[0], x[-1], DenseVector(x[1:-2]))).toDF(['indi_nbr','label','features'])
4) Dropped everything but the features to retain index:
dftest = indi_feat.drop('indi_nbr','label')
5) Instantiated the PCA object
dfPCA = PCAML(k=3, inputCol="features", outputCol="pcafeats")
6) And attempted to fit the model
PCAout = dfPCA.fit(dftest)
But my model fails to converge (error below).
Things I've tried:
- Mean-filling or zero-filling NA and Null values (as appropriate)
- Reducing the number of features (to 25, then I switched to SKlearn's PCA)
Py4JJavaError: An error occurred while calling o2242.fit.
: breeze.linalg.NotConvergedException:
at breeze.linalg.svd$.breeze$linalg$svd$$doSVD_Double(svd.scala:110)
at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:40)
at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:39)
at breeze.generic.UFunc$class.apply(UFunc.scala:48)
at breeze.linalg.svd$.apply(svd.scala:23)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponentsAndExplainedVariance(RowMatrix.scala:389)
at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:48)
at org.apache.spark.ml.feature.PCA.fit(PCA.scala:99)
at org.apache.spark.ml.feature.PCA.fit(PCA.scala:70)
My configuration is for 50 executors with 6GB/executor, so I don't think it's a matter of not having enough resources (and I don't see anything about resources here).
My input factors are a mixture of percentages, integers and 2-decimal floats, all positive and all ordinal. Could that be causing difficulty with convergence?
I had no trouble with the SKLearn method converging, and quickly, once I converted the PySpark DF to a Pandas DF.