Testing a trading system on bootstrap samples using Arch library in python - confidence-interval

I am trying to test a hypothesis on outperformance of a trading strategy over the buy and hold. I have original data's returns containing 1261 observations as a sample to be used for bootstrap.
I want to know if I have applied it correctly.
def back_test_series(x):
df= pd.DataFrame(x, columns= ['Close'])
return df.Close
from arch.bootstrap import CircularBlockBootstrap
bs = CircularBlockBootstrap(40, sample_return)
results = bs.apply(back_test_series, 2500)
Above, sample_return is the sample containing 2761 returns on actual data. I created 2500 bootstrapped samples containing 2761 observations each.
and then created a cummulative return to get price time series.
time_series = []
for simu in results:
df = pd.DataFrame(simu, columns=["Close"])
df['Close'] = (1+df).cumprod()
time_series.append(df)
and finally ran my backtesting in the price series obatained from bootstrap.
final_results = []
for simulation in enumerate(time_series):
x = Backtesting.scrip_backtest(simulation)
final_results.append(x)
Backtesting.scrip_backtest is my trading strategy which will return stats like buy and hold cagr, strategy cagr, std dev of strategy returns.
My question is can I use bootstrap in this way? Should I use MovingBlockBootstrap or CircularBlockBootstrap?
It it correct to run trading strategy on bootstrapped time series as mentioned above?

Related

Using tbl_regression with imputed data/pooled regression models

I've had great success using the gtsummary::tbl_regression function to display regression model results. I can't see how to use tbl_regression with pooled regression models from imputed data sets, however, and I'd really like to.
I don't have a reproducible example handy, I just wanted to see if anyone else has found a way to work with, say, mids objects created by the mice package in tbl_regression.
In the current development version of gtsummary, it's possible to summarize models estimated on imputed data from the mice package. Here's an example
# install dev version of gtsummary
remotes::install_github("ddsjoberg/gtsummary")
library(gtsummary)
packageVersion("gtsummary")
#> [1] ‘1.3.5.9012’
# impute the data
df_imputed <- mice::mice(trial, m = 2)
# build the model
imputed_model <- with(df_imputed, lm(age ~ marker + grade))
# present beautiful table with gtsummary
tbl_regression(imputed_model)
#> pool_and_tidy_mice: Tidying mice model with
#> `mice::pool(x) %>% mice::tidy(exponentiate = FALSE, conf.int = TRUE, conf.level = 0.95)`
Created on 2020-12-16 by the reprex package (v0.3.0)
It's important to note that you pass the mice model object to tbl_regression() BEFORE you pool the results. The tbl_regression() function needs access to the individual models in order to correctly identify the reference row and variable labels (among other things). Internally, the tidying function used on the mice model will first pool the results, then tidy the results. The code used for this process is printed to the console for transparency (as seen in the example above).

Customize metric visualization in MLFlow UI when using mlflow.tensorflow.autolog()

I'm trying to integrate MLFlow to my project. Because I'm using tf.keras.fit_generator() for my training so I take advantage of mlflow.tensorflow.autolog()(docs here) to enable automatic logging of metrics and parameters:
model = Unet()
optimizer = tf.keras.optimizers.Adam(LEARNING_RATE)
metrics = [IOUScore(threshold=0.5), FScore(threshold=0.5)]
model.compile(optimizer, customized_loss, metrics)
callbacks = [
tf.keras.callbacks.ModelCheckpoint("model.h5", save_weights_only=True, save_best_only=True, mode='min'),
tf.keras.callbacks.TensorBoard(log_dir='./logs', profile_batch=0, update_freq='batch'),
]
train_dataset = Dataset(src_dir=SOURCE_DIR)
train_data_loader = DataLoader(train_dataset, BATCH_SIZE, shuffle=True)
with mlflow.start_run():
mlflow.tensorflow.autolog()
mlflow.log_param("batch_size", BATCH_SIZE)
model.fit_generator(
train_data_loader,
steps_per_epoch=len(train_data_loader),
epochs=EPOCHS,
callbacks=callbacks
)
I expected something like this (just a demonstration taken from the docs):
However, after the training finished, this is what I got:
How can I configure so that the metric plot will update and display its value at each epoch instead of just showing the latest value?
After searching around, I found this issue related to my problem above. Actually, all my metrics just logged once each training (instead of each epoch as my intuitive thought). The reason is I didn't specify the every_n_iter parameter in mlflow.tensorflow.autolog(), which indicates how many 'iterations' must pass before MLflow logs metric executed (see the docs). So, changing my code to:
mlflow.tensorflow.autolog(every_n_iter=1)
fixed the problem.
P/s: Remember that in TF 2.x, an 'iteration' is an epoch (in TF 1.x it's a batch).

Transforming dates in tensorflow or tensorflow extended

I am working with Tensorflow Extended, preprocessing data and among this data are date values (e.g. values of the form 16-04-2019). I need to apply some preprocessing to this, like the difference between two dates and extracting the day, month and year from it.
For example, I could need to have the difference in days between 01-04-2019 and 16-04-2019, but this difference could also span days, months or years.
Now, just using Python scripts this is easy to do, but I am wondering if it is also possible to do this with Tensorflow? It's important for my use case to do this within Tensorflow, because the transform needs to be done in the graph format so that I can serve the model with the transformations inside the pipeline.
I am using Tensorflow 1.13.1, Tensorflow Extended and Python 2.7 for this.
Posting from similar issue on tft github.
Here's a way to do it:
import tensorflow_addons as tfa
import tensorflow as tf
from typing import TYPE_CHECKING
#tf.function(experimental_follow_type_hints=True)
def fn_seconds_since_1970(date_time: tf.string, date_format: str = "%Y-%m-%d %H:%M:%S %Z"):
seconds_since_1970 = tfa.text.parse_time(date_time, date_format, output_unit='SECOND')
seconds_since_1970 = tf.cast(seconds_since_1970, dtype=tf.int64)
return seconds_since_1970
string_date_tensor = tf.constant("2022-04-01 11:12:13 UTC")
seconds_since_1970 = fn_seconds_since_1970(string_date_tensor)
seconds_in_hour, hours_in_day = tf.constant(3600, dtype=tf.int64), tf.constant(24, dtype=tf.int64)
hours_since_1970 = seconds_since_1970 / seconds_in_hour
hours_since_1970 = tf.cast(hours_since_1970, tf.int64)
hour_of_day = hours_since_1970 % hours_in_day
days_since_1970 = seconds_since_1970 / (seconds_in_hour * hours_in_day)
days_since_1970 = tf.cast(days_since_1970, tf.int64)
day_of_week = (days_since_1970 + 4) % 7 #Jan 1st 1970 was a Thursday, a 4, Sunday is a 0
print(f"On {string_date_tensor.numpy().decode('utf-8')}, {seconds_since_1970} seconds had elapsed since 1970.")
My two cents on the broader underlying issue, here the question is computing time differences, for which we want to do these computations on tensors. Then the question becomes "What are the units of these tensors?" This is a question of granularity. "The next question is what are the data types involved?" Start with a string likely, end with a numeric. Then the next question becomes is there a "native" tensorflow function that can do this? Enter tensorflow addons!
Just like we are trying to optimize training by doing everything as tensor operations within the graph, similarly we need to optimize "getting to the graph". I have seen the way datetime would work with python functions here, and I would do everything I could do avoid going into python function land as the code becomes so complex and the performance suffers as well. It's a lose-lose in my opinion.
PS - This op is not yet implemented on windows as per this, maybe because it only returns unix timestamps :)
I had a similar problem. The issue because of an if-check with in TFX that doesn't take dates types into account. As far as I've been able to figure out, there are two options:
Preprocess the date column and cast it to an int (e.g. calling toordinal() on each element) field before reading it into TFX
Edit the TFX function that checks types to account for date-like types and cast them to ordinal on the fly.
You can navigate to venv/lib/python3.7/site-packages/tfx/components/example_gen/utils.py and look for the function dict_to_example. You can add a datetime check there like so
def dict_to_example(instance: Dict[Text, Any]) -> tf.train.Example:
"""Converts dict to tf example."""
feature = {}
for key, value in instance.items():
# TODO(jyzhao): support more types.
if isinstance(value, datetime.datetime): # <---- Check here
value = value.toordinal()
if value is None:
feature[key] = tf.train.Feature()
...
value will become an int, and the int will be handled and cast to a Tensorflow type later on in the function.

Incorrect evaluations on Random Forest in Pyspark

I am running a prediction using Logistic Regression and Random Forest on telecom churn data set.
Please find here the code snippet from my notebook:
data=spark.read.csv("D:\Shashank\CBA\Pyspark\Telecom_Churn_Data_SingTel.csv", header=True, inferSchema=True)
data.show(3)
This link is to show the kind of data i am dealing with on a high level
data=data.drop("State").drop("Area Code").drop("Phone Number")
from pyspark.ml.feature import StringIndexer, VectorAssembler
intlPlanIndex = StringIndexer(inputCol="International Plan", outputCol="International Plan Index")
voiceMailPlanIndex = StringIndexer(inputCol="Voice mail Plan", outputCol="Voice mail Plan Index")
churnIndex = StringIndexer(inputCol="Churn", outputCol="label")
othercols=["Account Length", "Num of Voice mail Messages","Total Day Minutes", "Total Day Calls", "Total day Charge","Total Eve Minutes","Total Eve Calls","Total Eve Charge","Total Night Minutes","Total Night Calls ","Total Night Charge","Total International Minutes","Total Intl Calls","Total Intl Charge","Number Customer Service calls "]
assembler = VectorAssembler(inputCols= ['International Plan Index'] + ['Voice mail Plan Index'] + othercols, outputCol="features")
(train, test) = data.randomSplit([0.8,0.2])
from pyspark.ml.classification import LogisticRegression
lrObj = LogisticRegression(labelCol='label', featuresCol='features')
from pyspark.ml.pipeline import Pipeline
pipeline = Pipeline(stages=[intlPlanIndex, voiceMailPlanIndex, churnIndex, assembler, lrObj])
lrModel = pipeline.fit(train)
prediction_train = lrModel.transform(train)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
lr_Evaluator = MulticlassClassificationEvaluator()
lr_Evaluator.evaluate(prediction_train)
This image shows the result on evaluation using logistic Regression
I then repeat the same using a Random Forest classification model:
and I evaluate to 94.4%
My result is sort of like this:
Link to my Random Forest evaluation result
Everything looks ok until now.
But I get curious to see how things actually are being predicted, so i print the values of my prediction using the code below:
selected = prediction_1.select("features", "Label", "Churn", "prediction")
for row in selected.collect():
print(row)
The result i get is sort of like this in the screenshot below:
Link to image that shows the 2 results printed out for manual analysis
I then copy both the cells as shown from the above link into a compactor to see if my predicted values are different. (I expect there to be some difference, since the evaluation for Random forest turned out to be better)
But the comparison on any tool showed that the predictions are the same. Yet, the result on evaluation shows a difference 83.6% on LogisticRegression and 94.4% using RandomForest.
Why is there no difference in the 2 sets of data that i have generated from 2 different models when the ultimate evaluation using MuticlassClassificationEvaluator gives me different probabilities ?
You seem to be interested in metricName="accuracy"
predictions = model.transform(test)
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
For more info refer the official documentation.
This question is no longer relevant since i am able to see the difference in the predictions which is in line with the accuracy predicted under each model.
The question came up because the data i copied from my Jupyter notebook was incomplete.
Thanks and appreciate your time.

Calculations in table based on variable names in matlab

I am trying to find a better solution to calculation using data stored in table. I have a large table with many variables (100+) from which I select smaller sub-table with only two observations and their difference for smaller selection of variables. Thus, the resulting table looks for example similarly to this:
air bbs bri
_________ ________ _________
test1 12.451 0.549 3.6987
test2 10.2 0.47 3.99
diff 2.251 0.078999 -0.29132
Now, I need to multiply the ‘diff’ row with various coefficients that differ between variables. I can get the same result with the following code:
T(4,:) = array2table([T.air(3)*0.2*0.25, T.bbs(3)*0.1*0.25, T.bri(3)*0.7*0.6/2]);
However, I need more flexible solution since the selection of variables will differ between applications. I was thinking that better solution might be using either varfun or rowfun and speficic function that would assign correct coefficients/equations based on variable names:
T(4,:) = varfun(#func, T(3,:), 'InputVariables', {'air' 'bbs' 'bri'});
or
T(4,:) = rowfun(#func, T(3,:), 'OutputVariableNames', T.Properties.VariableNames);
However, the current solution I have is similarly inflexible as the basic calculation above:
function [air_out, bbs_out, bri_out] = func(air, bbs, bri)
air_out = air*0.2*0.25;
bbs_out = bbs*0.1*0.25;
bri_out = bri*0.7*0.6/2;
since I need to define every input/output variable. What I need is to assign in the function coefficients/equations for every variable and the ability of the function to apply it only to the variables that are present in the specific sub-table.
Any suggestions?