Incorrect evaluations on Random Forest in Pyspark - pyspark

I am running a prediction using Logistic Regression and Random Forest on telecom churn data set.
Please find here the code snippet from my notebook:
data=spark.read.csv("D:\Shashank\CBA\Pyspark\Telecom_Churn_Data_SingTel.csv", header=True, inferSchema=True)
data.show(3)
This link is to show the kind of data i am dealing with on a high level
data=data.drop("State").drop("Area Code").drop("Phone Number")
from pyspark.ml.feature import StringIndexer, VectorAssembler
intlPlanIndex = StringIndexer(inputCol="International Plan", outputCol="International Plan Index")
voiceMailPlanIndex = StringIndexer(inputCol="Voice mail Plan", outputCol="Voice mail Plan Index")
churnIndex = StringIndexer(inputCol="Churn", outputCol="label")
othercols=["Account Length", "Num of Voice mail Messages","Total Day Minutes", "Total Day Calls", "Total day Charge","Total Eve Minutes","Total Eve Calls","Total Eve Charge","Total Night Minutes","Total Night Calls ","Total Night Charge","Total International Minutes","Total Intl Calls","Total Intl Charge","Number Customer Service calls "]
assembler = VectorAssembler(inputCols= ['International Plan Index'] + ['Voice mail Plan Index'] + othercols, outputCol="features")
(train, test) = data.randomSplit([0.8,0.2])
from pyspark.ml.classification import LogisticRegression
lrObj = LogisticRegression(labelCol='label', featuresCol='features')
from pyspark.ml.pipeline import Pipeline
pipeline = Pipeline(stages=[intlPlanIndex, voiceMailPlanIndex, churnIndex, assembler, lrObj])
lrModel = pipeline.fit(train)
prediction_train = lrModel.transform(train)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
lr_Evaluator = MulticlassClassificationEvaluator()
lr_Evaluator.evaluate(prediction_train)
This image shows the result on evaluation using logistic Regression
I then repeat the same using a Random Forest classification model:
and I evaluate to 94.4%
My result is sort of like this:
Link to my Random Forest evaluation result
Everything looks ok until now.
But I get curious to see how things actually are being predicted, so i print the values of my prediction using the code below:
selected = prediction_1.select("features", "Label", "Churn", "prediction")
for row in selected.collect():
print(row)
The result i get is sort of like this in the screenshot below:
Link to image that shows the 2 results printed out for manual analysis
I then copy both the cells as shown from the above link into a compactor to see if my predicted values are different. (I expect there to be some difference, since the evaluation for Random forest turned out to be better)
But the comparison on any tool showed that the predictions are the same. Yet, the result on evaluation shows a difference 83.6% on LogisticRegression and 94.4% using RandomForest.
Why is there no difference in the 2 sets of data that i have generated from 2 different models when the ultimate evaluation using MuticlassClassificationEvaluator gives me different probabilities ?

You seem to be interested in metricName="accuracy"
predictions = model.transform(test)
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
For more info refer the official documentation.

This question is no longer relevant since i am able to see the difference in the predictions which is in line with the accuracy predicted under each model.
The question came up because the data i copied from my Jupyter notebook was incomplete.
Thanks and appreciate your time.

Related

Testing a trading system on bootstrap samples using Arch library in python

I am trying to test a hypothesis on outperformance of a trading strategy over the buy and hold. I have original data's returns containing 1261 observations as a sample to be used for bootstrap.
I want to know if I have applied it correctly.
def back_test_series(x):
df= pd.DataFrame(x, columns= ['Close'])
return df.Close
from arch.bootstrap import CircularBlockBootstrap
bs = CircularBlockBootstrap(40, sample_return)
results = bs.apply(back_test_series, 2500)
Above, sample_return is the sample containing 2761 returns on actual data. I created 2500 bootstrapped samples containing 2761 observations each.
and then created a cummulative return to get price time series.
time_series = []
for simu in results:
df = pd.DataFrame(simu, columns=["Close"])
df['Close'] = (1+df).cumprod()
time_series.append(df)
and finally ran my backtesting in the price series obatained from bootstrap.
final_results = []
for simulation in enumerate(time_series):
x = Backtesting.scrip_backtest(simulation)
final_results.append(x)
Backtesting.scrip_backtest is my trading strategy which will return stats like buy and hold cagr, strategy cagr, std dev of strategy returns.
My question is can I use bootstrap in this way? Should I use MovingBlockBootstrap or CircularBlockBootstrap?
It it correct to run trading strategy on bootstrapped time series as mentioned above?

K-means in pyspark runing infinitely in jupyter notebook, works fine in zepplin notebook

I am running a k-means algorithm in pyspark:
from pyspark.ml.clustering import KMeans
from pyspark.ml.clustering import KMeansModel
import numpy as np
kmeans_modeling = KMeans(k = 3, seed = 0)
model = kmeans_modeling.fit(data.select("parameters"))
The data is a pyspark sql dataframe: pyspark.sql.dataframe.DataFrame
However, the algorithm is running infinitely (it is taking much, much longer than supposed for the amount of data in the dataframe).
Does anyone know what could be causing the algorithm to behave like this? I ran this exact code for a different dataframe of the same type, and everything worked fine.
The dataset I used before (that worked) had 72020 rows and 35 columns, and the present dataset has 60297 rows and 31 columns, so it is not a size-related problem. The data was normalized in both cases, but I assume the problem has to be in the data treatment. Can anyone help me with this? If any other information is needed let me know in the comments and I will answer or edit the question.
EDIT:
This is what I can show about creating the data:
aux1 = temp.filter("valflag = 0")
sample = spark.read.option("header", "true").option("delimiter", ",").csv("gs://LOCATION.csv").select("id")
data_pre = aux1.join(sample, sample["sample"] == aux1["id"], "leftanti").drop("sample")
data_pre.createOrReplaceTempView("data_pre")
data_pre = spark.table("data_pre")
data_pre = data.withColumn(col, functions.col(col).cast("double"))
data_pre = data_pre.na.fill(0)
data = vectorization_function(df = data_pre, inputCols = inputCols, outputCol = "parameters")
EDIT 2: I cannot provide additional information about the data, but I have now realized that the algorithm runs without problem in a zepplin notebook, but it is not working in a jupyter notebook; I have edited the tags and titel accordingly. Does anyone know why this could be happening?
Here is some documentation about running clustering jobs in Spark.
https://spark.apache.org/docs/latest/ml-clustering.html
Here is another, very similar, idea.
https://spark.apache.org/docs/latest/mllib-clustering.html

Using tbl_regression with imputed data/pooled regression models

I've had great success using the gtsummary::tbl_regression function to display regression model results. I can't see how to use tbl_regression with pooled regression models from imputed data sets, however, and I'd really like to.
I don't have a reproducible example handy, I just wanted to see if anyone else has found a way to work with, say, mids objects created by the mice package in tbl_regression.
In the current development version of gtsummary, it's possible to summarize models estimated on imputed data from the mice package. Here's an example
# install dev version of gtsummary
remotes::install_github("ddsjoberg/gtsummary")
library(gtsummary)
packageVersion("gtsummary")
#> [1] ‘1.3.5.9012’
# impute the data
df_imputed <- mice::mice(trial, m = 2)
# build the model
imputed_model <- with(df_imputed, lm(age ~ marker + grade))
# present beautiful table with gtsummary
tbl_regression(imputed_model)
#> pool_and_tidy_mice: Tidying mice model with
#> `mice::pool(x) %>% mice::tidy(exponentiate = FALSE, conf.int = TRUE, conf.level = 0.95)`
Created on 2020-12-16 by the reprex package (v0.3.0)
It's important to note that you pass the mice model object to tbl_regression() BEFORE you pool the results. The tbl_regression() function needs access to the individual models in order to correctly identify the reference row and variable labels (among other things). Internally, the tidying function used on the mice model will first pool the results, then tidy the results. The code used for this process is printed to the console for transparency (as seen in the example above).

Transforming dates in tensorflow or tensorflow extended

I am working with Tensorflow Extended, preprocessing data and among this data are date values (e.g. values of the form 16-04-2019). I need to apply some preprocessing to this, like the difference between two dates and extracting the day, month and year from it.
For example, I could need to have the difference in days between 01-04-2019 and 16-04-2019, but this difference could also span days, months or years.
Now, just using Python scripts this is easy to do, but I am wondering if it is also possible to do this with Tensorflow? It's important for my use case to do this within Tensorflow, because the transform needs to be done in the graph format so that I can serve the model with the transformations inside the pipeline.
I am using Tensorflow 1.13.1, Tensorflow Extended and Python 2.7 for this.
Posting from similar issue on tft github.
Here's a way to do it:
import tensorflow_addons as tfa
import tensorflow as tf
from typing import TYPE_CHECKING
#tf.function(experimental_follow_type_hints=True)
def fn_seconds_since_1970(date_time: tf.string, date_format: str = "%Y-%m-%d %H:%M:%S %Z"):
seconds_since_1970 = tfa.text.parse_time(date_time, date_format, output_unit='SECOND')
seconds_since_1970 = tf.cast(seconds_since_1970, dtype=tf.int64)
return seconds_since_1970
string_date_tensor = tf.constant("2022-04-01 11:12:13 UTC")
seconds_since_1970 = fn_seconds_since_1970(string_date_tensor)
seconds_in_hour, hours_in_day = tf.constant(3600, dtype=tf.int64), tf.constant(24, dtype=tf.int64)
hours_since_1970 = seconds_since_1970 / seconds_in_hour
hours_since_1970 = tf.cast(hours_since_1970, tf.int64)
hour_of_day = hours_since_1970 % hours_in_day
days_since_1970 = seconds_since_1970 / (seconds_in_hour * hours_in_day)
days_since_1970 = tf.cast(days_since_1970, tf.int64)
day_of_week = (days_since_1970 + 4) % 7 #Jan 1st 1970 was a Thursday, a 4, Sunday is a 0
print(f"On {string_date_tensor.numpy().decode('utf-8')}, {seconds_since_1970} seconds had elapsed since 1970.")
My two cents on the broader underlying issue, here the question is computing time differences, for which we want to do these computations on tensors. Then the question becomes "What are the units of these tensors?" This is a question of granularity. "The next question is what are the data types involved?" Start with a string likely, end with a numeric. Then the next question becomes is there a "native" tensorflow function that can do this? Enter tensorflow addons!
Just like we are trying to optimize training by doing everything as tensor operations within the graph, similarly we need to optimize "getting to the graph". I have seen the way datetime would work with python functions here, and I would do everything I could do avoid going into python function land as the code becomes so complex and the performance suffers as well. It's a lose-lose in my opinion.
PS - This op is not yet implemented on windows as per this, maybe because it only returns unix timestamps :)
I had a similar problem. The issue because of an if-check with in TFX that doesn't take dates types into account. As far as I've been able to figure out, there are two options:
Preprocess the date column and cast it to an int (e.g. calling toordinal() on each element) field before reading it into TFX
Edit the TFX function that checks types to account for date-like types and cast them to ordinal on the fly.
You can navigate to venv/lib/python3.7/site-packages/tfx/components/example_gen/utils.py and look for the function dict_to_example. You can add a datetime check there like so
def dict_to_example(instance: Dict[Text, Any]) -> tf.train.Example:
"""Converts dict to tf example."""
feature = {}
for key, value in instance.items():
# TODO(jyzhao): support more types.
if isinstance(value, datetime.datetime): # <---- Check here
value = value.toordinal()
if value is None:
feature[key] = tf.train.Feature()
...
value will become an int, and the int will be handled and cast to a Tensorflow type later on in the function.

MLlib classification example stops in stage 1

EDIT:
I tried using the text from Gabriel's answer and got spam features: 9 and ham features: 13. I tried changing the HashingTF to numFeatures = 9, then 13, then created one for each. Then the program stopped at "count at DataValidators.scala:38" just like before.
Completed Jobs(4)
count at 21 (spamFeatures)
count at 23 (hamFeatures)
count at 28 (trainingData.count())
first at GeneralizedLinearAlgorithm at 34 (val model = lrLearner.run(trainingData)
1) Why are the features being counted by lines, as in the code it is being split by spaces (" ")
2) Two things I see dift from my code and Gabriel's code:
a) I don't have anything about logger, but that shouldn't be an issue...
b) My files are located on hdfs(hdfs://ip-abc-de-.compute.internal:8020/user/ec2-user/spam.txt), once again shouldn't be an issue, but not sure if there's something i'm missing...
3) How long should I let it run for? I've let it run for at least 10 minutes with :local[2]..
I'm guessing at this point it might be some sort of issue with my Spark/MLlib setup? Is there an even simpler program I can run to see if there is a set up issue with MLLib? I have been able to run other spark streaming/sql jobs berfore...
Thanks!
[reposted from spark community]
Hello Everyone,
I am trying to run this MLlib example from Learning Spark:
https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48
Things I'm doing differently:
1) instead of their spam.txt and normal.txt I have text files with 200 words...nothing huge at all and just plain text, with periods, commas, etc.
3) I've used numFeatures = 200, 1000 and 10,000
Error: I keep getting stuck when I try to run the model (based off details from ui below):
val model = new LogisticRegressionWithSGD().run(trainingData)
It will freeze on something like this:
[Stage 1:==============> (1 + 0) / 4]
Some details from webui:
org.apache.spark.rdd.RDD.count(RDD.scala:910)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
scala.collection.immutable.List.forall(List.scala:84)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
$line21.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
$line21.$read$$iwC$$iwC$$iwC.<init>(<console>:38)
$line21.$read$$iwC$$iwC.<init>(<console>:40)
$line21.$read$$iwC.<init>(<console>:42)
$line21.$read.<init>(<console>:44)
$line21.$read$.<init>(<console>:48)
$line21.$read$.<clinit>(<console>)
$line21.$eval$.<init>(<console>:7)
$line21.$eval$.<clinit>(<console>)
$line21.$eval.$print(<console>)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
I am not sure what I am doing wrong...any help is much appreciated, thank you!
Thanks for this question, I wasn't aware of these examples so I downloaded them and tested them. What I see is that the git repository contains files with a lot of html code, it works, but you will end up adding 100 features which is possibly why you're not getting consistent results, since your own files contain much less features. What I did to test this works without html code was to remove the HTML code from spam.txt and ham.txt as follows:
ham.txt=
Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!
Check out videos of talks from the summit at ...
Hi Mom, Apologies for being late about emailing and forgetting to send you
the package. I hope you and bro have been ...
Wow, hey Fred, just heard about the Spark petabyte sort. I think we need to
take time to try it out immediately ...
Hi Spark user list, This is my first question to this list, so thanks in
advance for your help! I tried running ...
Thanks Tom for your email. I need to refer you to Alice for this one. I
haven't yet figured out that part either ...
Good job yesterday! I was attending your talk, and really enjoyed it. I
want to try out GraphX ...
Summit demo got whoops from audience! Had to let you know. --Joe
spam.txt=
Dear sir, I am a Prince in a far kingdom you have not heard of. I want to
send you money via wire transfer so please ...
Get Viagra real cheap! Send money right away to ...
Oh my gosh you can be really strong too with these drugs found in the
rainforest. Get them cheap right now ...
YOUR COMPUTER HAS BEEN INFECTED! YOU MUST RESET YOUR PASSWORD. Reply to
this email with your password and SSN ...
THIS IS NOT A SCAM! Send money and get access to awesome stuff really
cheap and never have to ...
Then use bellow modifed MLib.scala, make sure you have log4j referenced in your project to redirect output to a file instead of the console, so you basically need to run twice, in first run watch the output by printing the # of features in spam and ham you can then set the correct # of features (instead of 100) I used 5.
package com.oreilly.learningsparkexamples.scala
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.log4j.Logger
object MLlib {
private val logger = Logger.getLogger("MLlib")
def main(args: Array[String]) {
logger.info("This is spark in Windows")
val conf = new SparkConf().setAppName(s"Book example: Scala").setMaster("local[2]").set("spark.executor.memory","1g")
//val conf = new SparkConf().setAppName(s"Book example: Scala")
val sc = new SparkContext(conf)
// Load 2 types of emails from text files: spam and ham (non-spam).
// Each line has text from one email.
val spam = sc.textFile("spam.txt")
val ham = sc.textFile("ham.txt")
// Create a HashingTF instance to map email text to vectors of 5 (not 100) features.
val tf = new HashingTF(numFeatures = 5)
// Each email is split into words, and each word is mapped to one feature.
val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
println ("features in spam " + spamFeatures.count())
val hamFeatures = ham.map(email => tf.transform(email.split(" ")))
println ("features in ham " + ham.count())
// Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
val trainingData = positiveExamples ++ negativeExamples
trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
// Create a Logistic Regression learner which uses the LBFGS optimizer.
val lrLearner = new LogisticRegressionWithSGD()
// Run the actual learning algorithm on the training data.
val model = lrLearner.run(trainingData)
// Test on a positive example (spam) and a negative one (ham).
// First apply the same HashingTF feature transformation used on the training data.
val ex1 = "O M G GET cheap stuff by sending money to ...";
val ex2 = "Hi Dad, I started studying Spark the other ..."
val posTestExample = tf.transform(ex1.split(" "))
val negTestExample = tf.transform(ex2.split(" "))
// Now use the learned model to predict spam/ham for new emails.
println(s"Prediction for positive test example: ${ex1} : ${model.predict(posTestExample)}")
println(s"Prediction for negative test example: ${ex2} : ${model.predict(negTestExample)}")
sc.stop()
}
}
When I run this in the output I'm getting:
features in spam 5
features in ham 7
Prediction for positive test example: O M G GET cheap stuff by sending money
to ... : 1.0
Prediction for negative test example: Hi Dad, I started studying Spark the
other ... : 0.0
I had the same problem with Spark 1.5.2 on my local cluster.
My program stopped on "count at DataValidators.scala:40".
Resolved by running spark as "spark-submit --master local"
I had the similar problem with Spark 1.5.2 on my local cluster. My program stopped on "count at DataValidators.scala:40". I was caching my training features. Removed caching (just did not call cache function) and it resolved. Not sure of actual cause though.