Inconsistent behavior of pyspark code depending on order of line execution - pyspark

I am running the code through jupyter on EMR, pyspark version 3.3.0.
I have two dataframes that I have preprocessed with the pyspark.ml.feature functions (OneHotEncoder, StringIndexer, VectorAssembler). The first dataframe, lets call it df_good, has 5 features, the second dataframe, lets call it df_bad, omits 2 of the features from df_good. The underlying dataset used to generate the two datasets is the same, the code to generate the datasets is identical (other than two features not be included in the VectorAssembler inputCols for df_bad).
Below is the code I am using to train the model:
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType
from pyspark.ml.classification import LogisticRegression
def split_array(col):
def to_list(v):
return v.toArray().tolist()
return F.udf(to_list, ArrayType(DoubleType()))(col)
def train_model(df):
train_df = df.selectExpr("label as label", "features as features")
logit = LogisticRegression()
logit = logit.setFamily("multinomial")
logit_mod = logit.fit(train_df)
df = logit_mod.transform(df)
df = df.withColumn("pred", split_array(F.col("probability"))[0])
return df
Here is where things get weird.
If I run the code below it works and runs in 10-20 seconds each:
df_good = spark.read.parquet("<s3_location_good>")
df_good = train_model(df_good)
df_good.select(F.sum("pred")).show()
df_bad = spark.read.parquet("<s3_location_bad>")
df_bad = train_model(df_bad)
df_bad.select(F.sum("pred")).show()
If I change the order, the code completely hangs on df_bad:
df_bad = spark.read.parquet("<s3_location_bad>")
df_bad = train_model(df_bad)
df_bad.select(F.sum("pred")).show()
df_good = spark.read.parquet("<s3_location_good>")
df_good = train_model(df_good)
df_good.select(F.sum("pred")).show()
The data is unchanged, the code is the same, the behavior is always the same.
Any thoughts are appreciated.

Related

Getting model parameters from regression in Pyspark in efficient way for large data

I have created a function for applying OLS regression and just getting the model parameters. I used groupby and applyInPandas but it's taking too much of time. Is there are more efficient way to work around this?
Note: I din't had to use groupby as all features have many levels but as I cannot use applyInPandas without it so I created a dummy feature as 'group' having the same value as 1.
Code
import pandas as pd
import statsmodels.api as sm
from pyspark.sql.functions import lit
pdf = pd.DataFrame({
'x':[3,6,2,0,1,5,2,3,4,5],
'y':[0,1,2,0,1,5,2,3,4,5],
'z':[2,1,0,0,0.5,2.5,3,4,5,6]})
df = sqlContext.createDataFrame(pdf)
result_schema =StructType([
StructField('index',StringType()),
StructField('coef',DoubleType())
])
def ols(pdf):
y_column = ['z']
x_column = ['x', 'y']
y = pdf[y_column]
X = pdf[x_column]
model = sm.OLS(y, X).fit()
param_table =pd.DataFrame(model.params, columns = ['coef']).reset_index()
return param_table
#adding a new column to apply groupby
df = df.withColumn('group', lit(1))
#applying function
data = df.groupby('group').applyInPandas(ols, schema = result_schema)
Final output sample
index coef
x 0.183246073
y 0.770680628

How to get support numbers in pyspark.ml like we get in sklearn classification report?

I am using Pyspark and able to get metrices like accuracy, f1, precison and recall from MulticlassClassificationEvaluator but I am not sure how to get the support numbers like we get in classification report for sklearn. rfc_pred in my case has the group of each class that I run in loop. So, will rfc_pred.count() do the trick?
Below is my current code:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")
accuracy = evaluator.evaluate(rfc_pred, {evaluator.metricName: "accuracy"})
f1 = evaluator.evaluate(rfc_pred, {evaluator.metricName: "f1"})
weightedPrecision = evaluator.evaluate(rfc_pred, {evaluator.metricName: "weightedPrecision"})
weightedRecall = evaluator.evaluate(rfc_pred, {evaluator.metricName: "weightedRecall"})

Percentile rank in pyspark using QuantileDiscretizer

I am wondering if it's possible to obtain the result of percentile_rank using the QuantileDiscretizer transformer in pyspark.
The purpose is that I am trying to avoid computing the percent_rank over the entire column, as it generates the following error:
WARN WindowExec: No Partition Defined for Window operation!
Moving all data to a single partition, this can cause serious performance degradation.
The method I am following is to first use QuantileDiscretizer then normalize to [0,1]:
from pyspark.sql.window import Window
from pyspark.ml.feature import QuantileDiscretizer
from scipy.stats import gamma
X1 = gamma.rvs(0.2, size=1000)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("perc_rank", F.percent_rank().over(Window.orderBy("x")))
df = QuantileDiscretizer(numBuckets=df.count()+1,\
inputCol="x",\
outputCol="q_discretizer").fit(df).transform(df)
agg_values = df.agg(F.max(df["q_discretizer"]).alias("maxval"),\
F.min(df["q_discretizer"]).alias("minval")).collect()[0]
xmax, xmin = agg_values.__getitem__("maxval"), agg_values.__getitem__("minval")
normalize = F.udf(lambda x: (x-xmin)/(xmax-xmin))
df = df.withColumn("perc_discretizer", normalize("q_discretizer"))
df = df.withColumn("error", F.round(F.abs(F.col("perc_discretizer")- F.col("perc_rank")),6) )
print(df.select(F.max("error")).show())
df.show(5)
However, it seems that increasing the number of datapoints the error grows, so I am not sure this is the right way to do it.
Is it possible to use QuantileDiscretizer to obtain the percentile_rank?
Alternatively is there a way to compute percentile_rank over an entire column in an efficient way?
Well you can use the below to avoid the warning message:
X1 = gamma.rvs(0.2, size=10)
df = spark.createDataFrame(pd.DataFrame(X1, columns=["x"]))
df = df.withColumn("dummyCol", F.lit("some_val"))
win = Window.partitionBy("dummyCol").orderBy("x")
df = df.withColumn("perc_rank", F.percent_rank().over(win)).drop("dummyCol")
but nonetheless, the data would still be moved to a single worker, I don't think so there is any better alternative to avoid the shuffle here since the complete column needs to be rank-ordered.
In case you have multiple windows over the same column, you can try to pre-partition the data and then apply the ranking functions.

Split one row into multiple rows of dataframe

I want to convert one row from dataframe into multiple rows. If hours is same then rows will not get split but if hour is different then rows will split into multiple rows wrt difference between hours.I am good with solution using dataframe function or hive query.
Input Table or Dataframe
Expected Output Table or Dataframe
Please help me to get workaround for expected output.
The easiest solution for such a simple schema is to use Dataset.flatMap after defining case classes for the input and output schema.
A simple UDF solution would return a sequence and then you can use functions.explode. Far less clean & efficient that using flatMap.
Last but not least, you could create your own table-generating UDF but that would be extreme overkill for this problem.
You can implement your own logic inside the map operation and use flatMap to achieve this.
The following is the crude way, that I have implemented the solution, you can improvise it as per the need.
import java.time.format.DateTimeFormatter
import java.time.temporal.ChronoUnit
import java.time.{Duration, LocalDateTime}
import org.apache.spark.sql.Row
import scala.collection.mutable.ArrayBuffer
import sparkSession.sqlContext.implicits._
val df = Seq(("john", "2/9/2018", "2/9/2018 5:02", "2/9/2018 5:12"),
("smit", "3/9/2018", "3/9/2018 6:12", "3/9/2018 8:52"),
("rick", "4/9/2018", "4/9/2018 23:02", "5/9/2018 2:12")
).toDF("UserName", "Date", "start_time", "end_time")
val rdd = df.rdd.map(row => {
val result = new ArrayBuffer[Row]()
val formatter1 = DateTimeFormatter.ofPattern("d/M/yyyy H:m")
val formatter2 = DateTimeFormatter.ofPattern("d/M/yyyy H:mm")
val d1 = LocalDateTime.parse(row.getAs[String]("start_time"), formatter1)
val d2 = LocalDateTime.parse(row.getAs[String]("end_time"), formatter1)
if (d1.getHour == d2.getHour) result += row
else {
val hoursDiff = Duration.between(d1, d2).toHours.toInt
result += Row.fromSeq(Seq(
row.getAs[String]("UserName"),
row.getAs[String]("Date"),
row.getAs[String]("start_time"),
d1.plus(1, ChronoUnit.HOURS).withMinute(0).format(formatter2)))
for (index <- 1 until hoursDiff) {
result += Row.fromSeq(Seq(
row.getAs[String]("UserName"),
row.getAs[String]("Date"),
d1.plus(index, ChronoUnit.HOURS).withMinute(0).format(formatter1),
d1.plus(1 + index, ChronoUnit.HOURS).withMinute(0).format(formatter2)))
}
result += Row.fromSeq(Seq(
row.getAs[String]("UserName"),
row.getAs[String]("Date"),
d2.withMinute(0).format(formatter2),
row.getAs[String]("end_time")))
}
result
}).flatMap(_.toIterator)
rdd.collect.foreach(println)
and finally, your result is as follows:
[john,2/9/2018,2/9/2018 5:02,2/9/2018 5:12]
[smit,3/9/2018,3/9/2018 6:12,3/9/2018 7:00]
[smit,3/9/2018,3/9/2018 7:0,3/9/2018 8:00]
[smit,3/9/2018,3/9/2018 8:00,3/9/2018 8:52]
[rick,4/9/2018,4/9/2018 23:02,5/9/2018 0:00]
[rick,4/9/2018,5/9/2018 0:0,5/9/2018 1:00]
[rick,4/9/2018,5/9/2018 1:0,5/9/2018 2:00]
[rick,4/9/2018,5/9/2018 2:00,5/9/2018 2:12]

Deep decision tree in PySpark

I am using PySpark for machine learning and I want to train decision tree classifier, random forest and gradient boosted trees. I want to try out different maximum depth values and select the best one via grid search and cross-validation. However, Spark is telling me that DecisionTree currently only supports maxDepth <= 30. What is the reason to limit it to 30? Is there a way to increase it? I am using it with text data and my feature vectors are TF-IDFs, so I want to try higher values for the maximum depth. Sample code from the Spark website with some modifications:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label",
outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel",
featuresCol="indexedFeatures", numTrees=500)
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction",
outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])
paramGrid_rf = ParamGridBuilder() \
.addGrid(rf.maxDepth, [50,100,150,250,300]) \
.build()
crossval_rf = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid_rf,
evaluator=BinaryClassificationEvaluator(),
numFolds= 5)
cvModel_rf = crossval_rf.fit(trainingData)
The code above gives me the error message below.
Py4JJavaError: An error occurred while calling o12383.fit.
: java.lang.IllegalArgumentException: requirement failed: DecisionTree currently only supports maxDepth <= 30, but was given maxDepth = 50.
From https://forums.databricks.com/questions/12300/for-decision-trees-is-the-current-maxdepth-limited.html
...the current implmentation imposes a restriction of maxDepth <= 30:
https://github.com/apache/spark/blob/ca6955858cec868c878a2fd8528dbed0ef9edd3f/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L137
You could ask to increase that limit in github forum!