Failed to Save XGBoost Model in EMR 6.7 - scala

I am trying to train an Xgboost model in EMR 6.7 and save the model to S3 (Xgboost version is 1.3.1 and Spark version is 3.2.1). But when saving the model this error occurs: java.lang.NoSuchMethodError: org.json4s.JsonDSL$.pair2Assoc(Lscala/Tuple2;Lscala/Function1;)Lorg/json4s/JsonDSL$JsonAssoc;
I've tried the higher versions of xgboost but there was error while training.
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.StringIndexer
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
val data = Seq((9,1),(8,1),(8,1),(7,1),(8,0),(2,0),(1,0),(3,0))
val rdd = spark.sparkContext.parallelize(data)
val df_train = rdd.toDF("input","label")
val stringIndexer = new StringIndexer().setInputCol("label").setOutputCol("labelIndex").fit(df_train)
val labelTransformed = stringIndexer.transform(df_train).drop("label")
val vectorAssembler = new VectorAssembler().setInputCols(Array("input")).setOutputCol("features_vector")
val xgbInput = vectorAssembler.transform(labelTransformed).select("features_vector", "labelIndex")
val xgbParam = Map("eta" -> 0.1f,
"max_depth" -> 2,
"objective" -> "multi:softprob",
"num_class" -> 2,
"num_round" -> 100,
"num_workers" -> 2)
val xgbClassifier = new XGBoostClassifier(xgbParam).
setFeaturesCol("features_vector").
setLabelCol("labelIndex")
val xgbClassificationModel = xgbClassifier.fit(xgbInput)
xgbClassificationModel.write.overwrite().save("path")
Can you help me about that please?

Related

Spark ML throws exception for Decision Tree classification: Column features must be of type numeric but was actually of type struct

I am trying to create a Spark ML model with the Decision Tree Classifier to perform classification , but I am getting an error saying the features in my training set should be of type numeric instead of type struct.
Here is the minimal reproducible example that I tried:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.VectorUDT
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml._
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
val df8 = Seq(
("2022-08-22 10:00:00",417.7,419.97,419.97,417.31,"nothing"),
("2022-08-22 11:30:00",417.35,417.33,417.46,416.77,"buy"),
("2022-08-22 13:00:00",417.55,417.68,418.04,417.48,"sell"),
("2022-08-22 14:00:00",417.22,417.8,421.13,416.83,"sell")
)
val df77 = spark.createDataset(df8).toDF("30mins_date","30mins_close","30mins_open","30mins_high","30mins_low", "signal")
val assembler_features = new VectorAssembler()
.setInputCols(Array("30mins_close","30mins_open","30mins_high","30mins_low"))
.setOutputCol("features")
val output2 = assembler_features.transform(df77)
val indexer = new StringIndexer()
.setInputCol("signal")
.setOutputCol("signalIndex")
val indexed = indexer.fit(output2).transform(output2)
val assembler_label = new VectorAssembler()
.setInputCols(Array("signalIndex"))
.setOutputCol("signalIndexV")
val output = assembler_label.transform(indexed)
val dt = new DecisionTreeClassifier()
.setLabelCol("features")
.setFeaturesCol("signalIndexV")
val Array(trainingData, testData) = output.select("features", "signalIndexV").randomSplit(Array(0.7, 0.3))
val model = dt.fit(trainingData)
Output error:
java.lang.IllegalArgumentException: requirement failed: Column features must be of type numeric but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:78)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema(Predictor.scala:54)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema$(Predictor.scala:47)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:73)
at org.apache.spark.ml.classification.ClassifierParams.validateAndTransformSchema(Classifier.scala:43)
at org.apache.spark.ml.classification.ClassifierParams.validateAndTransformSchema$(Classifier.scala:39)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:51)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:34)
at org.apache.spark.ml.classification.DecisionTreeClassifier.org$apache$spark$ml$tree$DecisionTreeClassifierParams$$super$validateAndTransformSchema(DecisionTreeClassifier.scala:46)
at org.apache.spark.ml.tree.DecisionTreeClassifierParams.validateAndTransformSchema(treeParams.scala:245)
at org.apache.spark.ml.tree.DecisionTreeClassifierParams.validateAndTransformSchema$(treeParams.scala:241)
at org.apache.spark.ml.classification.DecisionTreeClassifier.validateAndTransformSchema(DecisionTreeClassifier.scala:46)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:177)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:133)
... 61 elided
I tried above code in spark-shell environment:
spark v 3.3.1
scala v 2.12.15
Here is what trainingData looks like
+-----------------------------+------------+
|features |signalIndexV|
+-----------------------------+------------+
|[417.7,419.97,419.97,417.31] |[2.0] |
|[417.35,417.33,417.46,416.77]|[1.0] |
|[417.55,417.68,418.04,417.48]|[0.0] |
|[417.22,417.8,421.13,416.83] |[0.0] |
+-----------------------------+------------+
So what did I do wrong ? How can I convert column features into numeric type ?

EMR delta lake is giving "org.apache.spark.sql.AnalysisException: unresolved operator 'TimeTravel RelationV2"

Im trying to use delta lake on top of EMR and I see an error whenever I try to run "restoreToVersion"
Using:
delta-storage-1.2.1.jar
delta-core_2.12-1.2.1.jar
emr-6.6.0
Hadoop distribution:Amazon 3.2.1
import io.delta.tables._
import io.delta.implicits._
val columns = Seq("language","users_count")
val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
val rdd = spark.sparkContext.parallelize(data)
val dfFromRDD1 = rdd.toDF()
val dfFromRDD1 = rdd.toDF("language","users_count")
dfFromRDD1.write.mode("append").delta("/tmp/dummy7")
val deltaTable = DeltaTable.forPath(spark, "/tmp/dummy8/")
val fullHistoryDF = deltaTable.history()
fullHistoryDF.show()
dfFromRDD1.write.mode("append").delta("/tmp/dummy8/")
val fullHistoryDF = deltaTable.history()
fullHistoryDF.show()
deltaTable.restoreToVersion(0)

type mismatch; found : org.apache.spark.sql.DataFrame required: org.apache.spark.rdd.RDD

I am new to scala and mllib and I have been getting the following error. Please let me know if anyone has been able to resolve something similar.
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
.
.
.
val conf = new SparkConf().setMaster("local").setAppName("SampleApp")
val sContext = new SparkContext(conf)
val sc = SparkSession.builder().master("local").appName("SampleApp").getOrCreate()
val sampleData = sc.read.json("input/sampleData.json")
val clusters = KMeans.train(sampleData, 10, 10)
val WSSSE = clusters.computeCost(sampleData)
clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
val sameModel = KMeansModel.load(sContext, "target/org/apache/spark/KMeansExample/KMeansModel")
this above line gives an error as:
type mismatch; found : org.apache.spark.sql.DataFrame (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried:
import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans().setK(20)
val model = kmeans.fit(sampleData)
val predictions = model.transform(sampleData)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
This gives the error:
Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
Available fields: address, attributes, business_id
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:266)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:58)
at org.apache.spark.ml.util.SchemaUtils$.validateVectorCompatibleColumn(SchemaUtils.scala:119)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:96)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:285)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:382)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:341)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:340)
at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:183)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:183)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:340)
I have been referring to https://spark.apache.org/docs/latest/ml-clustering.html and https://spark.apache.org/docs/latest/mllib-clustering.html
Edit
Using setFeaturesCol()
import org.apache.spark.ml.clustering.KMeans
val assembler = new VectorAssembler()
.setInputCols(Array("is_open", "review_count", "stars"))
.setOutputCol("features")
val output = assembler.transform(sampleData).select("features")
val kmeans = new KMeans().setK(20).setFeaturesCol("features")
val model = kmeans.fit(output)
val predictions = model.transform(sampleData)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")
This gives a different error still:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.util.Utils$.getSimpleName(Ljava/lang/Class;)Ljava/lang/String;
at org.apache.spark.ml.util.Instrumentation.logPipelineStage(Instrumentation.scala:52)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:350)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:340)
at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:183)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:183)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:340)
Thanks.
Use the scala pipeline
val assembler = new VectorAssembler()
.setInputCols(Array("feature1",feature2","feature3"))
.setOutputCol("assembled_features")
val scaler = new StandardScaler()
.setInputCol("assembled_features")
.setOutputCol("features")
.setWithStd(true)
.setWithMean(false)
val kmeans = new KMeans().setK(2).setSeed(1L)
// create the pipeline
val pipeline = new Pipeline()
.setStages(Array(assembler, scaler, kmeans))
// Fit the model
val clussterModel = pipeline.fit(train)

Use saved Spark mllib decision tree binary classification model to predict on new data

I am using Spark version 2.2.0 and scala version 2.11.8.
I created and saved a decision tree binary classification model using following code:
package...
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.SparkSession
object DecisionTreeClassification {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local[*]")
.appName("Decision Tree")
.getOrCreate()
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sparkSession.sparkContext, "path/to/file/xyz.txt")
// Split the data into training and test sets (20% held out for testing)
val splits = data.randomSplit(Array(0.8, 0.2))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count()
println(s"Test Error = $testErr")
println(s"Learned classification tree model:\n ${model.toDebugString}")
// Save and load model
model.save(sparkSession.sparkContext, "target/tmp/myDecisionTreeClassificationModel")
val sameModel = DecisionTreeModel.load(sparkSession.sparkContext, "target/tmp/myDecisionTreeClassificationModel")
// $example off$
sparkSession.sparkContext.stop()
}
}
Now, I want to predict a label (0 or 1) for a new data using this saved model. I am new to Spark, can anybody please let me know how to do that?
I found answer to this question so I thought I should share it if someone is looking for the answer to similar question
To make prediction for new data simply add few lines before stopping the spark session:
val newData = MLUtils.loadLibSVMFile(sparkSession.sparkContext, "path/to/file/abc.txt")
val newDataPredictions = newData.map
{ point =>
val newPrediction = model.predict(point.features)
(point.label, newPrediction)
}
newDataPredictions.foreach(f => println("Predicted label", f._2))

RandomForestClassifier was given input with invalid label column error in Apache Spark

I am trying to find Accuracy using 5-fold cross validation using Random Forest Classifier Model in SCALA. But i am getting the following error while running:
java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.
Getting the above error at line---> val cvModel = cv.fit(trainingData)
The code which i used for cross validation of data set using random forest is as follows:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val data = sc.textFile("exprogram/dataset.txt")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(41).toDouble,
Vectors.dense(parts(0).split(',').map(_.toDouble)))
}
val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)
val trainingData = training.toDF()
val testData = test.toDF()
val nFolds: Int = 5
val NumTrees: Int = 5
val rf = new
RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setNumTrees(NumTrees)
val pipeline = new Pipeline()
.setStages(Array(rf))
val paramGrid = new ParamGridBuilder()
.build()
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("precision")
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(nFolds)
val cvModel = cv.fit(trainingData)
val results = cvModel.transform(testData)
.select("label","prediction").collect
val numCorrectPredictions = results.map(row =>
if (row.getDouble(0) == row.getDouble(1)) 1 else 0).foldLeft(0)(_ + _)
val accuracy = 1.0D * numCorrectPredictions / results.size
println("Test set accuracy: %.3f".format(accuracy))
Can any one please explain what is the mistake in the above code.
RandomForestClassifier, same as many other ML algorithms, require specific metadata to be set on the label column and labels values to be integral values from [0, 1, 2 ..., #classes) represented as doubles. Typically this is handled by an upstream Transformers like StringIndexer. Since you convert labels manually metadata fields are not set and classifier cannot confirm that these requirements are satisfied.
val df = Seq(
(0.0, Vectors.dense(1, 0, 0, 0)),
(1.0, Vectors.dense(0, 1, 0, 0)),
(2.0, Vectors.dense(0, 0, 1, 0)),
(2.0, Vectors.dense(0, 0, 0, 1))
).toDF("label", "features")
val rf = new RandomForestClassifier()
.setFeaturesCol("features")
.setNumTrees(5)
rf.setLabelCol("label").fit(df)
// java.lang.IllegalArgumentException: RandomForestClassifier was given input ...
You can either re-encode label column using StringIndexer:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("label_idx")
.fit(df)
rf.setLabelCol("label_idx").fit(indexer.transform(df))
or set required metadata manually:
val meta = NominalAttribute
.defaultAttr
.withName("label")
.withValues("0.0", "1.0", "2.0")
.toMetadata
rf.setLabelCol("label_meta").fit(
df.withColumn("label_meta", $"label".as("", meta))
)
Note:
Labels created using StringIndexer depend on the frequency not value:
indexer.labels
// Array[String] = Array(2.0, 0.0, 1.0)
PySpark:
In Python metadata fields can be set directly on the schema:
from pyspark.sql.types import StructField, DoubleType
StructField(
"label", DoubleType(), False,
{"ml_attr": {
"name": "label",
"type": "nominal",
"vals": ["0.0", "1.0", "2.0"]
}}
)