Rewrite an Apache Spark Pipeline to use an existing model - scala

I have a Pipeline(see the pipelineBefore method) that:
Preprocess a data
Trains a model
Gets a prediction
Then I delegated models training and now need to preprocess data only and get prediction result. See the pipelineAfter
How can I refactor the code to use an existing model via the Pipeline API instead of invoking transformers manually?
Clarification. I need to integrate a plain model e.g org.apache.spark.ml.classification.LogisticRegression, not a previously trained org.apache.spark.ml.PipelineModel
private def pipelineBefore: org.apache.spark.sql.DataFrame = {
val training = spark.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")
println("Pipeline example. Training dataframe before preprocessing")
training.show()
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "spark hadoop spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
// Make predictions on test documents.
val predictionResult = model.transform(test)
println("Pipeline example. Prediction result")
predictionResult.show()
return predictionResult
}
private def pipelineAfter: org.apache.spark.sql.DataFrame = {
// Given a valid model trained on a preprocessed DataFrame
val trainedModel = getTrainedModel()
// Preprocess a test dataset
val test = spark.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "spark hadoop spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
//HOW TO ADOPT A PIPELINE API HERE ?
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val tokenizedTestData = tokenizer.transform(test)
val hashedTestData = hashingTF.transform(tokenizedTestData)
println("Preprocessed test data")
hashedTestData.show()
// Make predictions on the test dataset.
val predictionResult = trainedModel.transform(hashedTestData)
println("Prediction result")
predictionResult.show()
return predictionResult
}

You need to serialize your pipeline if you want to use latter with another model. In your example:
private def pipelineBefore: org.apache.spark.sql.DataFrame = {
val training = spark.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")
println("Pipeline example. Training dataframe before preprocessing")
training.show()
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
// Save your pipeline transformations
pipeline.write.overwrite().save("/tmp/path")
// ....
}
Then you need to load:
private def pipelineAfter: org.apache.spark.sql.DataFrame = {
// Given a valid model trained, for example a LR model
// You can use pipeline model to load your model too
val trainedModel : LogisticRegressionModel = ???
// val trainedModel = PipelineModel.load("path_to_your_model")
// Preprocess a test dataset
val test = spark.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "spark hadoop spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
//HOW TO ADOPT A PIPELINE API HERE ?
// Path where you stored the transform pipeline
val transformPipeline = PipelineModel.load("/tmp/path")
val hashedTestData = transformPipeline.transform(test)
// Make predictions on the test dataset.
val predictionResult = trainedModel.transform(hashedTestData)
println("Prediction result")
predictionResult.show()
return predictionResult
}
Check the Spark doc to see more details about this.

Related

How to use scala SparkML model to do multiple predictions in loop

I just started learning Spark 3.3 for scala to do some regressions.
I was able to create, fit and test a model, but I got stuck trying to predict different subsets of a dataframe looping through one of it's columns to filter the data.
This is my principal function: (I'm testing everything here)
The last line is what I'm trying to achieve, but I'm getting a "Task not serializable" error
def test() = {
val data_key = "./data/EAD_HILIC_PFP_Com.csv"
val df = spark.read
.option("multiLine", true)
.option("header", "true")
.option("inferSchema", "true")
.csv(data_key)
val df2 = df.withColumn("PRECURSORMZ", $"PRECURSORMZ".cast("double").as("PRECURSORMZ"))
// to get list of samples from data
val labelDF = df2.select("mix_label").distinct
val (lrModel, test) = createModel(df2, labelDF)
println(s"Coefficients: ${lrModel.coefficients}")
println(s"Intercept: ${lrModel.intercept}")
println(s"Root Mean Squared Error (RMSE) = ${lrModel.summary.rootMeanSquaredError}")
println(s"R^2 = ${lrModel.summary.r2}")
val predictions = lrModel.transform(test)
val rmse = new RegressionEvaluator()
.setLabelCol("PRECURSORMZ")
.setPredictionCol("prediction")
.setMetricName("rmse")
val r2 = new RegressionEvaluator()
.setLabelCol("PRECURSORMZ")
.setPredictionCol("prediction")
.setMetricName("r2")
println(s"Root Mean Squared Error (RMSE) on test data (${labelDF.head.get(0)}) = " + rmse.evaluate(predictions))
println(s"R^2 on test data (${labelDF.head.get(0)}) = " + r2.evaluate(predictions))
labelDF.foreach(label => process(df2, label, lrModel, rmse, r2))
}
The createModel funtion, does what it says, creates and fits a linear regression model:
def createModel(df2: DataFrame, labelDF: DataFrame): (LinearRegressionModel, DataFrame) = {
val first = labelDF.head.get(0)
//istds for single sample (mix_label)
val istdsDF = df2
.filter('mix_label === first)
.select($"PRECURSORMZ", $"Average_mz")
df2.show()
istdsDF.show()
val assembler = new VectorAssembler()
.setInputCols(istdsDF.drop("msms", "mix_label").columns)
.setOutputCol("features")
val (train, test) = train_test_split(istdsDF, assembler)
val lr = new LinearRegression()
.setLabelCol("PRECURSORMZ")
.setFeaturesCol("features")
val lrModel = lr.fit(train)
(lrModel, test)
}
def train_test_split(data: DataFrame, assembler: VectorAssembler): (DataFrame, DataFrame) = {
val Array(train, test) = data.randomSplit(Array(0.8, 0.2), seed = 30)
(assembler.transform(train), assembler.transform(test))
}
Thanks for any help
EDIT 1: adding process function:
def process(otherDF: DataFrame, otherLabel: String, lrModel: LinearRegressionModel): Unit = {
val assembler = new VectorAssembler()
.setInputCols(otherDF.drop("msms", "mix_label").columns)
.setOutputCol("features")
val others = assembler.transform(otherDF)
others.show()
val rmse = new RegressionEvaluator()
.setLabelCol("PRECURSORMZ")
.setPredictionCol("prediction")
.setMetricName("rmse")
val r2 = new RegressionEvaluator()
.setLabelCol("PRECURSORMZ")
.setPredictionCol("prediction")
.setMetricName("r2")
val otherPreds = lrModel.transform(others)
println(s"Root Mean Squared Error (RMSE) on other data (with label '${otherLabel}') = " + rmse.evaluate(otherPreds))
println(s"R^2 on other data (with label '${otherLabel}') = " + r2.evaluate(otherPreds))
}
StackTrace here: https://pastebin.com/xhUpqvnx

Mix Smark MLLIB and SparkNLP in pipeline

In a MLLIB pipeline, how can I chain a CountVectorizer (from SparkML) after a Stemmer (from Spark NLP) ?
When I try to use both in a pipeline I get:
myColName must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.
Regards,
You need to add a Finisher in your Spark NLP pipeline. Try that:
val documentAssembler =
new DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector =
new SentenceDetector().setInputCols("document").setOutputCol("sentences")
val tokenizer =
new Tokenizer().setInputCols("sentences").setOutputCol("token")
val stemmer = new Stemmer()
.setInputCols("token")
.setOutputCol("stem")
val finisher = new Finisher()
.setInputCols("stem")
.setOutputCols("token_features")
.setOutputAsArray(true)
.setCleanAnnotations(false)
val cv = new CountVectorizer()
.setInputCol("token_features")
.setOutputCol("features")
val pipeline = new Pipeline()
.setStages(
Array(
documentAssembler,
sentenceDetector,
tokenizer,
stemmer,
finisher,
cv
))
val data =
Seq("Peter Pipers employees are picking pecks of pickled peppers.")
.toDF("text")
val model = pipeline.fit(data)
val df = model.transform(data)
output:
+--------------------------------------------------------------------+
|features |
+--------------------------------------------------------------------+
|(10,[0,1,2,3,4,5,6,7,8,9],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
+--------------------------------------------------------------------+

Use dataframes for Decision tree classifier in spark with string fields

I have managed to get my Decision Tree classifier work for the RDD-based API, but now I am trying to switch to the Dataframes-based API in Spark.
I have a dataset like this (but with many more fields) :
country, destination, duration, label
Belgium, France, 10, 0
Bosnia, USA, 120, 1
Germany, Spain, 30, 0
First I load my csv file in a dataframe :
val data = session.read
.format("org.apache.spark.csv")
.option("header", "true")
.csv("/home/Datasets/data/dataset.csv")
Then I transform string columns into numerical columns
val stringColumns = Array("country", "destination")
val index_transformers = stringColumns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
Then I assemble all my features into one single vector, using VectorAssembler, like this :
val assembler = new VectorAssembler()
.setInputCols(Array("country_index", "destination_index", "duration_index"))
.setOutputCol("features")
I split my data into training and test :
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
Then I create my DecisionTree Classifier
val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
Then I use a pipeline to make all the transformations
val pipeline = new Pipeline()
.setStages(Array(index_transformers, assembler, dt))
I train my model and use it for predictions :
val model = pipeline.fit(trainingData)
val predictions = model.transform(testData)
But I get some mistakes I don't understand :
When I run my code like that, I have this error :
[error] found : Array[org.apache.spark.ml.feature.StringIndexer]
[error] required: org.apache.spark.ml.PipelineStage
[error] .setStages(Array(index_transformers, assembler,dt))
So what I did is that I added a pipeline right after the index_transformers val, and right before val assembler :
val index_pipeline = new Pipeline().setStages(index_transformers)
val index_model = index_pipeline.fit(data)
val df_indexed = index_model.transform(data)
and I use as training set and testing set, my new df_indexed dataframe, and I removed index_transformers from my pipeline with assembler and dt
val Array(trainingData, testData) = df_indexed.randomSplit(Array(0.7, 0.3))
val pipeline = new Pipeline()
.setStages(Array(assembler,dt))
And I get this error :
Exception in thread "main" java.lang.IllegalArgumentException: Data type StringType is not supported.
It basically says I use VectorAssembler on String, whereas I told it to use it on df_indexed which has now a numerical column_index, but it doesn't seem to use it in vectorAssembler, and i just don't understand..
Thank you
EDIT
Now I have almost managed to get it work :
val data = session.read
.format("org.apache.spark.csv")
.option("header", "true")
.csv("/home/hvfd8529/Datasets/dataOINIS/dataset.csv")
val stringColumns = Array("country_index", "destination_index", "duration_index")
val stringColumns_index = stringColumns.map(c => s"${c}_index")
val index_transformers = stringColumns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
val assembler = new VectorAssembler()
.setInputCols(stringColumns_index)
.setOutputCol("features")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a DecisionTree model.
val dt = new DecisionTreeClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
.setImpurity("entropy")
.setMaxBins(1000)
.setMaxDepth(15)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels())
val stages = index_transformers :+ assembler :+ labelIndexer :+ dt :+ labelConverter
val pipeline = new Pipeline()
.setStages(stages)
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "indexedFeatures").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("accuracy = " + accuracy)
val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
println("Learned classification tree model:\n" + treeModel.toDebugString)
except that now I have an error saying this :
value labels is not a member of org.apache.spark.ml.feature.StringIndexer
and I don't understand, as I am following examples on spark doc :/
Should be:
val pipeline = new Pipeline()
.setStages(index_transformers ++ Array(assembler, dt): Array[PipelineStage])
What I did for my first problem :
val stages = index_transformers :+ assembler :+ labelIndexer :+ rf :+ labelConverter
val pipeline = new Pipeline()
.setStages(stages)
For my second issue with label, I needed to use .fit(data) like this
val labelIndexer = new StringIndexer()
.setInputCol("label_fraude")
.setOutputCol("indexedLabel")
.fit(data)

Multiclass Classification Evaluator field does not exist error - Apache Spark

I am new to Spark and trying a basic classifier in Scala.
I'm trying to get the accuracy, but when using MulticlassClassificationEvaluator it gives the error below:
Caused by: java.lang.IllegalArgumentException: Field "label" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:227)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)
at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:76)
at com.classifier.classifier_app.App$.<init>(App.scala:90)
at com.classifier.classifier_app.App$.<clinit>(App.scala)
The code is as below:
val conf = new SparkConf().setMaster("local[*]").setAppName("Classifier")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("Email Classifier")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
val spamInput = "TRAIN_00000_0.eml" //files to train model
val normalInput = "TRAIN_00002_1.eml"
val spamData = spark.read.textFile(spamInput)
val normalData = spark.read.textFile(normalInput)
case class Feature(index: Int, value: String)
val indexer = new StringIndexer()
.setInputCol("value")
.setOutputCol("label")
val regexTokenizer = new RegexTokenizer()
.setInputCol("value")
.setOutputCol("cleared")
.setPattern("\\w+").setGaps(false)
val remover = new StopWordsRemover()
.setInputCol("cleared")
.setOutputCol("filtered")
val hashingTF = new HashingTF()
.setInputCol("filtered").setOutputCol("features")
.setNumFeatures(100)
val nb = new NaiveBayes()
val indexedSpam = spamData.map(x=>Feature(0, x))
val indexedNormal = normalData.map(x=>Feature(1, x))
val trainingData = indexedSpam.union(indexedNormal)
val pipeline = new Pipeline().setStages(Array (indexer, regexTokenizer, remover, hashingTF, nb))
val model = pipeline.fit(trainingData)
model.write.overwrite().save("myNaiveBayesModel")
val spamTest = spark.read.textFile("TEST_00009_0.eml")
val normalTest = spark.read.textFile("TEST_00000_1.eml")
val sameModel = PipelineModel.load("myNaiveBayesModel")
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
Console.println("Spam Test")
val predictionSpam = sameModel.transform(spamTest).select("prediction")
predictionSpam.foreach(println(_))
val accuracy = evaluator.evaluate(predictionSpam)
println("Accuracy Spam: " + accuracy)
Console.println("Normal Test")
val predictionNorm = sameModel.transform(normalTest).select("prediction")
predictionNorm.foreach(println(_))
val accuracyNorm = evaluator.evaluate(predictionNorm)
println("Accuracy Normal: " + accuracyNorm)
The error occurs when initializing the MulticlassClassificationEvaluator. How should the column names be specified? Any help is appreciated.
The error is in this line:
val predictionSpam = sameModel.transform(spamTest).select("prediction")
Your dataframe contains only prediction column and no label column.

RandomForestClassifier was given input with invalid label column error in Apache Spark

I am trying to find Accuracy using 5-fold cross validation using Random Forest Classifier Model in SCALA. But i am getting the following error while running:
java.lang.IllegalArgumentException: RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.
Getting the above error at line---> val cvModel = cv.fit(trainingData)
The code which i used for cross validation of data set using random forest is as follows:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val data = sc.textFile("exprogram/dataset.txt")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(41).toDouble,
Vectors.dense(parts(0).split(',').map(_.toDouble)))
}
val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)
val trainingData = training.toDF()
val testData = test.toDF()
val nFolds: Int = 5
val NumTrees: Int = 5
val rf = new
RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setNumTrees(NumTrees)
val pipeline = new Pipeline()
.setStages(Array(rf))
val paramGrid = new ParamGridBuilder()
.build()
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("precision")
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(nFolds)
val cvModel = cv.fit(trainingData)
val results = cvModel.transform(testData)
.select("label","prediction").collect
val numCorrectPredictions = results.map(row =>
if (row.getDouble(0) == row.getDouble(1)) 1 else 0).foldLeft(0)(_ + _)
val accuracy = 1.0D * numCorrectPredictions / results.size
println("Test set accuracy: %.3f".format(accuracy))
Can any one please explain what is the mistake in the above code.
RandomForestClassifier, same as many other ML algorithms, require specific metadata to be set on the label column and labels values to be integral values from [0, 1, 2 ..., #classes) represented as doubles. Typically this is handled by an upstream Transformers like StringIndexer. Since you convert labels manually metadata fields are not set and classifier cannot confirm that these requirements are satisfied.
val df = Seq(
(0.0, Vectors.dense(1, 0, 0, 0)),
(1.0, Vectors.dense(0, 1, 0, 0)),
(2.0, Vectors.dense(0, 0, 1, 0)),
(2.0, Vectors.dense(0, 0, 0, 1))
).toDF("label", "features")
val rf = new RandomForestClassifier()
.setFeaturesCol("features")
.setNumTrees(5)
rf.setLabelCol("label").fit(df)
// java.lang.IllegalArgumentException: RandomForestClassifier was given input ...
You can either re-encode label column using StringIndexer:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("label_idx")
.fit(df)
rf.setLabelCol("label_idx").fit(indexer.transform(df))
or set required metadata manually:
val meta = NominalAttribute
.defaultAttr
.withName("label")
.withValues("0.0", "1.0", "2.0")
.toMetadata
rf.setLabelCol("label_meta").fit(
df.withColumn("label_meta", $"label".as("", meta))
)
Note:
Labels created using StringIndexer depend on the frequency not value:
indexer.labels
// Array[String] = Array(2.0, 0.0, 1.0)
PySpark:
In Python metadata fields can be set directly on the schema:
from pyspark.sql.types import StructField, DoubleType
StructField(
"label", DoubleType(), False,
{"ml_attr": {
"name": "label",
"type": "nominal",
"vals": ["0.0", "1.0", "2.0"]
}}
)