Multiclass Classification Evaluator field does not exist error - Apache Spark

Multiclass Classification Evaluator field does not exist error - Apache Spark - scala

I am new to Spark and trying a basic classifier in Scala.
I'm trying to get the accuracy, but when using MulticlassClassificationEvaluator it gives the error below:
Caused by: java.lang.IllegalArgumentException: Field "label" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:227)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)
at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:76)
at com.classifier.classifier_app.App$.<init>(App.scala:90)
at com.classifier.classifier_app.App$.<clinit>(App.scala)
The code is as below:
val conf = new SparkConf().setMaster("local[*]").setAppName("Classifier")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("Email Classifier")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
val spamInput = "TRAIN_00000_0.eml" //files to train model
val normalInput = "TRAIN_00002_1.eml"
val spamData = spark.read.textFile(spamInput)
val normalData = spark.read.textFile(normalInput)
case class Feature(index: Int, value: String)
val indexer = new StringIndexer()
.setInputCol("value")
.setOutputCol("label")
val regexTokenizer = new RegexTokenizer()
.setInputCol("value")
.setOutputCol("cleared")
.setPattern("\\w+").setGaps(false)
val remover = new StopWordsRemover()
.setInputCol("cleared")
.setOutputCol("filtered")
val hashingTF = new HashingTF()
.setInputCol("filtered").setOutputCol("features")
.setNumFeatures(100)
val nb = new NaiveBayes()
val indexedSpam = spamData.map(x=>Feature(0, x))
val indexedNormal = normalData.map(x=>Feature(1, x))
val trainingData = indexedSpam.union(indexedNormal)
val pipeline = new Pipeline().setStages(Array (indexer, regexTokenizer, remover, hashingTF, nb))
val model = pipeline.fit(trainingData)
model.write.overwrite().save("myNaiveBayesModel")
val spamTest = spark.read.textFile("TEST_00009_0.eml")
val normalTest = spark.read.textFile("TEST_00000_1.eml")
val sameModel = PipelineModel.load("myNaiveBayesModel")
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
Console.println("Spam Test")
val predictionSpam = sameModel.transform(spamTest).select("prediction")
predictionSpam.foreach(println(_))
val accuracy = evaluator.evaluate(predictionSpam)
println("Accuracy Spam: " + accuracy)
Console.println("Normal Test")
val predictionNorm = sameModel.transform(normalTest).select("prediction")
predictionNorm.foreach(println(_))
val accuracyNorm = evaluator.evaluate(predictionNorm)
println("Accuracy Normal: " + accuracyNorm)
The error occurs when initializing the MulticlassClassificationEvaluator. How should the column names be specified? Any help is appreciated.

The error is in this line:
val predictionSpam = sameModel.transform(spamTest).select("prediction")
Your dataframe contains only prediction column and no label column.

Related

How to use scala SparkML model to do multiple predictions in loop

I just started learning Spark 3.3 for scala to do some regressions.
I was able to create, fit and test a model, but I got stuck trying to predict different subsets of a dataframe looping through one of it's columns to filter the data.
This is my principal function: (I'm testing everything here)
The last line is what I'm trying to achieve, but I'm getting a "Task not serializable" error
def test() = {
val data_key = "./data/EAD_HILIC_PFP_Com.csv"
val df = spark.read
.option("multiLine", true)
.option("header", "true")
.option("inferSchema", "true")
.csv(data_key)
val df2 = df.withColumn("PRECURSORMZ", $"PRECURSORMZ".cast("double").as("PRECURSORMZ"))
// to get list of samples from data
val labelDF = df2.select("mix_label").distinct
val (lrModel, test) = createModel(df2, labelDF)
println(s"Coefficients: ${lrModel.coefficients}")
println(s"Intercept: ${lrModel.intercept}")
println(s"Root Mean Squared Error (RMSE) = ${lrModel.summary.rootMeanSquaredError}")
println(s"R^2 = ${lrModel.summary.r2}")
val predictions = lrModel.transform(test)
val rmse = new RegressionEvaluator()
.setLabelCol("PRECURSORMZ")
.setPredictionCol("prediction")
.setMetricName("rmse")
val r2 = new RegressionEvaluator()
.setLabelCol("PRECURSORMZ")
.setPredictionCol("prediction")
.setMetricName("r2")
println(s"Root Mean Squared Error (RMSE) on test data (${labelDF.head.get(0)}) = " + rmse.evaluate(predictions))
println(s"R^2 on test data (${labelDF.head.get(0)}) = " + r2.evaluate(predictions))
labelDF.foreach(label => process(df2, label, lrModel, rmse, r2))
}
The createModel funtion, does what it says, creates and fits a linear regression model:
def createModel(df2: DataFrame, labelDF: DataFrame): (LinearRegressionModel, DataFrame) = {
val first = labelDF.head.get(0)
//istds for single sample (mix_label)
val istdsDF = df2
.filter('mix_label === first)
.select($"PRECURSORMZ", $"Average_mz")
df2.show()
istdsDF.show()
val assembler = new VectorAssembler()
.setInputCols(istdsDF.drop("msms", "mix_label").columns)
.setOutputCol("features")
val (train, test) = train_test_split(istdsDF, assembler)
val lr = new LinearRegression()
.setLabelCol("PRECURSORMZ")
.setFeaturesCol("features")
val lrModel = lr.fit(train)
(lrModel, test)
}
def train_test_split(data: DataFrame, assembler: VectorAssembler): (DataFrame, DataFrame) = {
val Array(train, test) = data.randomSplit(Array(0.8, 0.2), seed = 30)
(assembler.transform(train), assembler.transform(test))
}
Thanks for any help
EDIT 1: adding process function:
def process(otherDF: DataFrame, otherLabel: String, lrModel: LinearRegressionModel): Unit = {
val assembler = new VectorAssembler()
.setInputCols(otherDF.drop("msms", "mix_label").columns)
.setOutputCol("features")
val others = assembler.transform(otherDF)
others.show()
val rmse = new RegressionEvaluator()
.setLabelCol("PRECURSORMZ")
.setPredictionCol("prediction")
.setMetricName("rmse")
val r2 = new RegressionEvaluator()
.setLabelCol("PRECURSORMZ")
.setPredictionCol("prediction")
.setMetricName("r2")
val otherPreds = lrModel.transform(others)
println(s"Root Mean Squared Error (RMSE) on other data (with label '${otherLabel}') = " + rmse.evaluate(otherPreds))
println(s"R^2 on other data (with label '${otherLabel}') = " + r2.evaluate(otherPreds))
}
StackTrace here: https://pastebin.com/xhUpqvnx

Getting Task not Serializable error on trying to create Decision Tree using Spark with Scala while executing in my local machine

I am trying to create Fraud Transaction Detector using spark with scala. My code works fine with normal Spark logic. However when I try the solution using decision tree approach I get task not Serializable error. From my understanding I get this error when I try fitting my training data into pipeline. I tried a few solutions like extending to Serializable, converting my indexed data back to string etc. but none of them worked.
Can someone please help in understanding what I am doing wrong
Below is my code
package com.vinspark.frauddetection
object FraudDet_ML extends Serializable{
def main(args:Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val spark=SparkSession
.builder()
.appName("FraudDetection")
.master("local[*]")
.config("spark.sql.warehouse.dir","file:///C:/temp")
.getOrCreate()
import spark.sqlContext.implicits._
var df=spark.read.format("csv").option("header", "true").option("mode",
"DROPMALFORMED").option("inferSchema", "true").load("../PS_20174392719_1491204439457_log.csv")
df= df.withColumn("orgDiff", col("newbalanceOrig") -
col("oldbalanceOrg")).withColumn("destDiff",
col("newbalanceDest") - col("oldbalanceDest"))
df= df.withColumn("label",
when((col("oldbalanceOrg") <=56900 && col("type")=="TRANSFER" && col("newbalanceDest") <= 105)
||(col("oldbalanceOrg") >56900 && col("newbalanceOrig")<=12)
||(col("oldbalanceOrg") >56900 && col("newbalanceOrig")>12 && col("amount")>1160000),
1)
.otherwise(0)
)
df.createOrReplaceTempView("transaction")
val indexer = new StringIndexer().setInputCol("type").setOutputCol("typeIndexed")
println("indexer created")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel").fit(df)
val splitDataSet: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]
=df.randomSplit(Array(0.8, 0.2), 12345L)
val trainDataFrame = splitDataSet(0)val testDataFrame = splitDataSet(1)
val train = splitDataSet(0)
val test = splitDataSet(1)
println("train: "+train.count()+" test: "+test.count())
val va = new VectorAssembler().setInputCols(Array("typeIndexed", "amount", "oldbalanceOrg",
"newbalanceOrig", "oldbalanceDest", "newbalanceDest", "orgDiff",
"destDiff")).setOutputCol("features")
println("vector assembler created")
val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("features").setSeed(54321).setMaxDepth(5)
println("decision tree created")
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
val pipeline = new Pipeline().setStages(Array(indexer,labelIndexer, va, dt))
println("pipeline created")
val pipeConst=pipeline
val model = pipeConst.fit(train)
val model1 = model
println("model created")
val prediction=model1.transform(test)
println("indexer created")
prediction.collect().foreach(println)
}
}

type mismatch; found : org.apache.spark.sql.DataFrame required: org.apache.spark.rdd.RDD

I am new to scala and mllib and I have been getting the following error. Please let me know if anyone has been able to resolve something similar.
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
.
.
.
val conf = new SparkConf().setMaster("local").setAppName("SampleApp")
val sContext = new SparkContext(conf)
val sc = SparkSession.builder().master("local").appName("SampleApp").getOrCreate()
val sampleData = sc.read.json("input/sampleData.json")
val clusters = KMeans.train(sampleData, 10, 10)
val WSSSE = clusters.computeCost(sampleData)
clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
val sameModel = KMeansModel.load(sContext, "target/org/apache/spark/KMeansExample/KMeansModel")
this above line gives an error as:
type mismatch; found : org.apache.spark.sql.DataFrame (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried:
import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans().setK(20)
val model = kmeans.fit(sampleData)
val predictions = model.transform(sampleData)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
This gives the error:
Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
Available fields: address, attributes, business_id
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:266)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:58)
at org.apache.spark.ml.util.SchemaUtils$.validateVectorCompatibleColumn(SchemaUtils.scala:119)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:96)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:285)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:382)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:341)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:340)
at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:183)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:183)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:340)
I have been referring to https://spark.apache.org/docs/latest/ml-clustering.html and https://spark.apache.org/docs/latest/mllib-clustering.html
Edit
Using setFeaturesCol()
import org.apache.spark.ml.clustering.KMeans
val assembler = new VectorAssembler()
.setInputCols(Array("is_open", "review_count", "stars"))
.setOutputCol("features")
val output = assembler.transform(sampleData).select("features")
val kmeans = new KMeans().setK(20).setFeaturesCol("features")
val model = kmeans.fit(output)
val predictions = model.transform(sampleData)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")
This gives a different error still:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.util.Utils$.getSimpleName(Ljava/lang/Class;)Ljava/lang/String;
at org.apache.spark.ml.util.Instrumentation.logPipelineStage(Instrumentation.scala:52)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:350)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:340)
at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:183)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:183)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:340)
Thanks.

Use the scala pipeline
val assembler = new VectorAssembler()
.setInputCols(Array("feature1",feature2","feature3"))
.setOutputCol("assembled_features")
val scaler = new StandardScaler()
.setInputCol("assembled_features")
.setOutputCol("features")
.setWithStd(true)
.setWithMean(false)
val kmeans = new KMeans().setK(2).setSeed(1L)
// create the pipeline
val pipeline = new Pipeline()
.setStages(Array(assembler, scaler, kmeans))
// Fit the model
val clussterModel = pipeline.fit(train)

Use dataframes for Decision tree classifier in spark with string fields

I have managed to get my Decision Tree classifier work for the RDD-based API, but now I am trying to switch to the Dataframes-based API in Spark.
I have a dataset like this (but with many more fields) :
country, destination, duration, label
Belgium, France, 10, 0
Bosnia, USA, 120, 1
Germany, Spain, 30, 0
First I load my csv file in a dataframe :
val data = session.read
.format("org.apache.spark.csv")
.option("header", "true")
.csv("/home/Datasets/data/dataset.csv")
Then I transform string columns into numerical columns
val stringColumns = Array("country", "destination")
val index_transformers = stringColumns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
Then I assemble all my features into one single vector, using VectorAssembler, like this :
val assembler = new VectorAssembler()
.setInputCols(Array("country_index", "destination_index", "duration_index"))
.setOutputCol("features")
I split my data into training and test :
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
Then I create my DecisionTree Classifier
val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
Then I use a pipeline to make all the transformations
val pipeline = new Pipeline()
.setStages(Array(index_transformers, assembler, dt))
I train my model and use it for predictions :
val model = pipeline.fit(trainingData)
val predictions = model.transform(testData)
But I get some mistakes I don't understand :
When I run my code like that, I have this error :
[error] found : Array[org.apache.spark.ml.feature.StringIndexer]
[error] required: org.apache.spark.ml.PipelineStage
[error] .setStages(Array(index_transformers, assembler,dt))
So what I did is that I added a pipeline right after the index_transformers val, and right before val assembler :
val index_pipeline = new Pipeline().setStages(index_transformers)
val index_model = index_pipeline.fit(data)
val df_indexed = index_model.transform(data)
and I use as training set and testing set, my new df_indexed dataframe, and I removed index_transformers from my pipeline with assembler and dt
val Array(trainingData, testData) = df_indexed.randomSplit(Array(0.7, 0.3))
val pipeline = new Pipeline()
.setStages(Array(assembler,dt))
And I get this error :
Exception in thread "main" java.lang.IllegalArgumentException: Data type StringType is not supported.
It basically says I use VectorAssembler on String, whereas I told it to use it on df_indexed which has now a numerical column_index, but it doesn't seem to use it in vectorAssembler, and i just don't understand..
Thank you
EDIT
Now I have almost managed to get it work :
val data = session.read
.format("org.apache.spark.csv")
.option("header", "true")
.csv("/home/hvfd8529/Datasets/dataOINIS/dataset.csv")
val stringColumns = Array("country_index", "destination_index", "duration_index")
val stringColumns_index = stringColumns.map(c => s"${c}_index")
val index_transformers = stringColumns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
val assembler = new VectorAssembler()
.setInputCols(stringColumns_index)
.setOutputCol("features")
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a DecisionTree model.
val dt = new DecisionTreeClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
.setImpurity("entropy")
.setMaxBins(1000)
.setMaxDepth(15)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels())
val stages = index_transformers :+ assembler :+ labelIndexer :+ dt :+ labelConverter
val pipeline = new Pipeline()
.setStages(stages)
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "indexedFeatures").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("accuracy = " + accuracy)
val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
println("Learned classification tree model:\n" + treeModel.toDebugString)
except that now I have an error saying this :
value labels is not a member of org.apache.spark.ml.feature.StringIndexer
and I don't understand, as I am following examples on spark doc :/

Should be:
val pipeline = new Pipeline()
.setStages(index_transformers ++ Array(assembler, dt): Array[PipelineStage])

What I did for my first problem :
val stages = index_transformers :+ assembler :+ labelIndexer :+ rf :+ labelConverter
val pipeline = new Pipeline()
.setStages(stages)
For my second issue with label, I needed to use .fit(data) like this
val labelIndexer = new StringIndexer()
.setInputCol("label_fraude")
.setOutputCol("indexedLabel")
.fit(data)

How to use feature extraction with DStream in Apache Spark

I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords.
I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I hope to perform extraction in chunks - it doesn't matter to me if the accuracy will suffer a bit.
So far I put together something like that:
def extractKeywords(stream: DStream[Data]): Unit = {
val spark: SparkSession = SparkSession.builder.getOrCreate
val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData
val streamWithFeatures: DStream[(Data, Array[String])] = streamWithWords transform extractFeatures(spark) _
val streamWithKeywords: DStream[DataWithKeywords] = streamWithFeatures map addKeywordsToData
streamWithFeatures.print()
}
def extractFeatures(spark: SparkSession)
(rdd: RDD[(Data, Seq[String])]): RDD[(Data, Array[String])] = {
val df = spark.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numOfFeatures)
val rawFeatures = hashingTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(rawFeatures)
val rescaledData = idfModel.transform(rawFeature)
import spark.implicits._
rescaledData.select("data", "features").as[(Data, Array[String])].rdd
}
However, I received java.lang.IllegalStateException: Haven't seen any document yet. - I am not surprised as I just try out to scrap things together, and I understand that since I am not waiting for an arrival of some data, the generated model might be empty when I try to use it on data.
What would be the right approach for this problem?

I used advises from comments and split the procedure into 2 runs:
one that calculated IDF model and saves it to file
def trainFeatures(idfModelFile: File, rdd: RDD[(String, Seq[String])]) = {
val session: SparkSession = SparkSession.builder.getOrCreate
val wordsDf = session.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedDf)
idfModel.write.save(idfModelFile.getAbsolutePath)
}
one that reads IDF model from file and simply runs it on all incoming information
val idfModel = IDFModel.load(idfModelFile.getAbsolutePath)
val documentDf = spark.createDataFrame(rdd).toDF("update", "document")
val tokenizer = new Tokenizer().setInputCol("document").setOutputCol("words")
val wordsDf = tokenizer.transform(documentDf)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val extractor = idfModel.setInputCol("rawFeatures").setOutputCol("features")
val featuresDf = extractor.transform(featurizedDf)
featuresDf.select("update", "features")

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Multiclass Classification Evaluator field does not exist error - Apache Spark - scala

The error is in this line: val predictionSpam = sameModel.transform(spamTest).select("prediction") Your dataframe contains only prediction column and no label column.

Related

How to use scala SparkML model to do multiple predictions in loop

Getting Task not Serializable error on trying to create Decision Tree using Spark with Scala while executing in my local machine

type mismatch; found : org.apache.spark.sql.DataFrame required: org.apache.spark.rdd.RDD

Use dataframes for Decision tree classifier in spark with string fields

How to use feature extraction with DStream in Apache Spark

Categories

Resources