I'm looking into creating a pipeline to run logistic regression in spark and I'm running into an issue on whether there is either a way I can extend or bypass the "Tokenizer" object.
Essentially, the problem I'm running into is that the tokenizer is not nearly precise enough for the vectors I'm trying to create (i.e. stemming, lemmatization, bi-grams etc.), but in EVERY example for spark pipelines I see something along the lines of:
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
Must I have a tokenizer stage? Would it be trivial to extend the Tokenizer class to do the string modifications I want? Any help would be highly appreciated!
So I found a pretty decent example of extending the tokenizer class right here. This should give a pretty good roadmap of what needs to be overloaded for proper tokenization.
https://github.com/fyrz/spark-java-text-classifier/blob/master/src/main/java/org/fyrz/textclassifier/tokenizer/SparkLuceneTokenizer.java
Related
I tried to implement this example for multinomial logistic regression, but it doesn't recognize features that are being used. Probably some version mismatch. This part of code:
trainingSummary.falsePositiveRateByLabel.zipWithIndex.foreach { case (rate, label) =>
println(s"label $label: $rate")
}
None of the members of LogisticRegressionTrainingSummary are being recognized, falsePositiveRateByLabel particulary in the given example. As well as other members later in code: truePositiveRateByLabel, precisionByLabel,...
When I go to implementation I can't find any similar members that I could use instead, I use mllib 2.11. What am I missing?
You are correct, this is a versioning issue. The github code example you have given is for the current master branch of Spark where there has been some major changes in this part of the API.
What you have been following is what code in Spark 2.3 will look like. However, at this time, this version is not yet stable and available for download. This is what the version 2.2 branch of the same code example looks like:
val training = spark
.read
.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
// Print the coefficients and intercept for multinomial logistic regression
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: ${lrModel.interceptVector}")
// $example off$
spark.stop()
In other words, the methods you are trying to use are not yet implemented in your Spark version.
I've been trying to learn/use Scala for machine learning and to do that I need to convert string variables to an index of dummies.
The way I've done it is with the StringIndexer in Scala. Before running I've used df.na.fill("missing") to replace missing values. Even after I run that I still get a NullPointerException.
Is there something else I should be doing or something else I should be checking? I used printSchema to filter only on the string columns to get the list of columns I needed to run StringIndexer on.
val newDf1 = reweight.na.fill("Missing")
val cat_cols = Array("highest_tier_nm", "day_of_week", "month",
"provided", "docsis", "dwelling_type_grp", "dwelling_type_cd", "market"
"bulk_flag")
val transformers: Array[org.apache.spark.ml.PipelineStage] = cat_cols
.map(cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index"))
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers
val categorical = new Pipeline().setStages(stages)
val cat_reweight = categorical.fit(newDf)
Normally when using machine learning you would train the model with one part of the data and then test it with another part. Hence, there are two different methods to use to reflect this. You have only used fit() which is equivalent to training a model (or a pipeline).
This mean that your cat_reweight is not a dataframe, it is a PipelineModel. A PipelineModel have a function transform() that takes data with the same format as the one used for training and gives a dataframe as output. In other words, you should add .transform(newDf1) after fit(newDf1).
Another possible issue is that in your code you have used fit(newDf) instead of fit(newDf1). Make sure the correct dataframe is used for both the fit() and transform() methods, otherwise you will get a NullPointerException.
It works for me when running locally, however, if you still get an error you could try to cache() after replacing the nulls and then performing an action to make sure all transformations are done.
Hope it helps!
I am using Spark with Scala. I want to do different preprocessing on my data. Is there a way for CrossValidator to take multiple models (also with ParamMaps) to get the best model out of these two?
e.g. What I want to do is:
val discretizer = new QuantileDiscretizer()
.setInputCol("column1")
.setOutputCol("column1disc")
.setNumbuckets(5)
val normalizer = new Normalizer()
.setInputCol("column1")
.setOutputCol("column1norm")
val lr1 = new LinearRegression()
.setFeaturescol(discretizer.getOutputCol)
.setMaxIter(10)
val lr2 = new LinearRegression()
.setFeaturescol(normalizer.getOutputCol)
.setMaxIter(10)
val pipeline = new Pipeline().setStages(Array(dicretizer,normalizer,lr1,lr2))
Now I want my CrossValidator to pick the best of the two models from lr1 and lr2. This is just a small example, I want to extend it to multiple such possibilities with ParamMaps too.
You should be able to evaluate these models against each other using a custom estimator like in How to use CrossValidator to choose between different models.
I have trained a Spark Multilayer Perceptron Classifier to detect spam messages and would like to use it in a webservice in combination with the Play Framework.
My solution (see below) spawns an embedded local spark cluster, loads the model and classifies messages. Is there a way to use the model without an embedded Spark cluster?
Spark has some dependencies that clash with the Play Framework dependencies. I thought there might be a way to run the model in classification mode without starting an embedded spark cluster.
My second question is if I can classify a single message without putting it in a DataFrame first.
Application Loader:
lazy val sparkSession: SparkSession = {
val conf: SparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("Classifier")
.set("spark.ui.enabled", "false")
val session = SparkSession.builder()
.config(conf)
.getOrCreate()
applicationLifecycle.addStopHook { () ⇒
Future { session.stop() }
}
session
}
lazy val model: PipelineModel = {
sparkSession
CrossValidatorModel.load("mpc-model").bestModel.asInstanceOf[PipelineModel]
}
Classification service (model and spark session are injected):
val messageDto = Seq(MessageSparkDto(
sender = message.sender.email.value,
text = featureTransformer.cleanText(text).value,
messagelength = text.value.length,
isMultimail = featureTransformer.isMultimail(message.sender.email),
))
val messageDf = messageDto.toDS()
model.transform(messageDf).head().getAs[Double]("prediction") match {
case 1.0 ⇒ MessageEvaluationResult(MessageClass.Spam)
case _ ⇒ MessageEvaluationResult(MessageClass.NonSpam)
}
Edit: As pointed out in the comments, one solution could be to transform the model to PMML and then use another engine to load the model and use it for classification. This sounds too me like a lot of overhead as well. Has someone experience with running spark in local mode with minimal overhead and dependencies to use the ML classifiers?
Although I like the solution proposed in the linked post, the following might also be possible. You could of course copy that model to the Server onto which you will deploy the Webservice, install a spark "cluster" with one machine on it and put spark jobserver on top of it, which will handle the requests and access spark. That would be the no-brainer-solution and should work if your model does not need lots of computational power.
I have a spark.ml pipeline in Spark 1.5.1 which consists of a series of transformers followed by a k-means estimator. I want to be able to access the KMeansModel.clusterCenters after fitting the pipeline, but can't figure out how. Is there a spark.ml equivalent of sklearn's pipeline.named_steps feature?
I found this answer which gives two options. The first works if I take the k-means model out of my pipeline and fit it separately, but that kinda defeats the purpose of a pipeline. The second option doesn't work - I get error: value getModel is not a member of org.apache.spark.ml.PipelineModel.
EDIT: Example pipeline:
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.Pipeline
// create example dataframe
val sentenceData = sqlContext.createDataFrame(Seq(
("Hi I heard about Spark"),
("I wish Java could use case classes"),
("K-means models are neat")
)).toDF("sentence")
// initialize pipeline stages
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
val kmeans = new KMeans()
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, kmeans))
// fit the pipeline
val fitKmeans = pipeline.fit(sentenceData)
So now fitKmeans is of type org.apache.spark.ml.PipelineModel. My question is, how do I access the cluster centers calculated by the k-means model contained within this pipeline? As noted above, when not contained in a pipeline, this can be done with fitKmeans.clusterCenters.
Answering my own question...I finally stumbled on an example deep in the spark.ml docs that shows how to do this using the stages member of the PipelineModel class. So for the example I posted above, in order to access the k-means cluster centers, do:
val centers = fitKmeans.stages(2).asInstanceOf[KMeansModel].clusterCenters
where fitKmeans is a PipelineModel and 2 is the index of the k-means model in the array of pipeline stages.
Reference: the last line of most of the examples on this page.