How to extract best parameters from a CrossValidatorModel

How to extract best parameters from a CrossValidatorModel - scala

I want to find the parameters of ParamGridBuilder that make the best model in CrossValidator in Spark 1.4.x,
In Pipeline Example in Spark documentation, they add different parameters (numFeatures, regParam) by using ParamGridBuilder in the Pipeline. Then by the following line of code they make the best model:
val cvModel = crossval.fit(training.toDF)
Now, I want to know what are the parameters (numFeatures, regParam) from ParamGridBuilder that produces the best model.
I already used the following commands without success:
cvModel.bestModel.extractParamMap().toString()
cvModel.params.toList.mkString("(", ",", ")")
cvModel.estimatorParamMaps.toString()
cvModel.explainParams()
cvModel.getEstimatorParamMaps.mkString("(", ",", ")")
cvModel.toString()
Any help?
Thanks in advance,

One method to get a proper ParamMap object is to use CrossValidatorModel.avgMetrics: Array[Double] to find the argmax ParamMap:
implicit class BestParamMapCrossValidatorModel(cvModel: CrossValidatorModel) {
def bestEstimatorParamMap: ParamMap = {
cvModel.getEstimatorParamMaps
.zip(cvModel.avgMetrics)
.maxBy(_._2)
._1
}
}
When run on the CrossValidatorModel trained in the Pipeline Example you cited gives:
scala> println(cvModel.bestEstimatorParamMap)
{
hashingTF_2b0b8ccaeeec-numFeatures: 100,
logreg_950a13184247-regParam: 0.1
}

val bestPipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val stages = bestPipelineModel.stages
val hashingStage = stages(1).asInstanceOf[HashingTF]
println("numFeatures = " + hashingStage.getNumFeatures)
val lrStage = stages(2).asInstanceOf[LogisticRegressionModel]
println("regParam = " + lrStage.getRegParam)
source

To print everything in paramMap, you actually don't have to call parent:
cvModel.bestModel.extractParamMap()
To answer OP's question, to get a single best parameter, for example regParam:
cvModel.bestModel.extractParamMap().apply(cvModel.bestModel.getParam("regParam"))

This is how you get the chosen parameters
println(cvModel.bestModel.getMaxIter)
println(cvModel.bestModel.getRegParam)

this java code should work:
cvModel.bestModel().parent().extractParamMap().you can translate it to scala code
parent()method will return an estimator, you can get the best params then.

This is the ParamGridBuilder()
paraGrid = ParamGridBuilder().addGrid(
hashingTF.numFeatures, [10, 100, 1000]
).addGrid(
lr.regParam, [0.1, 0.01, 0.001]
).build()
There are 3 stages in pipeline. It seems we can assess parameters as the following:
for stage in cv_model.bestModel.stages:
print 'stages: {}'.format(stage)
print stage.params
print '\n'
stage: Tokenizer_46ffb9fac5968c6c152b
[Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='inputCol', doc='input column name'), Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='outputCol', doc='output column name')]
stage: HashingTF_40e1af3ba73764848d43
[Param(parent='HashingTF_40e1af3ba73764848d43', name='inputCol', doc='input column name'), Param(parent='HashingTF_40e1af3ba73764848d43', name='numFeatures', doc='number of features'), Param(parent='HashingTF_40e1af3ba73764848d43', name='outputCol', doc='output column name')]
stage: LogisticRegression_451b8c8dbef84ecab7a9
[]
However, there is no parameter in the last stage, logiscRegression.
We can also get weight and intercept parameter from logistregression like the following:
cv_model.bestModel.stages[1].getNumFeatures()
10
cv_model.bestModel.stages[2].intercept
1.5791827733883774
cv_model.bestModel.stages[2].weights
DenseVector([-2.5361, -0.9541, 0.4124, 4.2108, 4.4707, 4.9451, -0.3045, 5.4348, -0.1977, -1.8361])
Full exploration:
http://kuanliang.github.io/2016-06-07-SparkML-pipeline/

I am working with Spark Scala 1.6.x and here is a full example of how i can set and fit a CrossValidator and then return the value of the parameter used to get the best model (assuming that training.toDF gives a dataframe ready to be used) :
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// Instantiate a LogisticRegression object
val lr = new LogisticRegression()
// Instantiate a ParamGrid with different values for the 'RegParam' parameter of the logistic regression
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.0001, 0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1)).build()
// Setting and fitting the CrossValidator on the training set, using 'MultiClassClassificationEvaluator' as evaluator
val crossVal = new CrossValidator().setEstimator(lr).setEvaluator(new MulticlassClassificationEvaluator).setEstimatorParamMaps(paramGrid)
val cvModel = crossVal.fit(training.toDF)
// Getting the value of the 'RegParam' used to get the best model
val bestModel = cvModel.bestModel // Getting the best model
val paramReference = bestModel.getParam("regParam") // Getting the reference of the parameter you want (only the reference, not the value)
val paramValue = bestModel.get(paramReference) // Getting the value of this parameter
print(paramValue) // In my case : 0.001
You can do the same for any parameter or any other type of model.

If java，see this debug show;
bestModel.parent().extractParamMap()

Building in the solution of #macfeliga, a single liner that works for pipelines:
cvModel.bestModel.asInstanceOf[PipelineModel]
.stages.foreach(stage => println(stage.extractParamMap))

This SO thread kinda answers the question.
In a nutshell, you need to cast each object to its supposed-to-be class.
For the case of CrossValidatorModel, the following is what I did:
import org.apache.spark.ml.tuning.CrossValidatorModel
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.regression.RandomForestRegressionModel
// Load CV model from S3
val inputModelPath = "s3://path/to/my/random-forest-regression-cv"
val reloadedCvModel = CrossValidatorModel.load(inputModelPath)
// To get the parameters of the best model
(
reloadedCvModel.bestModel
.asInstanceOf[PipelineModel]
.stages(1)
.asInstanceOf[RandomForestRegressionModel]
.extractParamMap()
)
In the example, my pipeline has two stages (a VectorIndexer and a RandomForestRegressor), so the stage index is 1 for my model.

For me, the #orangeHIX solution is perfect:
val cvModel = cv.fit(training)
val cvMejorModelo = cvModel.bestModel.asInstanceOf[ALSModel]
cvMejorModelo.parent.extractParamMap()
res86: org.apache.spark.ml.param.ParamMap =
{
als_08eb64db650d-alpha: 0.05,
als_08eb64db650d-checkpointInterval: 10,
als_08eb64db650d-coldStartStrategy: drop,
als_08eb64db650d-finalStorageLevel: MEMORY_AND_DISK,
als_08eb64db650d-implicitPrefs: false,
als_08eb64db650d-intermediateStorageLevel: MEMORY_AND_DISK,
als_08eb64db650d-itemCol: product,
als_08eb64db650d-maxIter: 10,
als_08eb64db650d-nonnegative: false,
als_08eb64db650d-numItemBlocks: 10,
als_08eb64db650d-numUserBlocks: 10,
als_08eb64db650d-predictionCol: prediction,
als_08eb64db650d-rank: 1,
als_08eb64db650d-ratingCol: rating,
als_08eb64db650d-regParam: 0.1,
als_08eb64db650d-seed: 1994790107,
als_08eb64db650d-userCol: user
}

Related

How do we extract named entities in scala using any nlp library

I have a huge text file and I have to extract only named entites in from this file. I am using Scala language and Databricks cluster for this.
val input = sc.textFile('....Mypath...').flatMap(line => line.split("""\W+"""))
val namedEnt = something(input)
Can anyone tell what to code to get named entities?

If you convert your input to a DataFrame (ex: .toDF), this is how you can get the Named Entities out:
Just an example of Spark NLP installation
spark-shell --packages JohnSnowLabs:spark-nlp:2.4.0
Actual example:
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
SparkNLP.version()
// make sure you are using the latest release 2.4.x
// Download and load the pre-trained pipeline that has NER in English
// Full list: https://github.com/JohnSnowLabs/spark-nlp-models
val pipeline = PretrainedPipeline("recognize_entities_dl", lang="en")
//Transfrom your DataFrame to a new DataFrame that has NER column
val annotation = pipeline.transform(inputDF)
// This would look something like this:
/*
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| id| text| document| sentence| token| embeddings| ner| entities|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| 1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...|
| 2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 16, D...|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
*/
// This is where the results for entities are:
annotation.select("entities.result").show
Let me know if you have any questions or problems with your input data and I'll update my answer.
References:
https://github.com/JohnSnowLabs/spark-nlp
https://github.com/JohnSnowLabs/spark-nlp-models
https://github.com/JohnSnowLabs/spark-nlp-workshop

Scala - how to use variables in a multi-line string literal

I want to call value of 'myActionID' variable. How do I do that?
If i pass static value like "actionId":1368201 to myActionID then it works, but If I use "actionId" : ${actionIdd} it gives error.
Here's the relevant code:
class LaunchWorkflow_Act extends Simulation {
val scenarioRepeatCount = 1
val userCount = 1
val myActionID = "13682002351"
val scn = scenario("LaunchMyFile")
.repeat (scenarioRepeatCount) {
exec(session => session.set("counter", (globalVar.getAndIncrement+" "+timeStamp.toString())))
.exec(http("LaunchRequest")
.post("""/api/test""")
.headers(headers_0)
.body(StringBody(
"""{ "actionId": ${myActionID} ,
"jConfig": "{\"wflow\":[{\"Wflow\":{\"id\": \"13500145349\"},\"inherit-variables\": true,\"workflow-context-variable\": [{\"variable-name\": \"externalFilePath\",\"variable-value\": \"/var/nem/nem/media/mount/assets/Test.mp4\"},{\"variable-name\": \"Name\",\"variable-value\": \"${counter}\"}]}]}"
}""")))
.pause(pause)
}
}
setUp(scn.inject(atOnceUsers(userCount))).protocols(httpProtocol)
Everything works fine If I put value 13682002351 instead of myActionID. While executing this script in Gatling I am Getting this error
ERROR i.g.http.action.HttpRequestAction - 'httpRequest-3' failed to
execute: No attribute named 'myActionID' is defined

Scala has various mechanisms for String Interpolation (see docs), which can be used to embed variables in strings. All of them can be used in conjunction with the triple quotes """ used to create multi-line strings.
In this case, you can use:
val counter = 12
val myActionID = "13682002351"
val str = s"""{
"actionId": $myActionID ,
"jConfig": "{\"wflow\":[{\"Wflow\":{\"id\": \"13500145349\"},\"inherit-variables\": true,\"workflow-context-variable\": [{\"variable-name\": \"externalFilePath\",\"variable-value\": \"/var/nem/nem/media/mount/assets/Test.mp4\"},{\"variable-name\": \"Name\",\"variable-value\": \"${counter}\"}]}]}"
}"""
Notice the s prepended to the string literal, and the dollar sign prepended to the variable names.

Using S interpolated String we can do this easily:
s"""Hello Word , Welcome Back!
How are you doing ${userName}"""

Passing arguments between Gatling scenarios and simulation

I'm current creating some Gatling simulation to test a REST API. I don't really understand Scala.
I've created a scenario with several exec and pause;
object MyScenario {
val ccData = ssv("cardcode_fr.csv").random
val nameData = ssv("name.csv").random
val mobileData = ssv("mobile.csv").random
val emailData = ssv("email.csv").random
val itemData = ssv("item_fr.csv").random
val scn = scenario("My use case")
.feed(ccData)
.feed(nameData)
.feed(mobileData)
.feed(emailData)
.feed(itemData)
.exec(
http("GetCustomer")
.get("/rest/customers/${CardCode}")
.headers(Headers.headers)
.check(
status.is(200)
)
)
.pause(3, 5)
.exec(
http("GetOffers")
.get("/rest/offers")
.queryParam("customercode", "${CardCode}")
.headers(Headers.headers)
.check(
status.is(200)
)
)
}
And I've a simple Simulation :
class MySimulation extends Simulation {
setUp(MyScenario.scn
.inject(
constantUsersPerSec (1 ) during (1)))
.protocols(EsbHttpProtocol.httpProtocol)
.assertions(
global.successfulRequests.percent.is(100))
}
The application I'm trying to simulate is a multilocation mobile App, so I've prepared a set of samples data for each Locale (US, FR, IT...)
My REST API handles all the locales, therefore I want to make the simulation concurrently execute several instances of MyScenario, each with a different locale sample, to simulate the global load.
Is it possible to execute my simulation without having to create/duplicate the scenario and change the val ccData = ssv("cardcode_fr.csv").random for each one?
Also, each locale has its own load, how can I create a simulation that takes a single scenario and executes it several times concurrently with a different load and feeders?
Thanks in advance.

From what you've said, I think this may be a good approach:
Start by grouping your data in such a way that you can look up each item you want to send based on the current locale. For this, I would recommend using a Map that matches a locale string (such as "FR") to the item that matches that locale for the field you're looking to fill in. Then, at the start of each iteration of the scenario, you just pick which locale you want to use for the current iteration from a list. It would look something like this:
val locales = List("US", "FR", "IT")
val names = Map( "US" -> "John", "FR" -> "Pierre", "IT" -> "Guillame")
object MyScenario {
//These two lines pick a random locale from your list
val random_index = rand.nextInt(locales.length);
val currentLocale = locales(random_index);
//This line gets the name
val name = names(currentLocale)
//Do the rest of your logic here
}
This is a very simplified example - you'll have to figure out how you actually want to retrieve the data from files and put it into a Map structure, as I assume you don't want to hard code every item for every field into your code.

String category names with Wisp scala plotting library

I am trying to plot bars for String categories with Wisp. In other words, I have a string (key) and a count (value) in my repl and I want to have a bar chart of the count versus the key.
I don't know if something easy exists. I went as far as the following hack:
val plot = bar(topWords.map(_._2).toList)
val axisType: com.quantifind.charts.highcharts.AxisType.Type = "category"
val newPlot = plot.copy(xAxis = plot.xAxis.map {
axisArray => axisArray.map { _.copy(axisType = Option(axisType),
categories = Option(topWords.map(_._1))) }
})
but I don't know if it works because I don't find a way to visualize newPlot. Or maybe adding a method to the Wisp source implementing the above is the way to go?
Thanks for any help.
PS : I don't have the reputation to create the wisp tag, but I would have...

Update: in wisp 0.0.5 or later this will be supported directly:
column(List(("alpha", 1), ("omega", 5), ("zeta", 8)))
====
Thanks for trying out wisp (I am the author). I think a problem that you may have encountered is by naming your variable plot, you overrode access to the plot method defined by Highcharts, which allows you to plot any Highchart object. (now that I see it, naming your plot "plot" is an unfortunately natural thing to do!)
I'm filing this as a github issue, as category names are very common.
This works for me. I used column for vertical bars:
import com.quantifind.charts.Highcharts._
val topWords = Array(("alpha", 14), ("beta", 23), ("omega", 18))
val numberedColumns = column(topWords.map(_._2).toList)
val axisType: com.quantifind.charts.highcharts.AxisType.Type = "category"
val namedColumns = numberedColumns.copy(xAxis = numberedColumns.xAxis.map {
axisArray => axisArray.map { _.copy(axisType = Option(axisType),
categories = Option(topWords.map(_._1))) }
})
plot(namedColumns)
producing this visualization:
(You could also consider creating one bar at a time, and use the legend to name them)

Get the first elements (take function) of a DStream

I look for a way to retrieve the first elements of a DStream created as:
val dstream = ssc.textFileStream(args(1)).map(x => x.split(",").map(_.toDouble))
Unfortunately, there is no take function (as on RDD) on a dstream //dstream.take(2) !!!
Could someone has any idea on how to do it ?! thanks

You can use transform method in the DStream object then take n elements of the input RDD and save it to a list, then filter the original RDD to be contained in this list. This will return a new DStream contains n elements.
val n = 10
val partOfResult = dstream.transform(rdd => {
val list = rdd.take(n)
rdd.filter(list.contains)
})
partOfResult.print

The previous suggested solution did not compile for me as the take() method returns an Array, which is not serializable thus Spark streaming will fail with a java.io.NotSerializableException.
A simple variation on the previous code that worked for me:
val n = 10
val partOfResult = dstream.transform(rdd => {
rdd.filter(rdd.take(n).toList.contains)
})
partOfResult.print

Sharing a java based solution that is working for me. The idea is to use a custom function, which can send the top row from a sorted RDD.
someData.transform(
rdd ->
{
JavaRDD<CryptoDto> result =
rdd.keyBy(Recommendations.volumeAsKey)
.sortByKey(new CryptoComparator()).values().zipWithIndex()
.map(row ->{
CryptoDto purchaseCrypto = new CryptoDto();
purchaseCrypto.setBuyIndicator(row._2 + 1L);
purchaseCrypto.setName(row._1.getName());
purchaseCrypto.setVolume(row._1.getVolume());
purchaseCrypto.setProfit(row._1.getProfit());
purchaseCrypto.setClose(row._1.getClose());
return purchaseCrypto;
}
).filter(Recommendations.selectTopinSortedRdd);
return result;
}).print();
The custom function selectTopinSortedRdd looks like below:
public static Function<CryptoDto, Boolean> selectTopInSortedRdd = new Function<CryptoDto, Boolean>() {
private static final long serialVersionUID = 1L;
#Override
public Boolean call(CryptoDto value) throws Exception {
if (value.getBuyIndicator() == 1L) {
System.out.println("Value of buyIndicator :" + value.getBuyIndicator());
return true;
}
else {
return false;
}
}
};
It basically compares all incoming elements, and returns true only for the first record from the sorted RDD.

This seems to be always an issue with DStreams as well as regular RDDs.
If you don't want (or can't) to use .take() (especially in DStreams) you can think outside the box here and just use reduce instead. That is a valid function for both DStreams as well as RDD's.
Think about it. If you use reduce like this (Python example):
.reduce( lambda x, y : x)
Then what happens is: For every 2 elements you pass in, always return only the first. So if you have a million elements in your RDD or DStream it will shrink it to one element in the end which is the very first one in your RDD or DStream.
Simple and clean.
However keep in mind that .reduce() does not take order into consideration. However you can easily overcome this with a custom function instead.
Example: Let's assume your data looks like this x = (1, [1,2,3]) and y = (2, [1,2]). A tuple x where the 2nd element is a list. If you are sorting by the longest list for example then your code could look like below maybe (adapt as needed):
def your_reduce(x,y):
if len(x[1]) > len(y[1]):
return x
else:
return y
yourNewRDD = yourOldRDD.reduce(your_reduce)
Accordingly you will get '(1, [1,2,3])' as that has the longer list. There you go!
This has caused me some headaches in the past until I finally tried this. Hopefully this helps.