Setting the scalePosWeight parameter for the Spark xgBoost model in a CV grid - scala

I am trying to tune my xgBoost model on Spark using Scala. My XGb parameter grid is as follows:
val xgbParamGrid = (new ParamGridBuilder()
.addGrid(xgb.maxDepth, Array(8, 16))
.addGrid(xgb.minChildWeight, Array(0.5, 1, 2))
.addGrid(xgb.alpha, Array(0.8, 0.9, 1))
.addGrid(xgb.lambda, Array(0.8, 1, 2))
.addGrid(xgb.scalePosWeight, Array(1, 5, 9))
.addGrid(xgb.subSample, Array(0.5, 0.8, 1))
.addGrid(xgb.eta, Array(0.01, 0.1, 0.3, 0.5))
.build())
The call to the cross validator is as follows:
val evaluator = (new BinaryClassificationEvaluator()
.setLabelCol("label")
.setRawPredictionCol("prediction")
.setMetricName("areaUnderPR"))
val cv = (new CrossValidator()
.setEstimator(pipeline_model_xgb)
.setEvaluator(evaluator)
.setEstimatorParamMaps(xgbParamGrid)
.setNumFolds(10))
val xgb_model = cv.fit(train)
I am getting the following error just for the scalePosWeight parameter:
error: type mismatch;
found : org.apache.spark.ml.param.DoubleParam
required: org.apache.spark.ml.param.Param[AnyVal]
Note: Double <: AnyVal (and org.apache.spark.ml.param.DoubleParam <:
org.apache.spark.ml.param.Param[Double]), but class Param is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
.addGrid(xgb.scalePosWeight, Array(1, 5, 9))
^
Based on my search, the message "You may wish to define T as +T instead" is common but I am not sure how to fix this here. Thanks for your help!

I run into the same issue when setting the Array for minChildWeight and the array was composed by Int types only. The solution that worked (for both scalePosWeight and minChildWeight) is to pass an Array of Floats:
.addGrid(xgb.scalePosWeight, Array(1.0, 5.0, 9.0))

Related

Scala reduction with non-primitive lists

The following function calculates the Euclidean distance between two 2D points in Scala:
def euclideanDist(pt1: List[Double], pt2: List[Double]): Double =
sqrt(pow(pt1(0)-pt2(0), 2)+pow(pt1(1)-pt2(1), 2))
I would like to design a perimeter function to accumulate the distances between each consecutive points in a list of points (or a ListBuffer).
For instance
val arr:ListBuffer[List[Double]] = ListBuffer(List(0, 0), List(0,1), List(1,1), List(1,0), List(0, 0))
perimeter(arr)
should give the output as 4.
This is what I tried:
def perimeter(arr: ListBuffer[List[Double]]): Double =
arr.reduceLeft(euclideanDist)
On execution, the compiler throws this error
Name: Unknown Error
Message: <console>:43: error: type mismatch;
found : (List[Double], List[Double]) => Double
required: (Any, List[Double]) => Any
arr.reduceLeft(euclideanDist)
^
<console>:43: error: type mismatch;
found : Any
required: Double
arr.reduceLeft(euclideanDist)
^
StackTrace:
I could go imperative and do the whole thing with a for-loop, but would like to know if this can be solved simpler in the Scala way.
What with this:
val arr:List[List[Double]] = List(List(0, 0), List(0,1), List(1,1), List(1,0), List(0, 0))
arr.sliding(2).map{case List(a,b) => euclideanDist(a,b)}.sum

Scala: How to multiply a List of Lists by value

Now studying Scala and working with list of lists. Want to multiply array by an element(for example, 1).
However I get the following error:
identifier expected but integer constant found
Current code:
def multiply[A](listOfLists:List[List[A]]):List[List[A]] =
if (listOfLists == Nil) Nil
else -1 * listOfLists.head :: multiply(listOfLists.tail)
val tt = multiply[List[3,4,5,6];List[4,5,6,7,8]]
print(tt);;
There are a few issues with your code:
In general, you can't perform arithmetic on unconstrained generic types; somehow, you have to communicate any supported arithmetic operations.
Multiplying by 1 will typically have no effect anyway.
As already pointed out, you don't declare List instances using square brackets (they're used for declaring generic type arguments).
The arguments you're passing to multiply are two separate lists (using an invalid semicolon separator instead of a comma), not a list of lists.
In the if clause the return value is Nil, which matches the stated return type of List[List[A]]. However the else clause is trying to perform a calculation which is multiplying List instances (not the contents of the lists) by an Int. Even if this made sense, the resulting type is clearly not a List[List[A]]. (This also makes it difficult for me to understand exactly what it is you're trying to accomplish.)
Here's a version of your code that corrects the above, assuming that you're trying to multiply each member of the inner lists by a particular factor:
// Multiply every element in a list of lists by the specified factor, returning the
// resulting list of lists.
//
// Should work for any primitive numeric type (Int, Double, etc.). For custom value types,
// you will need to declare an `implicit val` of type Numeric[YourCustomType] with an
// appropriate implementation of the `Numeric[T]` trait. If in scope, the appropriate
// num value will be identified by the compiler and passed to the function automatically.
def multiply[A](ll: List[List[A]], factor: A)(implicit num: Numeric[A]): List[List[A]] = {
// Numeric[T] trait defines a times method that we use to perform the multiplication.
ll.map(_.map(num.times(_, factor)))
}
// Sample use: Multiply every value in the list by 5.
val tt = multiply(List(List(3, 4, 5, 6), List(4, 5, 6, 7, 8)), 5)
println(tt)
This should result in the following output:
List(List(15, 20, 25, 30), List(20, 25, 30, 35, 40))
However, it might be that you're just trying to multiply together all of the values in the lists. This is actually a little more straightforward (note the different return type):
def multiply[A](ll: List[List[A]])(implicit num: Numeric[A]): A = ll.flatten.product
// Sample use: Multiply all values in all lists together.
val tt = multiply(List(List(3, 4, 5, 6), List(4, 5, 6, 7, 8)))
println(tt)
This should result in the following output:
2419200
I'd recommend you read a good book on Scala. There's a lot of pretty sophisticated stuff going on in these examples, and it would take too long to explain it all here. A good start would be Programming in Scala, Third Edition by Odersky, Spoon & Venners. That will cover List[A] operations such as map, flatten and product as well as implicit function arguments and implicit val declarations.
To make numeric operations available to type A, you can use context bound to associate A with scala.math.Numeric which provides methods such as times and fromInt to carry out the necessary multiplication in this use case:
def multiply[A: Numeric](listOfLists: List[List[A]]): List[List[A]] = {
val num = implicitly[Numeric[A]]
import num._
if (listOfLists == Nil) Nil else
listOfLists.head.map(times(_, fromInt(-1))) :: multiply(listOfLists.tail)
}
multiply( List(List(3, 4, 5, 6), List(4, 5, 6, 7, 8)) )
// res1: List[List[Int]] = List(List(-3, -4, -5, -6), List(-4, -5, -6, -7, -8))
multiply( List(List(3.0, 4.0), List(5.0, 6.0, 7.0)) )
// res2: List[List[Double]] = List(List(-3.0, -4.0), List(-5.0, -6.0, -7.0))
For more details about context bound, here's a relevant SO link.

In Spark-Scala, how to copy Array of Lists into DataFrame?

I am familiar with Python and I am learning Spark-Scala.
I want to build a DataFrame which has structure desribed by this syntax:
// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
(1.1, Vectors.dense(1.1, 0.1)),
(0.2, Vectors.dense(1.0, -1.0)),
(3.0, Vectors.dense(1.3, 1.0)),
(1.0, Vectors.dense(1.2, -0.5))
)).toDF("label", "features")
I got the above syntax from this URL:
http://spark.apache.org/docs/latest/ml-pipeline.html
Currently my data is in array which I had pulled out of a DF:
val my_a = gspc17_df.collect().map{row => Seq(row(2),Vectors.dense(row(3).asInstanceOf[Double],row(4).asInstanceOf[Double]))}
The structure of my array is very similar to the above DF:
my_a: Array[Seq[Any]] =
Array(
List(-1.4830674013266898, [-0.004192832940431825,-0.003170667657263393]),
List(-0.05876766500768526, [-0.008462913654529357,-0.006880595828929472]),
List(1.0109273250546658, [-3.1816797620416693E-4,-0.006502619326182358]))
How to copy data from my array into a DataFrame which has the above structure?
I tried this syntax:
val my_df = spark.createDataFrame(my_a).toDF("label","features")
Spark barked at me:
<console>:105: error: inferred type arguments [Seq[Any]] do not conform to method createDataFrame's type parameter bounds [A <: Product]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
<console>:105: error: type mismatch;
found : scala.collection.mutable.WrappedArray[Seq[Any]]
required: Seq[A]
val my_df = spark.createDataFrame(my_a).toDF("label","features")
^
scala>
The first problem here is that you use List to store row data. List is a homogeneous data structure and since the only common type for Any (row(2)) and DenseVector is Any (Object) you end up with a Seq[Any].
The next issue is that you use row(2) at all. Since Row is effectively a collection of Any this operation doesn't return any useful type and result couldn't be stored in a DataFrame without providing an explicit Encoder.
From the more Sparkish perspective it is not the good approach neither. collect-int just to transform data shouldn't require any comment and. mapping over Rows just to create Vectors doesn't make much sense either.
Assuming that there is no type mismatch you can use VectorAssembler:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array(df.columns(3), df.columns(4)))
.setOutputCol("features")
assembler.transform(df).select(df.columns(2), "features")
or if you really want to handle this manually an UDF.
val toVec = udf((x: Double, y: Double) => Vectors.dense(x, y))
df.select(col(df.columns(2)), toVec(col(df.columns(3)), col(df.columns(4))))
In general I would strongly recommend getting familiar with Scala before you start using it with Spark.

Spark: Summary statistics

I am trying to use Spark summary statistics as described at: https://spark.apache.org/docs/1.1.0/mllib-statistics.html
According to Spark docs :
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg.DenseVector
val observations: RDD[Vector] = ... // an RDD of Vectors
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
I have a problem building observations:RDD[Vector] object. I try:
scala> val data:Array[Double] = Array(1, 2, 3, 4, 5)
data: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
scala> val v = new DenseVector(data)
v: org.apache.spark.mllib.linalg.DenseVector = [1.0,2.0,3.0,4.0,5.0]
scala> val observations = sc.parallelize(Array(v))
observations: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector] = ParallelCollectionRDD[3] at parallelize at <console>:19
scala> val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
<console>:21: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.DenseVector <: org.apache.spark.mllib.linalg.Vector, but class RDD is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
Questions:
1) How should I cast DenseVector to Vector?
2) In real program instead of array of doubles I have a to get statistics on a collection that I get from RDD using:
def countByKey(): Map[K, Long]
//Count the number of elements for each key, and return the result to the master as a Map.
So I have to do:
myRdd.countByKey().values.map(_.toDouble)
Which does not make much sense because instead of working with RDDs I now have to work with regular Scala collections whiich at some time stop fitting into memory. All advantages of Spark distributed computations is lost.
How to solve this in scalable manner?
Update
In my case I have:
val cnts: org.apache.spark.rdd.RDD[Int] = prodCntByCity.map(_._2) // get product counts only
val doubleCnts: org.apache.spark.rdd.RDD[Double] = cnts.map(_.toDouble)
How to convert doubleCnts into observations: RDD[Vector] ?
1) You don't need to cast, you just need to type:
val observations = sc.parallelize(Array(v: Vector))
2) Use aggregateByKey (map all the keys to to 1, and reduce by summing) rather than countByKey.
DenseVector has a compressed function. so you can change the RDD[ DenseVector] to RDD[Vector] as :
val st =observations.map(x=>x.compressed)

type inference still need enhance,any better suggestion for this example?

for instance in Clojure:
user=> (map #(* % 2) (concat [1.1 3] [5 7]))
(2.2 6 10 14)
but in Scala:
scala> List(1.1,3) ::: List(5, 7) map (_ * 2)
<console>:6: error: value * is not a member of AnyVal
List(1.1,3) ::: List(5, 7) map (_ * 2)
^
Here ::: obtain a list of type List,oops then failed. Can any coding more intuitively like Clojure above?
Here are you individual lists:
scala> List(1.1,3)
res1: List[Double] = List(1.1, 3.0)
scala> List(5, 7)
res2: List[Int] = List(5, 7)
The computed least upper bound (LUB) of Double and Int, needed to capture the type of the new list that includes elements of both the arguments lists passed to :::, is AnyVal. AnyVal includes Boolean, e.g., so there are no numeric operations defined.
As Randall has already said, the common supertype of Double and Int is AnyVal, which is inferred in this case. The only way I could make your example work is to add a type parameter to the second list:
scala> List[Double](1.1,3) ::: List(5, 7) map (_ * 2)
:6: error: value * is not a member of AnyVal
List[Double](1.1,3) ::: List(5, 7) map (_ * 2)
scala> List(1.1,3) ::: List[Double](5, 7) map (_ * 2)
res: List[Double] = List(2.2, 6.0, 10.0, 14.0)
I guess that in the latter case the implicit conversion from Int to Double is applied. I'm not sure why it this not applied in when adding the type parameter to the first list, however.
The first list is of type List[Double] because of literal type widening. Scala sees the literals, and notes that, even though they are of different types, they can be unified by widening some of them. If there was no type widening, then the most common superclass, AnyVal would be adopted.
List(1.1 /* Double literal */), 3 /* Int literal */)
The second list is clearly List[Int], though explicit calling for Double will result in type widening for the literals.
List(5 /* Int literal */, 7 /* Int literal */)
Now, it's important to note that type widening is something that happens at compile time. The compiled code will not contain any Int 3, only Double 3.0. Once a list has been creater, however, it is not possible to do type widening, because the stored objects are, in fact, different.
So, once you concat the two lists, the resulting type will be a superclass of Double and Int. Namely, AnyVal. As a result of Java interoperability, however, AnyVal cannot contain any useful methods (such as numeric operators).
I do wonder what Clojure does internally. Does it convert integers into doubles when concatenating? Or does it store everything as Object (like Scala does) but has smarter math operators?
I see two ways to make it work - braces for the second part
scala> List (1.1, 3) ::: (List (5, 7) map (_ * 2))
res6: List[AnyVal] = List(1.1, 3, 10, 14)
or explicit Doubles everywhere:
scala> List (1.1, 3.) ::: List (5., 7.) map (_ * 2)
res9: List[Double] = List(2.2, 6.0, 10.0, 14.0)
which is of course semantically different.
Why not just,
(List(1.1,3) ::: List(5, 7)).asInstanceOf[List[Double]] map (_ * 2)
?