MatrixEntry not iterable when processing CoordinateMatrix... pyspark MLlib - pyspark

I'm trying to execute this line on a CoordinateMatrix...
test = test.entries.map(lambda (i, j, v): (j, (i, v)))
where the equivalent in Scala seems to work but fails in pyspark. The error I get when the line is executing...
'MatrixEntry' object is not iterable
And confirming that I am working with a CoordinateMatrix...
>>> test = test_coord.entries
>>> test.first()
>>> MatrixEntry(0, 0, 7.0)
Anyone know what might be off?

Suppose test is a CoordinatedMatrix, then:
test.entries.map(lambda e: (e.j, (e.i, e.value)))
A side note: you can't unpack a tuple in a lambda function. So map(lambda (x, y, z): ) is not going to work in this case even though it doesn't seem to be the reason that fails.
Example:
test = CoordinateMatrix(sc.parallelize([(1,2,3), (4,5,6)]))
test.entries.collect()
# [MatrixEntry(1, 2, 3.0), MatrixEntry(4, 5, 6.0)]
test.entries.map(lambda e: (e.j, (e.i, e.value))).collect()
# [(2L, (1L, 3.0)), (5L, (4L, 6.0))]

Related

Setting the scalePosWeight parameter for the Spark xgBoost model in a CV grid

I am trying to tune my xgBoost model on Spark using Scala. My XGb parameter grid is as follows:
val xgbParamGrid = (new ParamGridBuilder()
.addGrid(xgb.maxDepth, Array(8, 16))
.addGrid(xgb.minChildWeight, Array(0.5, 1, 2))
.addGrid(xgb.alpha, Array(0.8, 0.9, 1))
.addGrid(xgb.lambda, Array(0.8, 1, 2))
.addGrid(xgb.scalePosWeight, Array(1, 5, 9))
.addGrid(xgb.subSample, Array(0.5, 0.8, 1))
.addGrid(xgb.eta, Array(0.01, 0.1, 0.3, 0.5))
.build())
The call to the cross validator is as follows:
val evaluator = (new BinaryClassificationEvaluator()
.setLabelCol("label")
.setRawPredictionCol("prediction")
.setMetricName("areaUnderPR"))
val cv = (new CrossValidator()
.setEstimator(pipeline_model_xgb)
.setEvaluator(evaluator)
.setEstimatorParamMaps(xgbParamGrid)
.setNumFolds(10))
val xgb_model = cv.fit(train)
I am getting the following error just for the scalePosWeight parameter:
error: type mismatch;
found : org.apache.spark.ml.param.DoubleParam
required: org.apache.spark.ml.param.Param[AnyVal]
Note: Double <: AnyVal (and org.apache.spark.ml.param.DoubleParam <:
org.apache.spark.ml.param.Param[Double]), but class Param is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
.addGrid(xgb.scalePosWeight, Array(1, 5, 9))
^
Based on my search, the message "You may wish to define T as +T instead" is common but I am not sure how to fix this here. Thanks for your help!
I run into the same issue when setting the Array for minChildWeight and the array was composed by Int types only. The solution that worked (for both scalePosWeight and minChildWeight) is to pass an Array of Floats:
.addGrid(xgb.scalePosWeight, Array(1.0, 5.0, 9.0))

How can I generate a list of n unique elements picked from a set?

How to generate a list of n unique values (Gen[List[T]]) from a set of values (not generators) using ScalaCheck? This post uses Gen[T]* instead of a set of values, and I can't seem to rewrite it to make it work.
EDIT
At #Jubobs' request I now shamefully display what I have tried so far, revealing my utter novice status at using ScalaCheck :-)
I simply tried to replace gs: Gen[T] repeated parameter to a Set in what #Eric wrote as a solution here:
def permute[T](n: Int, gs: Set[T]): Gen[Seq[T]] = {
val perm = Random.shuffle(gs.toList)
for {
is <- Gen.pick(n, 1 until gs.size)
xs <- Gen.sequence[List[T], T](is.toList.map(perm(_)))
} yield xs
}
but is.toList.map(perm(_)) got underlined with red, with IntelliJ IDEA telling me that "You should read ScalaCheck API first before blind (although intuitive) trial and error", or maybe "Type mismatch, expected: Traversable[Gen[T]], actual List[T]", I can't remember.
I also tried several other ways, most of which I find ridiculous (and thus not worthy of posting) in hindsight, with the most naive being the using of #Eric's (otherwise useful and neat) solution as-is:
val g = for (i1 <- Gen.choose(0, myList1.length - 1);
i2 <- Gen.choose(0, myList2.length - 1))
yield new MyObject(myList1(i1), myList2(i2))
val pleaseNoDuplicatesPlease = permute(4, g, g, g, g, g)
After some testing I saw that pleaseNoDuplicatesPlease in fact contained duplicates, at which point I weighed my options of having to read through ScalaCheck API and understand a whole lot more than I do now (which will inevitably, and gradually come), or posting my question at StackOverflow (after carefully searching whether similar questions exist).
Gen.pick is right up your alley:
scala> import org.scalacheck.Gen
import org.scalacheck.Gen
scala> val set = Set(1,2,3,4,5,6)
set: scala.collection.immutable.Set[Int] = Set(5, 1, 6, 2, 3, 4)
scala> val myGen = Gen.pick(5, set).map { _.toList }
myGen: org.scalacheck.Gen[scala.collection.immutable.List[Int]] = org.scalacheck.Gen$$anon$3#78693eee
scala> myGen.sample
res0: Option[scala.collection.immutable.List[Int]] = Some(List(5, 6, 2, 3, 4))
scala> myGen.sample
res1: Option[scala.collection.immutable.List[Int] = Some(List(1, 6, 2, 3, 4))

too many arguments for method apply: (i: Int)String in class Array

I have an RDD (r2Join1) which holds the following data
(100,(102|1001,201))
(100,(102|1001,200))
(100,(103|1002,201))
(100,(103|1002,200))
(150,(151|1003,204))
I want to transform this to the following
(102, (1001, 201))
(102, (1001, 200))
(103, (1002, 201))
(103, (1002, 200))
(151, (1003, 204))
i.e., I want to transform (k, (v1|v2, v3)) to (v1, (v2, v3)).
I did the following:
val m2 = r2Join1.map({case (k, (v1, v2)) => val p: Array[String] = v1.split("\\|") (p(0).toLong, (p(1).toLong, v2.toLong))})
I get the following error
error: too many arguments for method apply: (i: Int)String in class Array
I am new to Spark & Scala. Please let me know how this error can be resolved.
The code looks like it might be off in other areas, but without the rest I can't be sure, so at minimum this should get you moving you need either a semicolon after your split or to put the two separate statements on separate lines.
v1.split("\\|");(p(0).toLong, (p(1).toLong, v2.toLong))
Without the semicolon the compiler is interpreting it as:
v1.split("\\|").apply(p(0).toLong...)
where apply acts as an indexer of the array in this case.

Spark nested loop and RDD transformations

I am looking for example code that implements a nested loop in spark. I am looking for the following functionality.
Given a RDD data1 = sc.parallelize(range(10)) and another dataset data2 = sc.parallelize(['a', 'b', 'c']), I am looking for something which will pick each 'key' from data2, append each 'value' from data1 to create a list of key value pairs that look, perhaps in internal memory, something like [(a,1), (a, 2), (a, 3), ..., (c, 8), (c, 9)] and then do a reduce by key using a simple reducer function, say lambda x, y: x+y.
From the logic described above, the expected output is
(a, 45)
(b, 45)
(c, 45)
My attempt
data1 = sc.parallelize(range(100))
data2 = sc.parallelize(['a', 'b', 'c'])
f = lambda x: data2.map(lambda y: (y, x))
data1.map(f).reduceByKey(lambda x, y: x+y)
The obtained error
Exception: It appears that you are attempting to broadcast an RDD or
reference an RDD from an action or transformation. RDD transformations
and actions can only be invoked by the driver, not inside of other
transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x)
is invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation. For more information,
see SPARK-5063.
I am a complete newbie this, so any help is highly appreciated!
OS Information
I am running this on a standalone spark installation on linux. Details available if relevant.
Here is a potential solution. I am not too happy with it, though, because it doesn't represent a true for loop.
data1 = sc.parallelize(range(10))
data2 = sc.parallelize(['a', 'b', 'c'])
data2.cartesian(data1).reduceByKey(lambda x, y: x+y).collect()
gives
[('a', 45), ('c', 45), ('b', 45)]

How to explain that "Set(someList : _*)" results the same as "Set(someList).flatten"

I found a piece of code I wrote some time ago using _* to create a flattened set from a list of objects.
The real line of code is a bit more complex and as I didn't remember exactly why was that there, took a bit of experimentation to understand the effect, which is actually very simple as seen in the following REPL session:
scala> val someList = List("a","a","b")
someList: List[java.lang.String] = List(a, a, b)
scala> val x = Set(someList: _*)
x: scala.collection.immutable.Set[java.lang.String] = Set(a, b)
scala> val y = Set(someList).flatten
y: scala.collection.immutable.Set[java.lang.String] = Set(a, b)
scala> x == y
res0: Boolean = true
Just as a reference of what happens without flatten:
scala> val z = Set(someList)
z: scala.collection.immutable.Set[List[java.lang.String]] = Set(List(a, a, b))
As I can't remember where did I get that idiom from I'd like to hear about what is actually happening there and if there is any consequence in going for one way or the other (besides the readability impact)
P.S.: Maybe as an effect of the overuse of underscore in Scala language (IMHO), it is kind of difficult to find documentation about some of its use cases, specially if it comes together with a symbol commonly used as a wildcard in most search engines.
_* is for expand this collection as if it was written here literally, so
val x = Set(Seq(1,2,3,4): _*)
is the same as
val x = Set(1,2,3,4)
Whereas, Set(someList) treats someList as a single argument.
To lookup funky symbols, you could use symbolhound